Indie Announcement

May 15, 2016

We've gone indie! Which shouldn't change anything about the podcast that you know and love, but we're super excited to keep bringing you Linear Digressions as a fully independent podcast.

Some links mentioned in the show:
https://twitter.com/lindigressions
https://twitter.com/benjaffe
https://twitter.com/multiarmbandit
https://soundcloud.com/linear-digressions
http://lineardigressions.com/

The Cocktail Party Problem

May 15, 2016

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!

Links: Deep learning machine solves the cocktail party problem

The cocktail party effect

Portrait Beauty

May 15, 2016

It's Da Vinci meets Skynet: what makes a portrait beautiful, according to a machine learning algorithm. Snap a selfie and give us a listen.

Link: The beauty of capturing faces: rating the quality of digital portraits

A Criminally Short Introduction to Semi-Supervised Learning

May 15, 2016

Because there are more interesting problems than there are labeled datasets, semi-supervised learning provides a framework for getting feedback from the environment as a proxy for labels of what's "correct." Of all the machine learning methodologies, it might also be the closest to how humans usually learn--we go through the world, getting (noisy) feedback on the choices we make and learn from the outcomes of our actions.

Link: David Silver's Reinforcement Learning course

Thresholdout: Down with Overfitting

May 15, 2016

Overfitting to your training data can be avoided by evaluating your machine learning algorithm on a holdout test dataset, but what about overfitting to the test data? Turns out it can be done, easily, and you have to be very careful to avoid it. But an algorithm from the field of privacy research shows promise for keeping your test data safe from accidental overfitting.

Link: The Reusable Holdout: preserving validity in adaptive data analysis

The State of Data Science

May 08, 2016

How many data scientists are there, where do they live, where do they work, what kind of tools do they use, and how do they describe themselves? RJMetrics wanted to know the answers to these questions, so they decided to find out and share their analysis with the world. In this very special interview episode, we welcome Tristan Handy, VP of Marketing at RJMetrics, who will talk about "The State of Data Science Report."

Link: The State of Data Science

Data Science for Making the World a Better Place

May 08, 2016

There's a good chance that great data science is going on close to you, and that it's going toward making your city, state, country, and planet a better place. Not all the data science questions being tackled out there are about finding the sleekest new algorithm or billion-dollar company idea--there's a whole world of social data science that just wants to make the world a better place to live in.

Link:

Driven Data

Kalman Runners

May 08, 2016

The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If you've ever run a marathon, or been a nuclear missile, you probably know all about these challenges already. By the way, we neglected to mention in the episode: Katie's marathon time was 3:54:27!

Link:

How a Kalman filter works, in pictures

Neural Net Inception

May 08, 2016

When you sleep, the neural pathways in your brain take the "white noise" of your resting brain, mix in your experiences and imagination, and the result is dreams (that is a highly unscientific explanation, but you get the idea). What happens when neural nets are put through the same process? Train a neural net to recognize pictures, and then send through an image of white noise, and it will start to see some weird (but cool!) stuff.

Links:

Google Research Blog, Inceptionism: Going Deeper into Neural Networks

Benford's Law

May 08, 2016

Sometimes numbers are... weird. Benford's Law is a favorite example of this for us--it's a law that governs the distribution of the first digit in certain types of numbers. As it turns out, if you're looking up the length of a river, the population of a country, the price of a stock... not all first digits are created equal.

Links:

Guinness

May 08, 2016

Not to oversell it, but the student's t-test has got to have the most interesting history of any statistical test. Which is saying a lot, right? Add some boozy statistical trivia to your arsenal in this epsiode.

Link:

The Statistical Significance of Beer

PFun with P Values

May 08, 2016

Doing some science, and want to know if you might have found something? Or maybe you've just accomplished the scientific equivalent of going fishing and reeling in an old boot? Frequentist p-values can help you distinguish between "eh" and "oooh interesting". Also, there's a lot of physics in this episode, nerds.

Link:

Gelman, P Values and Statistical Practice

Watson

May 08, 2016

This machine learning algorithm beat the human champions at Jeopardy. What is... Watson?

Links: IEEE Xplore: This Is Watson [paywalled :(]

Bayesian Psychics

May 08, 2016

Come get a little "out there" with us this week, as we use a meta-study of extrasensory perception (or ESP, often used in the same sentence as "psychics") to chat about Bayesian vs. frequentist statistics.

Link:

Utts, Norris, Suess and Johnson: The Strength of Evidence Versus the Power of Belief: Are we all Bayesians?

Troll Detection

May 08, 2016

Ever found yourself wasting time reading online comments from trolls? Of course you have; we've all been there (it's 4 AM but I can't turn off the computer and go to sleep--someone on the internet is WRONG!). Now there's a way to use machine learning to automatically detect trolls, and minimize the impact when they try to derail online conversations.

Link: Justin Cheng, Danescu-Niculescu-Mizil, Jure Leskovec: Antisocial Behavior in Online Discussion Communities

Yiddish Translation

May 08, 2016

Imagine a language that is mostly spoken rather than written, contains many words in other languages, and has relatively little written overlap with English. Now imagine writing a machine-learning-based translation system that can convert that language to English. That's the problem that confronted researchers when they set out to automatically translate between Yiddish and English; the tricks they used help us understand a lot about machine translation.

Link:

Genzel, Macherey, and Uszkoreit: Creating a High-Quality Machine Translation System for a Low-Resource Language: Yiddish

Modeling Particles in Atomic Bombs

February 04, 2016

In a fun historical journey, Katie and Ben explore the history of the Manhattan Project, discuss the difficulties in modeling particle movement in atomic bombs with only punch-card computers and ingenuity, and eventually come to present-day uses of the Metropolis-Hastings algorithm... mentioning Solitaire along the way.

Links:

Random Number Generation

February 04, 2016

Let's talk about randomness! Although randomness is pervasive throughout the natural world, it's surprisingly difficult to generate random numbers. And even if your numbers look random (but actually aren't), it can have interesting consequences on the security of systems, and the accuracy of models and research.

In this episode, Katie and Ben talk about randomness, its place in machine learning and computation in general, along with some random digressions of their own.

Links: Mersenne Twister

Electoral Insights Part 2

February 04, 2016

Following up on our last episode about how experiments can be performed in political science, now we explore a high-profile case of an experiment gone wrong.

An extremely high-profile paper that was published in 2014, about how talking to people can convince them to change their minds on topics like abortion and gay marriage, has been exposed as the likely product of a fraudulently produced dataset. We’ll talk about a cool data science tool called the Kolmogorov-Smirnov test, which a pair of graduate students used to reverse-engineer the likely way that the fraudulent data was generated.

But a bigger question still remains—what does this whole episode tell us about fraud and oversight in science?

Links:

Irregularities in LaCour

The Case of the Amazing Gay Marriage Data: How a graduate student reluctantly uncovered a huge scientific fraud

Electoral Insights Part 1

February 04, 2016

The first of our two-parter discussing the recent electoral data fraud case. The results of the study in question were covered widely, including by This American Life (who later had to issue a retraction).

Data science for election research involves studying voters, who are people, and people are tricky to study—every one of them is different, and the same treatment can have different effects on different voters. But with randomized controlled trials, small variations from person to person can even out when you look at a larger group. With the advent of randomized experiments in elections a few decades ago, a whole new door was opened for studying the most effective ways to campaign.

Link: The Victory Lab