Stein's Paradox

When you're estimating something about some object that's a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should you estimate it: use measurements of the individual, or get some extra information from the group?  The James-Stein estimator tells you how to combine individual and group information make predictions that, taken over the whole group, are more accurate than if you treated each individual, well, individually.  

Relevant Links:

Empirical Bayes

Say you're looking to use some Bayesian methods to estimate parameters of a system.  You've got the normalization figured out, and the likelihood, but the prior... what should you use for a prior?  Empirical Bayes has an elegant answer: look to your previous experience, and use past measurements as a starting point in your prior.

Scratching your head about some of those terms, and why they matter?  Lucky for you, you're standing in front of a podcast episode that unpacks all of this.

Endogenous Variables and Measuring Protest Effectiveness

Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers?  It's a tricky question to answer, since usually we need randomly distributed treatments (e.g. big protests) to understand causality, but there's no reason to believe that big protests are actually randomly distributed.  In other words, protest size is endogenous to legislative response, and understanding cause and effect is very challenging.

So, what to do?  Well, at least in the case of studying Tea Party protest effectiveness, researchers have used rainfall, of all things, to understand the impact of a big protest.  In other words, rainfall is the instrumental variable in this analysis that cracks the scientific case open.  What does rainfall have to do with protests?  Do protests actually matter?  What do we mean when we talk about endogenous and instrumental variables?  We wouldn't be very good podcasters if we answered all those questions here--you gotta listen to this episode to find out.

Relevant links:

Rock the ROC Curve

This week: everybody's favorite WWII-era classifier metric!  But it's not just for winning wars, it's a fantastic go-to metric for all your classifier quality needs.  

Ensemble Algorithms

If one machine learning model is good, are two models better? In a lot of cases, the answer is yes. If you build many ok models, and then bring them all together and use them in combination to make your final predictions, you've just created an ensemble model. It feels a little bit like cheating, like you just got something for nothing, but the results don't like: algorithms like Random Forests and Gradient Boosting Trees (two types of ensemble algorithms) are some of the strongest out-of-the-box algorithms for classic supervised classification problems. What makes a Random Forest random, and what does it mean to gradient boost a tree? Have a listen and find out.

How to evaluate a translation: BLEU scores

As anyone who's encountered a badly translated text could tell you, not all translations are created equal.  Some translations are smooth, fluent and sound like a poet wrote them; some are jerky, non-grammatical and awkward.  When a machine is doing the translating, it's awfully easy to end up with a robotic-sounding text; as the state of the art in machine translation improves, though, a natural question to ask is: according to what measure?  How do we quantify a "good" translation?

Enter the BLEU score, which is the standard metric for quantifying the quality of a machine translation.  BLEU rewards translations that have large overlap with human translations of sentences, with some extra heuristics thrown in to guard against weird pathologies (like full sentences getting translated as one word, redundancies, and repetition).  Nowadays, if there's a machine translation being evaluated or a new state-of-the-art system (like the Google neural machine translation we've discussed on this podcast before), chances are that there's a BLEU score going into that assessment.

Relevant links:

Zero Shot Translation

Take Google-size data, the flexibility of a neural net, and all (well, most) of the languages of the world, and what you end up with is a pile of surprises.  This episode is about some interesting features of Google's new neural machine translation system, namely that with minimal tweaking, it can accommodate many different languages in a single neural net, that it can do a half-decent job of translating between language pairs it's never been explicitly trained on, and that it seems to have its own internal representation of concepts that's independent of the language those concepts are being represented in.  Intrigued?  You should be...

Relevant links:

Google Neural Machine Translation

Recently, Google swapped out the backend for Google Translate, moving from a statistical phrase-based method to a recurrent neural network.  This marks a big change in methodology: the tried-and-true statistical translation methods that have been in use for decades are giving way to a neural net that, across the board, appears to be giving more fluent and natural-sounding translations.  This episode recaps statistical phrase-based methods, digs into the RNN architecture a little bit, and recaps the impressive results that is making us all sound a little better in our non-native languages.

Relevant links:

Data + Healthcare + Government = The Future of Medicine : Interview with Precision Medicine Initiative researcher Matt Might

Today we are delighted to bring you an interview with Matt Might, computer scientist and medical researcher extraordinaire and architect of President Obama's Precision Medicine Initiative.  As the Obama Administration winds down, we're talking with Matt about the goals and accomplishments of precision medicine (and related projects like the Cancer Moonshot) and what he foresees as the future marriage of data and medicine.  Many thanks to Matt, our friends over at Partially Derivative (hi, Jonathon!) and the White House for arranging this opportunity to chat.  Enjoy!

Special Crossover Episode: Partially Derivative Interview with White House Chief Data Scientist DJ Patil

We have the pleasure of bringing you a very special crossover episode this week: our friends at Partially Derivative (another great podcast about data science, you should check it out) recently interviewed White House Chief Data Scientist DJ Patil.  We think DJ's message about the importance and impact of data science is worth spreading, so it's our pleasure to bring it to you today.  A huge thanks to Jonathon Morgan and Partially Derivative for sharing this interview with us--enjoy!

Relevant links:
http://partiallyderivative.com/podcast/2016/12/13/dj-patil

How to Lose at Kaggle

Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists.  Losing unexpectedly at the very end of the contest is also something that a lot of us have experienced.  It's not just bad luck: a very specific combination of overfitting on popular competitions can take someone who is in the top few spots in the final days of a contest and bump them down hundreds of slots in the final tally.

Relevant Links:

Attacking Discrimination in Machine Learning

Imagine there's an important decision to be made about someone, like a bank deciding whether to extend a loan, or a school deciding to admit a student--unfortunately, we're all too aware that discrimination can sneak into these situations (even when everyone is acting with the best of intentions!).  Now, these decisions are often made with the assistance of machine learning and statistical models, but unfortunately these algorithms pick up on the discrimination in the world (it sneaks in through the data, which can capture inequities, which the algorithms then learn) and reproduce it.

This podcast covers some of the most common ways we can try to minimize discrimination, and why none of those ways is perfect at fixing the problem.  Then we'll get to a new idea called "equality of opportunity," which came out of Google recently and takes a pretty practical and well-aimed approach to machine learning bias.  

Relevant links:

Stealing a PIN with Signal Processing and Machine Learning

Want another reason to be paranoid when using the free coffee shop wifi?  Allow us to introduce WindTalker, a system that cleverly combines a dose of signal processing with a dash of machine learning to (potentially) steal the PIN from your phone transactions without ever having physical access to your phone.  This episode has it all, folks--channel state information, ICMP echo requests, low-pass filtering, PCA, dynamic time warps, and the PIN for your phone.

Relevant links:

Deep Blue

In 1997, Deep Blue was the IBM algorithm/computer that did what no one, at the time, though possible: it beat the world's best chess player.  It turns out, though, that one of the most important moves in the matchup, where Deep Blue psyched out its opponent with a weird move, might not have been so inspired after all.  It might have been nothing more than a bug in the program, and it changed computer science history.

Relevant links:

Organizing Google's Datasets

If you're a data scientist, there's a good chance you're used to working with a lot of data. But there's a lot of data, and then there's Google-scale amounts of data. Keeping all that data organized is a Google-sized task, and as it happens, they've built a system for that organizational challenge. This episode is all about that system, called Goods, and in particular we'll dig into some of the details of what makes this so tough.

Relevant links:

Data for fighting cancer: followup

A few months ago, Katie started on a project for the Vice President's Cancer Moonshot surrounding how data can be used to better fight cancer.  The project is all wrapped up now, so we wanted to tell you about how that work went and what changes to cancer data policy were suggested to the Vice President.

Note: after this episode was recorded, but before release, the Vice President's office issued a summary report to the President, encompassing all recommendations received as part of the Cancer Moonshot and making final suggestions of what work will be most critical for fighting cancer more effectively (that includes our recommendations, and many others).  The second strategic goal, entitled "Unleash the Power of Data," is summarized here.

Relevant links:

The 19-year-old determining the election

Sick of the presidential election yet?  We are too, but there's still almost a month to go, so let's just embrace it together.  This week, we'll talk about one of the presidential polls, which has been kind of an outlier for quite a while.  This week, the NY Times took a closer look at this poll, and was able to figure out the reason it's such an outlier.  It all goes back to a 19-year-old African American man, living in Illinois, who really likes Donald Trump...

 

Relevant Links:

 

followup article from LA Times, released after recording: