Feature Processing for Text Analytics

It seems like every day there's more and more machine learning problems that involve learning on text data, but text itself makes for fairly lousy inputs to machine learning algorithms.  That's why there are text vectorization algorithms, which re-format text data so it's ready for using for machine learning.  In this episode, we'll go over some of the most common and useful ways to preprocess text data for machine learning.

Education Analytics

This week we'll hop into the rapidly developing industry around predictive analytics for education.  For many of the students who eventually drop out, data science is showing that there might be early warning signs that the student is in trouble--we'll talk about what some of those signs are, and then dig into the meatier questions around discrimination, who owns a student's data, and correlation vs. causation.  Spoiler: we have more questions than we have answers on this one.

Bonus appearance from Maeby the dog, who is pictured below (on her way home from the orphanage when she got adopted).

Relevant links:

A Technical Deep Dive on Stanley, the First Self-Driving Car

In our follow-up episode to last week's introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems for getting a car to drive itself 140 miles across the desert.  Lidar?  You betcha!  Drive-by-wire?  Of course!  Probabilistic terrain reconstruction?  Absolutely!  All this and more this week on Linear Digressions.

Relevant links:

An Introduction to Stanley, the First Self-Driving Car

In October 2005, 23 cars lined up in the desert for a 140 mile race.  Not one of those cars had a driver.  This was the DARPA grand challenge to see if anyone could build an autonomous vehicle capable of navigating a desert route (and if so, whose car could do it the fastest); the winning car, Stanley, now sits in the Smithsonian Museum in Washington DC as arguably the world's first real self-driving car.  In this episode (part one of a two-parter), we'll revisit the DARPA grand challenge from 2005 and the rules and constraints of what it took for Stanley to win the competition.  Next week, we'll do a deep dive into Stanley's control systems and overall operation and what the key systems were that allowed Stanley to win the race.

Relevant links:

Feature Importance

Figuring out what features actually matter in a model is harder to figure out than you might first guess.  When a human makes a decision, you can just ask them--why did you do that?  But with machine learning models, not so much.  That's why we wanted to talk a bit about both regularization (again) and also other ways that you can figure out which models have the biggest impact on the predictions of your model.

Relevant links:

Space Codes!

It's hard to get information to and from Mars.  Mars is very far away, and expensive to get to, and the bandwidth for passing messages with Earth is not huge.  The messages you do pass have to traverse millions of miles, which provides ample opportunity for the message to get corrupted or scrambled.  How, then, can you encode messages so that errors can be detected and corrected?  How does the decoding process allow you to actually find and correct the errors?  In this episode, we'll talk about three pieces of the process (Reed-Solomon codes, convolutional codes, and Viterbi decoding) that allow the scientists at NASA to talk to our rovers on Mars.

Relevant links:

Finding (and Studying) Wikipedia Trolls

You may be shocked to hear this, but sometimes, people on the internet can be mean.  For some of us this is just a minor annoyance, but if you're a maintainer or contributor of a large project like Wikipedia, abusive users can be a huge problem.  Fighting the problem starts with understanding it, and understanding it starts with measuring it; the thing is, for a huge website like Wikipedia, there can be millions of edits and comments where abuse might happen, so measurement isn't a simple task.  That's where machine learning comes in: by building an "abuse classifier," and pointing it at the Wikipedia edit corpus, researchers at Jigsaw and the Wikimedia foundation are for the first time able to estimate abuse rates and curate a dataset of abusive incidents.  Then those researchers, and others, can use that dataset to study the pathologies and effects of Wikipedia trolls.

Relevant links:

A Sprint Through What's New in Neural Nets

Advances in neural networks are moving fast enough that, even though it seems like we talk about them all the time around here, it also always seems like we're barely keeping up.  So this week we have another installment in our "neural nets: they so smart!" series, talking about three topics.  And all the topics this week were listener suggestions, too!

Relevant links:

Stein's Paradox

When you're estimating something about some object that's a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should you estimate it: use measurements of the individual, or get some extra information from the group?  The James-Stein estimator tells you how to combine individual and group information make predictions that, taken over the whole group, are more accurate than if you treated each individual, well, individually.  

Relevant Links:

Empirical Bayes

Say you're looking to use some Bayesian methods to estimate parameters of a system.  You've got the normalization figured out, and the likelihood, but the prior... what should you use for a prior?  Empirical Bayes has an elegant answer: look to your previous experience, and use past measurements as a starting point in your prior.

Scratching your head about some of those terms, and why they matter?  Lucky for you, you're standing in front of a podcast episode that unpacks all of this.

Endogenous Variables and Measuring Protest Effectiveness

Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers?  It's a tricky question to answer, since usually we need randomly distributed treatments (e.g. big protests) to understand causality, but there's no reason to believe that big protests are actually randomly distributed.  In other words, protest size is endogenous to legislative response, and understanding cause and effect is very challenging.

So, what to do?  Well, at least in the case of studying Tea Party protest effectiveness, researchers have used rainfall, of all things, to understand the impact of a big protest.  In other words, rainfall is the instrumental variable in this analysis that cracks the scientific case open.  What does rainfall have to do with protests?  Do protests actually matter?  What do we mean when we talk about endogenous and instrumental variables?  We wouldn't be very good podcasters if we answered all those questions here--you gotta listen to this episode to find out.

Relevant links:

Rock the ROC Curve

This week: everybody's favorite WWII-era classifier metric!  But it's not just for winning wars, it's a fantastic go-to metric for all your classifier quality needs.  

Ensemble Algorithms

If one machine learning model is good, are two models better? In a lot of cases, the answer is yes. If you build many ok models, and then bring them all together and use them in combination to make your final predictions, you've just created an ensemble model. It feels a little bit like cheating, like you just got something for nothing, but the results don't like: algorithms like Random Forests and Gradient Boosting Trees (two types of ensemble algorithms) are some of the strongest out-of-the-box algorithms for classic supervised classification problems. What makes a Random Forest random, and what does it mean to gradient boost a tree? Have a listen and find out.

How to evaluate a translation: BLEU scores

As anyone who's encountered a badly translated text could tell you, not all translations are created equal.  Some translations are smooth, fluent and sound like a poet wrote them; some are jerky, non-grammatical and awkward.  When a machine is doing the translating, it's awfully easy to end up with a robotic-sounding text; as the state of the art in machine translation improves, though, a natural question to ask is: according to what measure?  How do we quantify a "good" translation?

Enter the BLEU score, which is the standard metric for quantifying the quality of a machine translation.  BLEU rewards translations that have large overlap with human translations of sentences, with some extra heuristics thrown in to guard against weird pathologies (like full sentences getting translated as one word, redundancies, and repetition).  Nowadays, if there's a machine translation being evaluated or a new state-of-the-art system (like the Google neural machine translation we've discussed on this podcast before), chances are that there's a BLEU score going into that assessment.

Relevant links:

Zero Shot Translation

Take Google-size data, the flexibility of a neural net, and all (well, most) of the languages of the world, and what you end up with is a pile of surprises.  This episode is about some interesting features of Google's new neural machine translation system, namely that with minimal tweaking, it can accommodate many different languages in a single neural net, that it can do a half-decent job of translating between language pairs it's never been explicitly trained on, and that it seems to have its own internal representation of concepts that's independent of the language those concepts are being represented in.  Intrigued?  You should be...

Relevant links:

Google Neural Machine Translation

Recently, Google swapped out the backend for Google Translate, moving from a statistical phrase-based method to a recurrent neural network.  This marks a big change in methodology: the tried-and-true statistical translation methods that have been in use for decades are giving way to a neural net that, across the board, appears to be giving more fluent and natural-sounding translations.  This episode recaps statistical phrase-based methods, digs into the RNN architecture a little bit, and recaps the impressive results that is making us all sound a little better in our non-native languages.

Relevant links:

Data + Healthcare + Government = The Future of Medicine : Interview with Precision Medicine Initiative researcher Matt Might

Today we are delighted to bring you an interview with Matt Might, computer scientist and medical researcher extraordinaire and architect of President Obama's Precision Medicine Initiative.  As the Obama Administration winds down, we're talking with Matt about the goals and accomplishments of precision medicine (and related projects like the Cancer Moonshot) and what he foresees as the future marriage of data and medicine.  Many thanks to Matt, our friends over at Partially Derivative (hi, Jonathon!) and the White House for arranging this opportunity to chat.  Enjoy!

Special Crossover Episode: Partially Derivative Interview with White House Chief Data Scientist DJ Patil

We have the pleasure of bringing you a very special crossover episode this week: our friends at Partially Derivative (another great podcast about data science, you should check it out) recently interviewed White House Chief Data Scientist DJ Patil.  We think DJ's message about the importance and impact of data science is worth spreading, so it's our pleasure to bring it to you today.  A huge thanks to Jonathon Morgan and Partially Derivative for sharing this interview with us--enjoy!

Relevant links:
http://partiallyderivative.com/podcast/2016/12/13/dj-patil

How to Lose at Kaggle

Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists.  Losing unexpectedly at the very end of the contest is also something that a lot of us have experienced.  It's not just bad luck: a very specific combination of overfitting on popular competitions can take someone who is in the top few spots in the final days of a contest and bump them down hundreds of slots in the final tally.

Relevant Links: