Improving Upon a First-Draft Data Science Analysis

There are a lot of good resources out there for getting started with data science and machine learning, where you can walk through starting with a dataset and ending up with a model and set of predictions.  Think something like the homework for your favorite machine learning class, or your most recent online machine learning competition.  However, if you've ever tried to maintain a machine learning workflow (as opposed to building it from scratch), you know that taking a simple modeling script and turning it into clean, well-structured and maintainable software is way harder than most people give it credit for.  That said, if you're a professional data scientist (or want to be one), this is one of the most important skills you can develop.

In this episode, we'll walk through a workshop Katie is giving at the Open Data Science Conference in San Francisco in November 2017, which covers building a machine learning workflow that's more maintainable than a simple script.  If you'll be at ODSC, come say hi, and if you're not, here's a sneak preview!

Survey Raking

It's quite common for survey respondents not to be representative of the larger population from which they are drawn.  But if you're a researcher, you need to study the larger population using data from your survey respondents, so what should you do?  Reweighting the survey data, so that things like demographic distributions look similar between the survey and general populations, is a standard technique and in this episode we'll talk about survey raking, a way to calculate survey weights when there are several distributions of interest that need to be matched.

Relevant links:

Re-release: Kalman Runners

In honor of the Chicago marathon this weekend (and due in large part to Katie recovering from running in it...) we have a re-release of an episode about Kalman filters, which is part algorithm part elaborate metaphor for figuring out, if you're running a race but don't have a watch, how fast you're going.

Katie's Chicago race report:

  • miles 1-13: light ankle pain, lovely cool weather, the most fun imaginable.
  • miles 13-17: no more ankle pain but quads start getting tight, it's a little more effort
  • miles 17-20: oof, really tight legs but still plenty of gas in then tank.
  • miles 20-23: it's warmer out now, legs hurt a lot but running through Pilsen and Chinatown is too fun to notice
  • mile 24: ugh cramp everything hurts
  • miles 25-26.2: awesome crowd support, really tired and loving every second

Final time: 3:54:35

Neural Net Dropout

Neural networks are complex models with many parameters and can be prone to overfitting.  There's a surprisingly simple way to guard against this: randomly destroy connections between hidden units, also known as dropout.  It seems counterintuitive that undermining the structural integrity of the neural net makes it robust against overfitting, but in the world of neural nets, weirdness is just how things go sometimes.

Relevant links:

Disciplined Data Science

As data science matures as a field, it's becoming clearer what attributes a data science team needs to have to elevate their work to the next level.  Most of our episodes are about the cool work being done by other people, but this one summarizes some thinking Katie's been doing herself around how to guide data science teams toward more mature, effective practices.  We'll go through five key characteristics of great data science teams, which we collectively refer to as "disciplined data science," and why they matter.

Relevant links:

Hurricane Forecasting

It's been a busy hurricane season in the Southeastern United States, with millions of people making life-or-death decisions based on the forecasts around where the hurricanes will hit and with what intensity.  In this episode we'll deconstruct those models, talking about the different types of models, the theory behind them, and how they've evolved through the years.  

Relevant links:

Spy Planes

There are law enforcement surveillance aircraft circling over the United States every day, and in this episode, we'll talk about how some folks at BuzzFeed used public data and machine learning to find them.  The fun thing here, in our opinion, is the blend of intrigue (spy planes!) with tech journalism and a heavy dash of publicly available and reproducible analysis code so that you (yes, you!) can see exactly how BuzzFeed identifies the surveillance planes.

Relevant links:

Data Lineage

Software engineers are familiar with the idea of versioning code, so you can go back later and revive a past state of the system.  For data scientists who might want to reconstruct past models, though, it's not just about keeping the modeling code.  It's also about saving a version of the data that made the model.  There are a lot of other benefits to keeping track of datasets, so in this episode we'll talk about data lineage or data provenance.

Relevant Links:

Adversarial Examples for Machine Learning

Even as we rely more and more on machine learning algorithms to help with everyday decision-making, we're learning more and more about how they're frighteningly easy to fool sometimes.  Today we have a roundup of a few successful efforts to create robust adversarial examples, including what it means for an adversarial example to be robust and what this might mean for machine learning in the future.

Relevant links:

Jupyter Notebooks: A Data Scientist's Best Friend

This week's episode is just in time for JupyterCon in NYC, August 22-25...

Jupyter notebooks are probably familiar to a lot of data nerds out there as a great open-source tool for exploring data, doing quick visualizations, and packaging code snippets with explanations for sharing your work with others.  If you're not a data person, or you are but you haven't tried out Jupyter notebooks yet, here's your nudge to go give them a try.  In this episode we'll go back to the old days, before notebooks, and talk about all the ways that data scientists like to work that wasn't particularly well-suited to the command line + text editor setup, and talk about how notebooks have evolved over their lifetime to become even more powerful and well-suited to the data scientist's workflow.

Relevant links:

Curing Cancer with Machine Learning is Super Hard

Today, a dispatch on what can go wrong when machine learning hype outpaces reality: a high-profile partnership between IBM Watson and MD Anderson Cancer Center has recently hit the rocks as it turns out to be tougher than expected to cure cancer with artificial intelligence.  There are enough conflicting accounts in the media to make it tough to say exactly went wrong, but it's a good chance to remind ourselves that even in a post-AI world, hard problems remain hard.  

Relevant Links:

Kullback Leibler Divergence

Kullback Leibler divergence, or KL divergence, is a measure of information loss when you try to approximate one distribution with another distribution.  It comes to us originally from information theory, but today underpins other, more machine-learning-focused algorithms like t-SNE.  And boy oh boy can it be tough to explain.  But we're trying our hardest in this episode!

Relevant links:

Sabermetrics

It's moneyball time!  SABR (the Society for American Baseball Research) is the world's largest organization of statistics-minded baseball enthusiasts, who are constantly applying the craft of scientific analysis to trying to figure out who are the best baseball teams and players.  It can be hard to objectively measure sports greatness, but baseball has a data-rich history and plenty of nerdy fans interested in analyzing that data.  In this episode we'll dissect a few of the metrics from standard baseball and compare them to related metrics from Sabermetrics, so you can nerd out more effectively at your next baseball game.

Related links:

What Data Scientists Can Learn from Software Engineers

We're back again with friend of the pod Walt, former software engineer extraordinaire and current data scientist extraordinaire, to talk about some best practices from software engineering that are ready to jump the fence over to data science.  If last week's episode was for software engineers who are interested in becoming more like data scientists, then this week's episode is for data scientists who are looking to improve their game with best practices from software engineering.

From Software Engineering to Data Science

Data scientists and software engineers often work side by side, building out and scaling technical products and services that are data-heavy but also require a lot of software engineering to build and maintain.  In this episode, we'll chat with a Friend of the Pod named Walt, who started out as a software engineer but works as a data scientist now.  We'll talk about that transition from software engineering to data science, and what special capabilities software engineers have that data scientists might benefit from knowing about (and vice versa).

Re-Release: Fighting Cholera with Data, 1854

This episode was first released in November 2014.

In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera.

When a cholera outbreak in London killed scores of people, a doctor named John Snow used it as a chance to study whether the cause might be very small organisms that were spreading through the water supply (the prevailing theory at the time was miasma, or “bad air”). By tracing the geography of all the deaths from the outbreak, Snow was practicing elementary data science--and stumbled upon one of history’s most famous outliers.

In this episode, we’ll tell you more about this single data point, a case of cholera that cracked the case wide open for Snow and provided critical validation for the germ theory of disease.

Relevant links:

Re-release: The Enron Dataset

This episode was first release in February 2015.

In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad apples had been cooking the books, and billions of dollars and thousands of jobs disappeared.

In the aftermath, surprisingly, one of the greatest datasets in all of machine learning was born--the Enron emails corpus. Hundreds of thousands of emails amongst top executives were made public; there's no realistic chance any dataset like this will ever be made public again.

But the dataset that was released has gone on to immortality, serving as the basis for a huge variety of advances in machine learning and other fields.

Relevant Links:

Anscombe's Quartet

Anscombe's Quartet is a set of four datasets that have the same mean, variance and correlation but look very different.  It's easy to think that having a good set of summary statistics (like mean, variance and correlation) can tell you everything important about a dataset, or at least enough to know if two datasets are extremely similar or extremely different, but Anscombe's Quartet will always be standing behind you, laughing at how silly that idea is.

Anscombe's Quartet was devised in 1973 as an example of how summary statistics can be misleading, but today we can even do one better: the Datasaurus Dozen is a set of twelve datasets, all extremely visually distinct, that have the same summary stats as a source dataset that, there's no other way to put this, looks like a dinosaur.  It's an example of how datasets can be generated to look like almost anything while still preserving arbitrary summary statistics.  In other words, Anscombe's Quartets can be generated at-will and we all should be reminded to visualize our data (not just compute summary statistics) if we want to claim to really understand it.

Relevant Links: