Demystifying AI for the intelligently curious

Scikit + Optimization = Scikit-Optimize

September 11, 2016

We're excited to welcome a guest, Tim Head, who is one of the maintainers of the scikit-optimize package. With all the talk about optimization lately, it felt appropriate to get in a few words with someone who's out there making it happen for python.

Relevant links:

Two Cultures: Machine Learning and Statistics

September 11, 2016

It's a funny thing to realize, but data science modeling is usually about either explainability, interpretation and understanding, or it's about predictive accuracy. But usually not both--optimizing for one tends to compromise the other. Leo Breiman was one of the titans of both kinds of modeling, a statistician who helped bring machine learning into statistics and vice versa. In this episode, we unpack one of his seminal papers from 2001, when machine learning was just beginning to take root, and talk about how he made clear what machine learning could do for statistics and why it's so important.

Relevant link:

Statistical Modeling: The Two Cultures

Optimization Solutions

August 30, 2016

You've got an optimization problem to solve, and a less-than-forever amount of time in which to solve it. What do? Use a heuristic optimization algorithm, like a hill climber or simulated annealing--we cover both in this episode!

Relevant links:

Heuristic optimization algorithms for fun and (academic) profit

Optimization Problems

August 30, 2016

If modeling is about predicting the unknown, optimization tries to answer the question of what to do, what decision to make, to get the best results out of a given situation. Sometimes that's straightforward, but sometimes... not so much. What makes an optimization problem easy or hard, and what are some of the methods for finding optimal solutions to problems? Glad you asked! May we recommend our latest podcast episode to you?

Multi-level modeling for DEADLY RADIOACTIVE GAS

August 30, 2016

Ok, this episode is only sort of about DEADLY RADIOACTIVE GAS. It's mostly about multilevel modeling, which is a way of building models with data that has distinct, related subgroups within it. What are multilevel models used for? Elections (we can't get enough of 'em these days), understanding the effect that a good teacher can have on their students, and DEADLY RADIOACTIVE GAS.

Relevant links:

Multilevel (hierarchical) modeling: what it can and cannot do

How the polls got Brexit "wrong"

August 21, 2016

Continuing the discussion of how polls do (and sometimes don't) tell us what to expect in upcoming elections--let's take a concrete example from the recent past, shall we? The Brexit referendum was, by and large, expected to shake out for "remain", but when the votes were counted, "leave" came out ahead. Everyone was shocked (SHOCKED!) but maybe the polls weren't as wrong as the pundits like to claim.

Relevant links:

Election Forecasting

August 20, 2016

Not sure if you heard, but there's an election going on right now. Polls, surveys, and projections about, as far as the eye can see. How to make sense of it all? How are the projections made? Which are some good ones to follow? We'll be your trusty guides through a crash course in election forecasting.

Relevant links:

Machine Learning for Genomics

August 20, 2016

Genomics data is some of the biggest #bigdata, and doing machine learning on it is unlocking new ways of thinking about evolution, genomic diseases like cancer, and what really makes each of us different for everyone else. This episode touches on some of the things that make machine learning on genomics data so challenging, and the algorithms designed to do it anyway.

Climate Modeling

August 20, 2016

Hot enough for you? Climate models suggest that it's only going to get warmer in the coming years. This episode unpacks those models, so you understand how they work.

A lot of the episodes we do are about fun studies we hear about, like "if you're interested, this is kinda cool"--this episode is much more important than that. Understanding these models, and taking action on them where appropriate, will have huge implications in the years to come.

Relevant links:

climatesight.org/

Reinforcement Learning Gone Wrong

August 20, 2016

Last week’s episode on artificial intelligence gets a huge payoff this week—we’ll explore a wonderful couple of papers about all the ways that artificial intelligence can go wrong. Malevolent actors? You bet. Collateral damage? Of course. Reward hacking? Naturally! It’s fun to think about, and the discussion starting now will have reverberations for decades to come.

Relevant links:

How to create a malevolent artificial intelligence
Unethical research: how to create a malevolent artificial intelligence
Concrete problems in AI safety

Reinforcement Learning for Artificial Intelligence

August 20, 2016

There’s a ton of excitement about reinforcement learning, a form of semi-supervised machine learning that underpins a lot of today’s cutting-edge artificial intelligence algorithms. Here’s a crash course in the algorithmic machinery behind AlphaGo, and self-driving cars, and major logistical optimization projects—and the robots that, tomorrow, will clean our houses and (hopefully) not take over the world…

Differential Privacy: how to study people without being weird and gross

August 20, 2016

Apple wants to study iPhone users' activities and use it to improve performance. Google collects data on what people are doing online to try to improve their Chrome browser. Do you like the idea of this data being collected? Maybe not, if it's being collected on you--but you probably also realize that there is some benefit to be had from the improved iPhones and web browsers. Differential privacy is a set of policies that walks the line between individual privacy and better data, including even some old-school tricks that scientists use to get people to answer embarrassing questions honestly.

Relevant link:

RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response

How the sausage gets made

July 30, 2016

Something a little different in this episode--we'll be talking about the technical plumbing that gets our podcast from our brains to your ears. As it turns out, it's a multi-step bucket brigade process of RSS feeds, links to downloads, and lots of hand-waving when it comes to trying to figure out how many of you (listeners) are out there.

SMOTE: makin' yourself some fake minority data

July 30, 2016

Machine learning on imbalanced classes: surprisingly tricky. Many (most?) algorithms tend to just assign the majority class label to all the data and call it a day. SMOTE is an algorithm for manufacturing new minority class examples for yourself, to help your algorithm better identify them in the wild.

Relevant links:

SMOTE: synthetic minority over-sampling technique

Conjoint analysis: like AB testing, but on steroids

July 30, 2016

Conjoint analysis is like AB tester, but more bigger more better: instead of testing one or two things, you can test potentially dozens of options. Where might you use something like this? Well, if you wanted to design an entire hotel chain completely from scratch, and to do it in a data-driven way. You'll never look at Courtyard by Marriott the same way again.

Relevant links:

Courtyard by Marriott

Traffic Metering Algorithms

July 30, 2016

This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and clog up. If you're someone who listens to podcasts while commuting, and especially if your area has on-ramp metering, you'll never look at highway access control the same way again (yeah, we know this is super nerdy; it's also super awesome).

Relevant links:

Um Detector 2: the dynamic time warp

July 30, 2016

One tricky thing about working with time series data, like the audio data in our "um" detector (remember that? because we barely do...), is that sometimes events look really similar but one is a little bit stretched and squeezed relative to the other. Besides having an amazing name, the dynamic time warp is a handy algorithm for aligning two time series sequences that are close in shape, but don't quite line up out of the box.

Relevant links:

Using dynamic time warping to find patterns in time series

Inside a data analysis: fraud hunting at Enron

July 30, 2016

It's storytime this week--the story, from beginning to end, of how Katie designed and built the main project for Udacity's Intro to Machine Learning class, when she was developing the course. The project was to use email and financial data to hunt for signatures of fraud at Enron, one of the biggest cases of corporate fraud in history; that description makes the project sound pretty clean but getting the data into the right shape, and even doing some dataset merging (that hadn't ever been done before), made this project much more interesting to design than it might appear. Here's the story of what a data analysis like this looks like...from the inside.

What's the biggest #bigdata?

July 30, 2016

Data science and is often mentioned in the same breath as big data. But how big is big data? And who has the biggest big data? CERN? Youtube?

... Something (or someone) else?

Relevant links:

Big data: astronomical or genomical?

Data Contamination

July 30, 2016

Supervised machine learning assumes that the features and labels used for building a classifier are isolated from each other--basically, that you can't cheat by peeking. Turns out this can be easier said than done. In this episode, we'll talk about the many (and diverse!) cases where label information contaminates features, ruining data science competitions along the way.

Relevant links:

Leakage in data mining: formulation, detection, and avoidance