Facial Recognition with Eigenfaces

February 01, 2016

A true classic topic in ML: Facial recognition is very high-dimensional, meaning that each picture can have millions of pixels, each of which can be a single feature. It's computationally expensive to deal with all these features, and invites overfitting problems. PCA (principal components analysis) is a classic dimensionality reduction tool that compresses these many dimensions into the few that contain the most variation in the data, and those principal components are often then fed into a classic ML algorithm like and SVM.

One of the best thing about eigenfaces is the great example code that you can find in sklearn--you can distinguish pictures of world leaders yourself in just a few minutes!

Link: Scikit-learn example on eigenfaces

Statistics of World Series Streaks

February 01, 2016

Baseball is characterized by a high level of equality between teams; even the best teams might only have 55% win percentages (contrast this with college football, where teams go undefeated pretty regularly). In this regime, where 2 outcomes (Giants win/Giants lose) are approximately equally likely, we can model the win/loss chances with a binomial distribution.

Using the binomial distribution, we can calculate an interesting little result: what's the chance of the world series going to only 4 games? 5? 6? All the way to 7? Then we can compare to decades' worth of world series data, to see how well the data follows the binomial assumption.

The result tells us a lot about sports psychology--if each game is independent of the others, 4/5/6/7 game series are equally likely. The data shows a different trend: 4 and 7 game series are significantly more likely than 5 or 6. There's a powerful psychological effect at play--everybody loves the 7th game of the world series, or a good sweep. And it turns out that the baseball teams, whether they intend it or not, oblige our love of short (4) and long (7) world series!

Link: Phil Birnbaum: Winning the world series in X games

Computers Try to Tell Jokes

February 01, 2016

Computers are capable of many impressive feats, but making you laugh is usually not one of them. Or could it be? This episode will talk about a custom-built machine learning algorithm that searches through text and writes jokes based on what it finds.

The jokes are formulaic: they're all of the form "I like my X like I like my Y: Z" where X and Y are nouns, and Z is an adjective that can describe both X and Y. For (dumb) example, "I like my men like I like my coffee: steaming hot." The joke is funny when ZX and ZY are both very common phrases, but X and Y are rarely seen together.

So, given a large enough corpus of text, the algorithm looks for triplets of words that fit this description and writes jokes based on them. Are the jokes funny? You be the judge...

Link: Petrovic and Matthews: Unsupervised joke generation from big data

How Outliers Helped Defeat Cholera

February 01, 2016

In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera.

When a cholera outbreak in London killed scores of people, a doctor named John Snow used it as a chance to study whether the cause might be very small organisms that were spreading through the water supply (the prevailing theory at the time was miasma, or “bad air”). By tracing the geography of all the deaths from the outbreak, Snow was practicing elementary data science--and stumbled upon one of history’s most famous outliers.

In this episode, we’ll tell you more about this single data point, a case of cholera that cracked the case wide open for Snow and provided critical validation for the germ theory of disease.

Link: Wikipedia article on the Broad Street cholera outbreak

Hunting for the Higgs

February 01, 2016

Machine learning and particle physics go together like peanut butter and jelly--but this is a relatively new development.

For many decades, physicists looked through their fairly large datasets using the laws of physics to guide their exploration; that tradition continues today, but as ever-larger datasets get made, machine learning becomes a more tractable way to deal with the deluge.

With this in mind, ATLAS (one of the major experiments at CERN, the European Center for Nuclear Research and home laboratory of the recently discovered Higgs boson) ran a machine learning contest over the summer, to see what advances could be found from opening up the dataset to non-physicists.

The results were impressive--physicists are smart folks, but there’s clearly lots of advances yet to make as machine learning and physics learn from one another. And who knows--maybe more Nobel prizes to win as well!

Link: Kaggle Higgs Boson Challenge