Electoral Insights Part 2

Following up on our last episode about how experiments can be performed in political science, now we explore a high-profile case of an experiment gone wrong. 

An extremely high-profile paper that was published in 2014, about how talking to people can convince them to change their minds on topics like abortion and gay marriage, has been exposed as the likely product of a fraudulently produced dataset. We’ll talk about a cool data science tool called the Kolmogorov-Smirnov test, which a pair of graduate students used to reverse-engineer the likely way that the fraudulent data was generated. 

But a bigger question still remains—what does this whole episode tell us about fraud and oversight in science?

Links: 

Irregularities in LaCour

The Case of the Amazing Gay Marriage Data: How a graduate student reluctantly uncovered a huge scientific fraud

Electoral Insights Part 1

The first of our two-parter discussing the recent electoral data fraud case. The results of the study in question were covered widely, including by This American Life (who later had to issue a retraction).

Data science for election research involves studying voters, who are people, and people are tricky to study—every one of them is different, and the same treatment can have different effects on different voters.  But with randomized controlled trials, small variations from person to person can even out when you look at a larger group.  With the advent of randomized experiments in elections a few decades ago, a whole new door was opened for studying the most effective ways to campaign.

Link: The Victory Lab

Reporter Bot

There’s a big difference between a table of numbers or statistics, and the underlying story that a human might tell about how those numbers were generated. 

Think about a baseball game—the game stats and a newspaper story are describing the same thing, but one is a good input for a machine learning algorithm and the other is a good story to read over your morning coffee. Data science and machine learning are starting to bridge this gap, taking the raw data on things like baseball games, financial scenarios, etc. and automatically writing human-readable stories that are increasingly indistinguishable from what a human would write. 

In this episode, we’ll talk about some examples of auto-generated content—you’ll be amazed at how sophisticated some of these reporter-bots can be. By the way, this summary was written by a human. (Or was it?)

Links:

Careers in Data Science

Let’s talk money. As a “hot” career right now, data science can pay pretty well. But for an individual person matched with a specific job or industry, how much should someone expect to make? 

Since Katie was on the job market lately, this was something she’s been researching, and it turns out that data science itself (in particular linear regressions) has some answers. 

In this episode, we go through a survey of hundreds of data scientists, who report on their job duties, industry, skills, education, location, etc. along with their salaries, and then talk about how this data was fed into a linear regression so that you (yes, you!) can use the patterns in the data to know what kind of salary any particular kind of data scientist might expect.

Link: 2014 O'Reilly Data Science Salary Survey

Neural Nets Part 2

In the last episode, we zipped through neural nets and got a quick idea of how they work and why they can be so powerful. Here’s the real payoff of that work:

In this episode, we’ll talk about a brand-new pair of results, one from Stanford and one from Google, that use neural nets to perform automated picture captioning. One neural net does the object and relationship recognition of the image, a second neural net handles the natural language processing required to express that in an English sentence, and when you put them together you get an automated captioning tool. Two heads are better than one indeed...

Links:

Neural Nets Part 1

There is no known learning algorithm that is more flexible and powerful than the human brain. That's quite inspirational, if you think about it--to level up machine learning, maybe we should be going back to biology and letting millions of year of evolution guide the structure of our algorithms. 

This is the idea behind neural nets, which mock up the structure of the brain and are some of the most studied and powerful algorithms out there. In this episode, we’ll lay out the building blocks of the neural net (called neurons, naturally) and the networks that are built out of them. 

We’ll also explore the results that neural nets get when they're used to do object recognition in photographs.

Link: Lee, Grosse, Ranganath, Ng: Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Inferring Authorship Part 2

Now that we’re up to speed on the classic author ID problem (who wrote the unsigned Federalist Papers?), we move onto a couple more contemporary examples. 

First, J.K. Rowling was famously outed using computational linguistics (and Twitter) when she wrote a book under the pseudonym Robert Galbraith. 

Second, we’ll talk about a mystery that still endures--who is Satoshi Nakamoto? Satoshi is the mysterious person (or people) behind an extremely lucrative cryptocurrency (aka internet money) called Bitcoin; no one knows who he, she or they are, but we have plenty of writing samples in the form of whitepapers and Bitcoin forum posts. We’ll discuss some attempts to link Satoshi Nakamoto with a cryptocurrency expert and computer scientist named Nick Szabo; the links are tantalizing, but not a smoking gun. “Who is Satoshi” remains an example of attempted author identification where the threads are tangled, the conclusions inconclusive and the stakes high.

Links:

Juola: Rowling and "Galbraith": an authorial analysis

LikeInAMirror Blog: Occam's Razor: who is most likely to be Satoshi Nakamoto?

Inferring Authorship Part 1

This episode is inspired by one of our projects for Intro to Machine Learning: given a writing sample, can you use machine learning to identify who wrote it? Turns out that the answer is yes, a person’s writing style is as distinctive as their vocal inflection or their gait when they walk. 

By tracing the vocabulary used in a given piece, and comparing the word choices to the word choices in writing samples where we know the author, it can be surprisingly clear who is the more likely author of a given piece of text. 

We’ll use a seminal paper from the 1960’s as our example here, where the Naive Bayes algorithm was used to determine whether Alexander Hamilton or James Madison was the more likely author of a number of anonymous Federalist Papers.

Link: Mosteller and Wallace, Inference in an Authorship Problem

Statistical Mistakes and the Challenger Disaster

After the Challenger exploded in 1986, killing all 7 astronauts aboard, an investigation into the cause was immediately launched. 

In the cold temperatures the night before the launch, the o-rings that seal off the fuel tanks from the rocket boosters became inflexible, so they did not seal properly, which led to the fuel tank explosion. NASA knew that there could be o-ring problems, but performed the analysis of their data incorrectly and ended up massively underestimating the risk associated with the cold temperatures. 

In this episode, we'll unpack the mistakes they made. We'll talk about how they excluded data points that they thought were irrelevant but which actually were critical to recognizing a fatal pattern.

Link: Robison, Boisjoly, Hoecker, Young, Representation and Misrepresentation: Tufte and Morton Thiokol Engineers on the Challenger



Introducing Hidden Markov Models (HMMs Part 1)

Wikipedia says, "A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states." What does that even mean?

In part one of a special two-parter on HMMs, Katie, Ben, and special guest Francesco explain the basics of HMMs, and some simple applications of them in the real world. This episode sets the stage for part two, where we explore the use of HMMs in Modern Genetics, and possibly Katie's "Um Detector."

Link: Eddy, What is a hidden markov model?

Monte Carlo for Physicists

This is another physics-centered podcast, about an ML-backed particle identification tool that we use to figure out what kind of particle caused a particular blob in the detector. But in this case, as in many cases, it looks hard at the outset to use ML because we don't have labeled training data. Monte Carlo to the rescue! 

Monte Carlo (MC) is fake data that we generate for ourselves, usually following certain sets of rules (often a Markov chain; in physics we generate MC according to the laws of physics as we understand them) and since you generated the event, you "know" what the correct label is. 

Of course, it's a lot of work to validate your MC, but the payoff is that then you can use Machine Learning where you never could before.

Random Kanye

Ever feel like you could randomly assemble words from a certain vocabulary and make semi-coherent Kanye West lyrics? Or technical documentation, imitations of local newscasters, your politically outspoken uncle, etc.? Wonder no more, there's a way to do this exact type of thing: it's called a Markov Chain, and probably the most powerful way to generate made-up data that you can then use for fun and profit. The idea behind a Markov Chain is that you probabilistically generate a sequence of steps, numbers, words, etc. where each next step/number/word depends only on the previous one, which makes it fast and efficient to computationally generate. Usually Markov Chains are used for serious academic uses, but this ain't one of them: here they're used to randomly generate rap lyrics based on Kanye West lyrics.

Link: The Genesis of Kanye

Lie Detectors

Often machine learning discussions center around algorithms, or features, or datasets--this one centers around interpretation, and ethics. 

Suppose you could use a technology like fMRI to see what regions of a person's brain are active when they ask questions. And also suppose that you could run trials where you watch their brain activity while they lie about some minor issue (say, whether the card in their hand is a spade or a club)--could you use machine learning to analyze those images, and use the patterns in them for lie detection? Well you certainly can try, and indeed researchers have done just that. 

There are important problems though--the images of brains can be high variance, meaning that for any given person, there might not be a lot of certainty about whether they're lying or not. It's also open to debate whether the training set (in this case, test subjects with playing cards in their hands) really generalize well to the more important cases, like a person accused of a crime. 

So while machine learning has yielded some impressive gains in lie detection, it is not a solution to these thornier scientific issues.

Link: Bizzi et al, Using Imaging to Identify Deceit: Scientific and Ethical Questions

The Enron Dataset

In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets.  By 2002, it was apparent that many bad apples had been cooking the books, and billions of dollars and thousands of jobs disappeared.  

In the aftermath, surprisingly, one of the greatest datasets in all of machine learning was born--the Enron emails corpus.  Hundreds of thousands of emails amongst top executives were made public; there's no realistic chance any dataset like this will ever be made public again.  

But the dataset that was released has gone on to immortality, serving as the basis for a huge variety of advances in machine learning and other fields. 

Link: MIT Technology Review: The Immortal Life of the Enron E-mails


Labels and Where to Find Them

Supervised classification is built on the backs of labeled datasets, but a good set of labels can be hard to find.  Great data is everywhere, but the corresponding labels can sometimes be really tricky.  Take a few examples we've already covered, like lie detection with an MRI machine (have to take pictures of someone's brain while they try to lie, not a trivial task) or automated image captioning (so many images!  so many valid labels!)  

In this epsiode, we'll dig into this topic in depth, talking about some of the standard ways to get a labeled dataset if your project requires labels and you don't already have them.

Link: Higgs Hunters 


Um Detector

So, um... what about machine learning for audio applications?  In the course of starting this podcast, we've edited out a lot of "um"'s from our raw audio files.  It's gotten now to the point that, when we see the waveform in soundstudio, we can almost identify an "um" by eye.  Which makes it an interesting problem for machine learning--is there a way we can train an algorithm to recognize the "um" pattern, too?  This has become a little side project for Katie, which is very much still a work in progress.  We'll talk about what's been accomplished so far, some design choices Katie made in getting the project off the ground, and (of course) mistakes made and hopefully corrected.  We always say that the best way to learn something is by doing it, and this is our chance to try our own machine learning project instead of just telling you about what someone else did! 

Better Facial Recognition with Fisherfaces

Now that we know about eigenfaces (if you don't, listen to the previous episode), let's talk about how it breaks down. 

Variations that are trivial to humans when identifying faces can really mess up computer-driven facial ID--expressions, lighting, and angle are a few. Something that can easily happen is an algorithm can optimize to identify one of those traits, rather than the underlying trait of whether the person is the same (for example, if the training image is me smiling, you may reject an image of me frowning but accidentally approve an image of another woman smiling). 

Fisherfaces uses a fisher linear discriminant to distinguish based on the dimension in the data that shows the smallest inter-class distance, rather than maximizing the variation overall (we'll unpack this statement), and it is much more robust than our pal eigenfaces when there's shadows, cut-off images, expressions, etc.

Link: Belhumeur, Hespanha, and Kriegman: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection