Survival analysis is all about studying how long until an event occurs--it's used in marketing to study how long a customer stays with a service, in epidemiology to estimate the duration of survival of a patient with some illness, and in social science to understand how the characteristics of a war inform how long the war goes on. This episode talks about the special challenges associated with survival analysis, and the tools that (data) scientists use to answer all kinds of duration-related questions.
Gravitational Waves
All aboard the gravitational waves bandwagon--with the first direct observation of gravitational waves announced this week, Katie's dusting off her physics PhD for a very special gravity-related episode. Discussed in this episode: what are gravitational waves, how are they detected, and what does this announcement mean for future studies of the universe.
Relevant links:
http://www.nytimes.com/2016/02/12/science/ligo-gravitational-waves-black-holes-einstein.html
https://www.ligo.caltech.edu/news/ligo20160211
The Turing Test
Let's imagine a future in which a truly intelligent computer program exists. How would it convince us (humanity) that it was intelligent? Alan Turing's answer to this question, proposed over 60 years ago, is that the program could convince a human conversational partner that it, the computer, was in fact a human. 60 years later, the Turing Test endures as a gold standard of artificial intelligence. It hasn't been beaten, either--yet.
Relevant links:
https://en.wikipedia.org/wiki/Turing_test
http://commonsensereasoning.org/winograd.html
http://consumerist.com/2015/09/29/its-not-just-you-robots-are-also-bad-at-assembling-ikea-furniture/
Item Response Theory: How Smart ARE You?
Psychometrics is all about measuring the psychological characteristics of people; for example, scholastic aptitude. How is this done? Tests, of course! But there's a chicken-and-egg problem here: you need to know both how hard a test is, and how smart the test-taker is, in order to get the results you want. How to solve this problem, one equation with two unknowns? Item response theory--the data science behind such tests and the GRE.
Relevant links:
https://en.wikipedia.org/wiki/Item_response_theory
Go!
As you may have heard, a computer beat a world-class human player in Go last week. As recently as a year ago the prediction was that it would take a decade to get to this point, yet here we are, in 2016. We'll talk about the history and strategy of game-playing computer programs, and what makes Google's AlphaGo so special.
Relevant link:
http://googleresearch.blogspot.com/2016/01/alphago-mastering-ancient-game-of-go.html
Great Social Networks in History
The Medici were one of the great ruling families of Europe during the Renaissance. How did they come to rule? Not power, or money, or armies, but through the strength of their social network. And speaking of great historical social networks, analysis of the network of letter-writing during the Enlightenment is helping humanities scholars track the dispersion of great ideas across the world during that time, from Voltaire to Benjamin Franklin and everyone in between.
Relevant links:
https://www2.bc.edu/~jonescq/mb851/Mar12/PadgettAnsell_AJS_1993.pdf
http://republicofletters.stanford.edu/index.html
How Much to Pay a Spy
A few small encores on auction theory, and then--how can you value a piece of information before you know what it is? Decision theory has some pointers. Some highly relevant information if you are trying to figure out how much to pay a spy.
Relevant links:
https://tuecontheoryofnetworks.wordpress.com/2013/02/25/the-origin-of-the-dutch-auction/
http://www.nowozin.net/sebastian/blog/the-fair-price-to-pay-a-spy-an-introduction-to-the-value-of-information.html
Sold! Auctions Part 2
The Google ads auction is a special kind of auction, one you might not know as well as the famous English auction (which we talked about in the last episode). But if it's what Google uses to sell billions of dollars of ad space in real time, you know it must be pretty cool.
Relevant links:
https://en.wikipedia.org/wiki/English_auction
http://people.ischool.berkeley.edu/~hal/Papers/2006/position.pdf
http://www.benedelman.org/publications/gsp-060801.pdf
Going Once, Going Twice: Auctions Part 1
The Google AdWords algorithm is (famously) an auction system for allocating a massive amount of online ad space in real time--with that fascinating use case in mind, this episode is part one in a two-part series all about auctions. We dive into the theory of auctions, and what makes a "good" auction.
Relevant links:
https://en.wikipedia.org/wiki/English_auction
http://people.ischool.berkeley.edu/~hal/Papers/2006/position.pdf
http://www.benedelman.org/publications/gsp-060801.pdf
Chernoff Faces and Minard Maps
A data visualization extravaganza in this episode, as we discuss Chernoff faces (you: "faces? huh?" us: "oh just you wait") and the greatest data visualization of all time, or at least the Napoleonic era.
Relevant links:
http://lya.fciencias.unam.mx/rfuentes/faces-chernoff.pdf
https://en.wikipedia.org/wiki/Charles_Joseph_Minard
t-SNE: Reduce Your Dimensions, Keep Your Clusters
Ever tried to visualize a cluster of data points in 40 dimensions? Or even 4, for that matter? We prefer to stick to 2, or maybe 3 if we're feeling well-caffeinated. The t-SNE algorithm is one of the best tools on the market for doing dimensionality reduction when you have clustering in mind.
Relevant links:
https://www.youtube.com/watch?v=RJVL80Gg3lA
The [Expletive Deleted] Problem
The town of [expletive deleted], England, is responsible for the clbuttic [expletive deleted] problem. This week on Linear Digressions: we try really hard not to swear too much.
Related links:
https://en.wikipedia.org/wiki/Scunthorpe_problem
https://www.washingtonpost.com/news/worldviews/wp/2016/01/05/where-is-russia-actually-mordor-in-the-world-of-google-translate/
Unlabeled Supervised Learning--whaaa?
In order to do supervised learning, you need a labeled training dataset. Or do you...?
Relevant links:
http://www.cs.columbia.edu/~dplewis/candidacy/goldman00enhancing.pdf
Hacking Neural Nets
Machine learning: it can be fooled, just like you or me. Here's one of our favorite examples, a study into hacking neural networks.
Relevant links:
http://arxiv.org/pdf/1412.1897v4.pdf
Zipf's Law
Zipf's law is related to the statistics of how word usage is distributed. As it turns out, this is also strikingly reminiscent of how income is distributed, and populations of cities, and bug reports in software, as well as tons of other phenomena that we all interact with every day.
Relevant links:
http://economix.blogs.nytimes.com/2010/04/20/a-tale-of-many-cities/
http://arxiv.org/pdf/cond-mat/0412004.pdf
https://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/
Indie Announcement
We've gone indie! Which shouldn't change anything about the podcast that you know and love, but we're super excited to keep bringing you Linear Digressions as a fully independent podcast.
Some links mentioned in the show:
https://twitter.com/lindigressions
https://twitter.com/benjaffe
https://twitter.com/multiarmbandit
https://soundcloud.com/linear-digressions
http://lineardigressions.com/
The Cocktail Party Problem
Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!
Links: Deep learning machine solves the cocktail party problem
Portrait Beauty
It's Da Vinci meets Skynet: what makes a portrait beautiful, according to a machine learning algorithm. Snap a selfie and give us a listen.
Link: The beauty of capturing faces: rating the quality of digital portraits
A Criminally Short Introduction to Semi-Supervised Learning
Because there are more interesting problems than there are labeled datasets, semi-supervised learning provides a framework for getting feedback from the environment as a proxy for labels of what's "correct." Of all the machine learning methodologies, it might also be the closest to how humans usually learn--we go through the world, getting (noisy) feedback on the choices we make and learn from the outcomes of our actions.
Thresholdout: Down with Overfitting
Overfitting to your training data can be avoided by evaluating your machine learning algorithm on a holdout test dataset, but what about overfitting to the test data? Turns out it can be done, easily, and you have to be very careful to avoid it. But an algorithm from the field of privacy research shows promise for keeping your test data safe from accidental overfitting.
Link: The Reusable Holdout: preserving validity in adaptive data analysis