The Cathedral and the Bazaar

Imagine you have two choices of how to build something: top-down and controlled, with a few people playing a master designer role, or bottom-up and free-for-all, with nobody playing an explicit architect role. Which one do you think would make the better product? “The Cathedral and the Bazaar” is an essay exploring this question for open source software, and making an argument for the bottom-up approach. It’s not entirely intuitive that projects like Linux or scikit-learn, with many contributors and an open-door policy for modifying the code, would be able to resist the chaos of many cooks in the kitchen. So what makes it work in some cases? And sometimes not work in others? That’s the topic of discussion this week.

Relevant links:

AlphaStar

It’s time for our latest installation in the series on artificial intelligence agents beating humans at games that we thought were safe from the robots. In this case, the game is StarCraft, and the AI agent is AlphaStar, from the same team that built the Go-playing AlphaGo AI last year. StarCraft presents some interesting challenges though: the gameplay is continuous, there are many different kinds of actions a player must take, and of course there’s the usual complexities of playing strategy games and contending with human opponents. AlphaStar overcame all of these challenges, and more, to notch another win for the computers.

Relevant links:

Are machine learning engineers the new data scientists?

For many data scientists, maintaining models and workflows in production is both a huge part of their job and not something they necessarily trained for if their background is more in statistics or machine learning methodology. Productionizing and maintaining data science code has more in common with software engineering than traditional science, and to reflect that, there’s a new-ish role, and corresponding job title, that you should know about. It’s called machine learning engineer, and it’s what a lot of data scientists are becoming.

Relevant links:

Interview with Alex Radovic, particle physicist turned machine learning researcher

You’d be hard-pressed to find a field with bigger, richer, and more scientifically valuable data than particle physics. Years before “data scientist” was even a term, particle physicists were inventing technologies like the world wide web and cloud computing grids to help them distribute and analyze the datasets required to make particle physics discoveries. Somewhat counterintuitively, though, deep learning has only really debuted in particle physics in the last few years, although it’s making up for lost time with many exciting new advances.

This episode of Linear Digressions is a little different from most, as we’ll be interviewing a guest, one of my (Katie’s) friends from particle physics, Alex Radovic. Alex and his colleagues have been at the forefront of machine learning in physics over the last few years, and his perspective on the strengths and shortcomings of those two fields together is a fascinating one.

Relevant links:

K Nearest Neighbors

K Nearest Neighbors is an algorithm with secrets. On one hand, the algorithm itself is as straightforward as possible: find the labeled points nearest the point that you need to predict, and make a prediction that’s the average of their answers. On the other hand, what does “nearest” mean when you’re dealing with complex data? How do you decide whether a man and a woman of the same age are “nearer” to each other than two women several years apart? What if you convert all your monetary columns from dollars to cents, your distances from miles to nanometers, your weights from pounds to kilograms? Can your definition of “nearest” hold up under these types of transformations? We’re discussing all this, and more, in this week’s episode.

Not every deep learning paper is great. Is that a problem?

Deep learning is a field that’s growing quickly. That’s good! There are lots of new deep learning papers put out every day. That’s good too… right? What if not every paper out there is particularly good? What even makes a paper good in the first place? It’s an interesting thing to think about, and debate, since there’s no clean-cut answer and there are worthwhile arguments both ways. Wherever you find yourself coming down in the debate, though, you’ll appreciate the good papers that much more.

Relevant links:

The assumptions of ordinary least squares

Ordinary least squares (OLS) is often used synonymously with linear regression. If you’re a data scientist, machine learner, or statistician, you bump into it daily. If you haven’t had the opportunity to build up your understanding from the foundations, though, listen up: there are a number of assumptions underlying OLS that you should know and love. They’re interesting, force you to think about data and statistics, and help you know when you’re out of “good” OLS territory and into places where you could run into trouble.

Relevant links:

Quantile regression

Linear regression is a great tool if you want to make predictions about the mean value that an outcome will have given certain values for the inputs. But what if you want to predict the median? Or the 10th percentile? Or the 90th percentile. You need quantile regression, which is similar to ordinary least squares regression in some ways but with some really interesting twists that make it unique. This week, we’ll go over the concept of quantile regression, and also a bit about how it works and when you might use it.

Relevant links:

Heterogeneous Treatment Effects

When data scientists use a linear regression to look for causal relationships between a treatment and an outcome, what they’re usually finding is the so-called average treatment effect. In other words, on average, here’s what the treatment does in terms of making a certain outcome more or less likely to happen. But there’s more to life than averages: sometimes the relationship works one way in some cases, and another way in other cases, such that the average isn’t giving you the whole story. In that case, you want to start thinking about heterogeneous treatment effects, and this is the podcast episode for you.

Relevant links:

Pre-training language models for natural language processing problems

When you build a model for natural language processing (NLP), such as a recurrent neural network, it helps a ton if you’re not starting from zero. In other words, if you can draw upon other datasets for building your understanding of word meanings, and then use your training dataset just for subject-specific refinements, you’ll get farther than just using your training dataset for everything. This idea of starting with some pre-trained resources has an analogue in computer vision, where initializations from ImageNet used for the first few layers of a CNN have become the new standard. There’s a similar progression under way in NLP, where simple(r) embeddings like word2vec are giving way to more advanced pre-processing methods that aim to capture more sophisticated understanding of word meanings, contexts, language structure, and more.

Relevant links:

Facial recognition, society, and the law

Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it’s being used and advanced widely and with increasing speed, which means that our technical capabilities are starting to outpace (if they haven’t already) our consensus as a society about what is acceptable in facial recognition and what isn’t. The threats to privacy, fairness, and freedom are real, and Microsoft has become one of the first large companies using this technology to speak out in specific support of its regulation through legislation. Their arguments are interesting, provocative, and even if you don’t agree with every point they make or harbor some skepticism, there’s a lot to think about in what they’re saying.

Relevant links:

Re-release: Word2Vec

Bringing you another old classic this week, as we gear up for 2019! See you next week with new content.

Word2Vec is probably the go-to algorithm for vectorizing text data these days.  Which makes sense, because it is wicked cool.  Word2Vec has it all: neural networks, skip-grams and bag-of-words implementations, a multiclass classifier that gets swapped out for a binary classifier, made-up dummy words, and a model that isn't actually used to predict anything (usually).  And all that's before we get to the part about how Word2Vec allows you to do algebra with text.  Seriously, this stuff is cool.

Re-release: The Cold Start Problem

We’re taking a break for the holidays, chilling with the dog and an eggnog (Katie) and the cat and some spiced cider (Ben). Here’s an episode from a while back for you to enjoy. See you again in 2019!

You might sometimes find that it's hard to get started doing something, but once you're going, it gets easier. Turns out machine learning algorithms, and especially recommendation engines, feel the same way. The more they "know" about a user, like what movies they watch and how they rate them, the better they do at suggesting new movies, which is great until you realize that you have to start somewhere. The "cold start" problem will be our focus in this episode, both the heuristic solutions that help deal with it and a bit of realism about the importance of skepticism when someone claims a great solution to cold starts.

Relevant links:

Convex (and non-convex) optimization

Convex optimization is one of the keys to data science, both because some problems straight-up call for optimization solutions and because popular algorithms like a gradient descent solution to ordinary least squares are supported by optimization techniques. But there are all kinds of subtleties, starting with convex and non-convex functions, why gradient descent is really an optimization problem, and what that means for your average data scientist or statistician.

The Normal Distribution and the Central Limit Theorem

When you think about it, it’s pretty amazing that we can draw conclusions about huge populations, even the whole world, based on datasets that are comparatively very small (a few thousand, or a few hundred, or even sometimes a few dozen). That’s the power of statistics, though. This episode is kind of a two-for-one but we’re excited about it—first we’ll talk about the Normal or Gaussian distribution, which is maybe the most famous probability distribution function out there, and then turn to the Central Limit Theorem, which is one of the foundational tenets of statistics and the real reason why the Normal distribution is so important.

Relevant links:

Software 2.0

Neural nets are a way you can model a system, sure, but if you take a step back, squint, and tilt your head, they can also be called… software? Not in the sense that they’re written in code, but in the sense that the neural net itself operates under the same set of general requirements as does software that a human would write. Namely, neural nets take inputs and create outputs from them according to a set of rules, but the thing about the inside of the neural net black box is that it’s written by a computer, whereas the software we’re more familiar with is written by a human. Neural net researcher and Tesla director of AI Andrej Karpathy has taken to calling neural nets “Software 2.0” as a result, and the implications from this connection are really cool. We’ll talk about it this week.

Relevant links:

Building Data Science Teams

At many places, data scientists don’t work solo anymore—it’s a team sport. But data science teams aren’t simply teams of data scientists working together. Instead, they’re usually cross-functional teams with engineers, managers, data scientists, and sometimes others all working together to build tools and products around data science. This episode talks about some of those roles on a typical data science team, what the responsibilities are for each role, and what skills and traits are most important for each team member to have.

Optimized Optimized Web Crawling

Last week’s episode, about methods for optimized web crawling logic, left off on a bit of a cliffhanger: the data scientists had found a solution to the problem, but it wasn’t something that the engineers (who own the search codebase, remember) liked very much. It was black-boxy, hard to parallelize, and introduced a lot of complexity to their code. This episode takes a second crack, where we formulate the problem a little differently and end up with a different, arguably more elegant solution.

Relevant links:

Optimized Web Crawling

Got a fun optimization problem for you this week! It’s a two-for-one: how do you optimize the web crawling logic of an operation like Google search so that the results are, on average, as up-to-date as possible, and how do you optimize your solution of choice so that it’s maintainable by software engineers in a huge distributed system? We’re following an excellent post from the Unofficial Google Data Science blog going through this problem.

Relevant links: