Re-release: Word2Vec

Bringing you another old classic this week, as we gear up for 2019! See you next week with new content.

Word2Vec is probably the go-to algorithm for vectorizing text data these days.  Which makes sense, because it is wicked cool.  Word2Vec has it all: neural networks, skip-grams and bag-of-words implementations, a multiclass classifier that gets swapped out for a binary classifier, made-up dummy words, and a model that isn't actually used to predict anything (usually).  And all that's before we get to the part about how Word2Vec allows you to do algebra with text.  Seriously, this stuff is cool.

Re-release: The Cold Start Problem

We’re taking a break for the holidays, chilling with the dog and an eggnog (Katie) and the cat and some spiced cider (Ben). Here’s an episode from a while back for you to enjoy. See you again in 2019!

You might sometimes find that it's hard to get started doing something, but once you're going, it gets easier. Turns out machine learning algorithms, and especially recommendation engines, feel the same way. The more they "know" about a user, like what movies they watch and how they rate them, the better they do at suggesting new movies, which is great until you realize that you have to start somewhere. The "cold start" problem will be our focus in this episode, both the heuristic solutions that help deal with it and a bit of realism about the importance of skepticism when someone claims a great solution to cold starts.

Relevant links:

Convex (and non-convex) optimization

Convex optimization is one of the keys to data science, both because some problems straight-up call for optimization solutions and because popular algorithms like a gradient descent solution to ordinary least squares are supported by optimization techniques. But there are all kinds of subtleties, starting with convex and non-convex functions, why gradient descent is really an optimization problem, and what that means for your average data scientist or statistician.

The Normal Distribution and the Central Limit Theorem

When you think about it, it’s pretty amazing that we can draw conclusions about huge populations, even the whole world, based on datasets that are comparatively very small (a few thousand, or a few hundred, or even sometimes a few dozen). That’s the power of statistics, though. This episode is kind of a two-for-one but we’re excited about it—first we’ll talk about the Normal or Gaussian distribution, which is maybe the most famous probability distribution function out there, and then turn to the Central Limit Theorem, which is one of the foundational tenets of statistics and the real reason why the Normal distribution is so important.

Relevant links:

Software 2.0

Neural nets are a way you can model a system, sure, but if you take a step back, squint, and tilt your head, they can also be called… software? Not in the sense that they’re written in code, but in the sense that the neural net itself operates under the same set of general requirements as does software that a human would write. Namely, neural nets take inputs and create outputs from them according to a set of rules, but the thing about the inside of the neural net black box is that it’s written by a computer, whereas the software we’re more familiar with is written by a human. Neural net researcher and Tesla director of AI Andrej Karpathy has taken to calling neural nets “Software 2.0” as a result, and the implications from this connection are really cool. We’ll talk about it this week.

Relevant links:

Building Data Science Teams

At many places, data scientists don’t work solo anymore—it’s a team sport. But data science teams aren’t simply teams of data scientists working together. Instead, they’re usually cross-functional teams with engineers, managers, data scientists, and sometimes others all working together to build tools and products around data science. This episode talks about some of those roles on a typical data science team, what the responsibilities are for each role, and what skills and traits are most important for each team member to have.

Optimized Optimized Web Crawling

Last week’s episode, about methods for optimized web crawling logic, left off on a bit of a cliffhanger: the data scientists had found a solution to the problem, but it wasn’t something that the engineers (who own the search codebase, remember) liked very much. It was black-boxy, hard to parallelize, and introduced a lot of complexity to their code. This episode takes a second crack, where we formulate the problem a little differently and end up with a different, arguably more elegant solution.

Relevant links:

Optimized Web Crawling

Got a fun optimization problem for you this week! It’s a two-for-one: how do you optimize the web crawling logic of an operation like Google search so that the results are, on average, as up-to-date as possible, and how do you optimize your solution of choice so that it’s maintainable by software engineers in a huge distributed system? We’re following an excellent post from the Unofficial Google Data Science blog going through this problem.

Relevant links:

Better Know a Distribution: The Poisson Distribution

The Poisson distribution is a probability distribution function used to for events that happen in time or space. It’s super handy because it’s pretty simple to use and is applicable for tons of things—there are a lot of interesting processes that boil down to “events that happen in time or space.” This episode is a quick introduction to the distribution, and then a focus on two of our favorite applications: using the Poisson distribution to identify supernovas and study army deaths from horse kicks.


Searching for Datasets with Google

If you wanted to find a dataset of jokes, how would you do it? What about a dataset of podcast episodes? If your answer was “I’d try Google,” you might have been disappointed—Google is a great search engine for many types of web data, but it didn’t have any special tools to navigate the particular challenges of, well, dataset data. But all that is different now: Google recently announced Google Dataset Search, an effort to unify metadata tagging around datasets and complementary efforts on the search side to recognize and organize datasets in a way that’s useful and intuitive. So whether you’re an academic looking for an economics or physics or biology dataset, or a big old nerd modeling jokes or analyzing podcasts, there’s an exciting new way for you to find data.

Relevant links:

Happy 4th birthday to us

We started Linear Digressions 4 years ago… this isn’t a technical episode, just two buddies shooting the breeze about something we’ve somehow built together.

Gigantic Searches in Particle Physics

This week, we’re dusting off the ol’ particle physics PhD to bring you an episode about ambitious new model-agnostic searches for new particles happening at CERN. Traditionally, new particles have been discovered by “targeted searches,” where scientists have a hypothesis about the particle they’re looking for and where it might be found. However, with the huge amounts of data coming out of CERN, a new type of broader search algorithm is starting to be deployed. It’s a strategy that casts a very wide net, looking in many different places at the same time, which also introduces all kinds of interesting questions—even a one-in-a-thousand occurrence happens when you’re looking in many thousands of places.

Relevant links:

Data Engineering

If you’re a data scientist, you know how important it is to keep your data orderly, clean, moving smoothly between different systems, well-documented… there’s a ton of work that goes into building and maintaining databases and data pipelines. This job, that of owner and maintainer of the data being used for analytics, is often the realm of data engineers. From data extraction, transform and loading procedures to the data storage strategy and even the definitions of key data quantities that serve as focal points for a whole organization, data engineers keep the plumbing of data analytics running smoothly.

Relevant links:

Text Analysis for Guessing the NYTimes Op-Ed Author

A very intriguing op-ed was published in the NY Times recently, in which the author (a senior official in the Trump White House) claimed to be a minor saboteur of sorts, acting with his or her colleagues to undermine some of Donald Trump’s worst instincts and tendencies. Pretty stunning, right? So who is the author? It’s a mystery—the op-ed was published anonymously. That hasn’t stopped people from speculating though, and some machine learning on the vocabulary used in the op-ed is one way to get clues.

Relevant links:

The Three Types of Data Scientists, and What They Actually Do

If you've been in data science for more than a year or two, chances are you've noticed changes in the field as it's grown and matured. And if you're newer to the field, you may feel like there's a disconnect between lots of different stories about what data scientists should know, or do, or expect from their job. This week, we cover two thought pieces, one that arose from interviews with 35(!) data scientists speaking about what their jobs actually are (and aren't), and one from the head of data science at AirBnb organizing core data science work into three main specialties.

Relevant links:

Civis Analytics data scientist openings:

Agile Development for Data Scientists, Part 2: Where Modifications Help

There's just too much interesting stuff at the intersection of agile software development and data science for us to be able to cover it all in one episode, so this week we're picking up where we left off last time. We'll give a quick overview of agile for those who missed last week or still have some questions, and then cover some of the aspects of agile that don't work well out-of-the-box when applied to data analytics. Fortunately, though, there are some straightforward modifications to agile that make it work really nicely for data analytics!

Relevant links:

Agile Development for Data Scientists, Part 1: The Good

If you're a data scientist at a firm that does a lot of software building, chances are good that you've seen or heard engineers sometimes talking about "agile software development." If you don't work at a software firm, agile practices might be newer to you. In either case, we wanted to go through a great series of blog posts about some of the practices from agile that are relevant for how data scientists work, in hopes of inspiring some transfer learning from software development to data science. 

Relevant links:

Re-release: How to Lose at Kaggle

We've got a classic for you this week as we take a week off for the dog days of summer. See you again next week!

Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists.  Losing unexpectedly at the very end of the contest is also something that a lot of us have experienced.  It's not just bad luck: a very specific combination of overfitting on popular competitions can take someone who is in the top few spots in the final days of a contest and bump them down hundreds of slots in the final tally.

Relevant Links:

Troubling Trends in Machine Learning Scholarship

There's a lot of great machine learning papers coming out every day--and, if we're being honest, some papers that are not as great as we'd wish. In some ways this is symptomatic of a field that's growing really quickly, but it's also an artifact of strange incentive structures in academic machine learning, and the fact that sometimes machine learning is just really hard. At the same time, a high quality of academic work is critical for maintaining the reputation of the field, so in this episode we walk through a recent paper that spells out some of the most common shortcomings of academic machine learning papers and what we can do to make things better.

Relevant links: