Advice to those trying to get a first job in data science

We often hear from folks wondering what advice we can give them as they search for their first job in data science. What does a hiring manager look for? Should someone focus on taking classes online, doing a bootcamp, reading books, something else? How can they stand out in a crowd?

There’s no single answer, because so much depends on the person asking in the first place, but that doesn’t stop us from giving some perspective. So in this episode we’re sharing that advice out more widely, so hopefully more of you can benefit from it.

Relevant link:

Re-release: Machine Learning Technical Debt

This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows.  If you've worked in software before, you're probably familiar with the idea of technical debt, which are inefficiencies that crop up in the code when you're trying to go fast.  You take shortcuts, hard-code variable values, skimp on the documentation, and generally write not-that-great code in order to get something done quickly, and then end up paying for it later on.  This is technical debt, and it's particularly easy to accrue with machine learning workflows.  That's the premise of this episode's paper.

Relevant links:

Using statistics to understand why your software projects are always running late

If you’re like most software engineers and, especially, data scientists, you find it really hard to make accurate estimates of how long a project will take to complete. Don’t feel bad: statistics is most likely actively working against your best efforts to give your boss an accurate delivery date. This week, we’ll talk through a great blog post that digs into the underlying probability and statistics assumptions that are probably driving your estimates, versus the ones that maybe should be driving them.

Relevant links:

The Algorithm That Made the Black Hole Picture

53.5 million light-years away, there’s a gigantic galaxy called M87 with something interesting going on inside it. Between Einstein’s theory of relativity and the motion of a group of stars in the galaxy (the motion is characteristic of there being a huge gravitational mass present), scientists have believed for years that there is a supermassive black hole at the center of that galaxy. However, black holes are really hard to see directly because they aren’t a light source like a star or a supernova. They suck up all the light around them, and moreover, even though they’re really massive, they’re small in volume.

That’s why it was so amazing a few weeks ago when scientists announced that they had reconstructed an image of a black hole for the first time ever. The image was the result of many measurements combined together with a clever reconstruction strategy, and giving scientists, engineers, and all the rest of us something to marvel at.

Relevant links:

The Role of Structure in AI

As artificial intelligence algorithms get applied to more and more domains, a question that often arises is whether to somehow build structure into the algorithm itself to mimic the structure of the problem. There’s usually some amount of knowledge we already have of each domain, an understanding of how it usually works, but it’s not clear how (or even if) to lend this knowledge to an AI algorithm to help it get started. Sure, it may get the algorithm caught up to where we already were on solving that problem, but will it eventually become a limitation where the structure and assumptions prevent the algorithm from surpassing human performance?

It’s a problem without a universal answer. This week, we’ll talk about the question in general, and especially recommend a recent discussion between Christopher Manning and Yann LeCun, two AI researchers who hold different opinions on whether structure is a necessary good or a necessary evil.

Relevant links:

The Great Data Science Specialist vs. Generalist Debate

It’s not news that data scientists are expected to be capable in many different areas (writing software, designing experiments, analyzing data, talking to non-technical stakeholders). One thing that has been changing, though, as the field becomes a bit older and more mature, is our ideas about what data scientists should focus on to stay relevant. Should they specialize in a particular area (if so, which one)? Should they instead stay general and work across many different areas? In either case, what are the costs and benefits?

This question has prompted a number of think pieces lately, which are sometimes advocating for specializing, and sometimes pointing out the benefits of generalists. In short, if you’re trying to figure out what to actually do, you might be hearing some conflicting opinions. In this episode, we break apart the arguments both ways, and maybe (hopefully?) reach a little resolution about where to go from here.

Relevant links:

How to think about moonshot projects

If you work in data science, you’re well aware of the sheer volume of high-risk, high-reward projects that are hypothetically possible. The fact that they’re high-reward means they’re exciting to think about, and the payoff would be huge if they succeed, but the high-risk piece means that you have to be smart about what you choose to work on and be wary of investing all your resources in projects that fail entirely or starve other, higher-value projects.

This episode focuses mainly on Google X, the so-called “Moonshot Factory” at Google that is a modern-day heir to the research legacies of Bell Labs and Xerox PARC. It’s an organization entirely focused on rapidly imagining, prototyping, invalidating, and, occasionally, successfully creating game-changing technologies. The process and philosophy behind Google X are useful for anyone thinking about how to stay aggressive and “responsibly irresponsible,” which includes a lot of you data science folks out there.

Relevant links:

Statistical significance in hypothesis testing

When you are running an AB test, one of the most important questions is how much data to collect. Collect too little, and you can end up drawing the wrong conclusion from your experiment. But in a world where experimenting is generally not free, and you want to move quickly once you know the answer, there is such a thing as collecting too much data. Statisticians have been solving this problem for decades, and their best practices are encompassed in the ideas of power, statistical significance, and especially how to generally think about hypothesis testing. This week, we’re going over these important concepts, so your next AB test is just as data-intensive as it needs to be.

Relevant links:

The language model too dangerous to release

OpenAI recently created a cutting-edge new natural language processing model, but unlike all their other projects so far, they have not released it to the public. Why? It seems to be a little too good. It can answer reading comprehension questions, summarize text, translate from one language to another, and generate realistic fake text. This last case, in particular, raised concerns inside OpenAI that the raw model could be dangerous if bad actors had access to it, so researchers will spend the next six months studying the model (and reading comments from you, if you have strong opinions here) to decide what to do next. Regardless of where this lands from a policy perspective, it’s an impressive model and the snippets of released auto-generated text are quite impressive. We’re covering the methodology, the results, and a bit of the policy implications in our episode this week.

Relevant links:

The Cathedral and the Bazaar

Imagine you have two choices of how to build something: top-down and controlled, with a few people playing a master designer role, or bottom-up and free-for-all, with nobody playing an explicit architect role. Which one do you think would make the better product? “The Cathedral and the Bazaar” is an essay exploring this question for open source software, and making an argument for the bottom-up approach. It’s not entirely intuitive that projects like Linux or scikit-learn, with many contributors and an open-door policy for modifying the code, would be able to resist the chaos of many cooks in the kitchen. So what makes it work in some cases? And sometimes not work in others? That’s the topic of discussion this week.

Relevant links:

AlphaStar

It’s time for our latest installation in the series on artificial intelligence agents beating humans at games that we thought were safe from the robots. In this case, the game is StarCraft, and the AI agent is AlphaStar, from the same team that built the Go-playing AlphaGo AI last year. StarCraft presents some interesting challenges though: the gameplay is continuous, there are many different kinds of actions a player must take, and of course there’s the usual complexities of playing strategy games and contending with human opponents. AlphaStar overcame all of these challenges, and more, to notch another win for the computers.

Relevant links:

Are machine learning engineers the new data scientists?

For many data scientists, maintaining models and workflows in production is both a huge part of their job and not something they necessarily trained for if their background is more in statistics or machine learning methodology. Productionizing and maintaining data science code has more in common with software engineering than traditional science, and to reflect that, there’s a new-ish role, and corresponding job title, that you should know about. It’s called machine learning engineer, and it’s what a lot of data scientists are becoming.

Relevant links:

Interview with Alex Radovic, particle physicist turned machine learning researcher

You’d be hard-pressed to find a field with bigger, richer, and more scientifically valuable data than particle physics. Years before “data scientist” was even a term, particle physicists were inventing technologies like the world wide web and cloud computing grids to help them distribute and analyze the datasets required to make particle physics discoveries. Somewhat counterintuitively, though, deep learning has only really debuted in particle physics in the last few years, although it’s making up for lost time with many exciting new advances.

This episode of Linear Digressions is a little different from most, as we’ll be interviewing a guest, one of my (Katie’s) friends from particle physics, Alex Radovic. Alex and his colleagues have been at the forefront of machine learning in physics over the last few years, and his perspective on the strengths and shortcomings of those two fields together is a fascinating one.

Relevant links:

K Nearest Neighbors

K Nearest Neighbors is an algorithm with secrets. On one hand, the algorithm itself is as straightforward as possible: find the labeled points nearest the point that you need to predict, and make a prediction that’s the average of their answers. On the other hand, what does “nearest” mean when you’re dealing with complex data? How do you decide whether a man and a woman of the same age are “nearer” to each other than two women several years apart? What if you convert all your monetary columns from dollars to cents, your distances from miles to nanometers, your weights from pounds to kilograms? Can your definition of “nearest” hold up under these types of transformations? We’re discussing all this, and more, in this week’s episode.

Not every deep learning paper is great. Is that a problem?

Deep learning is a field that’s growing quickly. That’s good! There are lots of new deep learning papers put out every day. That’s good too… right? What if not every paper out there is particularly good? What even makes a paper good in the first place? It’s an interesting thing to think about, and debate, since there’s no clean-cut answer and there are worthwhile arguments both ways. Wherever you find yourself coming down in the debate, though, you’ll appreciate the good papers that much more.

Relevant links:

The assumptions of ordinary least squares

Ordinary least squares (OLS) is often used synonymously with linear regression. If you’re a data scientist, machine learner, or statistician, you bump into it daily. If you haven’t had the opportunity to build up your understanding from the foundations, though, listen up: there are a number of assumptions underlying OLS that you should know and love. They’re interesting, force you to think about data and statistics, and help you know when you’re out of “good” OLS territory and into places where you could run into trouble.

Relevant links:

Quantile regression

Linear regression is a great tool if you want to make predictions about the mean value that an outcome will have given certain values for the inputs. But what if you want to predict the median? Or the 10th percentile? Or the 90th percentile. You need quantile regression, which is similar to ordinary least squares regression in some ways but with some really interesting twists that make it unique. This week, we’ll go over the concept of quantile regression, and also a bit about how it works and when you might use it.

Relevant links:

Heterogeneous Treatment Effects

When data scientists use a linear regression to look for causal relationships between a treatment and an outcome, what they’re usually finding is the so-called average treatment effect. In other words, on average, here’s what the treatment does in terms of making a certain outcome more or less likely to happen. But there’s more to life than averages: sometimes the relationship works one way in some cases, and another way in other cases, such that the average isn’t giving you the whole story. In that case, you want to start thinking about heterogeneous treatment effects, and this is the podcast episode for you.

Relevant links:

Pre-training language models for natural language processing problems

When you build a model for natural language processing (NLP), such as a recurrent neural network, it helps a ton if you’re not starting from zero. In other words, if you can draw upon other datasets for building your understanding of word meanings, and then use your training dataset just for subject-specific refinements, you’ll get farther than just using your training dataset for everything. This idea of starting with some pre-trained resources has an analogue in computer vision, where initializations from ImageNet used for the first few layers of a CNN have become the new standard. There’s a similar progression under way in NLP, where simple(r) embeddings like word2vec are giving way to more advanced pre-processing methods that aim to capture more sophisticated understanding of word meanings, contexts, language structure, and more.

Relevant links:

Facial recognition, society, and the law

Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it’s being used and advanced widely and with increasing speed, which means that our technical capabilities are starting to outpace (if they haven’t already) our consensus as a society about what is acceptable in facial recognition and what isn’t. The threats to privacy, fairness, and freedom are real, and Microsoft has become one of the first large companies using this technology to speak out in specific support of its regulation through legislation. Their arguments are interesting, provocative, and even if you don’t agree with every point they make or harbor some skepticism, there’s a lot to think about in what they’re saying.

Relevant links: