Network effects re-release: when the power of a public health measure lies in widespread adoption

This week’s episode is a re-release of a recent episode, which we don’t usually do but it seems important for understanding what we can all do to slow the spread of covid-19. In brief, public health measures for infectious diseases get most of their effectiveness from their widespread adoption: most of the protection you get from a vaccine, for example, comes from all the other people who also got the vaccine.

That’s why measures like social distancing are so important right now: even if you’re not in a high-risk group for covid-19, you should still stay home and avoid in-person socializing because your good behavior lowers the risk for those who are in high-risk groups. If we all take these kinds of measures, the risk lowers dramatically. So stay home, work remotely if you can, avoid physical contact with others, and do your part to manage this crisis. We’re all in this together.

Relevant links:

Causal inference when you can't experiment: difference-in-differences and synthetic controls

When you need to untangle cause and effect, but you can’t run an experiment, it’s time to get creative. This episode covers difference in differences and synthetic controls, two observational causal inference techniques that researchers have used to understand causality in complex real-world situations.

Relevant links:

Better know a distribution: the Poisson distribution

This is a re-release of an episode that originally ran on October 21, 2018

The Poisson distribution is a probability distribution function used to for events that happen in time or space. It’s super handy because it’s pretty simple to use and is applicable for tons of things—there are a lot of interesting processes that boil down to “events that happen in time or space.” This episode is a quick introduction to the distribution, and then a focus on two of our favorite everyday applications: using the Poisson distribution to identify supernovas and study army deaths from horse kicks.

The Lottery Ticket Hypothesis

Recent research into neural networks reveals that sometimes, not all parts of the neural net are equally responsible for the performance of the network overall. Instead, it seems like (in some neural nets, at least) there are smaller subnetworks present where most of the predictive power resides. The fascinating thing is that, for some of these subnetworks (so-called “winning lottery tickets”), it’s not the training process that makes them good at their classification or regression tasks: they just happened to be initialized in a way that was very effective. This changes the way we think about what training might be doing, in a pretty fundamental way. Sometimes, instead of crafting a good fit from wholecloth, training might be finding the parts of the network that always had predictive power to begin with, and isolating and strengthening them. This research is pretty recent, having only come to prominence in the last year, but nonetheless challenges our notions about what it means to train a machine learning model.

Relevant links:

Interesting technical issues prompted by GDPR and data privacy concerns

Data privacy is a huge issue right now, after years of consumers and users gaining awareness of just how much of their personal data is out there and how companies are using it. Policies like GDPR are imposing more stringent rules on who can use what data for what purposes, with an end goal of giving consumers more control and privacy around their data. This episode digs into this topic, but not from a security or legal perspective—this week, we talk about some of the interesting technical challenges introduced by a simple idea: a company should remove a user’s data from their database when that user asks to be removed. We talk about two topics, namely using Bloom filters to efficiently find records in a database (and what Bloom filters are, for that matter) and types of machine learning algorithms that can un-learn their training data when it contains records that need to be deleted.

Relevant links:

Thinking of data science initiatives as innovation initiatives

Put yourself in the shoes of an executive at a big legacy company for a moment, operating in virtually any market vertical: you’re constantly hearing that data science is revolutionizing the world and the firms that survive and thrive in the coming years are those that execute on a data strategy. What does this mean for your company? How can you best guide your established firm through a successful transition to becoming data-driven? How do you balance the momentum your firm has right now, and the need to support all your current products, customers and operations, against a new and relatively unknown future?

If you’re working as a data scientist at a mature and well-established company, these are the worries on the mind of your boss’s boss’s boss. The worries on your mind may be similar: you’re trying to understand where your work fits into the bigger picture, you need to break down silos, you’re often running into cultural headwinds created by colleagues who don’t understand or trust your work. Congratulations, you’re in the midst of a classic set of challenges encountered by innovation initiatives everywhere. Harvard Business School professor Clayton Christensen wrote a classic business book (The Innovator’s Dilemma) explaining the paradox of trying to innovate in established companies, and why the structure and incentives of those companies almost guarantee an uphill climb to innovate. This week’s episode breaks down the innovator’s dilemma argument, and what it means for data scientists working in mature companies trying to become more data-centric.

Relevant links:

Building a curriculum for educating data scientists: Interview with Prof. Xiao-Li Meng

As demand for data scientists grows, and it remains as relevant as ever that practicing data scientists have a solid methodological and technical foundation for their work, higher education institutions are coming to terms with what’s required to educate the next cohorts of data scientists. The heterogeneity and speed of the field makes it challenging for even the most talented and dedicated educators to know what a data science education “should” look like.

This doesn’t faze Xiao-Li Meng, Professor of Statistics at Harvard University and founding Editor-in-Chief of the Harvard Data Science Review. He’s our interview guest in this episode, talking about the pedagogically distinct classes of data science and how he thinks about designing curricula for making anyone more data literate. From new initiatives in data science to dealing with data science FOMO, this wide-ranging conversation with a leading scholar gives us a lot to think about.

Relevant links:

Running experiments when there are network effects

Traditional A/B tests assume that whether or not one person got a treatment has no effect on the experiment outcome for another person. But that’s not a safe assumption, especially when there are network effects (like in almost any social context, for instance!) SUTVA, or the stable treatment unit value assumption, is a big phrase for this assumption and violations of SUTVA make for some pretty interesting experiment designs. From news feeds in LinkedIn to disentangling herd immunity from individual immunity in vaccine studies, indirect (i.e. network) effects in experiments can be just as big as, or even bigger than, direct (i.e. individual effects). And this is what we talk about this week on the podcast.

Relevant links:

Model robustness in the face of adversaries

Adversarial examples are really, really weird: pictures of penguins that get classified with high certainty by machine learning algorithms as drumsets, or random noise labeled as pandas, or any one of an infinite number of mistakes in labeling data that humans would never make but computers make with joyous abandon. What gives? A compelling new argument makes the case that it’s not the algorithms so much as the features in the datasets that holds the clue. This week’s episode goes through several papers pushing our collective understanding of adversarial examples, and giving us clues to what makes these counterintuitive cases possible.

Relevant links:

UMAP: The latest in dimensionality reduction for clustering

Dimensionality reduction redux: this episode covers UMAP, an unsupervised algorithm designed to make high-dimensional data easier to visualize, cluster, etc. It’s similar to t-SNE but has some advantages. This episode gives a quick recap of t-SNE, especially the connection it shares with information theory, then gets into how UMAP is different (many say better).

Between the time we recorded and released this episode, an interesting argument made the rounds on the internet that UMAP’s advantages largely stem from good initialization, not from advantages inherent in the algorithm. We don’t cover that argument here obviously, because it wasn’t out there when we were recording, but you can find a link to the paper below.

Relevant links:

Data scientists: beware of simple metrics

Picking a metric for a problem means defining how you’ll measure success in solving that problem. Which sounds important, because it is, but oftentimes new data scientists only get experience with a few kinds of metrics when they’re learning and those metrics have real shortcomings when you think about what they tell you, or don’t, about how well you’re really solving the underlying problem. This episode takes a step back and says, what are some metrics that are popular with data scientists, why are they popular, and what are their shortcomings when it comes to the real world? There’s been a lot of great thinking and writing recently on this topic, and we cover a lot of that discussion along with some perspective of our own.

Relevant links:

Communicating data science, from academia to industry

For something as multifaceted and ill-defined as data science, communication and sharing best practices across the field can be extremely valuable but also extremely, well, multifaceted and ill-defined. That doesn’t bother our guest today, Prof. Xiao-Li Meng of the Harvard statistics department, who is leading an effort to start an open-access Data Science Review journal in the model of the Harvard Business Review or Law Review. This episode features Xiao-Li talking about the need he sees for a central gathering place for data scientists in academia, industry, and government to come together to learn from (and teach!) each other.

Relevant links:

Optimizing for the short-term vs. the long-term

When data scientists run experiments, like A/B tests, it’s really easy to plan on a period of a few days to a few weeks for collecting data. The thing is, the change that’s being evaluated might have effects that last a lot longer than a few days or a few weeks—having a big sale might increase sales this week, but doing that repeatedly will teach customers to wait until there’s a sale and never buy anything at full price, which could ultimately drive down revenue in the long term. Increasing the volume of ads on a website might lead people to click on more ads in the short term, but in the long term they’ll be more likely to visually block the ads out and learn to ignore them. But these long-term effects aren’t apparent from the short-term experiment, so this week we’re talking about a paper from Google research that confronts the short-term vs. long-term tradeoff, and how to measure long-term effects from short-term experiments.

Relevant links:

Interview with Prof. Andrew Lo, on using data science to inform complex business decisions

This episode features Prof. Andrew Lo, the author of a paper that we discussed recently on Linear Digressions, in which Prof. Lo uses data to predict whether a medicine in the development pipeline will eventually go on to win FDA approval. This episode gets into the story behind that paper: how the approval prospects of different drugs inform the investment decisions of pharma companies, how to stitch together siloed and incomplete datasts to form a coherent picture, and how the academics building some of these models think about when and how their work can make it out of academia and into industry. Professor Lo is an expert in business (he teaches at the MIT Sloan School of Management) and work like his shows how data science can open up new ways of doing business.

Relevant links:

Using machine learning to predict drug approvals

One of the hottest areas in data science and machine learning right now is healthcare: the size of the healthcare industry, the amount of data it generates, and the myriad improvements possible in the healthcare system lay the groundwork for compelling, innovative new data initiatives. One spot that drives much of the cost of medicine is the riskiness of developing new drugs: drug trials can cost hundreds of millions of dollars to run and, especially given that numerous medicines end up failing to get approval from the FDA, pharmaceutical companies want to have as much insight as possible about whether a drug is more or less likely to make it through clinical trials and on to approval. Professor Andrew Lo and collaborators at MIT Sloan School of Management is taking a look at this prediction task using machine learning, and has an article in the Harvard Data Science Review showing what they were able to find. It’s a fascinating example of how data science can be used to address business needs in creative but very targeted and effective ways.

Relevant links:

Facial recognition, society, and the law

This is a re-release of an episode that originally ran on January 6, 2019.

Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it’s being used and advanced widely and with increasing speed, which means that our technical capabilities are starting to outpace (if they haven’t already) our consensus as a society about what is acceptable in facial recognition and what isn’t. The threats to privacy, fairness, and freedom are real, and Microsoft has become one of the first large companies using this technology to speak out in specific support of its regulation through legislation. Their arguments are interesting, provocative, and even if you don’t agree with every point they make or harbor some skepticism, there’s a lot to think about in what they’re saying.

Relevant links:

Lessons learned from doing data science, at scale, in industry

If you’ve taken a machine learning class, or read up on A/B tests, you likely have a decent grounding in the theoretical pillars of data science. But if you’re in a position to have actually built lots of models or run lots of experiments, there’s almost certainly a bunch of extra “street smarts” insights you’ve had that go beyond the “books smarts” of more academic studies. The data scientists at Booking.com, who run build models and experiments constantly, have written a paper that bridges the gap and talks about what non-obvious things they’ve learned from that practice. In this episode we read and digest that paper, talking through the gotchas that they don’t always teach in a classroom but that make data science tricky and interesting in the real world.

Relevant links:

Varsity A/B Testing

When you want to understand if doing something causes something else to happen, like if a change to a website causes and dip or rise in downstream conversions, the gold standard analysis method is to use randomized controlled trials. Once you’ve properly randomized the treatment and effect, the analysis methods are well-understood and there are great tools in R and python (and other languages) to find the effects. However, when you’re operating at scale, the logistics of running all those tests, and reaching correct conclusions reliably, becomes the main challenge—making sure the right metrics are being computed, you know when to stop an experiment, you minimize the chances of finding spurious results, and many other issues that are simple to track for one or two experiments but become real challenges for dozens or hundreds of experiments. Nonetheless, the reality is that there might be dozens or hundreds of experiments worth running. So in this episode, we’ll work through some of the most important issues for running experiments at scale, with strong support from a series of great blog posts from Airbnb about how they solve this very issue.

Relevant links:

The Care and Feeding of Data Scientists: Growing Careers

In the third and final installment of a conversation with Michelangelo D’Agostino, VP of Data Science and Engineering at Shoprunner, about growing and mentoring data scientists on your team. Some of our topics of conversation include how to institute hack time as a way to learn new things, what career growth looks like in data science, and how to institutionalize professional growth as part of a career ladder. As with the other episodes in this series, the topics we cover today are also covered in the O’Reilly report linked below.

Relevant links:

The Care and Feeding of Data Scientists: Recruiting and Hiring Data Scientists

This week’s episode is the second in a three-part interview series with Michelangelo D’Agostino, VP of Data Science at Shoprunner. This discussion centers on building a team, which means recruiting, interviewing and hiring data scientists. Since data science talent is in such high demand, and data scientists are understandably choosy about where they go to work, a good recruiting and hiring program can have a big impact on the size and quality of the team. Our chat covers much a couple of sections in our dual-authored O’Reilly report, “The Care and Feeding of Data Scientists,” which you can read at the link below.

Relevant links: