The Poisson distribution is a probability distribution function used to for events that happen in time or space. It’s super handy because it’s pretty simple to use and is applicable for tons of things—there are a lot of interesting processes that boil down to “events that happen in time or space.” This episode is a quick introduction to the distribution, and then a focus on two of our favorite applications: using the Poisson distribution to identify supernovas and study army deaths from horse kicks.
If you wanted to find a dataset of jokes, how would you do it? What about a dataset of podcast episodes? If your answer was “I’d try Google,” you might have been disappointed—Google is a great search engine for many types of web data, but it didn’t have any special tools to navigate the particular challenges of, well, dataset data. But all that is different now: Google recently announced Google Dataset Search, an effort to unify metadata tagging around datasets and complementary efforts on the search side to recognize and organize datasets in a way that’s useful and intuitive. So whether you’re an academic looking for an economics or physics or biology dataset, or a big old nerd modeling jokes or analyzing podcasts, there’s an exciting new way for you to find data.
We started Linear Digressions 4 years ago… this isn’t a technical episode, just two buddies shooting the breeze about something we’ve somehow built together.
This week, we’re dusting off the ol’ particle physics PhD to bring you an episode about ambitious new model-agnostic searches for new particles happening at CERN. Traditionally, new particles have been discovered by “targeted searches,” where scientists have a hypothesis about the particle they’re looking for and where it might be found. However, with the huge amounts of data coming out of CERN, a new type of broader search algorithm is starting to be deployed. It’s a strategy that casts a very wide net, looking in many different places at the same time, which also introduces all kinds of interesting questions—even a one-in-a-thousand occurrence happens when you’re looking in many thousands of places.
If you’re a data scientist, you know how important it is to keep your data orderly, clean, moving smoothly between different systems, well-documented… there’s a ton of work that goes into building and maintaining databases and data pipelines. This job, that of owner and maintainer of the data being used for analytics, is often the realm of data engineers. From data extraction, transform and loading procedures to the data storage strategy and even the definitions of key data quantities that serve as focal points for a whole organization, data engineers keep the plumbing of data analytics running smoothly.
A very intriguing op-ed was published in the NY Times recently, in which the author (a senior official in the Trump White House) claimed to be a minor saboteur of sorts, acting with his or her colleagues to undermine some of Donald Trump’s worst instincts and tendencies. Pretty stunning, right? So who is the author? It’s a mystery—the op-ed was published anonymously. That hasn’t stopped people from speculating though, and some machine learning on the vocabulary used in the op-ed is one way to get clues.
If you've been in data science for more than a year or two, chances are you've noticed changes in the field as it's grown and matured. And if you're newer to the field, you may feel like there's a disconnect between lots of different stories about what data scientists should know, or do, or expect from their job. This week, we cover two thought pieces, one that arose from interviews with 35(!) data scientists speaking about what their jobs actually are (and aren't), and one from the head of data science at AirBnb organizing core data science work into three main specialties.
- What Data Scientists Really Do, According to 35 Data Scientists
- One Data Science Job Doesn't Fit All
Civis Analytics data scientist openings:
There's just too much interesting stuff at the intersection of agile software development and data science for us to be able to cover it all in one episode, so this week we're picking up where we left off last time. We'll give a quick overview of agile for those who missed last week or still have some questions, and then cover some of the aspects of agile that don't work well out-of-the-box when applied to data analytics. Fortunately, though, there are some straightforward modifications to agile that make it work really nicely for data analytics!
If you're a data scientist at a firm that does a lot of software building, chances are good that you've seen or heard engineers sometimes talking about "agile software development." If you don't work at a software firm, agile practices might be newer to you. In either case, we wanted to go through a great series of blog posts about some of the practices from agile that are relevant for how data scientists work, in hopes of inspiring some transfer learning from software development to data science.
We've got a classic for you this week as we take a week off for the dog days of summer. See you again next week!
Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists. Losing unexpectedly at the very end of the contest is also something that a lot of us have experienced. It's not just bad luck: a very specific combination of overfitting on popular competitions can take someone who is in the top few spots in the final days of a contest and bump them down hundreds of slots in the final tally.
There's a lot of great machine learning papers coming out every day--and, if we're being honest, some papers that are not as great as we'd wish. In some ways this is symptomatic of a field that's growing really quickly, but it's also an artifact of strange incentive structures in academic machine learning, and the fact that sometimes machine learning is just really hard. At the same time, a high quality of academic work is critical for maintaining the reputation of the field, so in this episode we walk through a recent paper that spells out some of the most common shortcomings of academic machine learning papers and what we can do to make things better.
The stars aligned for me (Katie) this past weekend: I raced my first half-marathon in a long time and got to read a great article from the NY Times about a new running shoe that Nike claims can make its wearers run faster. Causal claims like this one are really tough to verify, because even if the data suggests that people wearing the shoe are faster that might be because of correlation, not causation, so I loved reading this article that went through an analysis of thousands of runners' data in 4 different ways. Each way has a great explanation with pros and cons (as well as results, of course), so be sure to read the article after you check out this episode!
When you're using an AB test to understand the effect of a treatment, there are a lot of assumptions about how the treatment (and control, for that matter) get applied. For example, it's easy to think that everyone who was assigned to the treatment arm actually gets the treatment, everyone in the control arm doesn't, and that the two groups get their treatment instantaneously. None of these things happen in real life, and if you really care about measuring your treatment effect then that's something you want to understand and correct. In this post we'll talk through a great blog post that outlines this for mobile experiments. Oh, and Ben sings.
We like learning on vacation. And we're on vacation, so we thought we'd re-air this episode about how to learn.
Original Summary: If you're anything like us, you a) always are curious to learn more about data science and machine learning and stuff, and b) are usually overwhelmed by how much content is out there (not all of it very digestible). We hope this podcast is a part of the solution for you, but if you're looking to go farther (who isn't?) then we have a few new resources that are presenting high-quality content in a fresh, accessible way. Boring old PDFs full of inscrutable math notation, your days are numbered!
We're on vacation on Mars, so we won't be communicating with you all directly this week. Though, if we wanted to, we could probably use this episode to help get started.
Original Episode: http://lineardigressions.com/episodes/2017/3/19/space-codes
Original Summary: It's hard to get information to and from Mars. Mars is very far away, and expensive to get to, and the bandwidth for passing messages with Earth is not huge. The messages you do pass have to traverse millions of miles, which provides ample opportunity for the message to get corrupted or scrambled. How, then, can you encode messages so that errors can be detected and corrected? How does the decoding process allow you to actually find and correct the errors? In this episode, we'll talk about three pieces of the process (Reed-Solomon codes, convolutional codes, and Viterbi decoding) that allow the scientists at NASA to talk to our rovers on Mars.
We're on vacation, so we hope you enjoy this episode while we each sip cocktails on the beach.
Original Episode: http://lineardigressions.com/episodes/2017/6/18/anscombes-quartet
Original Summary: Anscombe's Quartet is a set of four datasets that have the same mean, variance and correlation but look very different. It's easy to think that having a good set of summary statistics (like mean, variance and correlation) can tell you everything important about a dataset, or at least enough to know if two datasets are extremely similar or extremely different, but Anscombe's Quartet will always be standing behind you, laughing at how silly that idea is.
Anscombe's Quartet was devised in 1973 as an example of how summary statistics can be misleading, but today we can even do one better: the Datasaurus Dozen is a set of twelve datasets, all extremely visually distinct, that have the same summary stats as a source dataset that, there's no other way to put this, looks like a dinosaur. It's an example of how datasets can be generated to look like almost anything while still preserving arbitrary summary statistics. In other words, Anscombe's Quartets can be generated at-will and we all should be reminded to visualize our data (not just compute summary statistics) if we want to claim to really understand it.
Now that hurricane season is upon us again (and we are on vacation), we thought a look back on our hurricane forecasting episode was prudent. Stay safe out there.
Original Episode: http://lineardigressions.com/episodes/2017/9/17/hurricane-forecasting
Original Summary: It's been a busy hurricane season in the Southeastern United States, with millions of people making life-or-death decisions based on the forecasts around where the hurricanes will hit and with what intensity. In this episode we'll deconstruct those models, talking about the different types of models, the theory behind them, and how they've evolved through the years.
In this episode, we talk about some of the potential ramifications of GRPD in the world of data science.
If you're a data scientist, chances are good that you've heard of git, which is a system for version controlling code. Chances are also good that you're not quite as up on git as you want to be--git has a strong following among software engineers but, in our anecdotal experience, data scientists are less likely to know how to use this powerful tool. Never fear: in this episode we'll talk through some of the basics, and what does (and doesn't) translate from version control for regular software to version control for data science software.