Five and a half years have passed since Linear Digressions went on hiatus, and in that time... nothing has changed. Just kidding. Katie is joined by Phoebe to trace the surprisingly winding research path that led to ChatGPT. Here's a fun fact: GPT-3, the model behind ChatGPT when it launched, already existed in 2020 — and it was technically more powerful than the version that took over the world. So what happened in between? Why couldn't a 175-billion-parameter model trained on essentially the entire internet reliably answer "how do I bake a cake?"
The answer involves Atari games, simulated robots learning to walk, 40 contractors, and a series of papers stretching from 2017 to 2022 that quietly built the recipe every major AI assistant uses today. We trace that arc from reinforcement learning with human preferences all the way to the app that got a hundred million users in two months.
References mentioned in this episode include
1. Ouyang et al. 2022 (InstructGPT paper): https://arxiv.org/abs/2203.02155
2. OpenAI blog post (more readable): https://openai.com/index/instruction-following/
3. Christiano et al. 2017 (Deep Reinforcement Learning from Human Preferences – this is where they teach AI to walk and play Atari): https://arxiv.org/abs/1706.03741
4. Stiennon et al. 2020 (Learning to Summarize from Human Feedback): https://arxiv.org/abs/2009.01325
5. Ziegler et al. 2019 (Fine-Tuning Language Models from Human Preferences): https://arxiv.org/abs/1909.08593