What's actually happening when an LLM "thinks out loud"? Research on human decision-making suggests that much of the reasoning we believe drives our choices is actually post hoc rationalization — we decide first, explain later. Katie and Ben get curious about whether the same might be true for large language models: when you watch a model reason through a problem in real time, is that chain of thought the genuine process, or just a plausible-sounding story told after the fact? It's a deceptively deep question with real stakes for how much we should trust model explanations.
Miles Turpin et al., "Language Models Don't Always Say What They Think: Unfaithful Explanations in
Chain-of-Thought Prompting" (NeurIPS 2023, NYU and Anthropic): arxiv.org/abs/2305.04388
Anthropic, "Reasoning Models Don't Always Say What They Think" (Alignment Faking research, 2025):
www.anthropic.com/research/reasoni…s-dont-say-think