How do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This week we dig into the world of LLM benchmarks — the standardized tests used to compare models — exploring two canonical examples: MMLU, a 14,000-question multiple choice gauntlet spanning medicine, law, and philosophy, and SWE-bench, which throws real GitHub bugs at models to see if they can fix them. Along the way: Goodhart's Law, data contamination, canary strings, and why acing a test isn't always the same as being smart.

MMLU benchmark paper: "Measuring Massive Multitask Language Understanding" by Dan Hendrycks et al. https://arxiv.org/abs/2009.03300

SWE-bench: "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" by Carlos E. Jimenez et al. https://arxiv.org/abs/2310.06770

BIG-bench (including canary string approach): "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models" https://arxiv.org/abs/2206.04615