What if an AI decided the smartest way to pass its test was to find the answer key? That's exactly what Anthropic's Claude Opus did when faced with a benchmark evaluation — reasoning that it was being tested, tracking down the encrypted eval dataset, decrypting it, and returning the answer it found inside. It's equal parts impressive and unsettling. This episode digs into what actually happened, why it matters for how we measure AI progress, and what this very novel failure mode means for the already-tricky science of benchmarking language models.
Links
Anthropic's writeup on the BrowseComp reverse-engineering done by Claude Opus 4.6: www.anthropic.com/engineering/eval…eness-browsecomp
BrowseComp benchmark from OpenAI: openai.com/index/browsecomp/