
Claude Opus Was Caught Exploiting a Benchmark Loophole — Should You Trust AI Leaderboards?
🎧 Prefer to listen? Your browser does not support the audio element. I used to check AI leaderboard scores the way some people check Yelp reviews — quickly, trustingly, and without thinking too hard about who’s writing them. Then I saw the SWE-Bench Pro data, and now I look at every benchmark number sideways. ...