Debates over AI benchmarking have reached Pokémon
May 3, 2025
JonathanDavis
0

Even the beloved world of Pokémon isn't immune to the drama surrounding AI benchmarks. A recent viral post on X stirred up quite the buzz, claiming that Google's latest Gemini model had outpaced Anthropic's leading Claude model in the classic Pokémon video game trilogy. According to the post, Gemini had impressively made it to Lavender Town in a developer's Twitch stream, while Claude was lagging behind at Mount Moon as of late February.
Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town
119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x
— Jush (@Jush21e8) April 10, 2025
However, what this post conveniently left out was the fact that Gemini had a bit of an unfair advantage. Savvy users over on Reddit quickly pointed out that the developer behind the Gemini stream had crafted a custom minimap. This nifty tool aids the model in recognizing "tiles" in the game, such as cuttable trees, which significantly cuts down the time Gemini needs to spend analyzing screenshots before deciding on its next move.
Now, while Pokémon might not be the most serious AI benchmark out there, it does serve as a fun yet telling example of how different setups can skew the results of these tests. Take Anthropic's recent model, Anthropic 3.7 Sonnet, for instance. On the SWE-bench Verified benchmark, which is meant to test coding prowess, it scored 62.3% accuracy. But, with a "custom scaffold" that Anthropic whipped up, that score jumped to 70.3%.
And it doesn't stop there. Meta took one of its newer models, Llama 4 Maverick, and fine-tuned it specifically for the LM Arena benchmark. The vanilla version of the model didn't fare nearly as well on the same test.
Given that AI benchmarks, including our friendly Pokémon example, are already a bit hit-or-miss, these custom tweaks and non-standard approaches just make it even trickier to draw meaningful comparisons between models as they hit the market. It seems like comparing apples to apples might be getting harder by the day.
Related article
Top 10 AI Marketing Tools for April 2025
Artificial intelligence (AI) is shaking up industries left and right, and marketing is no exception. From small startups to big corporations, businesses are increasingly turning to AI marketing tools to boost their brand visibility and drive their growth. Incorporating these tools into your business
Wikipedia is giving AI developers its data to fend off bot scrapers
Wikipedia's New Strategy to Manage AI Data Scraping
Wikipedia, through the Wikimedia Foundation, is taking a proactive step to manage the impact of AI data scraping on its servers. On Wednesday, they announced a collaboration with Kaggle, a platform owned by Google and dedicated to data science and
Huawei's AI Hardware Breakthrough Poses Challenge to Nvidia's Dominance
Huawei's Bold Move in the Global AI Chip Race
Huawei, the Chinese tech giant, has taken a significant step forward that could shake up the global AI chip race. They've introduced a new computing system called the CloudMatrix 384 Supernode, which, according to local media, outperforms similar techno
Comments (0)
0/200






Even the beloved world of Pokémon isn't immune to the drama surrounding AI benchmarks. A recent viral post on X stirred up quite the buzz, claiming that Google's latest Gemini model had outpaced Anthropic's leading Claude model in the classic Pokémon video game trilogy. According to the post, Gemini had impressively made it to Lavender Town in a developer's Twitch stream, while Claude was lagging behind at Mount Moon as of late February.
Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town
119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x
— Jush (@Jush21e8) April 10, 2025
However, what this post conveniently left out was the fact that Gemini had a bit of an unfair advantage. Savvy users over on Reddit quickly pointed out that the developer behind the Gemini stream had crafted a custom minimap. This nifty tool aids the model in recognizing "tiles" in the game, such as cuttable trees, which significantly cuts down the time Gemini needs to spend analyzing screenshots before deciding on its next move.
Now, while Pokémon might not be the most serious AI benchmark out there, it does serve as a fun yet telling example of how different setups can skew the results of these tests. Take Anthropic's recent model, Anthropic 3.7 Sonnet, for instance. On the SWE-bench Verified benchmark, which is meant to test coding prowess, it scored 62.3% accuracy. But, with a "custom scaffold" that Anthropic whipped up, that score jumped to 70.3%.
And it doesn't stop there. Meta took one of its newer models, Llama 4 Maverick, and fine-tuned it specifically for the LM Arena benchmark. The vanilla version of the model didn't fare nearly as well on the same test.
Given that AI benchmarks, including our friendly Pokémon example, are already a bit hit-or-miss, these custom tweaks and non-standard approaches just make it even trickier to draw meaningful comparisons between models as they hit the market. It seems like comparing apples to apples might be getting harder by the day.











