Debates over AI benchmarking have reached Pokémon

Even the beloved world of Pokémon isn't immune to the drama surrounding AI benchmarks. A recent viral post on X stirred up quite the buzz, claiming that Google's latest Gemini model had outpaced Anthropic's leading Claude model in the classic Pokémon video game trilogy. According to the post, Gemini had impressively made it to Lavender Town in a developer's Twitch stream, while Claude was lagging behind at Mount Moon as of late February.
Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town
119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x
— Jush (@Jush21e8) April 10, 2025
However, what this post conveniently left out was the fact that Gemini had a bit of an unfair advantage. Savvy users over on Reddit quickly pointed out that the developer behind the Gemini stream had crafted a custom minimap. This nifty tool aids the model in recognizing "tiles" in the game, such as cuttable trees, which significantly cuts down the time Gemini needs to spend analyzing screenshots before deciding on its next move.
Now, while Pokémon might not be the most serious AI benchmark out there, it does serve as a fun yet telling example of how different setups can skew the results of these tests. Take Anthropic's recent model, Anthropic 3.7 Sonnet, for instance. On the SWE-bench Verified benchmark, which is meant to test coding prowess, it scored 62.3% accuracy. But, with a "custom scaffold" that Anthropic whipped up, that score jumped to 70.3%.
And it doesn't stop there. Meta took one of its newer models, Llama 4 Maverick, and fine-tuned it specifically for the LM Arena benchmark. The vanilla version of the model didn't fare nearly as well on the same test.
Given that AI benchmarks, including our friendly Pokémon example, are already a bit hit-or-miss, these custom tweaks and non-standard approaches just make it even trickier to draw meaningful comparisons between models as they hit the market. It seems like comparing apples to apples might be getting harder by the day.
Related article
Kakao Mobility outlines Level 4 autonomous driving roadmap for physical AI
Kakao Mobility is planning to develop Level 4 autonomous driving technologies internally as part of its physical AI strategy.
At the 2026 World IT Show conference in Seoul's COEX, Kim Jin-kyu — vice president and head of Kakao Mobility's Physical AI
Barry Diller: Trust in Sam Altman irrelevant as AGI nears
Barry Diller, the billionaire media titan, does not believe OpenAI CEO Sam Altman is untrustworthy, despite recent reports suggesting otherwise. Speaking at the Wall Street Journal's "Future of Everything" conference this week, Diller defended Altman
YouTube expands AI deepfake detection to politicians, government officials, and journalists
On Tuesday, YouTube announced it is expanding its deepfake detection technology to a select group of government officials, political candidates, and journalists. The tool identifies AI-generated likenesses and lets pilot participants request the remo
Related Special Topic Recommendations
Comments (9)
0/500
¿De verdad comparan a los Pokémon en benchmarks de IA? 😂 Suena raro pero me intriga saber cómo lo hacen. ¿Le harán jugar al Pokémon Rojo/Fuego para ver cuántas medallas consigue sin que se pierda? Sería divertido si fuese así, aunque al final estos rankings a veces se sienten solo una guerra de marketing entre las grandes tecnológicas. ¡Quiero ver un torneo oficial de IA jugando! 🎮
Mais franchement, comparer des IA sur Pokémon ? 😂 C'est comme évaluer un chef étoilé sur sa capacité à faire des nuggets. Cette course aux benchmarks devient absurde – next step on va les tester sur Candy Crush ? En tout cas ça montre à quel point les labos cherchent désespérément des moyens originaux de se démarquer.
Whoa, AI playing Pokémon? That's wild! I wonder if Gemini's got a secret Pikachu strategy or just brute-forced its way through. Gotta catch 'em all, I guess! ⚡️
Debates over AI benchmarking in Pokémon? That's wild! I never thought I'd see the day when AI models are compared using Pokémon games. It's fun but kinda confusing. Can someone explain how Gemini outpaced Claude? 🤯

Even the beloved world of Pokémon isn't immune to the drama surrounding AI benchmarks. A recent viral post on X stirred up quite the buzz, claiming that Google's latest Gemini model had outpaced Anthropic's leading Claude model in the classic Pokémon video game trilogy. According to the post, Gemini had impressively made it to Lavender Town in a developer's Twitch stream, while Claude was lagging behind at Mount Moon as of late February.
Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town
119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x
— Jush (@Jush21e8) April 10, 2025
However, what this post conveniently left out was the fact that Gemini had a bit of an unfair advantage. Savvy users over on Reddit quickly pointed out that the developer behind the Gemini stream had crafted a custom minimap. This nifty tool aids the model in recognizing "tiles" in the game, such as cuttable trees, which significantly cuts down the time Gemini needs to spend analyzing screenshots before deciding on its next move.
Now, while Pokémon might not be the most serious AI benchmark out there, it does serve as a fun yet telling example of how different setups can skew the results of these tests. Take Anthropic's recent model, Anthropic 3.7 Sonnet, for instance. On the SWE-bench Verified benchmark, which is meant to test coding prowess, it scored 62.3% accuracy. But, with a "custom scaffold" that Anthropic whipped up, that score jumped to 70.3%.
And it doesn't stop there. Meta took one of its newer models, Llama 4 Maverick, and fine-tuned it specifically for the LM Arena benchmark. The vanilla version of the model didn't fare nearly as well on the same test.
Given that AI benchmarks, including our friendly Pokémon example, are already a bit hit-or-miss, these custom tweaks and non-standard approaches just make it even trickier to draw meaningful comparisons between models as they hit the market. It seems like comparing apples to apples might be getting harder by the day.
Barry Diller: Trust in Sam Altman irrelevant as AGI nears
Barry Diller, the billionaire media titan, does not believe OpenAI CEO Sam Altman is untrustworthy, despite recent reports suggesting otherwise. Speaking at the Wall Street Journal's "Future of Everything" conference this week, Diller defended Altman
YouTube expands AI deepfake detection to politicians, government officials, and journalists
On Tuesday, YouTube announced it is expanding its deepfake detection technology to a select group of government officials, political candidates, and journalists. The tool identifies AI-generated likenesses and lets pilot participants request the remo
¿De verdad comparan a los Pokémon en benchmarks de IA? 😂 Suena raro pero me intriga saber cómo lo hacen. ¿Le harán jugar al Pokémon Rojo/Fuego para ver cuántas medallas consigue sin que se pierda? Sería divertido si fuese así, aunque al final estos rankings a veces se sienten solo una guerra de marketing entre las grandes tecnológicas. ¡Quiero ver un torneo oficial de IA jugando! 🎮
Mais franchement, comparer des IA sur Pokémon ? 😂 C'est comme évaluer un chef étoilé sur sa capacité à faire des nuggets. Cette course aux benchmarks devient absurde – next step on va les tester sur Candy Crush ? En tout cas ça montre à quel point les labos cherchent désespérément des moyens originaux de se démarquer.
Whoa, AI playing Pokémon? That's wild! I wonder if Gemini's got a secret Pikachu strategy or just brute-forced its way through. Gotta catch 'em all, I guess! ⚡️
Debates over AI benchmarking in Pokémon? That's wild! I never thought I'd see the day when AI models are compared using Pokémon games. It's fun but kinda confusing. Can someone explain how Gemini outpaced Claude? 🤯





Home






