Debates over AI benchmarking have reached Pokémon

Even the beloved world of Pokémon isn't immune to the drama surrounding AI benchmarks. A recent viral post on X stirred up quite the buzz, claiming that Google's latest Gemini model had outpaced Anthropic's leading Claude model in the classic Pokémon video game trilogy. According to the post, Gemini had impressively made it to Lavender Town in a developer's Twitch stream, while Claude was lagging behind at Mount Moon as of late February.
Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town
119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x
— Jush (@Jush21e8) April 10, 2025
However, what this post conveniently left out was the fact that Gemini had a bit of an unfair advantage. Savvy users over on Reddit quickly pointed out that the developer behind the Gemini stream had crafted a custom minimap. This nifty tool aids the model in recognizing "tiles" in the game, such as cuttable trees, which significantly cuts down the time Gemini needs to spend analyzing screenshots before deciding on its next move.
Now, while Pokémon might not be the most serious AI benchmark out there, it does serve as a fun yet telling example of how different setups can skew the results of these tests. Take Anthropic's recent model, Anthropic 3.7 Sonnet, for instance. On the SWE-bench Verified benchmark, which is meant to test coding prowess, it scored 62.3% accuracy. But, with a "custom scaffold" that Anthropic whipped up, that score jumped to 70.3%.
And it doesn't stop there. Meta took one of its newer models, Llama 4 Maverick, and fine-tuned it specifically for the LM Arena benchmark. The vanilla version of the model didn't fare nearly as well on the same test.
Given that AI benchmarks, including our friendly Pokémon example, are already a bit hit-or-miss, these custom tweaks and non-standard approaches just make it even trickier to draw meaningful comparisons between models as they hit the market. It seems like comparing apples to apples might be getting harder by the day.
Related article
xAI posts Grok’s behind-the-scenes prompts
xAI Releases Grok's System Prompts After Controversial "White Genocide" ResponsesIn an unexpected move, xAI has decided to publicly share the system prompts for its AI chatbot Grok after an incident where the bot began generating unprompted responses about "white genocide" on X (formerly Twitter). T
Billionaires Discuss Automating Jobs Away in This Week's AI Update
Hey everyone, welcome back to TechCrunch's AI newsletter! If you're not already subscribed, you can sign up here to get it delivered straight to your inbox every Wednesday.We took a little break last week, but for good reason—the AI news cycle was on fire, thanks in large part to the sudden surge of
NotebookLM App Launches: AI-Powered Tool for Instant Knowledge Access Anywhere
NotebookLM Goes Mobile: Your AI-Powered Research Assistant Now on Android & iOSWe’ve been blown away by the response to NotebookLM—millions of users have embraced it as their go-to
Comments (5)
0/200
JasonKing
May 5, 2025 at 12:00:00 AM GMT
Debates over AI benchmarking in Pokémon? That's wild! I never thought I'd see the day when AI models are compared using Pokémon games. It's fun but kinda confusing. Can someone explain how Gemini outpaced Claude? 🤯
0
NicholasAdams
May 5, 2025 at 12:00:00 AM GMT
ポケモンでAIのベンチマークを議論するなんて、信じられない!AIモデルがポケモンのゲームで比較される日が来るなんて思わなかった。面白いけど、ちょっと混乱する。ジェミニがクロードをどうやって追い越したのか、誰か説明してくれない?🤯
0
AlbertThomas
May 4, 2025 at 12:00:00 AM GMT
포켓몬에서 AI 벤치마킹 논쟁이라니, 이건 정말 놀랍네요! AI 모델이 포켓몬 게임으로 비교될 날이 올 줄은 몰랐어요. 재미있지만 조금 헷갈려요. 제미니가 클로드를 어떻게 앞질렀는지 설명해줄 수 있는 분? 🤯
0
CharlesRoberts
May 4, 2025 at 12:00:00 AM GMT
Debates sobre benchmarking de IA em Pokémon? Isso é loucura! Nunca pensei que veria o dia em que modelos de IA seriam comparados usando jogos de Pokémon. É divertido, mas um pouco confuso. Alguém pode explicar como o Gemini superou o Claude? 🤯
0
WalterThomas
May 4, 2025 at 12:00:00 AM GMT
पोकेमॉन में AI बेंचमार्किंग पर बहस? यह तो पागलपन है! मुझे कभी नहीं लगा था कि मैं AI मॉडल्स को पोकेमॉन गेम्स का उपयोग करके तुलना करते हुए देखूंगा। यह मजेदार है लेकिन थोड़ा भ्रमित करने वाला है। कोई बता सकता है कि जेमिनी ने क्लॉड को कैसे पछाड़ा? 🤯
0
Even the beloved world of Pokémon isn't immune to the drama surrounding AI benchmarks. A recent viral post on X stirred up quite the buzz, claiming that Google's latest Gemini model had outpaced Anthropic's leading Claude model in the classic Pokémon video game trilogy. According to the post, Gemini had impressively made it to Lavender Town in a developer's Twitch stream, while Claude was lagging behind at Mount Moon as of late February.
Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town
119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x
— Jush (@Jush21e8) April 10, 2025
However, what this post conveniently left out was the fact that Gemini had a bit of an unfair advantage. Savvy users over on Reddit quickly pointed out that the developer behind the Gemini stream had crafted a custom minimap. This nifty tool aids the model in recognizing "tiles" in the game, such as cuttable trees, which significantly cuts down the time Gemini needs to spend analyzing screenshots before deciding on its next move.
Now, while Pokémon might not be the most serious AI benchmark out there, it does serve as a fun yet telling example of how different setups can skew the results of these tests. Take Anthropic's recent model, Anthropic 3.7 Sonnet, for instance. On the SWE-bench Verified benchmark, which is meant to test coding prowess, it scored 62.3% accuracy. But, with a "custom scaffold" that Anthropic whipped up, that score jumped to 70.3%.
And it doesn't stop there. Meta took one of its newer models, Llama 4 Maverick, and fine-tuned it specifically for the LM Arena benchmark. The vanilla version of the model didn't fare nearly as well on the same test.
Given that AI benchmarks, including our friendly Pokémon example, are already a bit hit-or-miss, these custom tweaks and non-standard approaches just make it even trickier to draw meaningful comparisons between models as they hit the market. It seems like comparing apples to apples might be getting harder by the day.




Debates over AI benchmarking in Pokémon? That's wild! I never thought I'd see the day when AI models are compared using Pokémon games. It's fun but kinda confusing. Can someone explain how Gemini outpaced Claude? 🤯




ポケモンでAIのベンチマークを議論するなんて、信じられない!AIモデルがポケモンのゲームで比較される日が来るなんて思わなかった。面白いけど、ちょっと混乱する。ジェミニがクロードをどうやって追い越したのか、誰か説明してくれない?🤯




포켓몬에서 AI 벤치마킹 논쟁이라니, 이건 정말 놀랍네요! AI 모델이 포켓몬 게임으로 비교될 날이 올 줄은 몰랐어요. 재미있지만 조금 헷갈려요. 제미니가 클로드를 어떻게 앞질렀는지 설명해줄 수 있는 분? 🤯




Debates sobre benchmarking de IA em Pokémon? Isso é loucura! Nunca pensei que veria o dia em que modelos de IA seriam comparados usando jogos de Pokémon. É divertido, mas um pouco confuso. Alguém pode explicar como o Gemini superou o Claude? 🤯




पोकेमॉन में AI बेंचमार्किंग पर बहस? यह तो पागलपन है! मुझे कभी नहीं लगा था कि मैं AI मॉडल्स को पोकेमॉन गेम्स का उपयोग करके तुलना करते हुए देखूंगा। यह मजेदार है लेकिन थोड़ा भ्रमित करने वाला है। कोई बता सकता है कि जेमिनी ने क्लॉड को कैसे पछाड़ा? 🤯












