option
Home
News
AI Benchmarks: Should We Ignore Them for Now?

AI Benchmarks: Should We Ignore Them for Now?

April 10, 2025
240

Welcome to TechCrunch's regular AI newsletter! We're taking a little break, but don't worry, you can still get all our AI coverage, including my columns, daily analysis, and breaking news, right here at TechCrunch. Want to get these stories straight to your inbox every day? Just sign up for our daily newsletters here.

This week, Elon Musk's AI startup, xAI, dropped their latest flagship AI model, Grok 3, which is powering the company's Grok chatbot apps. They trained it on a whopping 200,000 GPUs, and it's outperforming a bunch of other top models, including some from OpenAI, in benchmarks for math, coding, and more.

But let's talk about what these benchmarks actually mean.

Here at TC, we report on these benchmark numbers, even if we're not always thrilled about it, because they're one of the few ways the AI industry tries to show off how their models are improving. The thing is, these popular AI benchmarks often focus on obscure stuff and give scores that don't really reflect how well the AI does the things people actually care about.

Ethan Mollick, a professor at Wharton, took to X to say there's a real need for better tests and independent groups to run them. He pointed out that AI companies often report their own benchmark results, which makes it hard to trust them completely.

"Public benchmarks are both 'meh' and saturated, leaving a lot of AI testing to be like food reviews, based on taste," Mollick wrote. "If AI is critical to work, we need more."

There are plenty of folks out there trying to come up with new benchmarks for AI, but no one can agree on what's best. Some think benchmarks should focus on economic impact to be useful, while others believe real-world adoption and usefulness are the true measures of success.

This debate could go on forever. Maybe, like X user Roon suggests, we should just pay less attention to new models and benchmarks unless there's a major AI breakthrough. It might be better for our sanity, even if it means missing out on some AI hype.

As mentioned, This Week in AI is taking a break. Thanks for sticking with us, readers, through all the ups and downs. Until next time.

News

Image Credits:Nathan Laine/Bloomberg / Getty Images
OpenAI is trying to "uncensor" ChatGPT. Max wrote about how they're changing their approach to AI development to embrace "intellectual freedom," even on tough or controversial topics.

Mira Murati, former CTO of OpenAI, has a new startup called Thinking Machines Lab. They're working on tools to "make AI work for [people's] unique needs and goals."

xAI released Grok 3 and added new features to the Grok apps for iOS and the web.

Meta is hosting its first developer conference focused on generative AI this spring. It's called LlamaCon, after their Llama models, and it's happening on April 29.

Paul wrote about OpenEuroLLM, a project by around 20 organizations to build foundation models for "transparent AI in Europe" that respects the "linguistic and cultural diversity" of all EU languages.

Research paper of the week

OpenAI ChatGPT website displayed on a laptop screen is seen in this illustration photo.

Image Credits:Jakub Porzycki/NurPhoto / Getty Images
OpenAI researchers have come up with a new AI benchmark called SWE-Lancer to test how well AI can code. It's made up of over 1,400 freelance software engineering tasks, from fixing bugs and adding features to proposing technical implementations.

OpenAI says the top-performing model, Anthropic's Claude 3.5 Sonnet, only scored 40.3% on the full SWE-Lancer benchmark, which shows AI still has a long way to go. They didn't test newer models like OpenAI's o3-mini or DeepSeek's R1 from China.

Model of the week

A Chinese AI company called Stepfun released an "open" AI model named Step-Audio that can understand and generate speech in Chinese, English, and Japanese. Users can even tweak the emotion and dialect of the synthetic audio, including singing.

Stepfun is one of several well-funded Chinese AI startups releasing models with permissive licenses. Founded in 2023, they recently closed a funding round worth hundreds of millions from investors, including Chinese state-owned private equity firms.

Grab bag

Nous Research DeepHermes

Image Credits:Nous Research
Nous Research, an AI research group, claims to have released one of the first AI models that combines reasoning with "intuitive language model capabilities."

Their model, DeepHermes-3 Preview, can switch between short and long "chains of thought" to balance accuracy and computational power. In "reasoning" mode, it takes more time to solve harder problems and shows its thought process along the way.

Anthropic is reportedly planning to release a similar model soon, and OpenAI says it's on their near-term roadmap.

Related article
Anthropic's SpaceX Lease Duration Divides Opinions Anthropic's SpaceX Lease Duration Divides Opinions Earlier this month, xAI finalized a significant compute arrangement with Anthropic, committing billions per month for exclusive access to the Colossus cluster. The deal proved advantageous for both sides: xAI gained essential revenue, while Anthropic
Greg Brockman reveals how Elon Musk departed OpenAI Greg Brockman reveals how Elon Musk departed OpenAI In late August 2017, key figures at OpenAI—then a small nonprofit research lab—met to discuss how they would establish a for-profit entity to commercialize their technology and raise the capital needed to achieve AGI.Elon Musk was demanding full cont
Pentagon signs deals with Nvidia, Microsoft, AWS to deploy AI on classified networks Pentagon signs deals with Nvidia, Microsoft, AWS to deploy AI on classified networks After previously reaching agreements with Google, SpaceX, and OpenAI, the U.S. Defense Department announced Friday that it has now signed deals with Nvidia, Microsoft, Amazon Web Services, and Reflection AI to deploy their AI technologies and models
Related Special Topic Recommendations
writing Best AI Xianxia & Wuxia Assistants: Write Epic Cultivation Progression & Martial Arts Choreography
Best AI Xianxia & Wuxia Assistants: Write Epic Cultivation Progression & Martial Arts Choreography

Discover the 2026 best AI assistants for crafting epic xianxia & wuxia tales. XIX.AI's curated list features top-rated, game-changing tools to master cultivation progression and martial arts choreography. Compare free vs paid options with real-world tests. Unlock your creative potential and start writing today!

10 tools
xix.ai
code AI Mobile App Coding Tools: Generate Cross-Platform Flutter & React Native Code from Prompts
AI Mobile App Coding Tools: Generate Cross-Platform Flutter & React Native Code from Prompts

Discover the 2026 best AI mobile app coding tools for Flutter & React Native. Our curated, top-rated list features powerful, game-changing solutions that generate cross-platform code from prompts. Compare free vs paid options with real-world tests. Unlock faster development and build better apps. Explore the rankings on XIX.AI now!

10 tools
xix.ai
code Best AI Chrome Extension Generators: Create Custom Browser Add-ons with Zero Coding Experience
Best AI Chrome Extension Generators: Create Custom Browser Add-ons with Zero Coding Experience

Discover the 2026 best AI Chrome extension generators on XIX.AI. Our curated list features top-rated, must-try tools that let you create custom browser add-ons with zero coding. Compare free vs paid options, see real-world tests, and unlock your productivity. Explore the latest rankings and find your perfect tool today!

10 tools
xix.ai
Text-to-speech Best AI Multilingual TTS: Generate Authentic Native-Accent Speech in 50+ Languages
Best AI Multilingual TTS: Generate Authentic Native-Accent Speech in 50+ Languages

Discover the 2026 best AI multilingual TTS tools for authentic native-accent speech in 50+ languages. Explore our top-rated, curated rankings with free vs paid comparisons and real-world tests. Find your perfect voice tool on XIX.AI and unlock global communication today.

10 tools
xix.ai
Meeting Assistant Best AI Meeting Automation Tools for Smarter and Faster Collaboration
Best AI Meeting Automation Tools for Smarter and Faster Collaboration

Discover the 2026 latest top-rated AI meeting automation tools for smarter, faster collaboration. Our curated list features powerful, game-changing solutions to automate notes, summaries, and action items. Compare free vs paid options with real-world tests and weekly updated rankings. Unlock peak team productivity. Explore the best picks now at XIX.AI.

10 tools
xix.ai
Prompt AI Prompts for Infrastructure-as-Code: Deploy Terraform & Docker Configurations Safely
AI Prompts for Infrastructure-as-Code: Deploy Terraform & Docker Configurations Safely

Discover the 2026 latest top-rated AI prompts for Infrastructure-as-Code. XIX.AI's curated selection helps you safely deploy Terraform & Docker configurations, automate cloud setups, and boost DevOps productivity. Compare free vs paid options with real-world tests. Explore now and unlock your AI edge.

10 tools
xix.ai
Comments (61)
0/500
JonathanDavis
JonathanDavis August 19, 2025 at 2:26:53 AM EDT

AI benchmarks are getting so hyped, but are they even reliable yet? 🤔 Feels like companies just cherry-pick numbers to flex. I’d rather see real-world use cases than some random leaderboard scores.

EdwardWalker
EdwardWalker August 19, 2025 at 1:00:59 AM EDT

AI benchmarks are getting so hyped, but are they even reliable yet? Feels like we're chasing numbers instead of real progress. 🤔 What do you all think—should we just ignore them for now?

HarrySmith
HarrySmith August 11, 2025 at 3:00:59 PM EDT

AI benchmarks are cool, but are they just tech flexing? I’d rather see real-world uses than numbers on a chart. 🤔

BillyLewis
BillyLewis August 4, 2025 at 2:01:00 AM EDT

AI benchmarks sound cool, but are they just overhyped numbers? I’m curious if they really tell us anything useful about real-world performance. 🧐

JimmyWilson
JimmyWilson July 31, 2025 at 10:48:18 PM EDT

AI benchmarks sound fancy, but are they just tech flexing? I mean, cool numbers, but do they really tell us how AI vibes in the real world? 🤔

JohnTaylor
JohnTaylor July 27, 2025 at 9:20:02 PM EDT

AI benchmarks sound fancy, but are they just tech flexing? I mean, cool numbers, but do they really tell us how AI impacts daily life? 🤔 Curious if we’re hyping stats over real-world use.

OR