option
Home
News
AI Benchmarks: Should We Ignore Them for Now?

AI Benchmarks: Should We Ignore Them for Now?

April 10, 2025
240

Welcome to TechCrunch's regular AI newsletter! We're taking a little break, but don't worry, you can still get all our AI coverage, including my columns, daily analysis, and breaking news, right here at TechCrunch. Want to get these stories straight to your inbox every day? Just sign up for our daily newsletters here.

This week, Elon Musk's AI startup, xAI, dropped their latest flagship AI model, Grok 3, which is powering the company's Grok chatbot apps. They trained it on a whopping 200,000 GPUs, and it's outperforming a bunch of other top models, including some from OpenAI, in benchmarks for math, coding, and more.

But let's talk about what these benchmarks actually mean.

Here at TC, we report on these benchmark numbers, even if we're not always thrilled about it, because they're one of the few ways the AI industry tries to show off how their models are improving. The thing is, these popular AI benchmarks often focus on obscure stuff and give scores that don't really reflect how well the AI does the things people actually care about.

Ethan Mollick, a professor at Wharton, took to X to say there's a real need for better tests and independent groups to run them. He pointed out that AI companies often report their own benchmark results, which makes it hard to trust them completely.

"Public benchmarks are both 'meh' and saturated, leaving a lot of AI testing to be like food reviews, based on taste," Mollick wrote. "If AI is critical to work, we need more."

There are plenty of folks out there trying to come up with new benchmarks for AI, but no one can agree on what's best. Some think benchmarks should focus on economic impact to be useful, while others believe real-world adoption and usefulness are the true measures of success.

This debate could go on forever. Maybe, like X user Roon suggests, we should just pay less attention to new models and benchmarks unless there's a major AI breakthrough. It might be better for our sanity, even if it means missing out on some AI hype.

As mentioned, This Week in AI is taking a break. Thanks for sticking with us, readers, through all the ups and downs. Until next time.

News

Image Credits:Nathan Laine/Bloomberg / Getty Images
OpenAI is trying to "uncensor" ChatGPT. Max wrote about how they're changing their approach to AI development to embrace "intellectual freedom," even on tough or controversial topics.

Mira Murati, former CTO of OpenAI, has a new startup called Thinking Machines Lab. They're working on tools to "make AI work for [people's] unique needs and goals."

xAI released Grok 3 and added new features to the Grok apps for iOS and the web.

Meta is hosting its first developer conference focused on generative AI this spring. It's called LlamaCon, after their Llama models, and it's happening on April 29.

Paul wrote about OpenEuroLLM, a project by around 20 organizations to build foundation models for "transparent AI in Europe" that respects the "linguistic and cultural diversity" of all EU languages.

Research paper of the week

OpenAI ChatGPT website displayed on a laptop screen is seen in this illustration photo.

Image Credits:Jakub Porzycki/NurPhoto / Getty Images
OpenAI researchers have come up with a new AI benchmark called SWE-Lancer to test how well AI can code. It's made up of over 1,400 freelance software engineering tasks, from fixing bugs and adding features to proposing technical implementations.

OpenAI says the top-performing model, Anthropic's Claude 3.5 Sonnet, only scored 40.3% on the full SWE-Lancer benchmark, which shows AI still has a long way to go. They didn't test newer models like OpenAI's o3-mini or DeepSeek's R1 from China.

Model of the week

A Chinese AI company called Stepfun released an "open" AI model named Step-Audio that can understand and generate speech in Chinese, English, and Japanese. Users can even tweak the emotion and dialect of the synthetic audio, including singing.

Stepfun is one of several well-funded Chinese AI startups releasing models with permissive licenses. Founded in 2023, they recently closed a funding round worth hundreds of millions from investors, including Chinese state-owned private equity firms.

Grab bag

Nous Research DeepHermes

Image Credits:Nous Research
Nous Research, an AI research group, claims to have released one of the first AI models that combines reasoning with "intuitive language model capabilities."

Their model, DeepHermes-3 Preview, can switch between short and long "chains of thought" to balance accuracy and computational power. In "reasoning" mode, it takes more time to solve harder problems and shows its thought process along the way.

Anthropic is reportedly planning to release a similar model soon, and OpenAI says it's on their near-term roadmap.

Related article
OpenAI outlines AI economy with public wealth funds, robot taxes, and four-day week OpenAI outlines AI economy with public wealth funds, robot taxes, and four-day week As governments struggle to manage the economic impact of superintelligent machines, OpenAI has released a set of policy proposals outlining how wealth and work could be reshaped in an "intelligence age." The ideas blend traditional left-leaning mecha
Anthropic's SpaceX Lease Duration Divides Opinions Anthropic's SpaceX Lease Duration Divides Opinions Earlier this month, xAI finalized a significant compute arrangement with Anthropic, committing billions per month for exclusive access to the Colossus cluster. The deal proved advantageous for both sides: xAI gained essential revenue, while Anthropic
Greg Brockman reveals how Elon Musk departed OpenAI Greg Brockman reveals how Elon Musk departed OpenAI In late August 2017, key figures at OpenAI—then a small nonprofit research lab—met to discuss how they would establish a for-profit entity to commercialize their technology and raise the capital needed to achieve AGI.Elon Musk was demanding full cont
Related Special Topic Recommendations
Productivity AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels
AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels

Discover the 2026 best AI personal wellness and focus coaches on XIX.AI. Our curated rankings feature top-rated, game-changing tools to manage burnout and boost mental energy. Compare free vs paid options with real-world insights. Unlock your path to peak productivity and well-being today.

10 tools
xix.ai
chatbot Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities
Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities

Discover the 2026 latest top-rated AI romantic chatbots for building genuine, long-term connections. Our curated list features powerful, consistent personalities, free vs paid comparisons, and real-world tests. Find your perfect companion and start building today at XIX.AI.

10 tools
xix.ai
Education and Learning Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows
Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows

Discover the 2026 best AI data science mentors to master SQL, Pandas & ML workflows. Explore our top-rated, curated selection at XIX.AI for powerful, game-changing guidance. Compare free vs paid options with real-world insights. Unlock your data science mastery today.

10 tools
xix.ai
chatbot Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time
Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time

Discover the 2026 best AI flirting and conversation trainers on XIX.AI. Our curated, top-rated selection helps you build social charisma and confidence in real-time. Explore must-try, game-changing tools with free vs paid comparisons and weekly updated rankings. Unlock your social edge today.

10 tools
xix.ai
code Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click
Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click

Discover the 2026 latest top-rated AI tools for automated unit testing. Our curated selection features powerful, game-changing solutions to generate Jest, PyTest & JUnit test cases instantly. Compare free vs paid options with real-world tests and weekly updated rankings on XIX.AI. Unlock your AI edge and boost development productivity today.

10 tools
xix.ai
Data Analysis Best AI Data Visualization Tools: Auto-Generate Interactive BI Dashboards from Raw Files
Best AI Data Visualization Tools: Auto-Generate Interactive BI Dashboards from Raw Files

Discover the 2026 best AI data visualization tools at XIX.AI. Our curated, top-rated selection helps you auto-generate powerful, interactive BI dashboards from raw files instantly. Compare free vs paid options with real-world tests and weekly updated rankings. Unlock your data's potential today.

10 tools
xix.ai
Comments (61)
0/500
JonathanDavis
JonathanDavis August 19, 2025 at 2:26:53 AM EDT

AI benchmarks are getting so hyped, but are they even reliable yet? 🤔 Feels like companies just cherry-pick numbers to flex. I’d rather see real-world use cases than some random leaderboard scores.

EdwardWalker
EdwardWalker August 19, 2025 at 1:00:59 AM EDT

AI benchmarks are getting so hyped, but are they even reliable yet? Feels like we're chasing numbers instead of real progress. 🤔 What do you all think—should we just ignore them for now?

HarrySmith
HarrySmith August 11, 2025 at 3:00:59 PM EDT

AI benchmarks are cool, but are they just tech flexing? I’d rather see real-world uses than numbers on a chart. 🤔

BillyLewis
BillyLewis August 4, 2025 at 2:01:00 AM EDT

AI benchmarks sound cool, but are they just overhyped numbers? I’m curious if they really tell us anything useful about real-world performance. 🧐

JimmyWilson
JimmyWilson July 31, 2025 at 10:48:18 PM EDT

AI benchmarks sound fancy, but are they just tech flexing? I mean, cool numbers, but do they really tell us how AI vibes in the real world? 🤔

JohnTaylor
JohnTaylor July 27, 2025 at 9:20:02 PM EDT

AI benchmarks sound fancy, but are they just tech flexing? I mean, cool numbers, but do they really tell us how AI impacts daily life? 🤔 Curious if we’re hyping stats over real-world use.

OR