AI Benchmarks: Should We Ignore Them for Now?

Home

News

April 10, 2025

MarkWilson

216

# openai # xai

Welcome to TechCrunch's regular AI newsletter! We're taking a little break, but don't worry, you can still get all our AI coverage, including my columns, daily analysis, and breaking news, right here at TechCrunch. Want to get these stories straight to your inbox every day? Just sign up for our daily newsletters here.

This week, Elon Musk's AI startup, xAI, dropped their latest flagship AI model, Grok 3, which is powering the company's Grok chatbot apps. They trained it on a whopping 200,000 GPUs, and it's outperforming a bunch of other top models, including some from OpenAI, in benchmarks for math, coding, and more.

But let's talk about what these benchmarks actually mean.

Here at TC, we report on these benchmark numbers, even if we're not always thrilled about it, because they're one of the few ways the AI industry tries to show off how their models are improving. The thing is, these popular AI benchmarks often focus on obscure stuff and give scores that don't really reflect how well the AI does the things people actually care about.

Ethan Mollick, a professor at Wharton, took to X to say there's a real need for better tests and independent groups to run them. He pointed out that AI companies often report their own benchmark results, which makes it hard to trust them completely.

"Public benchmarks are both 'meh' and saturated, leaving a lot of AI testing to be like food reviews, based on taste," Mollick wrote. "If AI is critical to work, we need more."

There are plenty of folks out there trying to come up with new benchmarks for AI, but no one can agree on what's best. Some think benchmarks should focus on economic impact to be useful, while others believe real-world adoption and usefulness are the true measures of success.

This debate could go on forever. Maybe, like X user Roon suggests, we should just pay less attention to new models and benchmarks unless there's a major AI breakthrough. It might be better for our sanity, even if it means missing out on some AI hype.

As mentioned, This Week in AI is taking a break. Thanks for sticking with us, readers, through all the ups and downs. Until next time.

News

Image Credits:Nathan Laine/Bloomberg / Getty Images

OpenAI is trying to "uncensor" ChatGPT. Max wrote about how they're changing their approach to AI development to embrace "intellectual freedom," even on tough or controversial topics.

Mira Murati, former CTO of OpenAI, has a new startup called Thinking Machines Lab. They're working on tools to "make AI work for [people's] unique needs and goals."

xAI released Grok 3 and added new features to the Grok apps for iOS and the web.

Meta is hosting its first developer conference focused on generative AI this spring. It's called LlamaCon, after their Llama models, and it's happening on April 29.

Paul wrote about OpenEuroLLM, a project by around 20 organizations to build foundation models for "transparent AI in Europe" that respects the "linguistic and cultural diversity" of all EU languages.

Research paper of the week

OpenAI ChatGPT website displayed on a laptop screen is seen in this illustration photo.

Image Credits:Jakub Porzycki/NurPhoto / Getty Images

OpenAI researchers have come up with a new AI benchmark called SWE-Lancer to test how well AI can code. It's made up of over 1,400 freelance software engineering tasks, from fixing bugs and adding features to proposing technical implementations.

OpenAI says the top-performing model, Anthropic's Claude 3.5 Sonnet, only scored 40.3% on the full SWE-Lancer benchmark, which shows AI still has a long way to go. They didn't test newer models like OpenAI's o3-mini or DeepSeek's R1 from China.

Model of the week

A Chinese AI company called Stepfun released an "open" AI model named Step-Audio that can understand and generate speech in Chinese, English, and Japanese. Users can even tweak the emotion and dialect of the synthetic audio, including singing.

Stepfun is one of several well-funded Chinese AI startups releasing models with permissive licenses. Founded in 2023, they recently closed a funding round worth hundreds of millions from investors, including Chinese state-owned private equity firms.

Grab bag

Nous Research DeepHermes

Image Credits:Nous Research

Nous Research, an AI research group, claims to have released one of the first AI models that combines reasoning with "intuitive language model capabilities."

Their model, DeepHermes-3 Preview, can switch between short and long "chains of thought" to balance accuracy and computational power. In "reasoning" mode, it takes more time to solve harder problems and shows its thought process along the way.

Anthropic is reportedly planning to release a similar model soon, and OpenAI says it's on their near-term roadmap.

Grok slams Democrats and Hollywood's 'Jewish executives' in controversial remarks On Friday morning, Elon Musk announced significant upgrades to @Grok, stating users would notice improved performance when interacting with the AI assistant. While specifics weren't provided, the xAI chief had previously committed to retraining Grok

Nonprofit leverages AI agents to boost charity fundraising efforts While major tech corporations promote AI "agents" as productivity boosters for businesses, one nonprofit organization is demonstrating their potential for social good. Sage Future, a philanthropic research group backed by Open Philanthropy, recently

Top AI Labs Warn Humanity Is Losing Grasp on Understanding AI Systems In an unprecedented show of unity, researchers from OpenAI, Google DeepMind, Anthropic and Meta have set aside competitive differences to issue a collective warning about responsible AI development. Over 40 leading scientists from these typically riv

Comments (61)

0/200

Submit

JonathanDavis

August 19, 2025 at 2:26:53 AM EDT

AI benchmarks are getting so hyped, but are they even reliable yet? 🤔 Feels like companies just cherry-pick numbers to flex. I’d rather see real-world use cases than some random leaderboard scores.

EdwardWalker

August 19, 2025 at 1:00:59 AM EDT

AI benchmarks are getting so hyped, but are they even reliable yet? Feels like we're chasing numbers instead of real progress. 🤔 What do you all think—should we just ignore them for now?

HarrySmith

August 11, 2025 at 3:00:59 PM EDT

AI benchmarks are cool, but are they just tech flexing? I’d rather see real-world uses than numbers on a chart. 🤔

BillyLewis

August 4, 2025 at 2:01:00 AM EDT

AI benchmarks sound cool, but are they just overhyped numbers? I’m curious if they really tell us anything useful about real-world performance. 🧐

JimmyWilson

July 31, 2025 at 10:48:18 PM EDT

AI benchmarks sound fancy, but are they just tech flexing? I mean, cool numbers, but do they really tell us how AI vibes in the real world? 🤔

JohnTaylor

July 27, 2025 at 9:20:02 PM EDT

AI benchmarks sound fancy, but are they just tech flexing? I mean, cool numbers, but do they really tell us how AI impacts daily life? 🤔 Curious if we’re hyping stats over real-world use.