Home News AI Benchmarks: Should We Ignore Them for Now?

AI Benchmarks: Should We Ignore Them for Now?

April 10, 2025
MarkWilson
79

Welcome to TechCrunch's regular AI newsletter! We're taking a little break, but don't worry, you can still get all our AI coverage, including my columns, daily analysis, and breaking news, right here at TechCrunch. Want to get these stories straight to your inbox every day? Just sign up for our daily newsletters here.

This week, Elon Musk's AI startup, xAI, dropped their latest flagship AI model, Grok 3, which is powering the company's Grok chatbot apps. They trained it on a whopping 200,000 GPUs, and it's outperforming a bunch of other top models, including some from OpenAI, in benchmarks for math, coding, and more.

But let's talk about what these benchmarks actually mean.

Here at TC, we report on these benchmark numbers, even if we're not always thrilled about it, because they're one of the few ways the AI industry tries to show off how their models are improving. The thing is, these popular AI benchmarks often focus on obscure stuff and give scores that don't really reflect how well the AI does the things people actually care about.

Ethan Mollick, a professor at Wharton, took to X to say there's a real need for better tests and independent groups to run them. He pointed out that AI companies often report their own benchmark results, which makes it hard to trust them completely.

"Public benchmarks are both 'meh' and saturated, leaving a lot of AI testing to be like food reviews, based on taste," Mollick wrote. "If AI is critical to work, we need more."

There are plenty of folks out there trying to come up with new benchmarks for AI, but no one can agree on what's best. Some think benchmarks should focus on economic impact to be useful, while others believe real-world adoption and usefulness are the true measures of success.

This debate could go on forever. Maybe, like X user Roon suggests, we should just pay less attention to new models and benchmarks unless there's a major AI breakthrough. It might be better for our sanity, even if it means missing out on some AI hype.

As mentioned, This Week in AI is taking a break. Thanks for sticking with us, readers, through all the ups and downs. Until next time.

News

Image Credits:Nathan Laine/Bloomberg / Getty Images
OpenAI is trying to "uncensor" ChatGPT. Max wrote about how they're changing their approach to AI development to embrace "intellectual freedom," even on tough or controversial topics.

Mira Murati, former CTO of OpenAI, has a new startup called Thinking Machines Lab. They're working on tools to "make AI work for [people's] unique needs and goals."

xAI released Grok 3 and added new features to the Grok apps for iOS and the web.

Meta is hosting its first developer conference focused on generative AI this spring. It's called LlamaCon, after their Llama models, and it's happening on April 29.

Paul wrote about OpenEuroLLM, a project by around 20 organizations to build foundation models for "transparent AI in Europe" that respects the "linguistic and cultural diversity" of all EU languages.

Research paper of the week

OpenAI ChatGPT website displayed on a laptop screen is seen in this illustration photo.

Image Credits:Jakub Porzycki/NurPhoto / Getty Images
OpenAI researchers have come up with a new AI benchmark called SWE-Lancer to test how well AI can code. It's made up of over 1,400 freelance software engineering tasks, from fixing bugs and adding features to proposing technical implementations.

OpenAI says the top-performing model, Anthropic's Claude 3.5 Sonnet, only scored 40.3% on the full SWE-Lancer benchmark, which shows AI still has a long way to go. They didn't test newer models like OpenAI's o3-mini or DeepSeek's R1 from China.

Model of the week

A Chinese AI company called Stepfun released an "open" AI model named Step-Audio that can understand and generate speech in Chinese, English, and Japanese. Users can even tweak the emotion and dialect of the synthetic audio, including singing.

Stepfun is one of several well-funded Chinese AI startups releasing models with permissive licenses. Founded in 2023, they recently closed a funding round worth hundreds of millions from investors, including Chinese state-owned private equity firms.

Grab bag

Nous Research DeepHermes

Image Credits:Nous Research
Nous Research, an AI research group, claims to have released one of the first AI models that combines reasoning with "intuitive language model capabilities."

Their model, DeepHermes-3 Preview, can switch between short and long "chains of thought" to balance accuracy and computational power. In "reasoning" mode, it takes more time to solve harder problems and shows its thought process along the way.

Anthropic is reportedly planning to release a similar model soon, and OpenAI says it's on their near-term roadmap.

Related article
Google Search Introduces 'AI Mode' for Complex, Multi-Part Queries Google Search Introduces 'AI Mode' for Complex, Multi-Part Queries Google Unveils "AI Mode" in Search to Rival Perplexity AI and ChatGPTGoogle is stepping up its game in the AI arena with the launch of an experimental "AI Mode" feature in its Search engine. Aimed at taking on the likes of Perplexity AI and OpenAI's ChatGPT Search, this new mode was announced on Wed
ChatGPT's Unsolicited Use of User Names Sparks 'Creepy' Concerns Among Some ChatGPT's Unsolicited Use of User Names Sparks 'Creepy' Concerns Among Some Some users of ChatGPT have recently encountered an odd new feature: the chatbot occasionally uses their name while working through problems. This wasn't part of its usual behavior before, and many users report that ChatGPT mentions their names without ever being told what to call them. Opinions on
OpenAI Enhances ChatGPT to Recall Previous Conversations OpenAI Enhances ChatGPT to Recall Previous Conversations OpenAI made a big announcement on Thursday about rolling out a fresh feature in ChatGPT called "memory." This nifty tool is designed to make your chats with the AI more personalized by remembering what you've talked about before. Imagine not having to repeat yourself every time you start a new conve
Comments (55)
0/200
FredAnderson
FredAnderson April 10, 2025 at 1:30:25 PM GMT

Honestly, AI Benchmarks can be a bit misleading sometimes. I signed up for the daily newsletter hoping for some clarity, but it's just more of the same hype. Maybe we should indeed ignore them for now until there's a more reliable standard. Keep up the good work on the coverage though!

WilliamYoung
WilliamYoung April 11, 2025 at 3:44:49 AM GMT

AIのベンチマークって本当に信用できるのかな?毎日のニュースレターに登録したけど、期待していたほど役立つ情報は得られなかった。もう少し信頼できる基準が出てくるまで無視したほうがいいかもね。でも、他のカバレッジは素晴らしいよ!

ChristopherDavis
ChristopherDavis April 10, 2025 at 1:20:05 PM GMT

Los benchmarks de IA a veces pueden ser engañosos. Me suscribí al boletín diario esperando más claridad, pero solo es más de lo mismo. Quizás deberíamos ignorarlos por ahora hasta que haya un estándar más confiable. ¡Sigan con el buen trabajo en la cobertura!

StephenLee
StephenLee April 10, 2025 at 8:29:13 PM GMT

Os benchmarks de IA podem ser um pouco enganosos às vezes. Me inscrevi no boletim diário esperando alguma clareza, mas é só mais do mesmo hype. Talvez devêssemos mesmo ignorá-los por enquanto até que haja um padrão mais confiável. Continuem o bom trabalho na cobertura!

TimothyRoberts
TimothyRoberts April 11, 2025 at 6:46:34 AM GMT

Thực sự thì các benchmarks của AI đôi khi có thể gây hiểu lầm. Tôi đã đăng ký nhận bản tin hàng ngày mong có thêm sự rõ ràng, nhưng lại chỉ nhận được thêm những lời quảng cáo. Có lẽ chúng ta nên bỏ qua chúng tạm thời cho đến khi có tiêu chuẩn đáng tin cậy hơn. Nhưng công việc bao quát của các bạn thì tuyệt vời!

NoahGreen
NoahGreen April 11, 2025 at 12:48:46 PM GMT

I used to rely on AI benchmarks to gauge the performance of new tech, but this article made me think twice. Maybe we're focusing too much on numbers and not enough on practical use. Still, it's a good read for anyone in the AI field. Worth a ponder!

Back to Top
OR