AI Benchmarks: Should We Ignore Them for Now?
Welcome to TechCrunch's regular AI newsletter! We're taking a little break, but don't worry, you can still get all our AI coverage, including my columns, daily analysis, and breaking news, right here at TechCrunch. Want to get these stories straight to your inbox every day? Just sign up for our daily newsletters here.
This week, Elon Musk's AI startup, xAI, dropped their latest flagship AI model, Grok 3, which is powering the company's Grok chatbot apps. They trained it on a whopping 200,000 GPUs, and it's outperforming a bunch of other top models, including some from OpenAI, in benchmarks for math, coding, and more.
But let's talk about what these benchmarks actually mean.
Here at TC, we report on these benchmark numbers, even if we're not always thrilled about it, because they're one of the few ways the AI industry tries to show off how their models are improving. The thing is, these popular AI benchmarks often focus on obscure stuff and give scores that don't really reflect how well the AI does the things people actually care about.
Ethan Mollick, a professor at Wharton, took to X to say there's a real need for better tests and independent groups to run them. He pointed out that AI companies often report their own benchmark results, which makes it hard to trust them completely.
"Public benchmarks are both 'meh' and saturated, leaving a lot of AI testing to be like food reviews, based on taste," Mollick wrote. "If AI is critical to work, we need more."
There are plenty of folks out there trying to come up with new benchmarks for AI, but no one can agree on what's best. Some think benchmarks should focus on economic impact to be useful, while others believe real-world adoption and usefulness are the true measures of success.
This debate could go on forever. Maybe, like X user Roon suggests, we should just pay less attention to new models and benchmarks unless there's a major AI breakthrough. It might be better for our sanity, even if it means missing out on some AI hype.
As mentioned, This Week in AI is taking a break. Thanks for sticking with us, readers, through all the ups and downs. Until next time.
News

Image Credits:Nathan Laine/Bloomberg / Getty Images OpenAI is trying to "uncensor" ChatGPT. Max wrote about how they're changing their approach to AI development to embrace "intellectual freedom," even on tough or controversial topics.
Mira Murati, former CTO of OpenAI, has a new startup called Thinking Machines Lab. They're working on tools to "make AI work for [people's] unique needs and goals."
xAI released Grok 3 and added new features to the Grok apps for iOS and the web.
Meta is hosting its first developer conference focused on generative AI this spring. It's called LlamaCon, after their Llama models, and it's happening on April 29.
Paul wrote about OpenEuroLLM, a project by around 20 organizations to build foundation models for "transparent AI in Europe" that respects the "linguistic and cultural diversity" of all EU languages.
Research paper of the week

Image Credits:Jakub Porzycki/NurPhoto / Getty Images OpenAI researchers have come up with a new AI benchmark called SWE-Lancer to test how well AI can code. It's made up of over 1,400 freelance software engineering tasks, from fixing bugs and adding features to proposing technical implementations.
OpenAI says the top-performing model, Anthropic's Claude 3.5 Sonnet, only scored 40.3% on the full SWE-Lancer benchmark, which shows AI still has a long way to go. They didn't test newer models like OpenAI's o3-mini or DeepSeek's R1 from China.
Model of the week
A Chinese AI company called Stepfun released an "open" AI model named Step-Audio that can understand and generate speech in Chinese, English, and Japanese. Users can even tweak the emotion and dialect of the synthetic audio, including singing.
Stepfun is one of several well-funded Chinese AI startups releasing models with permissive licenses. Founded in 2023, they recently closed a funding round worth hundreds of millions from investors, including Chinese state-owned private equity firms.
Grab bag

Image Credits:Nous Research Nous Research, an AI research group, claims to have released one of the first AI models that combines reasoning with "intuitive language model capabilities."
Their model, DeepHermes-3 Preview, can switch between short and long "chains of thought" to balance accuracy and computational power. In "reasoning" mode, it takes more time to solve harder problems and shows its thought process along the way.
Anthropic is reportedly planning to release a similar model soon, and OpenAI says it's on their near-term roadmap.
Related article
Former OpenAI Engineer Shares Insights on Company Culture and Rapid Growth
Three weeks ago, Calvin French-Owen, an engineer who contributed to a key OpenAI product, left the company.He recently shared a compelling blog post detailing his year at OpenAI, including the intense
Google Unveils Production-Ready Gemini 2.5 AI Models to Rival OpenAI in Enterprise Market
Google intensified its AI strategy Monday, launching its advanced Gemini 2.5 models for enterprise use and introducing a cost-efficient variant to compete on price and performance.The Alphabet-owned c
Meta Offers High Pay for AI Talent, Denies $100M Signing Bonuses
Meta is attracting AI researchers to its new superintelligence lab with substantial multimillion-dollar compensation packages. However, claims of $100 million "signing bonuses" are untrue, per a recru
Comments (58)
0/200
BillyLewis
August 4, 2025 at 2:01:00 AM EDT
AI benchmarks sound cool, but are they just overhyped numbers? I’m curious if they really tell us anything useful about real-world performance. 🧐
0
JimmyWilson
July 31, 2025 at 10:48:18 PM EDT
AI benchmarks sound fancy, but are they just tech flexing? I mean, cool numbers, but do they really tell us how AI vibes in the real world? 🤔
0
JohnTaylor
July 27, 2025 at 9:20:02 PM EDT
AI benchmarks sound fancy, but are they just tech flexing? I mean, cool numbers, but do they really tell us how AI impacts daily life? 🤔 Curious if we’re hyping stats over real-world use.
0
ChristopherThomas
April 26, 2025 at 1:57:18 AM EDT
I'm on the fence about AI benchmarks. They seem useful but also kinda miss the point sometimes. It's like judging a book by its cover. Still, it's good to have some metrics, right? Maybe we should take them with a grain of salt for now. 🤔
0
BrianWalker
April 25, 2025 at 3:19:34 PM EDT
Tôi không chắc về các tiêu chuẩn đánh giá AI. Chúng có vẻ hữu ích nhưng đôi khi cũng bỏ lỡ điểm chính. Giống như đánh giá một cuốn sách qua bìa của nó. Tuy nhiên, có một số chỉ số là tốt, đúng không? Có lẽ chúng ta nên xem xét chúng với một chút hoài nghi tạm thời. 🤔
0
CharlesMartinez
April 22, 2025 at 11:01:53 AM EDT
Estou em dúvida sobre os benchmarks de IA. Eles parecem úteis, mas às vezes também perdem o ponto. É como julgar um livro pela capa. Ainda assim, é bom ter algumas métricas, certo? Talvez devêssemos levá-los com um grão de sal por enquanto. 🤔
0
Welcome to TechCrunch's regular AI newsletter! We're taking a little break, but don't worry, you can still get all our AI coverage, including my columns, daily analysis, and breaking news, right here at TechCrunch. Want to get these stories straight to your inbox every day? Just sign up for our daily newsletters here.
This week, Elon Musk's AI startup, xAI, dropped their latest flagship AI model, Grok 3, which is powering the company's Grok chatbot apps. They trained it on a whopping 200,000 GPUs, and it's outperforming a bunch of other top models, including some from OpenAI, in benchmarks for math, coding, and more.
But let's talk about what these benchmarks actually mean.
Here at TC, we report on these benchmark numbers, even if we're not always thrilled about it, because they're one of the few ways the AI industry tries to show off how their models are improving. The thing is, these popular AI benchmarks often focus on obscure stuff and give scores that don't really reflect how well the AI does the things people actually care about.
Ethan Mollick, a professor at Wharton, took to X to say there's a real need for better tests and independent groups to run them. He pointed out that AI companies often report their own benchmark results, which makes it hard to trust them completely.
"Public benchmarks are both 'meh' and saturated, leaving a lot of AI testing to be like food reviews, based on taste," Mollick wrote. "If AI is critical to work, we need more."
There are plenty of folks out there trying to come up with new benchmarks for AI, but no one can agree on what's best. Some think benchmarks should focus on economic impact to be useful, while others believe real-world adoption and usefulness are the true measures of success.
This debate could go on forever. Maybe, like X user Roon suggests, we should just pay less attention to new models and benchmarks unless there's a major AI breakthrough. It might be better for our sanity, even if it means missing out on some AI hype.
As mentioned, This Week in AI is taking a break. Thanks for sticking with us, readers, through all the ups and downs. Until next time.
News
Mira Murati, former CTO of OpenAI, has a new startup called Thinking Machines Lab. They're working on tools to "make AI work for [people's] unique needs and goals."
xAI released Grok 3 and added new features to the Grok apps for iOS and the web.
Meta is hosting its first developer conference focused on generative AI this spring. It's called LlamaCon, after their Llama models, and it's happening on April 29.
Paul wrote about OpenEuroLLM, a project by around 20 organizations to build foundation models for "transparent AI in Europe" that respects the "linguistic and cultural diversity" of all EU languages.
Research paper of the week
OpenAI says the top-performing model, Anthropic's Claude 3.5 Sonnet, only scored 40.3% on the full SWE-Lancer benchmark, which shows AI still has a long way to go. They didn't test newer models like OpenAI's o3-mini or DeepSeek's R1 from China.
Model of the week
A Chinese AI company called Stepfun released an "open" AI model named Step-Audio that can understand and generate speech in Chinese, English, and Japanese. Users can even tweak the emotion and dialect of the synthetic audio, including singing.
Stepfun is one of several well-funded Chinese AI startups releasing models with permissive licenses. Founded in 2023, they recently closed a funding round worth hundreds of millions from investors, including Chinese state-owned private equity firms.
Grab bag
Their model, DeepHermes-3 Preview, can switch between short and long "chains of thought" to balance accuracy and computational power. In "reasoning" mode, it takes more time to solve harder problems and shows its thought process along the way.
Anthropic is reportedly planning to release a similar model soon, and OpenAI says it's on their near-term roadmap.




AI benchmarks sound cool, but are they just overhyped numbers? I’m curious if they really tell us anything useful about real-world performance. 🧐




AI benchmarks sound fancy, but are they just tech flexing? I mean, cool numbers, but do they really tell us how AI vibes in the real world? 🤔




AI benchmarks sound fancy, but are they just tech flexing? I mean, cool numbers, but do they really tell us how AI impacts daily life? 🤔 Curious if we’re hyping stats over real-world use.




I'm on the fence about AI benchmarks. They seem useful but also kinda miss the point sometimes. It's like judging a book by its cover. Still, it's good to have some metrics, right? Maybe we should take them with a grain of salt for now. 🤔




Tôi không chắc về các tiêu chuẩn đánh giá AI. Chúng có vẻ hữu ích nhưng đôi khi cũng bỏ lỡ điểm chính. Giống như đánh giá một cuốn sách qua bìa của nó. Tuy nhiên, có một số chỉ số là tốt, đúng không? Có lẽ chúng ta nên xem xét chúng với một chút hoài nghi tạm thời. 🤔




Estou em dúvida sobre os benchmarks de IA. Eles parecem úteis, mas às vezes também perdem o ponto. É como julgar um livro pela capa. Ainda assim, é bom ter algumas métricas, certo? Talvez devêssemos levá-los com um grão de sal por enquanto. 🤔












