option
Home
News
AI 'Reasoning' Models Surge, Driving Up Benchmarking Costs

AI 'Reasoning' Models Surge, Driving Up Benchmarking Costs

April 22, 2025
162

AI

The Rising Costs of Benchmarking AI Reasoning Models

AI labs like OpenAI have been touting their advanced "reasoning" AI models, which are designed to tackle complex problems step by step. These models, particularly effective in fields like physics, are indeed impressive. However, they come with a hefty price tag when it comes to benchmarking, making it challenging for independent verification of their capabilities.

According to data from Artificial Analysis, a third-party AI testing firm, the cost to evaluate OpenAI's o1 reasoning model across seven popular AI benchmarks is a staggering $2,767.05. These benchmarks include MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500. In contrast, benchmarking Anthropic's "hybrid" reasoning model, Claude 3.7 Sonnet, on the same tests cost $1,485.35, while OpenAI's o3-mini-high was significantly cheaper at $344.59.

Not all reasoning models are equally expensive to test. For instance, Artificial Analysis spent only $141.22 evaluating OpenAI’s o1-mini. However, the costs of these models tend to be high on average. Artificial Analysis has shelled out around $5,200 to evaluate about a dozen reasoning models, which is nearly double the $2,400 spent on analyzing over 80 non-reasoning models.

For comparison, the non-reasoning GPT-4o model from OpenAI, released in May 2024, cost Artificial Analysis just $108.85 to evaluate, while Claude 3.6 Sonnet, the non-reasoning predecessor to Claude 3.7 Sonnet, cost $81.41.

George Cameron, co-founder of Artificial Analysis, shared with TechCrunch that the organization is prepared to increase its benchmarking budget as more AI labs continue to develop reasoning models. "At Artificial Analysis, we run hundreds of evaluations monthly and devote a significant budget to these," Cameron stated. "We are planning for this spend to increase as models are more frequently released."

Artificial Analysis isn't alone in facing these escalating costs. Ross Taylor, CEO of AI startup General Reasoning, recently spent $580 to evaluate Claude 3.7 Sonnet on around 3,700 unique prompts. Taylor estimates that a single run-through of MMLU Pro, a benchmark designed to test language comprehension, would exceed $1,800.

Taylor highlighted a growing concern in a recent post on X, stating, "We’re moving to a world where a lab reports x% on a benchmark where they spend y amount of compute, but where resources for academics are << y. No one is going to be able to reproduce the results."

Why Are Reasoning Models So Expensive to Benchmark?

The primary reason for the high cost of testing reasoning models is their tendency to generate a substantial number of tokens. Tokens are units of raw text; for example, the word "fantastic" might be broken down into "fan," "tas," and "tic." According to Artificial Analysis, OpenAI's o1 model generated over 44 million tokens during their tests, which is roughly eight times the number of tokens generated by the non-reasoning GPT-4o model.

Most AI companies charge for model usage based on the number of tokens, which quickly adds up. Additionally, modern benchmarks are designed to elicit a high number of tokens by including questions that involve complex, multi-step tasks. Jean-Stanislas Denain, a senior researcher at Epoch AI, explained to TechCrunch, "Today’s benchmarks are more complex even though the number of questions per benchmark has overall decreased. They often attempt to evaluate models’ ability to do real-world tasks, such as write and execute code, browse the internet, and use computers."

Denain also pointed out that the cost per token for the most expensive models has been rising. For example, when Anthropic's Claude 3 Opus was released in May 2024, it cost $75 per million output tokens. In contrast, OpenAI's GPT-4.5 and o1-pro, launched earlier that year, cost $150 and $600 per million output tokens, respectively.

Despite the increasing cost per token, Denain noted, "Since models have gotten better over time, it’s still true that the cost to reach a given level of performance has greatly decreased over time. But if you want to evaluate the best largest models at any point in time, you’re still paying more."

The Integrity of Benchmarking

Many AI labs, including OpenAI, offer free or subsidized access to their models for benchmarking purposes. However, this practice raises concerns about the integrity of the evaluation process. Even without evidence of manipulation, the mere suggestion of an AI lab's involvement can cast doubt on the objectivity of the results.

Ross Taylor expressed this concern on X, asking, "From a scientific point of view, if you publish a result that no one can replicate with the same model, is it even science anymore? (Was it ever science, lol)"

The high costs and potential biases in AI benchmarking underscore the challenges facing the field as it strives to develop and validate increasingly sophisticated models.

Related article
Anthropic's experimental AI Claude completes negotiations and transactions in e-commerce test Anthropic's experimental AI Claude completes negotiations and transactions in e-commerce test As artificial intelligence advances rapidly, Anthropic quietly rolled out an internal experiment called "Project Deal" last Friday, showcasing AI's potential in e-commerce. The experiment had its AI model Claude autonomously handle buying, selling, a
DeepSeek Code poised for launch DeepSeek Code poised for launch As AI technology accelerates, DeepSeek is at a thrilling juncture. The AI company recently revealed it has secured over 70 billion yuan in funding. Leadership has emphasized a commitment to groundbreaking AI research over immediate commercial gains.
Musk’s Grok: 1.5 Trillion Parameters and Cursor Code Absorption—Game Changer or Bluff? Musk’s Grok: 1.5 Trillion Parameters and Cursor Code Absorption—Game Changer or Bluff? Elon Musk is finally making a move.In the AI programming race, OpenAI and Anthropic are accelerating, while xAI appears to be lagging. Musk has often stated his aim to rival Claude, yet despite multiple updates to the Grok4.X series, the results look
Related Special Topic Recommendations
Business Best AI Recruiting Tools: Screen Resumes & Automate Candidate Interview Scheduling
Best AI Recruiting Tools: Screen Resumes & Automate Candidate Interview Scheduling

Discover the 2026 latest top-rated AI recruiting tools on XIX.AI. Our curated list features powerful, game-changing solutions for screening resumes and automating candidate interview scheduling. Compare free vs paid options with real-world tests and weekly updated rankings. Find your perfect hiring assistant and streamline your recruitment today!

10 tools
xix.ai
Productivity AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels
AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels

Discover the 2026 best AI personal wellness and focus coaches on XIX.AI. Our curated rankings feature top-rated, game-changing tools to manage burnout and boost mental energy. Compare free vs paid options with real-world insights. Unlock your path to peak productivity and well-being today.

10 tools
xix.ai
chatbot Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities
Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities

Discover the 2026 latest top-rated AI romantic chatbots for building genuine, long-term connections. Our curated list features powerful, consistent personalities, free vs paid comparisons, and real-world tests. Find your perfect companion and start building today at XIX.AI.

10 tools
xix.ai
Education and Learning Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows
Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows

Discover the 2026 best AI data science mentors to master SQL, Pandas & ML workflows. Explore our top-rated, curated selection at XIX.AI for powerful, game-changing guidance. Compare free vs paid options with real-world insights. Unlock your data science mastery today.

10 tools
xix.ai
chatbot Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time
Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time

Discover the 2026 best AI flirting and conversation trainers on XIX.AI. Our curated, top-rated selection helps you build social charisma and confidence in real-time. Explore must-try, game-changing tools with free vs paid comparisons and weekly updated rankings. Unlock your social edge today.

10 tools
xix.ai
code Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click
Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click

Discover the 2026 latest top-rated AI tools for automated unit testing. Our curated selection features powerful, game-changing solutions to generate Jest, PyTest & JUnit test cases instantly. Compare free vs paid options with real-world tests and weekly updated rankings on XIX.AI. Unlock your AI edge and boost development productivity today.

10 tools
xix.ai
Comments (17)
0/500
FrankJackson
FrankJackson August 10, 2025 at 5:01:00 AM EDT

These AI reasoning models are impressive for tackling complex physics problems step by step, but the surging benchmarking costs could stifle innovation for smaller labs. 😟 Reminds me of how tech giants dominate—maybe we need more affordable alternatives?

DouglasRodriguez
DouglasRodriguez July 27, 2025 at 9:20:21 PM EDT

These AI reasoning models sound cool, but the skyrocketing benchmarking costs are wild! 😳 Makes me wonder if smaller labs can even keep up with the big players like OpenAI.

StevenGonzalez
StevenGonzalez April 24, 2025 at 8:58:05 AM EDT

These AI reasoning models are impressive, but the rising costs of benchmarking are a real bummer. It's great for fields like physics, but I hope they find a way to make it more affordable. Otherwise, it's just for the big players. 😕

JackPerez
JackPerez April 24, 2025 at 3:52:48 AM EDT

Esses modelos de raciocínio de IA são impressionantes, mas o aumento dos custos de benchmarking é uma decepção. É ótimo para áreas como a física, mas espero que encontrem uma maneira de torná-lo mais acessível. Caso contrário, será apenas para os grandes jogadores. 😕

GregoryJones
GregoryJones April 24, 2025 at 3:10:43 AM EDT

AI推論モデルは素晴らしいけど、ベンチマーキングのコストが上がるのは残念です。物理分野には良いけど、もっと手頃な価格になる方法を見つけてほしいです。さもないと、大手企業だけのものになってしまいますね。😕

SamuelRoberts
SamuelRoberts April 24, 2025 at 12:23:58 AM EDT

Esses modelos de raciocínio de IA parecem legais, mas o aumento dos custos de benchmarking? Não tanto. Será que podemos ter os benefícios sem falir? 🤔

OR