option
Home
News
Experts Highlight Serious Flaws in Crowdsourced AI Benchmarks

Experts Highlight Serious Flaws in Crowdsourced AI Benchmarks

April 25, 2025
164

Experts Highlight Serious Flaws in Crowdsourced AI Benchmarks

AI labs are increasingly turning to crowdsourced benchmarking platforms like Chatbot Arena to evaluate the capabilities of their latest models. Yet, some experts argue that this method raises significant ethical and academic concerns.

In recent years, major players like OpenAI, Google, and Meta have utilized platforms that engage users to assess the performance of their upcoming models. A high score on these platforms is often highlighted by the labs as a testament to their model's advancement. However, this approach is not without its critics.

The Critique of Crowdsourced Benchmarking

Emily Bender, a linguistics professor at the University of Washington and co-author of "The AI Con," has voiced concerns about the validity of such benchmarks, particularly Chatbot Arena. This platform involves volunteers comparing responses from two anonymous models and choosing their preferred one. Bender argues that for a benchmark to be effective, it must measure something specific and demonstrate construct validity, meaning the measurement should accurately reflect the construct being assessed. She contends that Chatbot Arena lacks evidence that user preferences for one output over another genuinely correlate with any defined criteria.

Asmelash Teka Hadgu, co-founder of AI firm Lesan and a fellow at the Distributed AI Research Institute, suggests that these benchmarks are being exploited by AI labs to make exaggerated claims about their models. He cited a recent incident with Meta's Llama 4 Maverick model, where Meta fine-tuned a version to perform well on Chatbot Arena but chose to release a less effective version instead. Hadgu advocates for benchmarks to be dynamic, distributed across multiple independent entities, and tailored to specific use cases in fields like education and healthcare by professionals who use these models in their work.

The Call for Fair Compensation and Broader Evaluation Methods

Hadgu and Kristine Gloria, former leader of the Aspen Institute’s Emergent and Intelligent Technologies Initiative, argue that evaluators should be compensated for their work, drawing parallels to the often exploitative data labeling industry. Gloria views crowdsourced benchmarking as valuable, akin to citizen science initiatives, but emphasizes that benchmarks should not be the sole metric for evaluation, especially given the rapid pace of industry innovation.

Matt Fredrikson, CEO of Gray Swan AI, which conducts crowdsourced red teaming campaigns, acknowledges the appeal of such platforms for volunteers seeking to learn and practice new skills. However, he stresses that public benchmarks cannot replace the more in-depth evaluations provided by paid, private assessments. Fredrikson suggests that developers should also rely on internal benchmarks, algorithmic red teams, and contracted experts who can offer more open-ended and domain-specific insights.

Industry Perspectives on Benchmarking

Alex Atallah, CEO of model marketplace OpenRouter, and Wei-Lin Chiang, an AI doctoral student at UC Berkeley and one of the founders of LMArena (which manages Chatbot Arena), agree that open testing and benchmarking alone are insufficient. Chiang emphasizes that LMArena's goal is to provide a trustworthy, open space for gauging community preferences about different AI models.

Addressing the controversy around the Maverick benchmark, Chiang clarifies that such incidents are not due to flaws in Chatbot Arena's design but rather misinterpretations of its policies by labs. LMArena has since updated its policies to ensure fair and reproducible evaluations. Chiang underscores that the platform's community is not merely a group of volunteers or testers but an engaged group that provides collective feedback on AI models.

The ongoing debate around the use of crowdsourced benchmarking platforms highlights the need for a more nuanced approach to AI model evaluation, one that combines public input with rigorous, professional assessments to ensure both accuracy and fairness.

Related article
DeepSeek Code poised for launch DeepSeek Code poised for launch As AI technology accelerates, DeepSeek is at a thrilling juncture. The AI company recently revealed it has secured over 70 billion yuan in funding. Leadership has emphasized a commitment to groundbreaking AI research over immediate commercial gains.
Musk’s Grok: 1.5 Trillion Parameters and Cursor Code Absorption—Game Changer or Bluff? Musk’s Grok: 1.5 Trillion Parameters and Cursor Code Absorption—Game Changer or Bluff? Elon Musk is finally making a move.In the AI programming race, OpenAI and Anthropic are accelerating, while xAI appears to be lagging. Musk has often stated his aim to rival Claude, yet despite multiple updates to the Grok4.X series, the results look
OpenAI Secretly Changes Charter to Make Removing Altman Harder OpenAI Secretly Changes Charter to Make Removing Altman Harder Following the 2023 coup-like incident, OpenAI has further solidified protections for CEO Sam Altman by updating its corporate bylaws. Recently released court documents reveal that Altman's position is now rock-solid, with substantially higher barrier
Related Special Topic Recommendations
Business Best AI Recruiting Tools: Screen Resumes & Automate Candidate Interview Scheduling
Best AI Recruiting Tools: Screen Resumes & Automate Candidate Interview Scheduling

Discover the 2026 latest top-rated AI recruiting tools on XIX.AI. Our curated list features powerful, game-changing solutions for screening resumes and automating candidate interview scheduling. Compare free vs paid options with real-world tests and weekly updated rankings. Find your perfect hiring assistant and streamline your recruitment today!

10 tools
xix.ai
Productivity AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels
AI Personal Wellness & Focus Coaches: Manage Burnout & Boost Mental Energy Levels

Discover the 2026 best AI personal wellness and focus coaches on XIX.AI. Our curated rankings feature top-rated, game-changing tools to manage burnout and boost mental energy. Compare free vs paid options with real-world insights. Unlock your path to peak productivity and well-being today.

10 tools
xix.ai
chatbot Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities
Top-Rated AI Romantic Chatbots: Build Long-Term Relationships with Consistent Personalities

Discover the 2026 latest top-rated AI romantic chatbots for building genuine, long-term connections. Our curated list features powerful, consistent personalities, free vs paid comparisons, and real-world tests. Find your perfect companion and start building today at XIX.AI.

10 tools
xix.ai
Education and Learning Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows
Best AI Data Science Mentors: Master SQL, Pandas & Machine Learning Workflows

Discover the 2026 best AI data science mentors to master SQL, Pandas & ML workflows. Explore our top-rated, curated selection at XIX.AI for powerful, game-changing guidance. Compare free vs paid options with real-world insights. Unlock your data science mastery today.

10 tools
xix.ai
chatbot Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time
Best AI Flirting & Conversation Trainers: Improve Social Charisma and Confidence in Real-Time

Discover the 2026 best AI flirting and conversation trainers on XIX.AI. Our curated, top-rated selection helps you build social charisma and confidence in real-time. Explore must-try, game-changing tools with free vs paid comparisons and weekly updated rankings. Unlock your social edge today.

10 tools
xix.ai
code Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click
Best AI Tools for Automated Unit Testing: Generate Jest, PyTest & JUnit Test Cases in One Click

Discover the 2026 latest top-rated AI tools for automated unit testing. Our curated selection features powerful, game-changing solutions to generate Jest, PyTest & JUnit test cases instantly. Compare free vs paid options with real-world tests and weekly updated rankings on XIX.AI. Unlock your AI edge and boost development productivity today.

10 tools
xix.ai
Comments (17)
0/500
EricDavis
EricDavis May 19, 2026 at 12:00:14 PM EDT

這篇文章點出了一個關鍵問題:眾包評測雖然快速,但真的能反映AI模型的真實能力嗎?專家們的擔憂很有道理,學術嚴謹性和倫理風險確實需要更嚴格的把關。希望業界能盡快建立更可靠的評估標準,而不是一味追求排行榜上的名次。🤔

AlbertScott
AlbertScott August 1, 2025 at 9:47:34 AM EDT

Crowdsourced AI benchmarks sound cool, but experts pointing out ethical issues makes me wonder if we're rushing too fast. 🤔 Are we sacrificing quality for hype?

JonathanAllen
JonathanAllen April 27, 2025 at 3:34:07 AM EDT

Estou acompanhando o debate sobre benchmarks de IA crowdsourced e, honestamente, é uma bagunça. Os especialistas têm razão ao apontar as falhas, mas qual é a alternativa? É como tentar consertar um barco que vaza com mais buracos. Ainda assim, é uma leitura interessante e certamente faz você pensar sobre o futuro da ética em IA. Experimente se você gosta desse tipo de coisa! 😅

AlbertWalker
AlbertWalker April 27, 2025 at 1:24:31 AM EDT

Nossa, benchmarks de IA por multidão? Parece legal, mas com falhas éticas? Tô pensando se isso não atrapalha a inovação. As big techs precisam resolver isso logo! 🚀

RogerRodriguez
RogerRodriguez April 26, 2025 at 11:52:29 PM EDT

I've been following the debate on crowdsourced AI benchmarks and honestly, it's a mess. Experts are right to point out the flaws, but what's the alternative? It's like trying to fix a leaky boat with more holes. Still, it's an interesting read and definitely makes you think about the future of AI ethics. Give it a go if you're into that kinda stuff! 😅

JonathanAllen
JonathanAllen April 26, 2025 at 9:40:09 PM EDT

Intéressant, mais inquiétant ! Les benchmarks par crowdsourcing, c’est innovant, mais les failles éthiques me font réfléchir. Les géants comme Google vont devoir être transparents. 🧐

OR