New AGI Test Proves Challenging, Stumps Majority of AI Models
April 10, 2025
WillPerez
37
The Arc Prize Foundation, co-founded by renowned AI researcher François Chollet, recently unveiled a new benchmark called ARC-AGI-2 in a blog post. This test aims to push the boundaries of AI's general intelligence, and so far, it's proving to be a tough nut to crack for most AI models.
According to the Arc Prize leaderboard, even advanced "reasoning" AI models like OpenAI's o1-pro and DeepSeek's R1 are only managing scores between 1% and 1.3%. Meanwhile, powerful non-reasoning models such as GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash are hovering around the 1% mark.
ARC-AGI tests challenge AI systems with puzzle-like problems, requiring them to identify visual patterns in grids of different-colored squares and generate the correct "answer" grid. These problems are designed to test an AI's ability to adapt to new, unseen challenges.
To establish a human baseline, the Arc Prize Foundation had over 400 people take the ARC-AGI-2 test. On average, these "panels" of humans achieved a 60% success rate, significantly outperforming the AI models.

a sample question from Arc-AGI-2.Image Credits:Arc Prize François Chollet took to X to claim that ARC-AGI-2 is a more accurate measure of an AI model's true intelligence compared to its predecessor, ARC-AGI-1. The Arc Prize Foundation's tests are designed to assess whether an AI can efficiently learn new skills beyond its training data.
Chollet emphasized that ARC-AGI-2 prevents AI models from relying on "brute force" computing power to solve problems, a flaw he acknowledged in the first test. To address this, ARC-AGI-2 introduces an efficiency metric and requires models to interpret patterns on the fly rather than relying on memorization.
In a blog post, Arc Prize Foundation co-founder Greg Kamradt stressed that intelligence isn't just about solving problems or achieving high scores. "The efficiency with which those capabilities are acquired and deployed is a crucial, defining component," he wrote. "The core question being asked is not just, 'Can AI acquire [the] skill to solve a task?' but also, 'At what efficiency or cost?'"
ARC-AGI-1 remained unbeaten for about five years until December 2024, when OpenAI's advanced reasoning model, o3, surpassed all other AI models and matched human performance. However, o3's success on ARC-AGI-1 came at a significant cost. The version of OpenAI's o3 model, o3 (low), which scored an impressive 75.7% on ARC-AGI-1, only managed a paltry 4% on ARC-AGI-2, using $200 worth of computing power per task.

Comparison of Frontier AI model performance on ARC-AGI-1 and ARC-AGI-2.Image Credits:Arc Prize The introduction of ARC-AGI-2 comes at a time when many in the tech industry are calling for new, unsaturated benchmarks to measure AI progress. Thomas Wolf, co-founder of Hugging Face, recently told TechCrunch that the AI industry lacks sufficient tests to measure key traits of artificial general intelligence, such as creativity.
Alongside the new benchmark, the Arc Prize Foundation announced the Arc Prize 2025 contest, challenging developers to achieve 85% accuracy on the ARC-AGI-2 test while spending only $0.42 per task.
Related article
OpenAI Strikes Back: Sues Elon Musk for Alleged Efforts to Undermine AI Competitor
OpenAI has launched a fierce legal counterattack against its co-founder, Elon Musk, and his competing AI company, xAI. In a dramatic escalation of their ongoing feud, OpenAI accuses Musk of waging a "relentless" and "malicious" campaign to undermine the company he helped start.
According to court d
New AGI Test Proves Challenging, Stumps Majority of AI Models
The Arc Prize Foundation, co-founded by renowned AI researcher François Chollet, recently unveiled a new benchmark called ARC-AGI-2 in a blog post. This test aims to push the boundaries of AI's general intelligence, and so far, it's proving to be a tough nut to crack for most AI models.According to
Eric Schmidt Opposes AGI Manhattan Project
In a policy paper released on Wednesday, former Google CEO Eric Schmidt, along with Scale AI CEO Alexandr Wang and Center for AI Safety Director Dan Hendrycks, advised against the U.S. launching a Manhattan Project-style initiative to develop AI systems with "superhuman" intelligence, commonly refer
Comments (35)
0/200
StephenMartinez
April 10, 2025 at 3:27:48 PM GMT
The new AGI test from the Arc Prize Foundation is seriously tough! It's great to see AI being pushed to its limits, but man, it's humbling to see how many models can't crack it. François Chollet's work is always pushing the envelope. Keep at it, AI devs!
0
StevenSanchez
April 10, 2025 at 3:27:48 PM GMT
Arc Prize Foundationの新しいAGIテストは本当に難しいですね!AIが限界まで押し上げられるのは素晴らしいですが、多くのモデルがこれを解けないのを見るのは謙虚な気持ちになります。フランソワ・ショレの仕事はいつも新しい領域を開拓しています。頑張ってください、AI開発者たち!
0
AndrewHernández
April 10, 2025 at 3:27:48 PM GMT
Arc Prize Foundation의 새로운 AGI 테스트는 정말 어렵네요! AI가 한계까지 밀어붙여지는 것은 멋지지만, 많은 모델이 이것을 풀지 못하는 것을 보는 것은 겸손해지게 합니다. 프랑수아 쇼레의 작업은 항상 새로운 영역을 개척하고 있습니다. 계속 노력하세요, AI 개발자들!
0
BrianGarcia
April 10, 2025 at 3:27:48 PM GMT
O novo teste de AGI da Arc Prize Foundation é seriamente difícil! É ótimo ver a IA sendo levada ao seu limite, mas cara, é humilhante ver quantos modelos não conseguem resolvê-lo. O trabalho de François Chollet está sempre expandindo os limites. Continuem assim, desenvolvedores de IA!
0
GeorgeEvans
April 10, 2025 at 3:27:48 PM GMT
¡El nuevo test de AGI de la Fundación Arc Prize es seriamente difícil! Es genial ver cómo se empuja a la IA hasta sus límites, pero hombre, es humilde ver cuántos modelos no pueden resolverlo. El trabajo de François Chollet siempre está empujando el sobre. ¡Sigan adelante, desarrolladores de IA!
0
StevenLopez
April 11, 2025 at 12:18:46 AM GMT
This ARC-AGI-2 test is seriously tough! I tried it with a bunch of AI models and most of them just couldn't handle it. It's cool to see how it challenges the limits of AI, but man, it's frustrating when even the smart ones fail. Maybe next time, right?
0






The Arc Prize Foundation, co-founded by renowned AI researcher François Chollet, recently unveiled a new benchmark called ARC-AGI-2 in a blog post. This test aims to push the boundaries of AI's general intelligence, and so far, it's proving to be a tough nut to crack for most AI models.
According to the Arc Prize leaderboard, even advanced "reasoning" AI models like OpenAI's o1-pro and DeepSeek's R1 are only managing scores between 1% and 1.3%. Meanwhile, powerful non-reasoning models such as GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash are hovering around the 1% mark.
ARC-AGI tests challenge AI systems with puzzle-like problems, requiring them to identify visual patterns in grids of different-colored squares and generate the correct "answer" grid. These problems are designed to test an AI's ability to adapt to new, unseen challenges.
To establish a human baseline, the Arc Prize Foundation had over 400 people take the ARC-AGI-2 test. On average, these "panels" of humans achieved a 60% success rate, significantly outperforming the AI models.
Chollet emphasized that ARC-AGI-2 prevents AI models from relying on "brute force" computing power to solve problems, a flaw he acknowledged in the first test. To address this, ARC-AGI-2 introduces an efficiency metric and requires models to interpret patterns on the fly rather than relying on memorization.
In a blog post, Arc Prize Foundation co-founder Greg Kamradt stressed that intelligence isn't just about solving problems or achieving high scores. "The efficiency with which those capabilities are acquired and deployed is a crucial, defining component," he wrote. "The core question being asked is not just, 'Can AI acquire [the] skill to solve a task?' but also, 'At what efficiency or cost?'"
ARC-AGI-1 remained unbeaten for about five years until December 2024, when OpenAI's advanced reasoning model, o3, surpassed all other AI models and matched human performance. However, o3's success on ARC-AGI-1 came at a significant cost. The version of OpenAI's o3 model, o3 (low), which scored an impressive 75.7% on ARC-AGI-1, only managed a paltry 4% on ARC-AGI-2, using $200 worth of computing power per task.
Alongside the new benchmark, the Arc Prize Foundation announced the Arc Prize 2025 contest, challenging developers to achieve 85% accuracy on the ARC-AGI-2 test while spending only $0.42 per task.



The new AGI test from the Arc Prize Foundation is seriously tough! It's great to see AI being pushed to its limits, but man, it's humbling to see how many models can't crack it. François Chollet's work is always pushing the envelope. Keep at it, AI devs!




Arc Prize Foundationの新しいAGIテストは本当に難しいですね!AIが限界まで押し上げられるのは素晴らしいですが、多くのモデルがこれを解けないのを見るのは謙虚な気持ちになります。フランソワ・ショレの仕事はいつも新しい領域を開拓しています。頑張ってください、AI開発者たち!




Arc Prize Foundation의 새로운 AGI 테스트는 정말 어렵네요! AI가 한계까지 밀어붙여지는 것은 멋지지만, 많은 모델이 이것을 풀지 못하는 것을 보는 것은 겸손해지게 합니다. 프랑수아 쇼레의 작업은 항상 새로운 영역을 개척하고 있습니다. 계속 노력하세요, AI 개발자들!




O novo teste de AGI da Arc Prize Foundation é seriamente difícil! É ótimo ver a IA sendo levada ao seu limite, mas cara, é humilhante ver quantos modelos não conseguem resolvê-lo. O trabalho de François Chollet está sempre expandindo os limites. Continuem assim, desenvolvedores de IA!




¡El nuevo test de AGI de la Fundación Arc Prize es seriamente difícil! Es genial ver cómo se empuja a la IA hasta sus límites, pero hombre, es humilde ver cuántos modelos no pueden resolverlo. El trabajo de François Chollet siempre está empujando el sobre. ¡Sigan adelante, desarrolladores de IA!




This ARC-AGI-2 test is seriously tough! I tried it with a bunch of AI models and most of them just couldn't handle it. It's cool to see how it challenges the limits of AI, but man, it's frustrating when even the smart ones fail. Maybe next time, right?












