AI 'Reasoning' Models Tested with NPR Sunday Puzzle Questions

Home

News

April 10, 2025

JamesJones

132

# ai # research # NPR

Every Sunday, NPR's Will Shortz, the mastermind behind The New York Times' crossword puzzles, engages thousands of listeners with his segment, the Sunday Puzzle. These puzzles are crafted to be solvable with general knowledge, yet they pose a significant challenge even to seasoned puzzle solvers.

This complexity is why some experts believe the Sunday Puzzle could serve as a valuable tool for testing the boundaries of AI's problem-solving capabilities.

In a recent study, researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and the startup Cursor developed an AI benchmark using riddles from the Sunday Puzzle. Their findings revealed intriguing behaviors in reasoning models, including OpenAI's o1, which occasionally "give up" and offer incorrect answers knowingly.

Arjun Guha, a computer science professor at Northeastern and a co-author of the study, explained to TechCrunch that the goal was to create a benchmark that could be understood by anyone with general knowledge. He noted, "We wanted to develop a benchmark with problems that humans can understand with only general knowledge."

The AI industry currently faces a challenge with benchmarking, as many tests focus on advanced skills like PhD-level math and science, which aren't relevant to most users. Moreover, even recently released benchmarks are nearing saturation.

The Sunday Puzzle offers a unique advantage because it doesn't rely on specialized knowledge, and its format prevents AI models from simply regurgitating memorized answers, according to Guha. He elaborated, "I think what makes these problems hard is that it's really difficult to make meaningful progress on a problem until you solve it — that's when everything clicks together all at once. That requires a combination of insight and a process of elimination."

However, the Sunday Puzzle isn't without its limitations. It's centered around U.S. culture and uses only English, and there's a risk that models trained on these puzzles could "cheat" if they've seen the questions before. Guha reassures, though, that he hasn't found evidence of this yet. He added, "New questions are released every week, and we can expect the latest questions to be truly unseen. We intend to keep the benchmark fresh and track how model performance changes over time."

The researchers' benchmark, featuring about 600 Sunday Puzzle riddles, showed that reasoning models like o1 and DeepSeek's R1 significantly outperformed other models. These models meticulously fact-check themselves, which helps them avoid common pitfalls. However, this thoroughness means they take longer to reach a solution — typically a few seconds to minutes more.

Interestingly, DeepSeek's R1 sometimes admits defeat, saying "I give up," before offering a random incorrect answer — a reaction many humans can empathize with. Other peculiar behaviors observed include models giving a wrong answer, retracting it, attempting another guess, and failing again. Some models get stuck in endless loops of "thinking," provide nonsensical explanations, or correctly answer a question only to then explore alternative answers unnecessarily.

Guha remarked on R1's behavior, saying, "On hard problems, R1 literally says that it's getting 'frustrated.' It was funny to see how a model emulates what a human might say. It remains to be seen how 'frustration' in reasoning can affect the quality of model results."

NPR benchmark

R1 getting “frustrated” on a question in the Sunday Puzzle challenge set.Image Credits:Guha et al.

The current top performer on the benchmark is o1, achieving a 59% score, followed by the recently released o3-mini set to high "reasoning effort" at 47%. R1 scored 35%. The researchers plan to expand their testing to more reasoning models, hoping to pinpoint areas for improvement.

NPR benchmark

The scores of the models the team tested on their benchmark.Image Credits:Guha et al.

Guha emphasized the importance of accessible benchmarks, stating, "You don't need a PhD to be good at reasoning, so it should be possible to design reasoning benchmarks that don't require PhD-level knowledge. A benchmark with broader access allows a wider set of researchers to comprehend and analyze the results, which may in turn lead to better solutions in the future. Furthermore, as state-of-the-art models are increasingly deployed in settings that affect everyone, we believe everyone should be able to intuit what these models are — and aren't — capable of."

"Dot AI Companion App Announces Closure, Discontinues Personalized Service" Dot, an AI companion application designed to function as a personal friend and confidant, will cease operations, according to a Friday announcement from its developers. New Computer, the startup behind Dot, stated on its website that the service will

Anthropic Resolves Legal Case Over AI-Generated Book Piracy Anthropic has reached a resolution in a significant copyright dispute with US authors, agreeing to a proposed class action settlement that avoids a potentially costly trial. The agreement, filed in court documents this Tuesday, stems from allegations

Figma Releases AI-Powered App Builder Tool to All Users Figma Make, the innovative prompt-to-app development platform unveiled earlier this year, has officially exited beta and rolled out to all users. This groundbreaking tool joins the ranks of AI-powered coding assistants like Google's Gemini Code Assis

Comments (11)

0/200

Submit

StephenRamirez

July 22, 2025 at 2:33:07 AM EDT

NPR's Sunday Puzzle with AI? Sounds like a brain teaser showdown! I wonder if these models can outsmart Will Shortz’s tricky wordplay. 🤔

PaulTaylor

April 19, 2025 at 5:13:34 PM EDT

¡Esta herramienta de IA que resuelve los rompecabezas de los domingos de NPR es genial! Es como tener un amigo listo que ama los rompecabezas tanto como yo. A veces se equivoca, pero ¿quién no? ¡Sigue así, IA! 😄

StephenScott

April 19, 2025 at 6:57:20 AM EDT

This AI tool tackling NPR's Sunday Puzzles is super cool! It's like having a brainy friend who loves puzzles as much as I do. Sometimes it gets the answers wrong, but hey, who doesn't? Keep up the good work, AI! 🤓

CharlesThomas

April 18, 2025 at 10:09:55 PM EDT

NPRのサンデーパズルに挑戦するこのAIツール、めっちゃ面白い！パズル好きの友達がいるみたいで嬉しい。たまに答えを間違えるけど、誰でもそうなるよね。頑張ってね、AI！😊

JackMartin

April 13, 2025 at 6:51:16 AM EDT

NPRのサンデーパズルをAIで解くのは驚きです！これらのトリッキーな質問をモデルがどれだけうまく処理するかを見るのはクールです。時々間違えることもありますが、それでも印象的です。アルゴリズムを調整し続けてくださいね！🤓

RichardRoberts

April 13, 2025 at 4:54:45 AM EDT

Sử dụng AI để giải các câu đố Chủ Nhật của NPR thật là đáng kinh ngạc! Thật tuyệt khi thấy các mô hình xử lý tốt những câu hỏi khó khăn này. Đôi khi chúng sai, nhưng vẫn rất ấn tượng. Cứ tiếp tục điều chỉnh các thuật toán đó, các bạn! 🤓