AI 'Reasoning' Models Tested with NPR Sunday Puzzle Questions
Every Sunday, NPR's Will Shortz, the mastermind behind The New York Times' crossword puzzles, engages thousands of listeners with his segment, the Sunday Puzzle. These puzzles are crafted to be solvable with general knowledge, yet they pose a significant challenge even to seasoned puzzle solvers.
This complexity is why some experts believe the Sunday Puzzle could serve as a valuable tool for testing the boundaries of AI's problem-solving capabilities.
In a recent study, researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and the startup Cursor developed an AI benchmark using riddles from the Sunday Puzzle. Their findings revealed intriguing behaviors in reasoning models, including OpenAI's o1, which occasionally "give up" and offer incorrect answers knowingly.
Arjun Guha, a computer science professor at Northeastern and a co-author of the study, explained to TechCrunch that the goal was to create a benchmark that could be understood by anyone with general knowledge. He noted, "We wanted to develop a benchmark with problems that humans can understand with only general knowledge."
The AI industry currently faces a challenge with benchmarking, as many tests focus on advanced skills like PhD-level math and science, which aren't relevant to most users. Moreover, even recently released benchmarks are nearing saturation.
The Sunday Puzzle offers a unique advantage because it doesn't rely on specialized knowledge, and its format prevents AI models from simply regurgitating memorized answers, according to Guha. He elaborated, "I think what makes these problems hard is that it's really difficult to make meaningful progress on a problem until you solve it — that's when everything clicks together all at once. That requires a combination of insight and a process of elimination."
However, the Sunday Puzzle isn't without its limitations. It's centered around U.S. culture and uses only English, and there's a risk that models trained on these puzzles could "cheat" if they've seen the questions before. Guha reassures, though, that he hasn't found evidence of this yet. He added, "New questions are released every week, and we can expect the latest questions to be truly unseen. We intend to keep the benchmark fresh and track how model performance changes over time."
The researchers' benchmark, featuring about 600 Sunday Puzzle riddles, showed that reasoning models like o1 and DeepSeek's R1 significantly outperformed other models. These models meticulously fact-check themselves, which helps them avoid common pitfalls. However, this thoroughness means they take longer to reach a solution — typically a few seconds to minutes more.
Interestingly, DeepSeek's R1 sometimes admits defeat, saying "I give up," before offering a random incorrect answer — a reaction many humans can empathize with. Other peculiar behaviors observed include models giving a wrong answer, retracting it, attempting another guess, and failing again. Some models get stuck in endless loops of "thinking," provide nonsensical explanations, or correctly answer a question only to then explore alternative answers unnecessarily.
Guha remarked on R1's behavior, saying, "On hard problems, R1 literally says that it's getting 'frustrated.' It was funny to see how a model emulates what a human might say. It remains to be seen how 'frustration' in reasoning can affect the quality of model results."

R1 getting “frustrated” on a question in the Sunday Puzzle challenge set.Image Credits:Guha et al.
The current top performer on the benchmark is o1, achieving a 59% score, followed by the recently released o3-mini set to high "reasoning effort" at 47%. R1 scored 35%. The researchers plan to expand their testing to more reasoning models, hoping to pinpoint areas for improvement.

The scores of the models the team tested on their benchmark.Image Credits:Guha et al.
Guha emphasized the importance of accessible benchmarks, stating, "You don't need a PhD to be good at reasoning, so it should be possible to design reasoning benchmarks that don't require PhD-level knowledge. A benchmark with broader access allows a wider set of researchers to comprehend and analyze the results, which may in turn lead to better solutions in the future. Furthermore, as state-of-the-art models are increasingly deployed in settings that affect everyone, we believe everyone should be able to intuit what these models are — and aren't — capable of."
Related article
Salesforce Unveils AI Digital Teammates in Slack to Rival Microsoft Copilot
Salesforce launched a new workplace AI strategy, introducing specialized “digital teammates” integrated into Slack conversations, the company revealed on Monday.The new tool, Agentforce in Slack, enab
Oracle's $40B Nvidia Chip Investment Boosts Texas AI Data Center
Oracle is set to invest approximately $40 billion in Nvidia chips to power a major new data center in Texas, developed by OpenAI, as reported by the Financial Times. This deal, one of the largest chip
Meta AI App to Introduce Premium Tier and Ads
Meta's AI app may soon feature a paid subscription, mirroring offerings from competitors like OpenAI, Google, and Microsoft. During a Q1 2025 earnings call, Meta CEO Mark Zuckerberg outlined plans for
Comments (11)
0/200
StephenRamirez
July 22, 2025 at 2:33:07 AM EDT
NPR's Sunday Puzzle with AI? Sounds like a brain teaser showdown! I wonder if these models can outsmart Will Shortz’s tricky wordplay. 🤔
0
PaulTaylor
April 19, 2025 at 5:13:34 PM EDT
¡Esta herramienta de IA que resuelve los rompecabezas de los domingos de NPR es genial! Es como tener un amigo listo que ama los rompecabezas tanto como yo. A veces se equivoca, pero ¿quién no? ¡Sigue así, IA! 😄
0
StephenScott
April 19, 2025 at 6:57:20 AM EDT
This AI tool tackling NPR's Sunday Puzzles is super cool! It's like having a brainy friend who loves puzzles as much as I do. Sometimes it gets the answers wrong, but hey, who doesn't? Keep up the good work, AI! 🤓
0
CharlesThomas
April 18, 2025 at 10:09:55 PM EDT
NPRのサンデーパズルに挑戦するこのAIツール、めっちゃ面白い!パズル好きの友達がいるみたいで嬉しい。たまに答えを間違えるけど、誰でもそうなるよね。頑張ってね、AI!😊
0
JackMartin
April 13, 2025 at 6:51:16 AM EDT
NPRのサンデーパズルをAIで解くのは驚きです!これらのトリッキーな質問をモデルがどれだけうまく処理するかを見るのはクールです。時々間違えることもありますが、それでも印象的です。アルゴリズムを調整し続けてくださいね!🤓
0
RichardRoberts
April 13, 2025 at 4:54:45 AM EDT
Sử dụng AI để giải các câu đố Chủ Nhật của NPR thật là đáng kinh ngạc! Thật tuyệt khi thấy các mô hình xử lý tốt những câu hỏi khó khăn này. Đôi khi chúng sai, nhưng vẫn rất ấn tượng. Cứ tiếp tục điều chỉnh các thuật toán đó, các bạn! 🤓
0
Every Sunday, NPR's Will Shortz, the mastermind behind The New York Times' crossword puzzles, engages thousands of listeners with his segment, the Sunday Puzzle. These puzzles are crafted to be solvable with general knowledge, yet they pose a significant challenge even to seasoned puzzle solvers.
This complexity is why some experts believe the Sunday Puzzle could serve as a valuable tool for testing the boundaries of AI's problem-solving capabilities.
In a recent study, researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and the startup Cursor developed an AI benchmark using riddles from the Sunday Puzzle. Their findings revealed intriguing behaviors in reasoning models, including OpenAI's o1, which occasionally "give up" and offer incorrect answers knowingly.
Arjun Guha, a computer science professor at Northeastern and a co-author of the study, explained to TechCrunch that the goal was to create a benchmark that could be understood by anyone with general knowledge. He noted, "We wanted to develop a benchmark with problems that humans can understand with only general knowledge."
The AI industry currently faces a challenge with benchmarking, as many tests focus on advanced skills like PhD-level math and science, which aren't relevant to most users. Moreover, even recently released benchmarks are nearing saturation.
The Sunday Puzzle offers a unique advantage because it doesn't rely on specialized knowledge, and its format prevents AI models from simply regurgitating memorized answers, according to Guha. He elaborated, "I think what makes these problems hard is that it's really difficult to make meaningful progress on a problem until you solve it — that's when everything clicks together all at once. That requires a combination of insight and a process of elimination."
However, the Sunday Puzzle isn't without its limitations. It's centered around U.S. culture and uses only English, and there's a risk that models trained on these puzzles could "cheat" if they've seen the questions before. Guha reassures, though, that he hasn't found evidence of this yet. He added, "New questions are released every week, and we can expect the latest questions to be truly unseen. We intend to keep the benchmark fresh and track how model performance changes over time."
The researchers' benchmark, featuring about 600 Sunday Puzzle riddles, showed that reasoning models like o1 and DeepSeek's R1 significantly outperformed other models. These models meticulously fact-check themselves, which helps them avoid common pitfalls. However, this thoroughness means they take longer to reach a solution — typically a few seconds to minutes more.
Interestingly, DeepSeek's R1 sometimes admits defeat, saying "I give up," before offering a random incorrect answer — a reaction many humans can empathize with. Other peculiar behaviors observed include models giving a wrong answer, retracting it, attempting another guess, and failing again. Some models get stuck in endless loops of "thinking," provide nonsensical explanations, or correctly answer a question only to then explore alternative answers unnecessarily.
Guha remarked on R1's behavior, saying, "On hard problems, R1 literally says that it's getting 'frustrated.' It was funny to see how a model emulates what a human might say. It remains to be seen how 'frustration' in reasoning can affect the quality of model results."
The current top performer on the benchmark is o1, achieving a 59% score, followed by the recently released o3-mini set to high "reasoning effort" at 47%. R1 scored 35%. The researchers plan to expand their testing to more reasoning models, hoping to pinpoint areas for improvement.
Guha emphasized the importance of accessible benchmarks, stating, "You don't need a PhD to be good at reasoning, so it should be possible to design reasoning benchmarks that don't require PhD-level knowledge. A benchmark with broader access allows a wider set of researchers to comprehend and analyze the results, which may in turn lead to better solutions in the future. Furthermore, as state-of-the-art models are increasingly deployed in settings that affect everyone, we believe everyone should be able to intuit what these models are — and aren't — capable of."



NPR's Sunday Puzzle with AI? Sounds like a brain teaser showdown! I wonder if these models can outsmart Will Shortz’s tricky wordplay. 🤔




¡Esta herramienta de IA que resuelve los rompecabezas de los domingos de NPR es genial! Es como tener un amigo listo que ama los rompecabezas tanto como yo. A veces se equivoca, pero ¿quién no? ¡Sigue así, IA! 😄




This AI tool tackling NPR's Sunday Puzzles is super cool! It's like having a brainy friend who loves puzzles as much as I do. Sometimes it gets the answers wrong, but hey, who doesn't? Keep up the good work, AI! 🤓




NPRのサンデーパズルに挑戦するこのAIツール、めっちゃ面白い!パズル好きの友達がいるみたいで嬉しい。たまに答えを間違えるけど、誰でもそうなるよね。頑張ってね、AI!😊




NPRのサンデーパズルをAIで解くのは驚きです!これらのトリッキーな質問をモデルがどれだけうまく処理するかを見るのはクールです。時々間違えることもありますが、それでも印象的です。アルゴリズムを調整し続けてくださいね!🤓




Sử dụng AI để giải các câu đố Chủ Nhật của NPR thật là đáng kinh ngạc! Thật tuyệt khi thấy các mô hình xử lý tốt những câu hỏi khó khăn này. Đôi khi chúng sai, nhưng vẫn rất ấn tượng. Cứ tiếp tục điều chỉnh các thuật toán đó, các bạn! 🤓












