

Former DeepSeeker and collaborators release new method for training reliable AI agents: RAGEN
May 4, 2025
DavidMartínez
0

The Year of AI Agents: A Closer Look at 2025's Expectations and Realities
2025 was heralded by many experts as the year when AI agents—specialized AI systems powered by advanced large language and multimodal models from companies like OpenAI, Anthropic, Google, and DeepSeek—would finally take center stage. However, according to a recent VentureBeat poll on the social network X, most AI agents are still languishing in experimental stages, caught in a sort of corporate limbo.
But there's a glimmer of hope on the horizon. A collaborative effort from researchers at Northwestern University, Microsoft, Stanford, and the University of Washington, including Zihan Wang, a former DeepSeek researcher now pursuing a PhD in computer science at Northwestern, has introduced RAGEN. This new system aims to train and evaluate AI agents to make them more reliable and adaptable for real-world, enterprise use.
RAGEN: A New Approach to Training AI Agents
Unlike static tasks such as math solving or code generation, RAGEN focuses on dynamic, multi-turn interactions where agents need to adapt, remember, and reason amidst uncertainty. The system is built on a custom reinforcement learning (RL) framework called StarPO (State-Thinking-Actions-Reward Policy Optimization), which emphasizes learning through experience rather than rote memorization. StarPO looks at entire decision-making sequences, not just single-step responses.
StarPO operates in two phases: a rollout stage where the LLM generates complete interaction sequences guided by reasoning, and an update stage where the model is optimized using normalized cumulative rewards. This approach offers a more stable and interpretable learning loop compared to traditional policy optimization methods.
The researchers tested this framework using fine-tuned versions of Alibaba’s Qwen models, specifically Qwen 1.5 and Qwen 2.5, chosen for their open weights and strong instruction-following capabilities. This choice facilitated reproducibility and consistent baseline comparisons across symbolic tasks.
The Echo Trap: A Challenge in Reinforcement Learning
Zihan Wang highlighted a critical issue in RL training in a widely shared X thread: *Why does your RL training always collapse?* The team identified that while LLM agents initially produce well-reasoned responses, RL systems often reward shortcuts, leading to repetitive behaviors that degrade performance—a phenomenon they dubbed the "Echo Trap."
This regression is fueled by feedback loops where certain phrases or strategies earn high rewards early on, encouraging overuse and stifling exploration. The symptoms are clear: reward variance cliffs, gradient spikes, and disappearing reasoning traces.
RAGEN's Test Environments
To study these behaviors in a controlled setting, RAGEN evaluates agents across three symbolic environments:
- Bandit: A single-turn, stochastic task that tests symbolic risk-reward reasoning.
- Sokoban: A multi-turn, deterministic puzzle involving irreversible decisions.
- Frozen Lake: A stochastic, multi-turn task requiring adaptive planning.
Each environment is designed to minimize real-world priors and focus solely on decision-making strategies developed during training. For example, in the Bandit environment, agents must reason symbolically about Dragon and Phoenix arms representing different reward distributions, interpreting them as "strength" and "hope" to predict outcomes.
Stabilizing Reinforcement Learning with StarPO-S
To combat training collapse, the researchers introduced StarPO-S, a stabilized version of the original framework. StarPO-S includes three key interventions:
- Uncertainty-based rollout filtering: Prioritizing rollouts where the agent shows outcome uncertainty.
- KL penalty removal: Allowing the model to deviate more freely from its original policy and explore new behaviors.
- Asymmetric PPO clipping: Amplifying high-reward trajectories more than low-reward ones to boost learning.
These changes help delay or eliminate training collapse and improve performance across all three tasks. As Wang put it, "StarPO-S… works across all 3 tasks. Relieves collapse. Better reward."
What Makes a Good Agentic AI Model?
The success of RL training depends not only on the architecture but also on the quality of the data generated by the agents. The team identified three crucial dimensions that significantly impact training:
- Task diversity: Exposing the model to a wide range of initial scenarios improves generalization.
- Interaction granularity: Allowing multiple actions per turn enables more meaningful planning.
- Rollout freshness: Keeping training data aligned with the current model policy avoids outdated learning signals.
These factors contribute to a more stable and effective training process. An interactive demo site on Github visualizes agent rollouts as full dialogue turns, including not just actions but the step-by-step thought process that precedes them. For instance, in solving a math problem, an agent might first 'think' about isolating a variable before submitting an answer like 'x = 5'. These intermediate thoughts are visible and traceable, adding transparency to how agents make decisions.
When Reasoning Runs Out
While explicit reasoning enhances performance in simple, single-turn tasks like Bandit, it tends to decay during multi-turn training. Despite using structured prompts and tokens, reasoning traces often shrink or vanish unless directly rewarded. This highlights a limitation in how rewards are typically designed: focusing on task completion may neglect the quality of the process behind it. The team experimented with format-based penalties to encourage better-structured reasoning but acknowledges that more refined reward shaping is likely needed.
Open Tools and Future Directions
RAGEN, along with its StarPO and StarPO-S frameworks, is now available as an open-source project at https://github.com/RAGEN-AI/RAGEN. However, at the time of writing, no explicit license is listed in the GitHub repository, which may limit its use or redistribution by others.
The system provides a valuable foundation for those interested in developing AI agents that not only complete tasks but also think, plan, and evolve. As AI moves toward greater autonomy, projects like RAGEN help illuminate what it takes to train models that learn from the consequences of their own actions.
Outstanding Questions for Real-World Enterprise Adoption
While the RAGEN paper offers a detailed technical roadmap, several practical questions remain for those looking to apply these methods in enterprise settings. For instance, how transferable is RAGEN’s approach beyond stylized, symbolic tasks? Would businesses need to design entirely new environments and reward functions to use this system in workflows like invoice processing or customer support?
Wang, in a direct message to VentureBeat on X, suggested that improving task diversity could help, as the current gaming tasks only have similar grid representations but lack semantic information. He also expressed optimism about businesses designing their own training exercises for AI agents using RAGEN, noting that the GitHub link provides a simple introduction to adding new environments.
Another critical area is scalability. Even with the enhancements provided by StarPO-S, the paper acknowledges that training still eventually collapses over longer horizons. This raises the question: is there a theoretical or practical path to sustaining reasoning over open-ended or continuously evolving task sequences?
At the time of writing, no explicit license is listed in the RAGEN GitHub repository or documentation, leaving open questions about usage rights. Nonetheless, RAGEN stands out not just as a technical contribution but as a conceptual step toward more autonomous, reasoning-capable AI agents. Whether it becomes part of the enterprise AI stack remains to be seen, but its insights into agent learning dynamics are already helping redefine the frontier of LLM training.
Related article
GAIA Introduces New Benchmark in Quest for True Intelligence Beyond ARC-AGI
Intelligence is everywhere, yet gauging it accurately feels like trying to catch a cloud with your bare hands. We use tests and benchmarks, like college entrance exams, to get a rough idea. Each year, students cram for these tests, sometimes even scoring a perfect 100%. But does that perfect score m
Open Deep Search arrives to challenge Perplexity and ChatGPT Search
If you're in the tech world, you've likely heard about the buzz surrounding Open Deep Search (ODS), the new open-source framework from the Sentient Foundation. ODS is making waves by offering a robust alternative to proprietary AI search engines like Perplexity and ChatGPT Search, and it's all about
MCP Standardizes AI Connectivity with Tools and Data: A New Protocol Emerges
If you're diving into the world of artificial intelligence (AI), you've probably noticed how crucial it is to get different AI models, data sources, and tools to play nicely together. That's where the Model Context Protocol (MCP) comes in, acting as a game-changer in standardizing AI connectivity. T
Comments (0)
0/200






The Year of AI Agents: A Closer Look at 2025's Expectations and Realities
2025 was heralded by many experts as the year when AI agents—specialized AI systems powered by advanced large language and multimodal models from companies like OpenAI, Anthropic, Google, and DeepSeek—would finally take center stage. However, according to a recent VentureBeat poll on the social network X, most AI agents are still languishing in experimental stages, caught in a sort of corporate limbo.
But there's a glimmer of hope on the horizon. A collaborative effort from researchers at Northwestern University, Microsoft, Stanford, and the University of Washington, including Zihan Wang, a former DeepSeek researcher now pursuing a PhD in computer science at Northwestern, has introduced RAGEN. This new system aims to train and evaluate AI agents to make them more reliable and adaptable for real-world, enterprise use.
RAGEN: A New Approach to Training AI Agents
Unlike static tasks such as math solving or code generation, RAGEN focuses on dynamic, multi-turn interactions where agents need to adapt, remember, and reason amidst uncertainty. The system is built on a custom reinforcement learning (RL) framework called StarPO (State-Thinking-Actions-Reward Policy Optimization), which emphasizes learning through experience rather than rote memorization. StarPO looks at entire decision-making sequences, not just single-step responses.
StarPO operates in two phases: a rollout stage where the LLM generates complete interaction sequences guided by reasoning, and an update stage where the model is optimized using normalized cumulative rewards. This approach offers a more stable and interpretable learning loop compared to traditional policy optimization methods.
The researchers tested this framework using fine-tuned versions of Alibaba’s Qwen models, specifically Qwen 1.5 and Qwen 2.5, chosen for their open weights and strong instruction-following capabilities. This choice facilitated reproducibility and consistent baseline comparisons across symbolic tasks.
The Echo Trap: A Challenge in Reinforcement Learning
Zihan Wang highlighted a critical issue in RL training in a widely shared X thread: *Why does your RL training always collapse?* The team identified that while LLM agents initially produce well-reasoned responses, RL systems often reward shortcuts, leading to repetitive behaviors that degrade performance—a phenomenon they dubbed the "Echo Trap."
This regression is fueled by feedback loops where certain phrases or strategies earn high rewards early on, encouraging overuse and stifling exploration. The symptoms are clear: reward variance cliffs, gradient spikes, and disappearing reasoning traces.
RAGEN's Test Environments
To study these behaviors in a controlled setting, RAGEN evaluates agents across three symbolic environments:
- Bandit: A single-turn, stochastic task that tests symbolic risk-reward reasoning.
- Sokoban: A multi-turn, deterministic puzzle involving irreversible decisions.
- Frozen Lake: A stochastic, multi-turn task requiring adaptive planning.
Each environment is designed to minimize real-world priors and focus solely on decision-making strategies developed during training. For example, in the Bandit environment, agents must reason symbolically about Dragon and Phoenix arms representing different reward distributions, interpreting them as "strength" and "hope" to predict outcomes.
Stabilizing Reinforcement Learning with StarPO-S
To combat training collapse, the researchers introduced StarPO-S, a stabilized version of the original framework. StarPO-S includes three key interventions:
- Uncertainty-based rollout filtering: Prioritizing rollouts where the agent shows outcome uncertainty.
- KL penalty removal: Allowing the model to deviate more freely from its original policy and explore new behaviors.
- Asymmetric PPO clipping: Amplifying high-reward trajectories more than low-reward ones to boost learning.
These changes help delay or eliminate training collapse and improve performance across all three tasks. As Wang put it, "StarPO-S… works across all 3 tasks. Relieves collapse. Better reward."
What Makes a Good Agentic AI Model?
The success of RL training depends not only on the architecture but also on the quality of the data generated by the agents. The team identified three crucial dimensions that significantly impact training:
- Task diversity: Exposing the model to a wide range of initial scenarios improves generalization.
- Interaction granularity: Allowing multiple actions per turn enables more meaningful planning.
- Rollout freshness: Keeping training data aligned with the current model policy avoids outdated learning signals.
These factors contribute to a more stable and effective training process. An interactive demo site on Github visualizes agent rollouts as full dialogue turns, including not just actions but the step-by-step thought process that precedes them. For instance, in solving a math problem, an agent might first 'think' about isolating a variable before submitting an answer like 'x = 5'. These intermediate thoughts are visible and traceable, adding transparency to how agents make decisions.
When Reasoning Runs Out
While explicit reasoning enhances performance in simple, single-turn tasks like Bandit, it tends to decay during multi-turn training. Despite using structured prompts and tokens, reasoning traces often shrink or vanish unless directly rewarded. This highlights a limitation in how rewards are typically designed: focusing on task completion may neglect the quality of the process behind it. The team experimented with format-based penalties to encourage better-structured reasoning but acknowledges that more refined reward shaping is likely needed.
Open Tools and Future Directions
RAGEN, along with its StarPO and StarPO-S frameworks, is now available as an open-source project at https://github.com/RAGEN-AI/RAGEN. However, at the time of writing, no explicit license is listed in the GitHub repository, which may limit its use or redistribution by others.
The system provides a valuable foundation for those interested in developing AI agents that not only complete tasks but also think, plan, and evolve. As AI moves toward greater autonomy, projects like RAGEN help illuminate what it takes to train models that learn from the consequences of their own actions.
Outstanding Questions for Real-World Enterprise Adoption
While the RAGEN paper offers a detailed technical roadmap, several practical questions remain for those looking to apply these methods in enterprise settings. For instance, how transferable is RAGEN’s approach beyond stylized, symbolic tasks? Would businesses need to design entirely new environments and reward functions to use this system in workflows like invoice processing or customer support?
Wang, in a direct message to VentureBeat on X, suggested that improving task diversity could help, as the current gaming tasks only have similar grid representations but lack semantic information. He also expressed optimism about businesses designing their own training exercises for AI agents using RAGEN, noting that the GitHub link provides a simple introduction to adding new environments.
Another critical area is scalability. Even with the enhancements provided by StarPO-S, the paper acknowledges that training still eventually collapses over longer horizons. This raises the question: is there a theoretical or practical path to sustaining reasoning over open-ended or continuously evolving task sequences?
At the time of writing, no explicit license is listed in the RAGEN GitHub repository or documentation, leaving open questions about usage rights. Nonetheless, RAGEN stands out not just as a technical contribution but as a conceptual step toward more autonomous, reasoning-capable AI agents. Whether it becomes part of the enterprise AI stack remains to be seen, but its insights into agent learning dynamics are already helping redefine the frontier of LLM training.











