Former DeepSeeker and collaborators release new method for training reliable AI agents: RAGEN

Home

News

May 4, 2025

DavidMartínez

# alibaba # DeepSeek # qwen # GitHub # LLMs # nlp # qwen-2-5 # ragen # starpo

Former DeepSeeker and collaborators release new method for training reliable AI agents: RAGEN

The Year of AI Agents: A Closer Look at 2025's Expectations and Realities

2025 was heralded by many experts as the year when AI agents—specialized AI systems powered by advanced large language and multimodal models from companies like OpenAI, Anthropic, Google, and DeepSeek—would finally take center stage. However, according to a recent VentureBeat poll on the social network X, most AI agents are still languishing in experimental stages, caught in a sort of corporate limbo.

But there's a glimmer of hope on the horizon. A collaborative effort from researchers at Northwestern University, Microsoft, Stanford, and the University of Washington, including Zihan Wang, a former DeepSeek researcher now pursuing a PhD in computer science at Northwestern, has introduced RAGEN. This new system aims to train and evaluate AI agents to make them more reliable and adaptable for real-world, enterprise use.

RAGEN: A New Approach to Training AI Agents

Unlike static tasks such as math solving or code generation, RAGEN focuses on dynamic, multi-turn interactions where agents need to adapt, remember, and reason amidst uncertainty. The system is built on a custom reinforcement learning (RL) framework called StarPO (State-Thinking-Actions-Reward Policy Optimization), which emphasizes learning through experience rather than rote memorization. StarPO looks at entire decision-making sequences, not just single-step responses.

StarPO operates in two phases: a rollout stage where the LLM generates complete interaction sequences guided by reasoning, and an update stage where the model is optimized using normalized cumulative rewards. This approach offers a more stable and interpretable learning loop compared to traditional policy optimization methods.

The researchers tested this framework using fine-tuned versions of Alibaba’s Qwen models, specifically Qwen 1.5 and Qwen 2.5, chosen for their open weights and strong instruction-following capabilities. This choice facilitated reproducibility and consistent baseline comparisons across symbolic tasks.

The Echo Trap: A Challenge in Reinforcement Learning

Zihan Wang highlighted a critical issue in RL training in a widely shared X thread: *Why does your RL training always collapse?* The team identified that while LLM agents initially produce well-reasoned responses, RL systems often reward shortcuts, leading to repetitive behaviors that degrade performance—a phenomenon they dubbed the "Echo Trap."

This regression is fueled by feedback loops where certain phrases or strategies earn high rewards early on, encouraging overuse and stifling exploration. The symptoms are clear: reward variance cliffs, gradient spikes, and disappearing reasoning traces.

RAGEN's Test Environments

To study these behaviors in a controlled setting, RAGEN evaluates agents across three symbolic environments:

Bandit: A single-turn, stochastic task that tests symbolic risk-reward reasoning.
Sokoban: A multi-turn, deterministic puzzle involving irreversible decisions.
Frozen Lake: A stochastic, multi-turn task requiring adaptive planning.

Each environment is designed to minimize real-world priors and focus solely on decision-making strategies developed during training. For example, in the Bandit environment, agents must reason symbolically about Dragon and Phoenix arms representing different reward distributions, interpreting them as "strength" and "hope" to predict outcomes.

Stabilizing Reinforcement Learning with StarPO-S

To combat training collapse, the researchers introduced StarPO-S, a stabilized version of the original framework. StarPO-S includes three key interventions:

Uncertainty-based rollout filtering: Prioritizing rollouts where the agent shows outcome uncertainty.
KL penalty removal: Allowing the model to deviate more freely from its original policy and explore new behaviors.
Asymmetric PPO clipping: Amplifying high-reward trajectories more than low-reward ones to boost learning.

These changes help delay or eliminate training collapse and improve performance across all three tasks. As Wang put it, "StarPO-S… works across all 3 tasks. Relieves collapse. Better reward."

What Makes a Good Agentic AI Model?

The success of RL training depends not only on the architecture but also on the quality of the data generated by the agents. The team identified three crucial dimensions that significantly impact training:

Task diversity: Exposing the model to a wide range of initial scenarios improves generalization.
Interaction granularity: Allowing multiple actions per turn enables more meaningful planning.
Rollout freshness: Keeping training data aligned with the current model policy avoids outdated learning signals.

These factors contribute to a more stable and effective training process. An interactive demo site on Github visualizes agent rollouts as full dialogue turns, including not just actions but the step-by-step thought process that precedes them. For instance, in solving a math problem, an agent might first 'think' about isolating a variable before submitting an answer like 'x = 5'. These intermediate thoughts are visible and traceable, adding transparency to how agents make decisions.

When Reasoning Runs Out

While explicit reasoning enhances performance in simple, single-turn tasks like Bandit, it tends to decay during multi-turn training. Despite using structured prompts and tokens, reasoning traces often shrink or vanish unless directly rewarded. This highlights a limitation in how rewards are typically designed: focusing on task completion may neglect the quality of the process behind it. The team experimented with format-based penalties to encourage better-structured reasoning but acknowledges that more refined reward shaping is likely needed.

Open Tools and Future Directions

RAGEN, along with its StarPO and StarPO-S frameworks, is now available as an open-source project at https://github.com/RAGEN-AI/RAGEN. However, at the time of writing, no explicit license is listed in the GitHub repository, which may limit its use or redistribution by others.

The system provides a valuable foundation for those interested in developing AI agents that not only complete tasks but also think, plan, and evolve. As AI moves toward greater autonomy, projects like RAGEN help illuminate what it takes to train models that learn from the consequences of their own actions.

Outstanding Questions for Real-World Enterprise Adoption

While the RAGEN paper offers a detailed technical roadmap, several practical questions remain for those looking to apply these methods in enterprise settings. For instance, how transferable is RAGEN’s approach beyond stylized, symbolic tasks? Would businesses need to design entirely new environments and reward functions to use this system in workflows like invoice processing or customer support?

Wang, in a direct message to VentureBeat on X, suggested that improving task diversity could help, as the current gaming tasks only have similar grid representations but lack semantic information. He also expressed optimism about businesses designing their own training exercises for AI agents using RAGEN, noting that the GitHub link provides a simple introduction to adding new environments.

Another critical area is scalability. Even with the enhancements provided by StarPO-S, the paper acknowledges that training still eventually collapses over longer horizons. This raises the question: is there a theoretical or practical path to sustaining reasoning over open-ended or continuously evolving task sequences?

At the time of writing, no explicit license is listed in the RAGEN GitHub repository or documentation, leaving open questions about usage rights. Nonetheless, RAGEN stands out not just as a technical contribution but as a conceptual step toward more autonomous, reasoning-capable AI agents. Whether it becomes part of the enterprise AI stack remains to be seen, but its insights into agent learning dynamics are already helping redefine the frontier of LLM training.

Google Unveils Production-Ready Gemini 2.5 AI Models to Rival OpenAI in Enterprise Market Google intensified its AI strategy Monday, launching its advanced Gemini 2.5 models for enterprise use and introducing a cost-efficient variant to compete on price and performance.The Alphabet-owned c

Alibaba Unveils Wan2.1-VACE: Open-Source AI Video Solution Alibaba has introduced Wan2.1-VACE, an open-source AI model poised to transform video creation and editing processes.VACE is a key component of Alibaba’s Wan2.1 video AI model family, with the company

AI-Powered Retail Experiment Fails Spectacularly at Anthropic Imagine handing over a small shop to an artificial intelligence, entrusting it with everything from pricing to customer interactions. What could go wrong?A recent Anthropic study, released on Friday,

Comments (6)

0/200

Submit

JimmyRamirez

July 23, 2025 at 12:59:29 AM EDT

This RAGEN method sounds like a game-changer for AI reliability! Curious how it stacks up against what OpenAI’s cooking. Anyone tried it yet? 🤔

RalphWalker

May 6, 2025 at 3:48:04 AM EDT

RAGEN is pretty cool, but it's not the game-changer I was hoping for. It's great for training AI agents, but sometimes the results are a bit off. Still, it's a step in the right direction. Keep pushing the boundaries, guys! 🚀

NicholasAdams

May 5, 2025 at 6:45:54 PM EDT

RAGENはかなりクールですが、期待していたほどのゲームチェンジャーではありません。AIエージェントのトレーニングには良いですが、結果が少しずれることがあります。それでも、前進の一歩です。皆さん、限界を押し広げてくださいね！🚀

EricLewis

May 4, 2025 at 11:45:04 PM EDT

RAGEN es bastante genial, pero no fue el cambio de juego que esperaba. Es bueno para entrenar agentes de IA, pero a veces los resultados están un poco desajustados. Sin embargo, es un paso en la dirección correcta. ¡Sigan empujando los límites, chicos! 🚀

GeorgeTaylor

May 4, 2025 at 4:00:48 PM EDT

RAGEN é bem legal, mas não foi o divisor de águas que eu esperava. É ótimo para treinar agentes de IA, mas às vezes os resultados estão um pouco fora. Ainda assim, é um passo na direção certa. Continuem expandindo os limites, pessoal! 🚀

MateoAdams

May 4, 2025 at 11:14:20 AM EDT

RAGEN은 꽤 멋지지만 기대했던 만큼의 게임 체인저는 아니었어요. AI 에이전트 훈련에는 좋지만 결과가 조금 어긋날 때가 있어요. 그래도 앞으로 나아가는 한 걸음이죠. 계속해서 한계를 넓혀가세요! 🚀