Procedural Memory Slashes AI Agent Costs and Complexity
A new technique developed by Zhejiang University and Alibaba Group equips large language model (LLM) agents with dynamic memory, boosting their efficiency and effectiveness in handling complex tasks. Named Memp, this approach provides agents with a "procedural memory" that updates continuously as they accumulate experience, mirroring the way humans learn through repeated practice.
Memp establishes a lifelong learning system where agents no longer need to begin from zero for each new task. As they face new scenarios in real-world environments, they steadily improve and become more efficient, a critical feature for dependable enterprise automation.
The importance of procedural memory in AI agents
LLM agents show great potential for automating intricate, multi-step business operations. However, in practice, these extended tasks can be prone to failure. The researchers highlight that unexpected issues—such as network interruptions, user interface updates, or changing data formats—can disrupt the entire workflow. Currently, this often forces agents to restart from the beginning each time, leading to delays and higher costs.
At the same time, many complex tasks, while appearing different on the surface, share underlying structural similarities. Rather than relearning these patterns every time, an agent should be capable of drawing from and reusing its past experiences—both successes and failures—as the researchers emphasize. This calls for a specialized "procedural memory," similar to human long-term memory for skills like typing or cycling, which become second nature with repetition.

Starting from scratch (top) vs using procedural memory (bottom) (source: arXiv) Most current agent systems lack this functionality. Their procedural knowledge is usually manually programmed by developers, stored in inflexible prompt templates, or embedded in the model’s parameters, which are costly and slow to modify. Even existing frameworks with memory enhancements offer only broad abstractions and fail to properly address how skills should be developed, indexed, refined, and managed throughout an agent's lifecycle.
As the researchers state in their paper, "there is no principled way to quantify how efficiently an agent evolves its procedural repertoire or to guarantee that new experiences improve rather than erode performance."
How Memp works
Memp is a task-agnostic framework that treats procedural memory as a fundamental, optimizable component. It operates through three core stages that form a continuous cycle: building, retrieving, and updating memory.
Memories are constructed from an agent’s past experiences, or "trajectories." The team investigated storing these memories in two forms: either as exact, step-by-step actions, or by summarizing those actions into higher-level, script-like abstractions. For retrieval, when presented with a new task, the agent searches its memory for the most relevant prior experience. The researchers tested various methods, including vector search, to align the new task’s description with past queries, or extracting keywords to find the closest match.
The update mechanism is the most vital element. Memp incorporates several strategies to ensure the agent’s memory evolves. As an agent finishes more tasks, its memory can be updated by adding the new experience, filtering for only successful outcomes, or—most effectively—by analyzing failures to correct and improve the original memory.

Memp framework (source: arXiv) This emphasis on dynamic, adaptable memory situates Memp within a growing research area focused on enhancing the reliability of AI agents for long-term assignments. The project aligns with other initiatives, such as Mem0, which extracts essential information from lengthy conversations and organizes it into structured facts and knowledge graphs to maintain consistency. Similarly, A-MEM allows agents to autonomously generate and connect "memory notes" from their interactions, building a sophisticated knowledge network over time.
Yet, co-author Runnan Fang points out a crucial difference between Memp and similar frameworks.
"Mem0 and A-MEM are excellent works… but they focus on remembering salient content within a single trajectory or conversation," Fang explained to VentureBeat. Essentially, they help an agent recall "what" occurred. "Memp, by contrast, targets cross-trajectory procedural memory." It concentrates on "how-to" knowledge that can be applied across comparable tasks, eliminating the need for the agent to repeatedly start from square one.
"By distilling past successful workflows into reusable procedural priors, Memp raises success rates and shortens steps," Fang continued. "Crucially, we also introduce an update mechanism so that this procedural memory keeps improving— after all, practice makes perfect for agents too."
Overcoming the cold-start challenge
While learning from past experiences is a powerful concept, it poses a practical challenge: How does an agent develop its initial memory when no flawless examples are available? The research team tackles this "cold-start" issue with a practical solution.
Fang described how developers can begin by defining a robust evaluation metric instead of needing a perfect "gold" trajectory from the start. This metric, which may be rule-based or powered by another LLM, rates the quality of an agent's performance. "Once that metric is in place, we let state-of-the-art models explore within the agent workflow and retain the trajectories that achieve the highest scores," Fang said. This method quickly assembles an initial collection of valuable memories, enabling a new agent to become proficient without extensive manual coding.
Memp in action
To evaluate the framework, the team integrated Memp with leading LLMs like GPT-4o, Claude 3.5 Sonnet, and Qwen2.5, testing them on demanding tasks such as household chores in the ALFWorld benchmark and information gathering in TravelPlanner. The outcomes revealed that by building and accessing procedural memory, an agent could effectively summarize and reuse its previous experience.
In testing, agents using Memp not only reached higher success rates but also operated far more efficiently. They cut out unproductive exploration and guesswork, significantly lowering both the number of steps and the token usage needed to finish a task.

Using procedural memory (right) enables agents to complete tasks with fewer steps and reduced token usage (source: arXiv) A particularly notable discovery for business use is that procedural memory can be transferred. In one test, procedural memory created by the high-powered GPT-4o was provided to a much smaller model, Qwen2.5-14B. The smaller model’s performance improved markedly, increasing its success rate and cutting down the steps required to accomplish tasks.
According to Fang, this is possible because smaller models typically handle straightforward, single-step actions competently but struggle with long-term planning and reasoning. The procedural memory from the larger model effectively bridges this gap. This implies that knowledge can be gathered using a top-tier model and then applied to smaller, more budget-friendly models without sacrificing the advantages of that learned experience.
Advancing toward fully autonomous agents
By integrating memory-update functions, the Memp framework enables agents to continually develop and polish their procedural knowledge as they work in live settings. The researchers observed that this gave the agent a "continual, almost linear mastery of the task."
Still, achieving complete autonomy faces another obstacle: numerous real-world tasks, like generating a research report, don't have a clear-cut success indicator. To keep improving, an agent must know whether its performance was satisfactory. Fang believes the solution lies ahead in employing LLMs as evaluators.
"Today we often combine powerful models with hand-crafted rules to compute completion scores," he observes. "This works, but hand-written rules are inflexible and difficult to generalize."
An LLM acting as a judge could offer the detailed, supervisory feedback necessary for an agent to self-correct on intricate, subjective assignments. This would make the overall learning process more scalable and resilient, representing a vital move toward creating the durable, flexible, and genuinely autonomous AI workers essential for advanced enterprise automation.
Related article
Multiverse Computing Launches Free Compressed Generative AI Model
Large language models face a significant challenge: their immense size. Spanish startup Multiverse Computing is tackling this problem by creating compressed models designed to bridge the gap between the capabilities of cutting-edge AI and what busine
Secret Tracking Data Exposes Theft of AI Models
A new method can invisibly watermark models like ChatGPT in seconds without retraining, leaving no trace in standard outputs and resisting all practical removal attempts. The key distinction between watermarking and 'copyright-baiting' is that waterm
AI Systems Tricked into Approving Absurd Scientific Papers
New research reveals that AI systems can now produce fraudulent scientific papers that other AI models mistakenly accept as authentic. These fabricated studies bypass detection methods that were previously effective, highlighting the risk of research
Related Special Topic Recommendations
Comments (2)
0/500
Interesting approach! Giving LLMs a memory system could be a game-changer for automating complex workflows. I wonder how the cost savings compare to other optimization methods out there. The collaboration between academia and industry on this is promising. 🧠
A new technique developed by Zhejiang University and Alibaba Group equips large language model (LLM) agents with dynamic memory, boosting their efficiency and effectiveness in handling complex tasks. Named Memp, this approach provides agents with a "procedural memory" that updates continuously as they accumulate experience, mirroring the way humans learn through repeated practice.
Memp establishes a lifelong learning system where agents no longer need to begin from zero for each new task. As they face new scenarios in real-world environments, they steadily improve and become more efficient, a critical feature for dependable enterprise automation.
The importance of procedural memory in AI agents
LLM agents show great potential for automating intricate, multi-step business operations. However, in practice, these extended tasks can be prone to failure. The researchers highlight that unexpected issues—such as network interruptions, user interface updates, or changing data formats—can disrupt the entire workflow. Currently, this often forces agents to restart from the beginning each time, leading to delays and higher costs.
At the same time, many complex tasks, while appearing different on the surface, share underlying structural similarities. Rather than relearning these patterns every time, an agent should be capable of drawing from and reusing its past experiences—both successes and failures—as the researchers emphasize. This calls for a specialized "procedural memory," similar to human long-term memory for skills like typing or cycling, which become second nature with repetition.

Most current agent systems lack this functionality. Their procedural knowledge is usually manually programmed by developers, stored in inflexible prompt templates, or embedded in the model’s parameters, which are costly and slow to modify. Even existing frameworks with memory enhancements offer only broad abstractions and fail to properly address how skills should be developed, indexed, refined, and managed throughout an agent's lifecycle.
As the researchers state in their paper, "there is no principled way to quantify how efficiently an agent evolves its procedural repertoire or to guarantee that new experiences improve rather than erode performance."
How Memp works
Memp is a task-agnostic framework that treats procedural memory as a fundamental, optimizable component. It operates through three core stages that form a continuous cycle: building, retrieving, and updating memory.
Memories are constructed from an agent’s past experiences, or "trajectories." The team investigated storing these memories in two forms: either as exact, step-by-step actions, or by summarizing those actions into higher-level, script-like abstractions. For retrieval, when presented with a new task, the agent searches its memory for the most relevant prior experience. The researchers tested various methods, including vector search, to align the new task’s description with past queries, or extracting keywords to find the closest match.
The update mechanism is the most vital element. Memp incorporates several strategies to ensure the agent’s memory evolves. As an agent finishes more tasks, its memory can be updated by adding the new experience, filtering for only successful outcomes, or—most effectively—by analyzing failures to correct and improve the original memory.

This emphasis on dynamic, adaptable memory situates Memp within a growing research area focused on enhancing the reliability of AI agents for long-term assignments. The project aligns with other initiatives, such as Mem0, which extracts essential information from lengthy conversations and organizes it into structured facts and knowledge graphs to maintain consistency. Similarly, A-MEM allows agents to autonomously generate and connect "memory notes" from their interactions, building a sophisticated knowledge network over time.
Yet, co-author Runnan Fang points out a crucial difference between Memp and similar frameworks.
"Mem0 and A-MEM are excellent works… but they focus on remembering salient content within a single trajectory or conversation," Fang explained to VentureBeat. Essentially, they help an agent recall "what" occurred. "Memp, by contrast, targets cross-trajectory procedural memory." It concentrates on "how-to" knowledge that can be applied across comparable tasks, eliminating the need for the agent to repeatedly start from square one.
"By distilling past successful workflows into reusable procedural priors, Memp raises success rates and shortens steps," Fang continued. "Crucially, we also introduce an update mechanism so that this procedural memory keeps improving— after all, practice makes perfect for agents too."
Overcoming the cold-start challenge
While learning from past experiences is a powerful concept, it poses a practical challenge: How does an agent develop its initial memory when no flawless examples are available? The research team tackles this "cold-start" issue with a practical solution.
Fang described how developers can begin by defining a robust evaluation metric instead of needing a perfect "gold" trajectory from the start. This metric, which may be rule-based or powered by another LLM, rates the quality of an agent's performance. "Once that metric is in place, we let state-of-the-art models explore within the agent workflow and retain the trajectories that achieve the highest scores," Fang said. This method quickly assembles an initial collection of valuable memories, enabling a new agent to become proficient without extensive manual coding.
Memp in action
To evaluate the framework, the team integrated Memp with leading LLMs like GPT-4o, Claude 3.5 Sonnet, and Qwen2.5, testing them on demanding tasks such as household chores in the ALFWorld benchmark and information gathering in TravelPlanner. The outcomes revealed that by building and accessing procedural memory, an agent could effectively summarize and reuse its previous experience.
In testing, agents using Memp not only reached higher success rates but also operated far more efficiently. They cut out unproductive exploration and guesswork, significantly lowering both the number of steps and the token usage needed to finish a task.

A particularly notable discovery for business use is that procedural memory can be transferred. In one test, procedural memory created by the high-powered GPT-4o was provided to a much smaller model, Qwen2.5-14B. The smaller model’s performance improved markedly, increasing its success rate and cutting down the steps required to accomplish tasks.
According to Fang, this is possible because smaller models typically handle straightforward, single-step actions competently but struggle with long-term planning and reasoning. The procedural memory from the larger model effectively bridges this gap. This implies that knowledge can be gathered using a top-tier model and then applied to smaller, more budget-friendly models without sacrificing the advantages of that learned experience.
Advancing toward fully autonomous agents
By integrating memory-update functions, the Memp framework enables agents to continually develop and polish their procedural knowledge as they work in live settings. The researchers observed that this gave the agent a "continual, almost linear mastery of the task."
Still, achieving complete autonomy faces another obstacle: numerous real-world tasks, like generating a research report, don't have a clear-cut success indicator. To keep improving, an agent must know whether its performance was satisfactory. Fang believes the solution lies ahead in employing LLMs as evaluators.
"Today we often combine powerful models with hand-crafted rules to compute completion scores," he observes. "This works, but hand-written rules are inflexible and difficult to generalize."
An LLM acting as a judge could offer the detailed, supervisory feedback necessary for an agent to self-correct on intricate, subjective assignments. This would make the overall learning process more scalable and resilient, representing a vital move toward creating the durable, flexible, and genuinely autonomous AI workers essential for advanced enterprise automation.
Multiverse Computing Launches Free Compressed Generative AI Model
Large language models face a significant challenge: their immense size. Spanish startup Multiverse Computing is tackling this problem by creating compressed models designed to bridge the gap between the capabilities of cutting-edge AI and what busine
Secret Tracking Data Exposes Theft of AI Models
A new method can invisibly watermark models like ChatGPT in seconds without retraining, leaving no trace in standard outputs and resisting all practical removal attempts. The key distinction between watermarking and 'copyright-baiting' is that waterm
AI Systems Tricked into Approving Absurd Scientific Papers
New research reveals that AI systems can now produce fraudulent scientific papers that other AI models mistakenly accept as authentic. These fabricated studies bypass detection methods that were previously effective, highlighting the risk of research
Interesting approach! Giving LLMs a memory system could be a game-changer for automating complex workflows. I wonder how the cost savings compare to other optimization methods out there. The collaboration between academia and industry on this is promising. 🧠





Home






