Physical Intelligence Unveils Robot Brain Capable of Learning Unseen Tasks
Physical Intelligence, a two-year-old robotics startup based in San Francisco that has emerged as one of the Bay Area's most closely monitored AI firms, released new research on Thursday. The findings reveal that its latest model can guide robots to perform tasks they were never specifically trained for—a capability that even the company's own researchers admit took them by surprise.
The new model, named π0.7, marks what the company calls an early yet significant stride toward the long-standing ambition of a general-purpose robot brain. This system can be directed toward an unfamiliar task, instructed using simple language, and successfully complete it. If these results withstand scrutiny, they indicate that robotic AI may be nearing a turning point akin to the evolution of large language models—where abilities begin to compound in ways that surpass what the foundational data might suggest.
At the heart of the paper is the concept of compositional generalization: the ability to blend skills learned in distinct contexts to tackle entirely new problems. Traditionally, robot training has relied on rote memorization—gathering data for a specific task, training a specialized model on it, and repeating the process for each new chore. Physical Intelligence asserts that π0.7 breaks this cycle.
"Once it crosses the threshold from merely replicating the exact data it was trained on to creatively recombining elements in novel ways," explains Sergey Levine, a co-founder of Physical Intelligence and a UC Berkeley professor specializing in AI for robotics, "capabilities start to increase at a rate greater than linear relative to the data volume. This more favorable scaling dynamic is something we've observed in other fields, like language and vision."
The paper's most compelling demonstration involves an air fryer the model had virtually no exposure to during training. Upon investigation, the team found only two relevant instances in the entire dataset: one where a different robot simply pushed the air fryer's door shut, and another from an open-source dataset where a robot placed a plastic bottle inside one on command. Somehow, the model synthesized these fragments, along with broader web-based pretraining data, into a practical understanding of how the appliance operates.
"It's incredibly difficult to pinpoint exactly where the knowledge originates or predict where it will succeed or fail," notes Ashwin Balakrishna, a research scientist at Physical Intelligence and a Stanford computer science PhD student. Nevertheless, without any prior coaching, the model made a credible attempt at using the appliance to cook a sweet potato. When provided with step-by-step verbal instructions—essentially, a human talking the robot through the process as one would train a new employee—it completed the task successfully.
This coaching capability is significant because it implies robots could be deployed in novel settings and refined in real time, eliminating the need for additional data collection or model retraining.
So, what are the broader implications? The researchers are upfront about the model's limitations and cautious about overstating its progress. In at least one instance, they attribute a failure directly to their own team.
"Sometimes the failure isn't due to the robot or the model," Balakrishna says. "It's on us—not being skilled at prompt engineering." He cites an early air fryer experiment that achieved only a 5% success rate. After spending roughly thirty minutes refining how the task was explained to the model, the success rate soared to 95%.

Image Credits:Physical Intelligence
The model is also not yet able to autonomously execute complex, multi-step tasks from a single high-level command. "You can't just tell it, 'Go make me some toast,'" Levine states. "But if you guide it through the steps—'open this part of the toaster, press that button, do this'—then it tends to perform quite well."
The team also acknowledges the lack of standardized benchmarks in robotics, which complicates external validation of their claims. Instead, the company compared π0.7 against its own earlier specialist models—systems custom-built and trained for individual tasks—and found that the generalist model matched their performance across a variety of complex activities, including making coffee, folding laundry, and assembling boxes.
Perhaps the most remarkable aspect of the research—taking the researchers at their word—is not any single demonstration, but the extent to which the results astonished the very people whose job is to know the training data inside out and, consequently, what the model should and shouldn't be capable of.
"My experience has always been that when I have a deep understanding of the data, I can usually predict what the model will be able to do," Balakrishna reflects. "I'm rarely surprised. But the past few months have been the first time I've been genuinely taken aback. I randomly bought a gear set and asked the robot, 'Can you rotate this gear?' And it just worked."
Levine recalls the moment researchers first witnessed GPT-2 generate a story about unicorns in the Andes. "Where on earth did it learn about unicorns in Peru?" he says. "It's such an odd combination. Seeing that kind of emergent capability in robotics is truly special."
Naturally, critics will highlight an inherent asymmetry: language models were trained on the entire internet. Robots do not have that luxury, and no amount of clever prompting can fully bridge that gap. However, when asked where he anticipates skepticism, Levine points in a different direction entirely.
"The criticism that can always be leveled at any robotic generalization demo is that the tasks seem somewhat mundane," he observes. "The robot isn't doing a backflip." He challenges this perspective, arguing that the difference between a flashy robot demo and a system that genuinely generalizes is precisely the point. True generalization, he suggests, will always appear less dramatic than a carefully orchestrated stunt—but it is far more practical.
The paper itself employs cautious language throughout, describing π0.7 as exhibiting "early signs" of generalization and "initial demonstrations" of new capabilities. These are research findings, not a commercial product, and Physical Intelligence has been consistently reserved about its timeline for commercialization.
When asked directly when a system based on this research might be ready for real-world use, Levine declines to speculate. "There's good reason for optimism, and progress is certainly faster than I anticipated a couple of years ago," he says. "But it's very difficult for me to give a definitive answer."
To date, Physical Intelligence has raised over $1 billion and was most recently valued at $5.6 billion. A significant portion of the investor excitement surrounding the company is linked to co-founder Lachy Groom, who spent years as one of Silicon Valley's most respected angel investors—backing companies like Figma, Notion, and Ramp—before concluding that Physical Intelligence was the venture he had been seeking. This pedigree has helped the startup attract substantial institutional funding, even as it has refrained from providing investors with a specific commercialization roadmap.
The company is now reportedly in talks for a new funding round that would nearly double its valuation to $11 billion. The team declined to comment on the matter.
Related article
Trace raises $3M to tackle enterprise AI agent adoption hurdles
Despite their potential, AI agents have struggled to gain traction in the enterprise. One emerging startup believes the core issue is a lack of context.Launched as part of Y Combinator’s 2025 summer cohort, Trace is a workflow orchestration startup d
Hightouch hits $100M ARR with AI-powered marketing tools
In the past, marketers depended on designers and other creative specialists to produce images and videos for personalized online advertising campaigns.In late 2024, seven-year-old startup Hightouch introduced an AI-driven service that enables marketi
Meta's natural gas surge may fuel South Dakota's power grid
Data centers have grown so massive that their electricity consumption now matches that of entire U.S. states. Consider Meta's Hyperion AI data center: once finished, it will consume as much power as South Dakota.Meta recently announced funding for se
Related Special Topic Recommendations
Comments (0)
0/500
Physical Intelligence, a two-year-old robotics startup based in San Francisco that has emerged as one of the Bay Area's most closely monitored AI firms, released new research on Thursday. The findings reveal that its latest model can guide robots to perform tasks they were never specifically trained for—a capability that even the company's own researchers admit took them by surprise.
The new model, named π0.7, marks what the company calls an early yet significant stride toward the long-standing ambition of a general-purpose robot brain. This system can be directed toward an unfamiliar task, instructed using simple language, and successfully complete it. If these results withstand scrutiny, they indicate that robotic AI may be nearing a turning point akin to the evolution of large language models—where abilities begin to compound in ways that surpass what the foundational data might suggest.
At the heart of the paper is the concept of compositional generalization: the ability to blend skills learned in distinct contexts to tackle entirely new problems. Traditionally, robot training has relied on rote memorization—gathering data for a specific task, training a specialized model on it, and repeating the process for each new chore. Physical Intelligence asserts that π0.7 breaks this cycle.
"Once it crosses the threshold from merely replicating the exact data it was trained on to creatively recombining elements in novel ways," explains Sergey Levine, a co-founder of Physical Intelligence and a UC Berkeley professor specializing in AI for robotics, "capabilities start to increase at a rate greater than linear relative to the data volume. This more favorable scaling dynamic is something we've observed in other fields, like language and vision."
The paper's most compelling demonstration involves an air fryer the model had virtually no exposure to during training. Upon investigation, the team found only two relevant instances in the entire dataset: one where a different robot simply pushed the air fryer's door shut, and another from an open-source dataset where a robot placed a plastic bottle inside one on command. Somehow, the model synthesized these fragments, along with broader web-based pretraining data, into a practical understanding of how the appliance operates.
"It's incredibly difficult to pinpoint exactly where the knowledge originates or predict where it will succeed or fail," notes Ashwin Balakrishna, a research scientist at Physical Intelligence and a Stanford computer science PhD student. Nevertheless, without any prior coaching, the model made a credible attempt at using the appliance to cook a sweet potato. When provided with step-by-step verbal instructions—essentially, a human talking the robot through the process as one would train a new employee—it completed the task successfully.
This coaching capability is significant because it implies robots could be deployed in novel settings and refined in real time, eliminating the need for additional data collection or model retraining.
So, what are the broader implications? The researchers are upfront about the model's limitations and cautious about overstating its progress. In at least one instance, they attribute a failure directly to their own team.
"Sometimes the failure isn't due to the robot or the model," Balakrishna says. "It's on us—not being skilled at prompt engineering." He cites an early air fryer experiment that achieved only a 5% success rate. After spending roughly thirty minutes refining how the task was explained to the model, the success rate soared to 95%.

Image Credits:Physical Intelligence
The model is also not yet able to autonomously execute complex, multi-step tasks from a single high-level command. "You can't just tell it, 'Go make me some toast,'" Levine states. "But if you guide it through the steps—'open this part of the toaster, press that button, do this'—then it tends to perform quite well."
The team also acknowledges the lack of standardized benchmarks in robotics, which complicates external validation of their claims. Instead, the company compared π0.7 against its own earlier specialist models—systems custom-built and trained for individual tasks—and found that the generalist model matched their performance across a variety of complex activities, including making coffee, folding laundry, and assembling boxes.
Perhaps the most remarkable aspect of the research—taking the researchers at their word—is not any single demonstration, but the extent to which the results astonished the very people whose job is to know the training data inside out and, consequently, what the model should and shouldn't be capable of.
"My experience has always been that when I have a deep understanding of the data, I can usually predict what the model will be able to do," Balakrishna reflects. "I'm rarely surprised. But the past few months have been the first time I've been genuinely taken aback. I randomly bought a gear set and asked the robot, 'Can you rotate this gear?' And it just worked."
Levine recalls the moment researchers first witnessed GPT-2 generate a story about unicorns in the Andes. "Where on earth did it learn about unicorns in Peru?" he says. "It's such an odd combination. Seeing that kind of emergent capability in robotics is truly special."
Naturally, critics will highlight an inherent asymmetry: language models were trained on the entire internet. Robots do not have that luxury, and no amount of clever prompting can fully bridge that gap. However, when asked where he anticipates skepticism, Levine points in a different direction entirely.
"The criticism that can always be leveled at any robotic generalization demo is that the tasks seem somewhat mundane," he observes. "The robot isn't doing a backflip." He challenges this perspective, arguing that the difference between a flashy robot demo and a system that genuinely generalizes is precisely the point. True generalization, he suggests, will always appear less dramatic than a carefully orchestrated stunt—but it is far more practical.
The paper itself employs cautious language throughout, describing π0.7 as exhibiting "early signs" of generalization and "initial demonstrations" of new capabilities. These are research findings, not a commercial product, and Physical Intelligence has been consistently reserved about its timeline for commercialization.
When asked directly when a system based on this research might be ready for real-world use, Levine declines to speculate. "There's good reason for optimism, and progress is certainly faster than I anticipated a couple of years ago," he says. "But it's very difficult for me to give a definitive answer."
To date, Physical Intelligence has raised over $1 billion and was most recently valued at $5.6 billion. A significant portion of the investor excitement surrounding the company is linked to co-founder Lachy Groom, who spent years as one of Silicon Valley's most respected angel investors—backing companies like Figma, Notion, and Ramp—before concluding that Physical Intelligence was the venture he had been seeking. This pedigree has helped the startup attract substantial institutional funding, even as it has refrained from providing investors with a specific commercialization roadmap.
The company is now reportedly in talks for a new funding round that would nearly double its valuation to $11 billion. The team declined to comment on the matter.
Trace raises $3M to tackle enterprise AI agent adoption hurdles
Despite their potential, AI agents have struggled to gain traction in the enterprise. One emerging startup believes the core issue is a lack of context.Launched as part of Y Combinator’s 2025 summer cohort, Trace is a workflow orchestration startup d
Hightouch hits $100M ARR with AI-powered marketing tools
In the past, marketers depended on designers and other creative specialists to produce images and videos for personalized online advertising campaigns.In late 2024, seven-year-old startup Hightouch introduced an AI-driven service that enables marketi
Meta's natural gas surge may fuel South Dakota's power grid
Data centers have grown so massive that their electricity consumption now matches that of entire U.S. states. Consider Meta's Hyperion AI data center: once finished, it will consume as much power as South Dakota.Meta recently announced funding for se





Home






