Meta FAIR Unveils Five Breakthroughs Advancing Human-Like AI
Meta's Fundamental AI Research (FAIR) team has unveiled five new projects that push forward its work in advanced machine intelligence (AMI).
These latest releases concentrate on improving AI perception—how machines process sensory input—alongside progress in language models, robotics, and collaborative AI agents.
Meta explained its objective is to build machines "capable of acquiring, processing, and interpreting sensory data from our world, and using that information to make decisions with human-like intelligence and speed."
The five new initiatives represent a range of interconnected efforts to reach this ambitious target.
Perception Encoder: Sharpening AI's Visual Intelligence
A cornerstone of the new releases is the Perception Encoder, a large-scale vision encoder built to perform exceptionally across diverse image and video tasks.
Vision encoders act as the "eyes" of AI systems, enabling them to comprehend visual information.
Meta points out the growing difficulty of creating encoders for advanced AI, which need to connect vision with language, handle both images and videos proficiently, and stay reliable under tough conditions, including adversarial attacks.
According to Meta, the ideal encoder should recognize a broad spectrum of concepts while also picking up on fine details—like noticing "a stingray hidden under the seafloor, identifying a small goldfinch in an image's background, or detecting a fast-moving agouti on a night-vision wildlife camera."
Meta states the Perception Encoder delivers "outstanding performance on zero-shot image and video classification and retrieval, surpassing all current open-source and proprietary models for these tasks."
Additionally, its perceptual abilities reportedly enhance performance on language tasks.
When paired with a large language model (LLM), the encoder is said to outdo other vision encoders in areas like visual question answering (VQA), captioning, document understanding, and grounding (linking text to specific parts of an image). It also reportedly improves performance on tasks where LLMs typically struggle, such as understanding spatial relationships (e.g., "if one object is behind another") or camera movement relative to an object.
"As the Perception Encoder starts to be integrated into new applications, we look forward to seeing how its advanced visual capabilities will power even more sophisticated AI systems," Meta commented.
Perception Language Model (PLM): Advancing Open Vision-Language Research
Working alongside the encoder is the Perception Language Model (PLM), an open and reproducible vision-language model designed for intricate visual recognition tasks.
PLM was trained using extensive synthetic data alongside open vision-language datasets, deliberately avoiding knowledge distilled from external proprietary models.
Acknowledging shortcomings in existing video understanding data, the FAIR team assembled 2.5 million new, human-labeled samples focused on detailed video question answering and spatio-temporal captioning. Meta claims this is the "largest dataset of its kind to date."
PLM is available in 1, 3, and 8 billion parameter versions to meet the needs of academic research requiring full transparency.
Along with the models, Meta is releasing PLM-VideoBench, a new benchmark specifically crafted to test capabilities often overlooked by existing benchmarks, namely "fine-grained activity understanding and spatiotemporally grounded reasoning."
Meta hopes that providing open models, a large dataset, and a challenging benchmark will strengthen the open-source community.
Meta Locate 3D: Providing Robots with Situational Awareness
Bridging language commands and physical action is Meta Locate 3D. This end-to-end model is designed to enable robots to accurately find objects in a 3D space based on open-ended natural language queries.
Meta Locate 3D processes 3D point clouds directly from RGB-D sensors (like those on some robots or depth-sensing cameras). Given a text prompt, such as "flower vase near the TV console," the system analyzes spatial relationships and context to identify the correct object instance, differentiating it from, for example, a "vase on the table."
The system consists of three main components: a preprocessing step that converts 2D features into 3D featurized point clouds; the 3D-JEPA encoder (a pre-trained model that creates a contextualized 3D world representation); and the Locate 3D decoder, which uses the 3D representation and the language query to generate bounding boxes and masks for the specified objects.
Alongside the model, Meta is releasing a substantial new dataset for object localization based on referring expressions. It includes 130,000 language annotations across 1,346 scenes from the ARKitScenes, ScanNet, and ScanNet++ datasets, effectively doubling the existing annotated data in this field.
Meta views this technology as essential for developing more capable robotic systems, including its own PARTNR robot project, facilitating more natural human-robot interaction and teamwork.
Dynamic Byte Latent Transformer: Efficient and Robust Language Modeling
Following research published in late 2024, Meta is now releasing the model weights for its 8-billion parameter Dynamic Byte Latent Transformer.
This architecture marks a departure from traditional tokenization-based language models, operating directly at the byte level. Meta claims this method achieves similar performance at scale while offering significant gains in inference efficiency and robustness.
Conventional LLMs split text into 'tokens,' which can have trouble with misspellings, new words, or adversarial inputs. Byte-level models process raw bytes, potentially offering greater resilience.
Meta reports that the Dynamic Byte Latent Transformer "outperforms tokenizer-based models across various tasks, showing an average robustness advantage of +7 points (on perturbed HellaSwag), and reaching up to +55 points on tasks from the CUTE token-understanding benchmark."
By releasing the weights along with the previously shared codebase, Meta encourages the research community to explore this alternative approach to language modeling.
Collaborative Reasoner: Advancing Socially-Intelligent AI Agents
The final release, Collaborative Reasoner, addresses the complex challenge of creating AI agents that can work effectively with humans or other AIs.
Meta notes that human collaboration often produces better outcomes and aims to equip AI with similar capabilities for tasks like assisting with homework or preparing for a job interview.
Such collaboration requires not just problem-solving but also social skills like communication, empathy, giving feedback, and understanding others' perspectives (theory-of-mind), typically unfolding over multiple conversational turns.
Current LLM training and evaluation methods often overlook these social and collaborative dimensions. Moreover, gathering relevant conversational data is costly and challenging.
Collaborative Reasoner provides a framework to evaluate and improve these skills. It includes goal-oriented tasks that require multi-step reasoning achieved through dialogue between two agents. The framework tests abilities like constructive disagreement, persuasion, and arriving at a mutually optimal solution.
Meta's evaluations showed that current models often fail to consistently use collaboration to improve results. To tackle this, they propose a self-improvement technique using synthetic interaction data where an LLM agent collaborates with itself.
Generating this data at scale is made possible by a new high-performance model serving engine called Matrix. Using this method on math, scientific, and social reasoning tasks reportedly led to improvements of up to 29.4% compared to the standard 'chain-of-thought' performance of a single LLM.
By open-sourcing the data generation and modeling pipeline, Meta aims to accelerate research into developing truly "social agents that can partner with humans and other agents."
Together, these five releases highlight Meta's ongoing substantial investment in fundamental AI research, particularly in creating the foundational components for machines that can perceive, understand, and interact with the world in more human-like ways.
See also: Meta will train AI models using EU user data
Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo happening in Amsterdam, California, and London. This comprehensive event is co-located with other major events including the Intelligent Automation Conference, BlockX, Digital Transformation Week, and the Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.
Related article
WordPress.com now allows AI agents to write and publish posts, plus more
WordPress.com, the popular web hosting and publishing platform, is now embracing AI agents—a move that could reshape the look and feel of the web. The company announced Friday that it will allow AI agents to draft, edit, and publish content on custom
Meta AI now responds to buyer messages on Facebook Marketplace
Facebook Marketplace introduces new Meta AI features, including automated replies to buyer inquiries, the company announced Thursday. The platform also leverages AI to accelerate item listings, summarize seller profiles, and now lets sellers offer sh
Meta signs deal for millions of Amazon AI CPUs
Amazon has secured a significant partnership with Meta, once again relying on its own custom-designed chips. Meta has agreed to deploy millions of AWS Graviton chips to meet its expanding AI demands, Amazon confirmed on Friday.Note that AWS Graviton
Related Special Topic Recommendations
Comments (1)
0/500
So these advancements focus on perception and reasoning, huh? As someone who deals with automation at work, I find the 'AMI' goal both exciting and a bit unsettling. It feels like we're closing the loop between what a machine 'sees' and what it 'understands', which could revolutionize everything from logistics to creative tools. But honestly, I hope the focus stays on augmenting human ability rather than just chasing benchmarks that sound cool in research papers. The ethics of human-like perception need to be front and center. 🧠
Meta's Fundamental AI Research (FAIR) team has unveiled five new projects that push forward its work in advanced machine intelligence (AMI).
These latest releases concentrate on improving AI perception—how machines process sensory input—alongside progress in language models, robotics, and collaborative AI agents.
Meta explained its objective is to build machines "capable of acquiring, processing, and interpreting sensory data from our world, and using that information to make decisions with human-like intelligence and speed."
The five new initiatives represent a range of interconnected efforts to reach this ambitious target.
Perception Encoder: Sharpening AI's Visual Intelligence
A cornerstone of the new releases is the Perception Encoder, a large-scale vision encoder built to perform exceptionally across diverse image and video tasks.
Vision encoders act as the "eyes" of AI systems, enabling them to comprehend visual information.
Meta points out the growing difficulty of creating encoders for advanced AI, which need to connect vision with language, handle both images and videos proficiently, and stay reliable under tough conditions, including adversarial attacks.
According to Meta, the ideal encoder should recognize a broad spectrum of concepts while also picking up on fine details—like noticing "a stingray hidden under the seafloor, identifying a small goldfinch in an image's background, or detecting a fast-moving agouti on a night-vision wildlife camera."
Meta states the Perception Encoder delivers "outstanding performance on zero-shot image and video classification and retrieval, surpassing all current open-source and proprietary models for these tasks."
Additionally, its perceptual abilities reportedly enhance performance on language tasks.
When paired with a large language model (LLM), the encoder is said to outdo other vision encoders in areas like visual question answering (VQA), captioning, document understanding, and grounding (linking text to specific parts of an image). It also reportedly improves performance on tasks where LLMs typically struggle, such as understanding spatial relationships (e.g., "if one object is behind another") or camera movement relative to an object.
"As the Perception Encoder starts to be integrated into new applications, we look forward to seeing how its advanced visual capabilities will power even more sophisticated AI systems," Meta commented.
Perception Language Model (PLM): Advancing Open Vision-Language Research
Working alongside the encoder is the Perception Language Model (PLM), an open and reproducible vision-language model designed for intricate visual recognition tasks.
PLM was trained using extensive synthetic data alongside open vision-language datasets, deliberately avoiding knowledge distilled from external proprietary models.
Acknowledging shortcomings in existing video understanding data, the FAIR team assembled 2.5 million new, human-labeled samples focused on detailed video question answering and spatio-temporal captioning. Meta claims this is the "largest dataset of its kind to date."
PLM is available in 1, 3, and 8 billion parameter versions to meet the needs of academic research requiring full transparency.
Along with the models, Meta is releasing PLM-VideoBench, a new benchmark specifically crafted to test capabilities often overlooked by existing benchmarks, namely "fine-grained activity understanding and spatiotemporally grounded reasoning."
Meta hopes that providing open models, a large dataset, and a challenging benchmark will strengthen the open-source community.
Meta Locate 3D: Providing Robots with Situational Awareness
Bridging language commands and physical action is Meta Locate 3D. This end-to-end model is designed to enable robots to accurately find objects in a 3D space based on open-ended natural language queries.
Meta Locate 3D processes 3D point clouds directly from RGB-D sensors (like those on some robots or depth-sensing cameras). Given a text prompt, such as "flower vase near the TV console," the system analyzes spatial relationships and context to identify the correct object instance, differentiating it from, for example, a "vase on the table."
The system consists of three main components: a preprocessing step that converts 2D features into 3D featurized point clouds; the 3D-JEPA encoder (a pre-trained model that creates a contextualized 3D world representation); and the Locate 3D decoder, which uses the 3D representation and the language query to generate bounding boxes and masks for the specified objects.
Alongside the model, Meta is releasing a substantial new dataset for object localization based on referring expressions. It includes 130,000 language annotations across 1,346 scenes from the ARKitScenes, ScanNet, and ScanNet++ datasets, effectively doubling the existing annotated data in this field.
Meta views this technology as essential for developing more capable robotic systems, including its own PARTNR robot project, facilitating more natural human-robot interaction and teamwork.
Dynamic Byte Latent Transformer: Efficient and Robust Language Modeling
Following research published in late 2024, Meta is now releasing the model weights for its 8-billion parameter Dynamic Byte Latent Transformer.
This architecture marks a departure from traditional tokenization-based language models, operating directly at the byte level. Meta claims this method achieves similar performance at scale while offering significant gains in inference efficiency and robustness.
Conventional LLMs split text into 'tokens,' which can have trouble with misspellings, new words, or adversarial inputs. Byte-level models process raw bytes, potentially offering greater resilience.
Meta reports that the Dynamic Byte Latent Transformer "outperforms tokenizer-based models across various tasks, showing an average robustness advantage of +7 points (on perturbed HellaSwag), and reaching up to +55 points on tasks from the CUTE token-understanding benchmark."
By releasing the weights along with the previously shared codebase, Meta encourages the research community to explore this alternative approach to language modeling.
Collaborative Reasoner: Advancing Socially-Intelligent AI Agents
The final release, Collaborative Reasoner, addresses the complex challenge of creating AI agents that can work effectively with humans or other AIs.
Meta notes that human collaboration often produces better outcomes and aims to equip AI with similar capabilities for tasks like assisting with homework or preparing for a job interview.
Such collaboration requires not just problem-solving but also social skills like communication, empathy, giving feedback, and understanding others' perspectives (theory-of-mind), typically unfolding over multiple conversational turns.
Current LLM training and evaluation methods often overlook these social and collaborative dimensions. Moreover, gathering relevant conversational data is costly and challenging.
Collaborative Reasoner provides a framework to evaluate and improve these skills. It includes goal-oriented tasks that require multi-step reasoning achieved through dialogue between two agents. The framework tests abilities like constructive disagreement, persuasion, and arriving at a mutually optimal solution.
Meta's evaluations showed that current models often fail to consistently use collaboration to improve results. To tackle this, they propose a self-improvement technique using synthetic interaction data where an LLM agent collaborates with itself.
Generating this data at scale is made possible by a new high-performance model serving engine called Matrix. Using this method on math, scientific, and social reasoning tasks reportedly led to improvements of up to 29.4% compared to the standard 'chain-of-thought' performance of a single LLM.
By open-sourcing the data generation and modeling pipeline, Meta aims to accelerate research into developing truly "social agents that can partner with humans and other agents."
Together, these five releases highlight Meta's ongoing substantial investment in fundamental AI research, particularly in creating the foundational components for machines that can perceive, understand, and interact with the world in more human-like ways.
See also: Meta will train AI models using EU user data
Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo happening in Amsterdam, California, and London. This comprehensive event is co-located with other major events including the Intelligent Automation Conference, BlockX, Digital Transformation Week, and the Cyber Security & Cloud Expo.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.
WordPress.com now allows AI agents to write and publish posts, plus more
WordPress.com, the popular web hosting and publishing platform, is now embracing AI agents—a move that could reshape the look and feel of the web. The company announced Friday that it will allow AI agents to draft, edit, and publish content on custom
Meta AI now responds to buyer messages on Facebook Marketplace
Facebook Marketplace introduces new Meta AI features, including automated replies to buyer inquiries, the company announced Thursday. The platform also leverages AI to accelerate item listings, summarize seller profiles, and now lets sellers offer sh
Meta signs deal for millions of Amazon AI CPUs
Amazon has secured a significant partnership with Meta, once again relying on its own custom-designed chips. Meta has agreed to deploy millions of AWS Graviton chips to meet its expanding AI demands, Amazon confirmed on Friday.Note that AWS Graviton
So these advancements focus on perception and reasoning, huh? As someone who deals with automation at work, I find the 'AMI' goal both exciting and a bit unsettling. It feels like we're closing the loop between what a machine 'sees' and what it 'understands', which could revolutionize everything from logistics to creative tools. But honestly, I hope the focus stays on augmenting human ability rather than just chasing benchmarks that sound cool in research papers. The ethics of human-like perception need to be front and center. 🧠





Home






