Meituan Unveils LongCat-Next AI Model with Unified Vision and Speech Architecture

On April 3, the MiTi team officially launched the native multimodal large model LongCat-Next. This model moves beyond the conventional "language foundation plus plugins" approach by converting images, audio, and text into a unified stream of discrete tokens. This allows the AI to natively "see" and "hear" the physical world, processing these inputs just as it does text.
Technical Core: DiNA Architecture Enables "Modality Internalization"
To eliminate barriers between different data types, MiTi developed the DiNA (Discrete Native Autoregressive) architecture, achieving a deep unification in multimodal modeling:
Complete Modality Unification: The model uses the same parameters, attention mechanisms, and loss functions for text, images, and audio.
Symmetry of Understanding and Generation: Within a single mathematical framework, predicting the next text token constitutes "understanding," while predicting an image token is "generation." Both processes show significant synergistic benefits during training.
Extreme Compression: Utilizing the dNaViT Visual Tokenizer, it handles inputs at any resolution. Through an 8-layer residual vector quantization process, it achieves up to 28x compression in pixel space while preserving critical details for tasks like OCR and financial document analysis.
Empirical Performance: Discrete Modeling Has No Inherent Limits
LongCat-Next delivers performance that surpasses specialized models across multiple benchmarks, effectively challenging the traditional notion that "discretization inevitably causes information loss":
Fine-Grained Perception: On the OmniDocBench for dense text scenarios, it outperforms not only Qwen3-Omni but also the specialized vision model Qwen3-VL.
Visual Reasoning: It scored an impressive 83.1 on MathVista, demonstrating robust, industry-grade logical reasoning.
Cross-Modal Collaboration: While maintaining leading language capabilities (C-Eval 86.80), it supports low-latency parallel generation of text and speech, along with customizable voice cloning.
Industry Insight: A Foundation for Physical-World AI
Large language models have long been centered on text. The breakthrough of LongCat-Next is its proof that physical-world information can be discretized and modeled like language. When an AI possesses a unified "native language," it becomes more intelligent and intuitive when using tools, writing code, or interpreting complex charts.
MiTi has now open-sourced the LongCat-Next model and the dNaViT tokenizer. This efficient, high-potential native discrete architecture provides developers with essential tools for building AI that can perceive and interact with the real world.
Related article
OpenAI Partners with U.S. Department of Defense, ChatGPT Uninstallations Surge 295%
Public Outrage: OpenAI's Military Partnership Sparks a 'Uninstall Surge'Recently, AI leader OpenAI announced a deep partnership with the U.S. Department of Defense (DoD), integrating its AI models into top-secret military networks. The news sparked w
OpenAI Launches Sites Feature, Marking the End of the No-Code Era with Word-Powered Websites
OpenAI has introduced Sites, a new feature for Codex, its AI for software engineering. Currently in preview, it's available only to paying Business and Enterprise subscribers and aims to remove traditional barriers in web and application development.
OpenAI Acquires AI Personal Finance Startup Hiro
OpenAI has acquired the personal finance startup Hiro Finance, founder Ethan Bloch announced on Monday, with OpenAI confirming the deal to TechCrunch. The startup was backed by top fintech venture capital firm Ribbit, along with General Catalyst and
Related Special Topic Recommendations
Comments (1)
0/500
Interesting approach! Unifying vision and speech into a single stream sounds like a step towards more 'native' multimodal understanding, unlike just bolting on separate modules. Makes me wonder how this affects real-time processing efficiency for delivery robots or AR navigation apps. Could be a game-changer for Meituan's on-demand services if it works smoothly in the wild. 🧐

On April 3, the MiTi team officially launched the native multimodal large model LongCat-Next. This model moves beyond the conventional "language foundation plus plugins" approach by converting images, audio, and text into a unified stream of discrete tokens. This allows the AI to natively "see" and "hear" the physical world, processing these inputs just as it does text.
Technical Core: DiNA Architecture Enables "Modality Internalization"
To eliminate barriers between different data types, MiTi developed the DiNA (Discrete Native Autoregressive) architecture, achieving a deep unification in multimodal modeling:
Complete Modality Unification: The model uses the same parameters, attention mechanisms, and loss functions for text, images, and audio.
Symmetry of Understanding and Generation: Within a single mathematical framework, predicting the next text token constitutes "understanding," while predicting an image token is "generation." Both processes show significant synergistic benefits during training.
Extreme Compression: Utilizing the dNaViT Visual Tokenizer, it handles inputs at any resolution. Through an 8-layer residual vector quantization process, it achieves up to 28x compression in pixel space while preserving critical details for tasks like OCR and financial document analysis.
Empirical Performance: Discrete Modeling Has No Inherent Limits
LongCat-Next delivers performance that surpasses specialized models across multiple benchmarks, effectively challenging the traditional notion that "discretization inevitably causes information loss":
Fine-Grained Perception: On the OmniDocBench for dense text scenarios, it outperforms not only Qwen3-Omni but also the specialized vision model Qwen3-VL.
Visual Reasoning: It scored an impressive 83.1 on MathVista, demonstrating robust, industry-grade logical reasoning.
Cross-Modal Collaboration: While maintaining leading language capabilities (C-Eval 86.80), it supports low-latency parallel generation of text and speech, along with customizable voice cloning.
Industry Insight: A Foundation for Physical-World AI
Large language models have long been centered on text. The breakthrough of LongCat-Next is its proof that physical-world information can be discretized and modeled like language. When an AI possesses a unified "native language," it becomes more intelligent and intuitive when using tools, writing code, or interpreting complex charts.
MiTi has now open-sourced the LongCat-Next model and the dNaViT tokenizer. This efficient, high-potential native discrete architecture provides developers with essential tools for building AI that can perceive and interact with the real world.
OpenAI Partners with U.S. Department of Defense, ChatGPT Uninstallations Surge 295%
Public Outrage: OpenAI's Military Partnership Sparks a 'Uninstall Surge'Recently, AI leader OpenAI announced a deep partnership with the U.S. Department of Defense (DoD), integrating its AI models into top-secret military networks. The news sparked w
OpenAI Launches Sites Feature, Marking the End of the No-Code Era with Word-Powered Websites
OpenAI has introduced Sites, a new feature for Codex, its AI for software engineering. Currently in preview, it's available only to paying Business and Enterprise subscribers and aims to remove traditional barriers in web and application development.
OpenAI Acquires AI Personal Finance Startup Hiro
OpenAI has acquired the personal finance startup Hiro Finance, founder Ethan Bloch announced on Monday, with OpenAI confirming the deal to TechCrunch. The startup was backed by top fintech venture capital firm Ribbit, along with General Catalyst and
Interesting approach! Unifying vision and speech into a single stream sounds like a step towards more 'native' multimodal understanding, unlike just bolting on separate modules. Makes me wonder how this affects real-time processing efficiency for delivery robots or AR navigation apps. Could be a game-changer for Meituan's on-demand services if it works smoothly in the wild. 🧐





Home






