option
Home
News
Meituan Unveils LongCat-Next AI Model with Unified Vision and Speech Architecture

Meituan Unveils LongCat-Next AI Model with Unified Vision and Speech Architecture

April 12, 2026
112

Meituan Unveils LongCat-Next AI Model with Unified Vision and Speech Architecture

On April 3, the MiTi team officially launched the native multimodal large model LongCat-Next. This model moves beyond the conventional "language foundation plus plugins" approach by converting images, audio, and text into a unified stream of discrete tokens. This allows the AI to natively "see" and "hear" the physical world, processing these inputs just as it does text.

Technical Core: DiNA Architecture Enables "Modality Internalization"

To eliminate barriers between different data types, MiTi developed the DiNA (Discrete Native Autoregressive) architecture, achieving a deep unification in multimodal modeling:

Complete Modality Unification: The model uses the same parameters, attention mechanisms, and loss functions for text, images, and audio.

Symmetry of Understanding and Generation: Within a single mathematical framework, predicting the next text token constitutes "understanding," while predicting an image token is "generation." Both processes show significant synergistic benefits during training.

Extreme Compression: Utilizing the dNaViT Visual Tokenizer, it handles inputs at any resolution. Through an 8-layer residual vector quantization process, it achieves up to 28x compression in pixel space while preserving critical details for tasks like OCR and financial document analysis.

Empirical Performance: Discrete Modeling Has No Inherent Limits

LongCat-Next delivers performance that surpasses specialized models across multiple benchmarks, effectively challenging the traditional notion that "discretization inevitably causes information loss":

Fine-Grained Perception: On the OmniDocBench for dense text scenarios, it outperforms not only Qwen3-Omni but also the specialized vision model Qwen3-VL.

Visual Reasoning: It scored an impressive 83.1 on MathVista, demonstrating robust, industry-grade logical reasoning.

Cross-Modal Collaboration: While maintaining leading language capabilities (C-Eval 86.80), it supports low-latency parallel generation of text and speech, along with customizable voice cloning.

Industry Insight: A Foundation for Physical-World AI

Large language models have long been centered on text. The breakthrough of LongCat-Next is its proof that physical-world information can be discretized and modeled like language. When an AI possesses a unified "native language," it becomes more intelligent and intuitive when using tools, writing code, or interpreting complex charts.

MiTi has now open-sourced the LongCat-Next model and the dNaViT tokenizer. This efficient, high-potential native discrete architecture provides developers with essential tools for building AI that can perceive and interact with the real world.

Related article
OpenAI Partners with U.S. Department of Defense, ChatGPT Uninstallations Surge 295% OpenAI Partners with U.S. Department of Defense, ChatGPT Uninstallations Surge 295% Public Outrage: OpenAI's Military Partnership Sparks a 'Uninstall Surge'Recently, AI leader OpenAI announced a deep partnership with the U.S. Department of Defense (DoD), integrating its AI models into top-secret military networks. The news sparked w
OpenAI Launches Sites Feature, Marking the End of the No-Code Era with Word-Powered Websites OpenAI Launches Sites Feature, Marking the End of the No-Code Era with Word-Powered Websites OpenAI has introduced Sites, a new feature for Codex, its AI for software engineering. Currently in preview, it's available only to paying Business and Enterprise subscribers and aims to remove traditional barriers in web and application development.
OpenAI Acquires AI Personal Finance Startup Hiro OpenAI Acquires AI Personal Finance Startup Hiro OpenAI has acquired the personal finance startup Hiro Finance, founder Ethan Bloch announced on Monday, with OpenAI confirming the deal to TechCrunch. The startup was backed by top fintech venture capital firm Ribbit, along with General Catalyst and
Related Special Topic Recommendations
chatbot AI Multi-Agent Orchestrators: Design Complex Automated Workflows through Natural Language
AI Multi-Agent Orchestrators: Design Complex Automated Workflows through Natural Language

2026 Latest: Discover the best AI multi-agent orchestrators to design complex automated workflows through natural language. Our curated list features top-rated, powerful platforms for seamless task automation and intelligent process management. Compare free vs paid options with real-world insights. Unlock your AI edge with XIX.AI's expert weekly updated rankings.

10 tools
xix.ai
Image editing Best AI Noise Reduction Software: Remove Grain & Artifacts from Low-Light Night Photography
Best AI Noise Reduction Software: Remove Grain & Artifacts from Low-Light Night Photography

Discover the 2026 best AI noise reduction software for low-light night photography. Our top-rated, curated list compares free vs paid tools, featuring real-world tests and weekly updated rankings. Remove grain & artifacts effortlessly. Unlock your AI edge at XIX.AI.

10 tools
xix.ai
chatbot Best Custom AI Girlfriend Generators: Design Unique Personalities, Hobbies, and Backstories
Best Custom AI Girlfriend Generators: Design Unique Personalities, Hobbies, and Backstories

Discover the 2026 best custom AI girlfriend generators on XIX.AI. Explore our top-rated, curated list for designing unique personalities, hobbies, and deep backstories. Compare free vs paid options with real-world insights. Unlock your perfect creative companion today.

10 tools
xix.ai
Productivity AI Architecture Designers: Build Scalable System Architectures Using Natural Language
AI Architecture Designers: Build Scalable System Architectures Using Natural Language

Discover the 2026 best AI architecture design tools on XIX.AI. Our curated, top-rated list features powerful, game-changing solutions to build scalable system architectures using natural language. Compare free vs paid options with real-world insights. Unlock your AI edge and streamline development today.

10 tools
xix.ai
Comic Creation AI Character Profile Creators: Generate Detailed Backstories & Visual Refs for Manga Leads
AI Character Profile Creators: Generate Detailed Backstories & Visual Refs for Manga Leads

2026 Latest Best AI Character Profile Creators: Discover top-rated tools to generate detailed backstories and visual references for your manga leads. Our curated, weekly-updated list compares free vs paid options based on real-world tests. Find powerful, game-changing solutions to craft compelling characters and streamline your creative workflow. Explore the rankings on XIX.AI and unlock your perfect storytelling ally today.

10 tools
xix.ai
Health & Wellness AI Pregnancy Copilots: Generate Safe Trimester-by-Trimester Workout & Nutrition Plans
AI Pregnancy Copilots: Generate Safe Trimester-by-Trimester Workout & Nutrition Plans

Discover the 2026 best AI pregnancy copilots for safe, personalized trimester-by-trimester workout and nutrition plans. Get top-rated, curated recommendations with free vs paid comparisons and real-world insights. Unlock your healthiest pregnancy journey with XIX.AI's expert guide. Explore now.

10 tools
xix.ai
Comments (1)
0/500
CharlesHernández
CharlesHernández May 16, 2026 at 2:00:15 PM EDT

Interesting approach! Unifying vision and speech into a single stream sounds like a step towards more 'native' multimodal understanding, unlike just bolting on separate modules. Makes me wonder how this affects real-time processing efficiency for delivery robots or AR navigation apps. Could be a game-changer for Meituan's on-demand services if it works smoothly in the wild. 🧐

OR