Xiaomi's OmniVoice Open-Source TTS Model Enables Zero-Shot Cloning Across 600+ Languages
Recently, the next-generation Kaldi team (k2-fsa) at Xiaomi officially open-sourced OmniVoice, a massive multilingual zero-shot text-to-speech model that supports over 600 languages. It achieves state-of-the-art results across multiple key benchmarks for Chinese, English, and multilingual synthesis, marking a significant breakthrough in the field.
Leading Performance: Chinese WER as Low as 0.84%, Outperforming Mainstream Models in Multilingual Tests
On the Seed-TTS Chinese test set, OmniVoice achieves a remarkably low word error rate (WER) of just 0.84%. In multilingual evaluations, its similarity (SIM-o) and WER scores surpass well-known commercial models like ElevenLabs v2 and MiniMax, demonstrating exceptional speech naturalness and clarity.

Ultra-Fast Inference: RTF as Low as 0.025, 40x Faster Than Real-Time
OmniVoice boasts a real-time factor (RTF) as low as 0.025, meaning its synthesis speed far exceeds real-time requirements. This massive efficiency gain enables the rapid generation of long-form speech in practical applications, greatly enhancing the user experience.
Core Architectural Innovation: Discrete Non-Autoregressive Design Inspired by Diffusion Models
OmniVoice employs a novel discrete non-autoregressive architecture inspired by diffusion language models. It generates speech from text in a single step, bypassing traditional intermediate semantic tokens. This streamlined design simplifies the pipeline while maintaining high output quality. A full codebook random masking strategy, combined with pre-trained LLM initialization, further boosts training efficiency and improves the final speech's clarity and intelligibility.
Flexible Voice Cloning & Customization: Works with Just 3-10 Seconds of Audio
The model supports high-quality zero-shot voice cloning using only 3-10 seconds of reference audio. Users can also customize voice attributes through natural language prompts, specifying gender, age, pitch, accent, dialect, and even special effects like whispering.
Handles Non-Linguistic Symbols & Fine-Grained Pronunciation Control
OmniVoice can process non-linguistic symbols, such as [laughter], and supports pronunciation correction via pinyin or phonetic symbols. This makes it particularly well-suited for precise synthesis in Chinese and various dialects.
Support for 600+ Languages: Aiding Digital Preservation of Minority and Endangered Languages
A key highlight of OmniVoice is its extensive language coverage, efficiently supporting both major and numerous low-resource languages. For minority and endangered languages, it can generate high-quality speech with minimal data samples, offering significant potential for digital language preservation and cultural protection.
OmniVoice's code and pre-trained models are now open-sourced on GitHub and Hugging Face, enabling developers to deploy it locally or integrate it into applications. AIbase will continue to monitor community feedback and real-world use cases. Developers are encouraged to share their experiences.
Project Link: https://github.com/k2-fsa/OmniVoice
Related article
How to protect assets, buildings, and personal health?
In an unpredictable world, protection has become a strategic necessity—not just an option. Whether it's safeguarding finances, strengthening buildings, or focusing on personal health, long-term stability relies on proactive planning. True security is
AI Browser Comet Launches with Full Multitasking Support on iPad
Perplexity’s AI browser, Comet, has officially launched its iPad version, now fully compatible with iPadOS. The update introduces multi-window browsing, multitasking support, and deep integration with leading AI models like OpenAI and Anthropic, deli
Trace raises $3M to tackle enterprise AI agent adoption hurdles
Despite their potential, AI agents have struggled to gain traction in the enterprise. One emerging startup believes the core issue is a lack of context.Launched as part of Y Combinator’s 2025 summer cohort, Trace is a workflow orchestration startup d
Related Special Topic Recommendations
Comments (0)
0/500
Recently, the next-generation Kaldi team (k2-fsa) at Xiaomi officially open-sourced OmniVoice, a massive multilingual zero-shot text-to-speech model that supports over 600 languages. It achieves state-of-the-art results across multiple key benchmarks for Chinese, English, and multilingual synthesis, marking a significant breakthrough in the field.
Leading Performance: Chinese WER as Low as 0.84%, Outperforming Mainstream Models in Multilingual Tests
On the Seed-TTS Chinese test set, OmniVoice achieves a remarkably low word error rate (WER) of just 0.84%. In multilingual evaluations, its similarity (SIM-o) and WER scores surpass well-known commercial models like ElevenLabs v2 and MiniMax, demonstrating exceptional speech naturalness and clarity.

Ultra-Fast Inference: RTF as Low as 0.025, 40x Faster Than Real-Time
OmniVoice boasts a real-time factor (RTF) as low as 0.025, meaning its synthesis speed far exceeds real-time requirements. This massive efficiency gain enables the rapid generation of long-form speech in practical applications, greatly enhancing the user experience.
Core Architectural Innovation: Discrete Non-Autoregressive Design Inspired by Diffusion Models
OmniVoice employs a novel discrete non-autoregressive architecture inspired by diffusion language models. It generates speech from text in a single step, bypassing traditional intermediate semantic tokens. This streamlined design simplifies the pipeline while maintaining high output quality. A full codebook random masking strategy, combined with pre-trained LLM initialization, further boosts training efficiency and improves the final speech's clarity and intelligibility.
Flexible Voice Cloning & Customization: Works with Just 3-10 Seconds of Audio
The model supports high-quality zero-shot voice cloning using only 3-10 seconds of reference audio. Users can also customize voice attributes through natural language prompts, specifying gender, age, pitch, accent, dialect, and even special effects like whispering.
Handles Non-Linguistic Symbols & Fine-Grained Pronunciation Control
OmniVoice can process non-linguistic symbols, such as [laughter], and supports pronunciation correction via pinyin or phonetic symbols. This makes it particularly well-suited for precise synthesis in Chinese and various dialects.
Support for 600+ Languages: Aiding Digital Preservation of Minority and Endangered Languages
A key highlight of OmniVoice is its extensive language coverage, efficiently supporting both major and numerous low-resource languages. For minority and endangered languages, it can generate high-quality speech with minimal data samples, offering significant potential for digital language preservation and cultural protection.
OmniVoice's code and pre-trained models are now open-sourced on GitHub and Hugging Face, enabling developers to deploy it locally or integrate it into applications. AIbase will continue to monitor community feedback and real-world use cases. Developers are encouraged to share their experiences.
Project Link: https://github.com/k2-fsa/OmniVoice
How to protect assets, buildings, and personal health?
In an unpredictable world, protection has become a strategic necessity—not just an option. Whether it's safeguarding finances, strengthening buildings, or focusing on personal health, long-term stability relies on proactive planning. True security is
AI Browser Comet Launches with Full Multitasking Support on iPad
Perplexity’s AI browser, Comet, has officially launched its iPad version, now fully compatible with iPadOS. The update introduces multi-window browsing, multitasking support, and deep integration with leading AI models like OpenAI and Anthropic, deli
Trace raises $3M to tackle enterprise AI agent adoption hurdles
Despite their potential, AI agents have struggled to gain traction in the enterprise. One emerging startup believes the core issue is a lack of context.Launched as part of Y Combinator’s 2025 summer cohort, Trace is a workflow orchestration startup d





Home






