Home
Microsoft's VibeVoice AI Family Goes Open Source, Handles 90-Minute Dialogues, Tops 27K GitHub Stars
Microsoft has recently open-sourced a state-of-the-art family of voice AI models named VibeVoice, featuring capabilities like automatic speech recognition (ASR) and text-to-speech (TTS). The project has rapidly captured the developer community's interest, thanks to its robust long-audio processing, natural multi-speaker dialogue generation, and real-time, low-latency performance. It has already garnered around 27,000 Stars on GitHub.
Released as an open-source research framework under the MIT license, VibeVoice supports local deployment with no cloud subscription fees, aiming to foster collaboration and innovation in speech synthesis. The model family comprises three core members, each addressing specific challenges in traditional voice AI, such as long-sequence handling, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Powerful Tool for Structured Speech-to-Text, Handling Up to 60 Minutes of Audio
VibeVoice-ASR-7B is a unified speech-to-text model capable of processing audio files up to 60 minutes long in a single pass, directly outputting structured transcripts. The output identifies the speaker, provides precise timestamps, and details the spoken content, while supporting custom hotwords to improve accuracy for proper nouns or technical terms. Supporting over 50 languages, it is well-suited for complex scenarios like lengthy meeting recordings and podcast transcription.
Community developers have already built practical tools on this model, such as a voice input method called Vibing for macOS and Windows. User feedback indicates strong performance in speed and accuracy, significantly boosting daily voice input efficiency.
VibeVoice-TTS-1.5B: Expressive Speech Generation for Up to 90 Minutes with Multiple Speakers
VibeVoice-TTS-1.5B is the core text-to-speech model, capable of generating continuous audio up to 90 minutes long in one go and supporting up to four distinct speakers for natural dialogue simulation. The synthesized speech is expressive, sounding natural and fluent with realistic pauses, emphasis, and emotional shifts, making it ideal for podcasts, long narratives, audiobooks, or multi-character dialogues.
Unlike many traditional TTS models limited to 1-2 speakers, VibeVoice-TTS achieves significant breakthroughs in long-form and multi-speaker consistency. Its architecture combines a continuous speech tokenizer (acoustic and semantic) with a low frame rate (7.5Hz), greatly enhancing computational efficiency for long sequences.
VibeVoice-Realtime-0.5B: Real-Time TTS with Around 300 Milliseconds of Latency
VibeVoice-Realtime-0.5B is designed for real-time applications, supporting streaming text input with a first-audio latency of approximately 300 milliseconds, while still capable of generating audio up to 10 minutes long. This model is particularly suitable for interactive applications requiring instant feedback, such as real-time voice assistants or live streaming dubbing.
Additionally, the project introduced experimental speaker support, including multilingual speech and various English style variations, offering developers greater customization options.
AIbase Review: Microsoft's open-sourcing of VibeVoice not only lowers the barrier to entry for high-performance voice AI but also provides a complete local deployment solution. The project was briefly taken down due to potential misuse risks but was relaunched after implementing security measures like audio watermarks and audible disclaimers, reflecting responsible AI development principles. Developers can now obtain model weights from GitHub and Hugging Face and quickly test them via platforms like Colab.
With ongoing contributions from the open-source community, including optimizations for Apple Silicon, VibeVoice is poised to accelerate adoption in content creation, accessibility tools, and voice interaction. Interested developers can visit Microsoft's official project page for further exploration.
Project Address: https://github.com/microsoft/VibeVoice
Related article
OpenAI Partners with U.S. Department of Defense, ChatGPT Uninstallations Surge 295%
Public Outrage: OpenAI's Military Partnership Sparks a 'Uninstall Surge'Recently, AI leader OpenAI announced a deep partnership with the U.S. Department of Defense (DoD), integrating its AI models into top-secret military networks. The news sparked w
OpenAI Launches Sites Feature, Marking the End of the No-Code Era with Word-Powered Websites
OpenAI has introduced Sites, a new feature for Codex, its AI for software engineering. Currently in preview, it's available only to paying Business and Enterprise subscribers and aims to remove traditional barriers in web and application development.
OpenAI Acquires AI Personal Finance Startup Hiro
OpenAI has acquired the personal finance startup Hiro Finance, founder Ethan Bloch announced on Monday, with OpenAI confirming the deal to TechCrunch. The startup was backed by top fintech venture capital firm Ribbit, along with General Catalyst and
Related Special Topic Recommendations
Comments (0)
0/500
Microsoft has recently open-sourced a state-of-the-art family of voice AI models named VibeVoice, featuring capabilities like automatic speech recognition (ASR) and text-to-speech (TTS). The project has rapidly captured the developer community's interest, thanks to its robust long-audio processing, natural multi-speaker dialogue generation, and real-time, low-latency performance. It has already garnered around 27,000 Stars on GitHub.
Released as an open-source research framework under the MIT license, VibeVoice supports local deployment with no cloud subscription fees, aiming to foster collaboration and innovation in speech synthesis. The model family comprises three core members, each addressing specific challenges in traditional voice AI, such as long-sequence handling, speaker consistency, and natural fluency.

VibeVoice-ASR-7B: A Powerful Tool for Structured Speech-to-Text, Handling Up to 60 Minutes of Audio
VibeVoice-ASR-7B is a unified speech-to-text model capable of processing audio files up to 60 minutes long in a single pass, directly outputting structured transcripts. The output identifies the speaker, provides precise timestamps, and details the spoken content, while supporting custom hotwords to improve accuracy for proper nouns or technical terms. Supporting over 50 languages, it is well-suited for complex scenarios like lengthy meeting recordings and podcast transcription.
Community developers have already built practical tools on this model, such as a voice input method called Vibing for macOS and Windows. User feedback indicates strong performance in speed and accuracy, significantly boosting daily voice input efficiency.
VibeVoice-TTS-1.5B: Expressive Speech Generation for Up to 90 Minutes with Multiple Speakers
VibeVoice-TTS-1.5B is the core text-to-speech model, capable of generating continuous audio up to 90 minutes long in one go and supporting up to four distinct speakers for natural dialogue simulation. The synthesized speech is expressive, sounding natural and fluent with realistic pauses, emphasis, and emotional shifts, making it ideal for podcasts, long narratives, audiobooks, or multi-character dialogues.
Unlike many traditional TTS models limited to 1-2 speakers, VibeVoice-TTS achieves significant breakthroughs in long-form and multi-speaker consistency. Its architecture combines a continuous speech tokenizer (acoustic and semantic) with a low frame rate (7.5Hz), greatly enhancing computational efficiency for long sequences.
VibeVoice-Realtime-0.5B: Real-Time TTS with Around 300 Milliseconds of Latency
VibeVoice-Realtime-0.5B is designed for real-time applications, supporting streaming text input with a first-audio latency of approximately 300 milliseconds, while still capable of generating audio up to 10 minutes long. This model is particularly suitable for interactive applications requiring instant feedback, such as real-time voice assistants or live streaming dubbing.
Additionally, the project introduced experimental speaker support, including multilingual speech and various English style variations, offering developers greater customization options.
AIbase Review: Microsoft's open-sourcing of VibeVoice not only lowers the barrier to entry for high-performance voice AI but also provides a complete local deployment solution. The project was briefly taken down due to potential misuse risks but was relaunched after implementing security measures like audio watermarks and audible disclaimers, reflecting responsible AI development principles. Developers can now obtain model weights from GitHub and Hugging Face and quickly test them via platforms like Colab.
With ongoing contributions from the open-source community, including optimizations for Apple Silicon, VibeVoice is poised to accelerate adoption in content creation, accessibility tools, and voice interaction. Interested developers can visit Microsoft's official project page for further exploration.
Project Address: https://github.com/microsoft/VibeVoice
OpenAI Partners with U.S. Department of Defense, ChatGPT Uninstallations Surge 295%
Public Outrage: OpenAI's Military Partnership Sparks a 'Uninstall Surge'Recently, AI leader OpenAI announced a deep partnership with the U.S. Department of Defense (DoD), integrating its AI models into top-secret military networks. The news sparked w
OpenAI Launches Sites Feature, Marking the End of the No-Code Era with Word-Powered Websites
OpenAI has introduced Sites, a new feature for Codex, its AI for software engineering. Currently in preview, it's available only to paying Business and Enterprise subscribers and aims to remove traditional barriers in web and application development.
OpenAI Acquires AI Personal Finance Startup Hiro
OpenAI has acquired the personal finance startup Hiro Finance, founder Ethan Bloch announced on Monday, with OpenAI confirming the deal to TechCrunch. The startup was backed by top fintech venture capital firm Ribbit, along with General Catalyst and











