Alibaba's Tongyi Lab Open-Sources Fun-CineForge, Solves Multi-Speaker Dubbing Challenge
Traditional AI voice dubbing often falls short in high-stakes productions like films and animation, where capturing nuanced emotional peaks and perfectly synchronized lip movements is paramount. To tackle this core industry challenge, Tongyi Lab has officially launched and open-sourced the groundbreaking film-grade, multi-scenario multimodal dubbing model—Fun-CineForge .
Bridging the Audio-Visual Gap: A Four-Pillar Framework for Seamless Sync
Rather than relying on basic text-to-speech, Fun-CineForge is engineered to master four critical dimensions of professional dubbing:
Lip Sync: Ensures synthesized speech aligns with on-screen character mouth movements with exceptional precision.
Emotional Expression: Infuses the voice with authentic human-like emotion by analyzing facial cues and contextual instructions.
Voice Consistency: Maintains a stable, recognizable vocal identity for specific characters across complex multi-speaker dialogue scenes.
Time Alignment: Enables millisecond-accurate insertion of dialogue, even when the speaker is off-screen or partially obscured.
Core Innovation: Pioneering "Time Modality" and a High-Fidelity Dataset
The technical leap of Fun-CineForge stems from its unique "data + model" co-design philosophy:

The CineDub High-Quality Dataset: Tongyi Lab has also open-sourced the automated CineDub dataset construction pipeline. Utilizing a chain-of-thought error correction mechanism, it reduces transcription error rates for Chinese and English text to approximately 1% - 2% and slashes speaker diarization errors to as low as 1.2%.
Four-Modality Fusion Architecture: The model pioneers the integration of a "time modality", jointly modeling visual inputs (lip shape and expression), text (dialogue and emotional context), and audio (voice reference). This fusion allows for exact synchronization in challenging scenes, including those without visible faces.
Demonstrated Excellence: Pioneering Authentic Multi-Character Dialogue Dubbing
Benchmark results demonstrate that Fun-CineForge substantially outperforms baseline models like DeepDubber-V1 across key metrics: word error rate (WER/CER), lip synchronization (LSE-C/D), and voice similarity. A landmark achievement is its first-of-its-kind capability to handle duet and multi-person dialogues with precision, showing remarkable robustness in video clips up to 30 seconds.
GitHub: https://github.com/FunAudioLLM/FunCineForge
HuggingFace: https://huggingface.co/FunAudioLLM/Fun-CineForge
ModelScope: https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/
Related article
China Telecom Invests in Mianbi Intelligence, Raises Capital to 713,000 Yuan for LLM & Data Infra
The "national team" and the leading figure from Tsinghua University in the large model space are deepening their strategic alignment. On March 1, 2026, according to the latest business registration data from Qichacha, Beijing Mianbi Intelligent Techn
Taotian Group Accelerates AI-Native Restructuring, Grants Interns Free Token Quotas
TaoTian Group recently introduced the "AI Productivity Plan," designed to accelerate the integration of AI technology into e-commerce operations and R&D workflows through resource allocation and tool subsidies. The program is now available to all int
Glean targets enterprise AI infrastructure in land grab
The race to dominate enterprise AI is accelerating. Microsoft is embedding Copilot into Office, Google is integrating Gemini into Workspace, and both OpenAI and Anthropic are selling directly to corporations. Meanwhile, nearly every SaaS vendor now i
Related Special Topic Recommendations
Comments (0)
0/500
Traditional AI voice dubbing often falls short in high-stakes productions like films and animation, where capturing nuanced emotional peaks and perfectly synchronized lip movements is paramount. To tackle this core industry challenge, Tongyi Lab has officially launched and open-sourced the groundbreaking film-grade, multi-scenario multimodal dubbing model—
Bridging the Audio-Visual Gap: A Four-Pillar Framework for Seamless Sync
Rather than relying on basic text-to-speech, Fun-CineForge is engineered to master four critical dimensions of professional dubbing:
Lip Sync: Ensures synthesized speech aligns with on-screen character mouth movements with exceptional precision.
Emotional Expression: Infuses the voice with authentic human-like emotion by analyzing facial cues and contextual instructions.
Voice Consistency: Maintains a stable, recognizable vocal identity for specific characters across complex multi-speaker dialogue scenes.
Time Alignment: Enables millisecond-accurate insertion of dialogue, even when the speaker is off-screen or partially obscured.
Core Innovation: Pioneering "Time Modality" and a High-Fidelity Dataset
The technical leap of Fun-CineForge stems from its unique "data + model" co-design philosophy:

The CineDub High-Quality Dataset: Tongyi Lab has also open-sourced the automated CineDub dataset construction pipeline. Utilizing a chain-of-thought error correction mechanism, it reduces transcription error rates for Chinese and English text to approximately 1% - 2% and slashes speaker diarization errors to as low as 1.2%.
Four-Modality Fusion Architecture: The model pioneers the integration of a "time modality", jointly modeling visual inputs (lip shape and expression), text (dialogue and emotional context), and audio (voice reference). This fusion allows for exact synchronization in challenging scenes, including those without visible faces.
Demonstrated Excellence: Pioneering Authentic Multi-Character Dialogue Dubbing
Benchmark results demonstrate that Fun-CineForge substantially outperforms baseline models like DeepDubber-V1 across key metrics: word error rate (WER/CER), lip synchronization (LSE-C/D), and voice similarity. A landmark achievement is its first-of-its-kind capability to handle duet and multi-person dialogues with precision, showing remarkable robustness in video clips up to 30 seconds.
GitHub: https://github.com/FunAudioLLM/FunCineForge
HuggingFace: https://huggingface.co/FunAudioLLM/Fun-CineForge
ModelScope: https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/
China Telecom Invests in Mianbi Intelligence, Raises Capital to 713,000 Yuan for LLM & Data Infra
The "national team" and the leading figure from Tsinghua University in the large model space are deepening their strategic alignment. On March 1, 2026, according to the latest business registration data from Qichacha, Beijing Mianbi Intelligent Techn
Taotian Group Accelerates AI-Native Restructuring, Grants Interns Free Token Quotas
TaoTian Group recently introduced the "AI Productivity Plan," designed to accelerate the integration of AI technology into e-commerce operations and R&D workflows through resource allocation and tool subsidies. The program is now available to all int
Glean targets enterprise AI infrastructure in land grab
The race to dominate enterprise AI is accelerating. Microsoft is embedding Copilot into Office, Google is integrating Gemini into Workspace, and both OpenAI and Anthropic are selling directly to corporations. Meanwhile, nearly every SaaS vendor now i





Home






