Alibaba Tongyi unveils voice model with 'FreeStyle' natural language control
Today, Alibaba Tongyi Lab's Speech Team introduced two groundbreaking voice generation models: Fun-CosyVoice3.5 and Fun-AudioGen-VD. The standout feature of these models is their support for "FreeStyle" commands. Instead of complex parameter adjustments, users can precisely control vocal expression styles or build intricate audio scenes from scratch using simple natural language descriptions.

Each model serves distinct purposes:
Fun-CosyVoice3.5: Multilingual Replication and Fine-Grained Control
This enhanced version of CosyVoice achieves core breakthroughs in understanding speech expression nuances.
Command-Driven Generation: Users can input instructions like "speak more confidently" or "slow down with emotional variation" for real-time vocal adjustments.
Language Expansion: Added support for Thai, Indonesian, Portuguese, and Vietnamese maintains industry-leading performance in transcription accuracy (WER) and voice similarity across 13 languages.
Rare Character Optimization: Specialized training reduced error rates for uncommon characters from 15.2% to 5.3%.
Performance Boost: First packet latency decreased by 35%, significantly enhancing real-time interaction fluidity.
Fun-AudioGen-VD: Comprehensive Sound Design
This model acts as an "audio director," generating integrated audio combining "characters + environments."
Voice Customization: Specify gender, age, accent, and detailed characteristics like "hoarse, deep, or low-pitched" voices.
Emotion and Role Play: Simulates roles including customer service agents, broadcasters, and children, even conveying complex states like "outward calm with internal tension."
Immersive Environments: Adds background sounds (battlefield chaos, café murmurs) and spatial effects (cathedral reverb, underwater acoustics) for full spatial simulation.
Tongyi Lab notes these models will democratize high-quality voice creation, offering powerful AI support for podcasting, game development, and film post-production.
Related article
First Baidu AI Comic Drama Creation Base in Shandong Launches in Zibo
On April 27, Shandong Province reached a milestone in digital cultural creation with the official launch of its first Baidu AI comic drama creation base at Zibo Normal College. This base represents a new chapter in school-enterprise collaboration, ai
Sandberg and Clegg Join Nscale Board as 'Stargate Norway' Startup Hits $14.6B Valuation
As demand surges for data centers capable of delivering AI compute at scale, Nscale, a British AI infrastructure company backed by Nvidia, has reached a valuation of $14.6 billion. That positions it as one of Europe's newest decacorns, alongside Hels
Runway's $5.3B Valuation Challenges Google as Video AI Surpasses Language
While most AI giants have poured billions into language models, generative AI video startup Runway is charging ahead on a very different path. According to TechCrunch, this young company—founded by art school graduates—has now reached a valuation of
Related Special Topic Recommendations
Comments (0)
0/500
Today, Alibaba Tongyi Lab's Speech Team introduced two groundbreaking voice generation models: Fun-CosyVoice3.5 and Fun-AudioGen-VD. The standout feature of these models is their support for "FreeStyle" commands. Instead of complex parameter adjustments, users can precisely control vocal expression styles or build intricate audio scenes from scratch using simple natural language descriptions.

Each model serves distinct purposes:
Fun-CosyVoice3.5: Multilingual Replication and Fine-Grained Control
This enhanced version of CosyVoice achieves core breakthroughs in understanding speech expression nuances.
Command-Driven Generation: Users can input instructions like "speak more confidently" or "slow down with emotional variation" for real-time vocal adjustments.
Language Expansion: Added support for Thai, Indonesian, Portuguese, and Vietnamese maintains industry-leading performance in transcription accuracy (WER) and voice similarity across 13 languages.
Rare Character Optimization: Specialized training reduced error rates for uncommon characters from 15.2% to 5.3%.
Performance Boost: First packet latency decreased by 35%, significantly enhancing real-time interaction fluidity.
Fun-AudioGen-VD: Comprehensive Sound Design
This model acts as an "audio director," generating integrated audio combining "characters + environments."
Voice Customization: Specify gender, age, accent, and detailed characteristics like "hoarse, deep, or low-pitched" voices.
Emotion and Role Play: Simulates roles including customer service agents, broadcasters, and children, even conveying complex states like "outward calm with internal tension."
Immersive Environments: Adds background sounds (battlefield chaos, café murmurs) and spatial effects (cathedral reverb, underwater acoustics) for full spatial simulation.
Tongyi Lab notes these models will democratize high-quality voice creation, offering powerful AI support for podcasting, game development, and film post-production.
First Baidu AI Comic Drama Creation Base in Shandong Launches in Zibo
On April 27, Shandong Province reached a milestone in digital cultural creation with the official launch of its first Baidu AI comic drama creation base at Zibo Normal College. This base represents a new chapter in school-enterprise collaboration, ai
Sandberg and Clegg Join Nscale Board as 'Stargate Norway' Startup Hits $14.6B Valuation
As demand surges for data centers capable of delivering AI compute at scale, Nscale, a British AI infrastructure company backed by Nvidia, has reached a valuation of $14.6 billion. That positions it as one of Europe's newest decacorns, alongside Hels
Runway's $5.3B Valuation Challenges Google as Video AI Surpasses Language
While most AI giants have poured billions into language models, generative AI video startup Runway is charging ahead on a very different path. According to TechCrunch, this young company—founded by art school graduates—has now reached a valuation of





Home






