OpenAI upgrades its transcription and voice-generating AI models
OpenAI is rolling out new AI models for transcription and voice generation via its API, promising significant improvements over their earlier versions. These updates are part of OpenAI's larger "agentic" vision, which focuses on creating autonomous systems capable of performing tasks independently for users. While the term "agent" can be debated, OpenAI's Head of Product, Olivier Godement, sees it as a chatbot that can interact with a business's customers.
"We're going to see more and more agents emerge in the coming months," Godement shared with TechCrunch during a briefing. "The overarching goal is to assist customers and developers in utilizing agents that are useful, accessible, and precise."
OpenAI's latest text-to-speech model, dubbed "gpt-4o-mini-tts," not only aims to produce more lifelike and nuanced speech but is also more adaptable than its predecessors. Developers can now guide the model using natural language commands, such as "speak like a mad scientist" or "use a serene voice, like a mindfulness teacher." This level of control allows for a more personalized voice experience.
Here’s a sample of a "true crime-style," weathered voice:
And here’s an example of a female "professional" voice:
Jeff Harris, a member of OpenAI's product team, emphasized to TechCrunch that the objective is to enable developers to customize both the voice "experience" and "context." "In various scenarios, you don't want a monotonous voice," Harris explained. "For instance, in a customer support setting where the voice needs to sound apologetic for a mistake, you can infuse that emotion into the voice. We strongly believe that developers and users want to control not just the content, but the manner of speech."
Moving to OpenAI's new speech-to-text offerings, "gpt-4o-transcribe" and "gpt-4o-mini-transcribe," these models are set to replace the outdated Whisper transcription model. Trained on a diverse array of high-quality audio data, they claim to better handle accented and varied speech, even in noisy settings. Additionally, these models are less prone to "hallucinations," a problem where Whisper would sometimes invent words or entire passages, adding inaccuracies like racial commentary or fictitious medical treatments to transcripts.
"These models show significant improvement over Whisper in this regard," Harris noted. "Ensuring model accuracy is crucial for a dependable voice experience, and by accuracy, we mean the models correctly capture the spoken words without adding unvoiced content."
However, performance may vary across languages. OpenAI's internal benchmarks indicate that gpt-4o-transcribe, the more precise of the two, has a "word error rate" nearing 30% for Indic and Dravidian languages like Tamil, Telugu, Malayalam, and Kannada. This suggests that about three out of every ten words might differ from a human transcription in these languages.

The results from OpenAI transcription benchmarking. Image Credits: OpenAI
In a departure from their usual practice, OpenAI won't be making these new transcription models freely available. Historically, they released new Whisper versions under an MIT license for commercial use. Harris pointed out that gpt-4o-transcribe and gpt-4o-mini-transcribe are significantly larger than Whisper, making them unsuitable for open release.
"These models are too big to run on a typical laptop like Whisper could," Harris added. "When we release models openly, we want to do it thoughtfully, ensuring they're tailored for specific needs. We see end-user devices as a prime area for open-source models."
Updated March 20, 2025, 11:54 a.m. PT to clarify the language around word error rate and update the benchmark results chart with a more recent version.
Related article
OpenAI mejora el modelo de IA detrás de su Operator Agent
OpenAI lleva a Operator al siguiente nivelOpenAI está dando una gran actualización a su agente de IA autónomo, Operator. Los próximos cambios significan que Operator pronto funcion
El modelo o3 de IA de OpenAI obtiene una puntuación más baja en la prueba de referencia de lo que se sugirió inicialmente
Por qué las discrepancias en las pruebas de rendimiento importan en la IACuando se trata de IA, los números suelen contar la historia, y a veces esos números no cuadran del todo. T
Ziff Davis demanda a OpenAI por violación de derechos de autor
Ziff Davis presenta una demanda por infracción de derechos de autor contra OpenAIEn un movimiento que ha causado conmoción en los mundos de la tecnología y la publicación, Ziff Dav
Comments (30)
0/200
ThomasBaker
April 12, 2025 at 12:00:00 AM GMT
OpenAI's new transcription and voice models are a game-changer! 🎤 The improvements are legit, making my workflow so much smoother. Can't wait to see what else they come up with in their 'agentic' vision. Keep it up, OpenAI! 🚀
0
EmmaTurner
April 12, 2025 at 12:00:00 AM GMT
OpenAIの新しいトランスクリプションと音声生成モデルは革命的!🎤 改善点が本物で、私の作業がずっとスムーズになった。'agentic'ビジョンで次に何を出すのか楽しみだね。頑張れ、OpenAI!🚀
0
DanielThomas
April 11, 2025 at 12:00:00 AM GMT
OpenAI의 새로운 전사 및 음성 생성 모델은 혁신적이야! 🎤 개선 사항이 진짜라서 내 작업 흐름이 훨씬 더 부드러워졌어. 'agentic' 비전에서 다음에 무엇을 내놓을지 기대돼. 계속해라, OpenAI! 🚀
0
JasonMartin
April 15, 2025 at 12:00:00 AM GMT
Os novos modelos de transcrição e geração de voz da OpenAI são revolucionários! 🎤 As melhorias são reais, tornando meu fluxo de trabalho muito mais suave. Mal posso esperar para ver o que mais eles vão lançar na visão 'agentic'. Continue assim, OpenAI! 🚀
0
RobertLewis
April 10, 2025 at 12:00:00 AM GMT
OpenAI के नए ट्रांसक्रिप्शन और वॉइस जनरेशन मॉडल क्रांतिकारी हैं! 🎤 सुधार वास्तविक हैं, जिससे मेरा कार्यप्रवाह बहुत आसान हो गया है। 'एजेंटिक' विजन में वे और क्या लाएंगे, इसका इंतजार नहीं कर सकता। आगे बढ़ो, OpenAI! 🚀
0
OliverPhillips
April 12, 2025 at 12:00:00 AM GMT
OpenAI's new transcription and voice models sound promising! I'm excited to see how these upgrades will improve my workflow. The idea of autonomous systems is cool, but I hope they don't get too creepy. 🤖
0
OpenAI is rolling out new AI models for transcription and voice generation via its API, promising significant improvements over their earlier versions. These updates are part of OpenAI's larger "agentic" vision, which focuses on creating autonomous systems capable of performing tasks independently for users. While the term "agent" can be debated, OpenAI's Head of Product, Olivier Godement, sees it as a chatbot that can interact with a business's customers.
"We're going to see more and more agents emerge in the coming months," Godement shared with TechCrunch during a briefing. "The overarching goal is to assist customers and developers in utilizing agents that are useful, accessible, and precise."
OpenAI's latest text-to-speech model, dubbed "gpt-4o-mini-tts," not only aims to produce more lifelike and nuanced speech but is also more adaptable than its predecessors. Developers can now guide the model using natural language commands, such as "speak like a mad scientist" or "use a serene voice, like a mindfulness teacher." This level of control allows for a more personalized voice experience.
Here’s a sample of a "true crime-style," weathered voice:
And here’s an example of a female "professional" voice:
Jeff Harris, a member of OpenAI's product team, emphasized to TechCrunch that the objective is to enable developers to customize both the voice "experience" and "context." "In various scenarios, you don't want a monotonous voice," Harris explained. "For instance, in a customer support setting where the voice needs to sound apologetic for a mistake, you can infuse that emotion into the voice. We strongly believe that developers and users want to control not just the content, but the manner of speech."
Moving to OpenAI's new speech-to-text offerings, "gpt-4o-transcribe" and "gpt-4o-mini-transcribe," these models are set to replace the outdated Whisper transcription model. Trained on a diverse array of high-quality audio data, they claim to better handle accented and varied speech, even in noisy settings. Additionally, these models are less prone to "hallucinations," a problem where Whisper would sometimes invent words or entire passages, adding inaccuracies like racial commentary or fictitious medical treatments to transcripts.
"These models show significant improvement over Whisper in this regard," Harris noted. "Ensuring model accuracy is crucial for a dependable voice experience, and by accuracy, we mean the models correctly capture the spoken words without adding unvoiced content."
However, performance may vary across languages. OpenAI's internal benchmarks indicate that gpt-4o-transcribe, the more precise of the two, has a "word error rate" nearing 30% for Indic and Dravidian languages like Tamil, Telugu, Malayalam, and Kannada. This suggests that about three out of every ten words might differ from a human transcription in these languages.
In a departure from their usual practice, OpenAI won't be making these new transcription models freely available. Historically, they released new Whisper versions under an MIT license for commercial use. Harris pointed out that gpt-4o-transcribe and gpt-4o-mini-transcribe are significantly larger than Whisper, making them unsuitable for open release.
"These models are too big to run on a typical laptop like Whisper could," Harris added. "When we release models openly, we want to do it thoughtfully, ensuring they're tailored for specific needs. We see end-user devices as a prime area for open-source models."
Updated March 20, 2025, 11:54 a.m. PT to clarify the language around word error rate and update the benchmark results chart with a more recent version.




OpenAI's new transcription and voice models are a game-changer! 🎤 The improvements are legit, making my workflow so much smoother. Can't wait to see what else they come up with in their 'agentic' vision. Keep it up, OpenAI! 🚀




OpenAIの新しいトランスクリプションと音声生成モデルは革命的!🎤 改善点が本物で、私の作業がずっとスムーズになった。'agentic'ビジョンで次に何を出すのか楽しみだね。頑張れ、OpenAI!🚀




OpenAI의 새로운 전사 및 음성 생성 모델은 혁신적이야! 🎤 개선 사항이 진짜라서 내 작업 흐름이 훨씬 더 부드러워졌어. 'agentic' 비전에서 다음에 무엇을 내놓을지 기대돼. 계속해라, OpenAI! 🚀




Os novos modelos de transcrição e geração de voz da OpenAI são revolucionários! 🎤 As melhorias são reais, tornando meu fluxo de trabalho muito mais suave. Mal posso esperar para ver o que mais eles vão lançar na visão 'agentic'. Continue assim, OpenAI! 🚀




OpenAI के नए ट्रांसक्रिप्शन और वॉइस जनरेशन मॉडल क्रांतिकारी हैं! 🎤 सुधार वास्तविक हैं, जिससे मेरा कार्यप्रवाह बहुत आसान हो गया है। 'एजेंटिक' विजन में वे और क्या लाएंगे, इसका इंतजार नहीं कर सकता। आगे बढ़ो, OpenAI! 🚀




OpenAI's new transcription and voice models sound promising! I'm excited to see how these upgrades will improve my workflow. The idea of autonomous systems is cool, but I hope they don't get too creepy. 🤖












