OpenAI upgrades its transcription and voice-generating AI models
April 10, 2025
CharlesWhite
6
OpenAI is rolling out new AI models for transcription and voice generation via its API, promising significant improvements over their earlier versions. These updates are part of OpenAI's larger "agentic" vision, which focuses on creating autonomous systems capable of performing tasks independently for users. While the term "agent" can be debated, OpenAI's Head of Product, Olivier Godement, sees it as a chatbot that can interact with a business's customers.
"We're going to see more and more agents emerge in the coming months," Godement shared with TechCrunch during a briefing. "The overarching goal is to assist customers and developers in utilizing agents that are useful, accessible, and precise."
OpenAI's latest text-to-speech model, dubbed "gpt-4o-mini-tts," not only aims to produce more lifelike and nuanced speech but is also more adaptable than its predecessors. Developers can now guide the model using natural language commands, such as "speak like a mad scientist" or "use a serene voice, like a mindfulness teacher." This level of control allows for a more personalized voice experience.
Here’s a sample of a "true crime-style," weathered voice:
And here’s an example of a female "professional" voice:
Jeff Harris, a member of OpenAI's product team, emphasized to TechCrunch that the objective is to enable developers to customize both the voice "experience" and "context." "In various scenarios, you don't want a monotonous voice," Harris explained. "For instance, in a customer support setting where the voice needs to sound apologetic for a mistake, you can infuse that emotion into the voice. We strongly believe that developers and users want to control not just the content, but the manner of speech."
Moving to OpenAI's new speech-to-text offerings, "gpt-4o-transcribe" and "gpt-4o-mini-transcribe," these models are set to replace the outdated Whisper transcription model. Trained on a diverse array of high-quality audio data, they claim to better handle accented and varied speech, even in noisy settings. Additionally, these models are less prone to "hallucinations," a problem where Whisper would sometimes invent words or entire passages, adding inaccuracies like racial commentary or fictitious medical treatments to transcripts.
"These models show significant improvement over Whisper in this regard," Harris noted. "Ensuring model accuracy is crucial for a dependable voice experience, and by accuracy, we mean the models correctly capture the spoken words without adding unvoiced content."
However, performance may vary across languages. OpenAI's internal benchmarks indicate that gpt-4o-transcribe, the more precise of the two, has a "word error rate" nearing 30% for Indic and Dravidian languages like Tamil, Telugu, Malayalam, and Kannada. This suggests that about three out of every ten words might differ from a human transcription in these languages.

The results from OpenAI transcription benchmarking. Image Credits: OpenAI
In a departure from their usual practice, OpenAI won't be making these new transcription models freely available. Historically, they released new Whisper versions under an MIT license for commercial use. Harris pointed out that gpt-4o-transcribe and gpt-4o-mini-transcribe are significantly larger than Whisper, making them unsuitable for open release.
"These models are too big to run on a typical laptop like Whisper could," Harris added. "When we release models openly, we want to do it thoughtfully, ensuring they're tailored for specific needs. We see end-user devices as a prime area for open-source models."
Updated March 20, 2025, 11:54 a.m. PT to clarify the language around word error rate and update the benchmark results chart with a more recent version.
Related article
Google Search Introduces 'AI Mode' for Complex, Multi-Part Queries
Google Unveils "AI Mode" in Search to Rival Perplexity AI and ChatGPTGoogle is stepping up its game in the AI arena with the launch of an experimental "AI Mode" feature in its Search engine. Aimed at taking on the likes of Perplexity AI and OpenAI's ChatGPT Search, this new mode was announced on Wed
ChatGPT's Unsolicited Use of User Names Sparks 'Creepy' Concerns Among Some
Some users of ChatGPT have recently encountered an odd new feature: the chatbot occasionally uses their name while working through problems. This wasn't part of its usual behavior before, and many users report that ChatGPT mentions their names without ever being told what to call them.
Opinions on
OpenAI Enhances ChatGPT to Recall Previous Conversations
OpenAI made a big announcement on Thursday about rolling out a fresh feature in ChatGPT called "memory." This nifty tool is designed to make your chats with the AI more personalized by remembering what you've talked about before. Imagine not having to repeat yourself every time you start a new conve
Comments (20)
0/200
ThomasBaker
April 11, 2025 at 6:32:00 PM GMT
OpenAI's new transcription and voice models are a game-changer! 🎤 The improvements are legit, making my workflow so much smoother. Can't wait to see what else they come up with in their 'agentic' vision. Keep it up, OpenAI! 🚀
0
EmmaTurner
April 11, 2025 at 9:05:15 PM GMT
OpenAIの新しいトランスクリプションと音声生成モデルは革命的!🎤 改善点が本物で、私の作業がずっとスムーズになった。'agentic'ビジョンで次に何を出すのか楽しみだね。頑張れ、OpenAI!🚀
0
DanielThomas
April 10, 2025 at 7:20:36 PM GMT
OpenAI의 새로운 전사 및 음성 생성 모델은 혁신적이야! 🎤 개선 사항이 진짜라서 내 작업 흐름이 훨씬 더 부드러워졌어. 'agentic' 비전에서 다음에 무엇을 내놓을지 기대돼. 계속해라, OpenAI! 🚀
0
JasonMartin
April 14, 2025 at 9:30:18 PM GMT
Os novos modelos de transcrição e geração de voz da OpenAI são revolucionários! 🎤 As melhorias são reais, tornando meu fluxo de trabalho muito mais suave. Mal posso esperar para ver o que mais eles vão lançar na visão 'agentic'. Continue assim, OpenAI! 🚀
0
RobertLewis
April 10, 2025 at 3:34:07 PM GMT
OpenAI के नए ट्रांसक्रिप्शन और वॉइस जनरेशन मॉडल क्रांतिकारी हैं! 🎤 सुधार वास्तविक हैं, जिससे मेरा कार्यप्रवाह बहुत आसान हो गया है। 'एजेंटिक' विजन में वे और क्या लाएंगे, इसका इंतजार नहीं कर सकता। आगे बढ़ो, OpenAI! 🚀
0
OliverPhillips
April 11, 2025 at 5:06:16 PM GMT
OpenAI's new transcription and voice models sound promising! I'm excited to see how these upgrades will improve my workflow. The idea of autonomous systems is cool, but I hope they don't get too creepy. 🤖
0






OpenAI is rolling out new AI models for transcription and voice generation via its API, promising significant improvements over their earlier versions. These updates are part of OpenAI's larger "agentic" vision, which focuses on creating autonomous systems capable of performing tasks independently for users. While the term "agent" can be debated, OpenAI's Head of Product, Olivier Godement, sees it as a chatbot that can interact with a business's customers.
"We're going to see more and more agents emerge in the coming months," Godement shared with TechCrunch during a briefing. "The overarching goal is to assist customers and developers in utilizing agents that are useful, accessible, and precise."
OpenAI's latest text-to-speech model, dubbed "gpt-4o-mini-tts," not only aims to produce more lifelike and nuanced speech but is also more adaptable than its predecessors. Developers can now guide the model using natural language commands, such as "speak like a mad scientist" or "use a serene voice, like a mindfulness teacher." This level of control allows for a more personalized voice experience.
Here’s a sample of a "true crime-style," weathered voice:
And here’s an example of a female "professional" voice:
Jeff Harris, a member of OpenAI's product team, emphasized to TechCrunch that the objective is to enable developers to customize both the voice "experience" and "context." "In various scenarios, you don't want a monotonous voice," Harris explained. "For instance, in a customer support setting where the voice needs to sound apologetic for a mistake, you can infuse that emotion into the voice. We strongly believe that developers and users want to control not just the content, but the manner of speech."
Moving to OpenAI's new speech-to-text offerings, "gpt-4o-transcribe" and "gpt-4o-mini-transcribe," these models are set to replace the outdated Whisper transcription model. Trained on a diverse array of high-quality audio data, they claim to better handle accented and varied speech, even in noisy settings. Additionally, these models are less prone to "hallucinations," a problem where Whisper would sometimes invent words or entire passages, adding inaccuracies like racial commentary or fictitious medical treatments to transcripts.
"These models show significant improvement over Whisper in this regard," Harris noted. "Ensuring model accuracy is crucial for a dependable voice experience, and by accuracy, we mean the models correctly capture the spoken words without adding unvoiced content."
However, performance may vary across languages. OpenAI's internal benchmarks indicate that gpt-4o-transcribe, the more precise of the two, has a "word error rate" nearing 30% for Indic and Dravidian languages like Tamil, Telugu, Malayalam, and Kannada. This suggests that about three out of every ten words might differ from a human transcription in these languages.
In a departure from their usual practice, OpenAI won't be making these new transcription models freely available. Historically, they released new Whisper versions under an MIT license for commercial use. Harris pointed out that gpt-4o-transcribe and gpt-4o-mini-transcribe are significantly larger than Whisper, making them unsuitable for open release.
"These models are too big to run on a typical laptop like Whisper could," Harris added. "When we release models openly, we want to do it thoughtfully, ensuring they're tailored for specific needs. We see end-user devices as a prime area for open-source models."
Updated March 20, 2025, 11:54 a.m. PT to clarify the language around word error rate and update the benchmark results chart with a more recent version.




OpenAI's new transcription and voice models are a game-changer! 🎤 The improvements are legit, making my workflow so much smoother. Can't wait to see what else they come up with in their 'agentic' vision. Keep it up, OpenAI! 🚀




OpenAIの新しいトランスクリプションと音声生成モデルは革命的!🎤 改善点が本物で、私の作業がずっとスムーズになった。'agentic'ビジョンで次に何を出すのか楽しみだね。頑張れ、OpenAI!🚀




OpenAI의 새로운 전사 및 음성 생성 모델은 혁신적이야! 🎤 개선 사항이 진짜라서 내 작업 흐름이 훨씬 더 부드러워졌어. 'agentic' 비전에서 다음에 무엇을 내놓을지 기대돼. 계속해라, OpenAI! 🚀




Os novos modelos de transcrição e geração de voz da OpenAI são revolucionários! 🎤 As melhorias são reais, tornando meu fluxo de trabalho muito mais suave. Mal posso esperar para ver o que mais eles vão lançar na visão 'agentic'. Continue assim, OpenAI! 🚀




OpenAI के नए ट्रांसक्रिप्शन और वॉइस जनरेशन मॉडल क्रांतिकारी हैं! 🎤 सुधार वास्तविक हैं, जिससे मेरा कार्यप्रवाह बहुत आसान हो गया है। 'एजेंटिक' विजन में वे और क्या लाएंगे, इसका इंतजार नहीं कर सकता। आगे बढ़ो, OpenAI! 🚀




OpenAI's new transcription and voice models sound promising! I'm excited to see how these upgrades will improve my workflow. The idea of autonomous systems is cool, but I hope they don't get too creepy. 🤖












