OpenAI upgrades its transcription and voice-generating AI models
OpenAI is rolling out new AI models for transcription and voice generation via its API, promising significant improvements over their earlier versions. These updates are part of OpenAI's larger "agentic" vision, which focuses on creating autonomous systems capable of performing tasks independently for users. While the term "agent" can be debated, OpenAI's Head of Product, Olivier Godement, sees it as a chatbot that can interact with a business's customers.
"We're going to see more and more agents emerge in the coming months," Godement shared with TechCrunch during a briefing. "The overarching goal is to assist customers and developers in utilizing agents that are useful, accessible, and precise."
OpenAI's latest text-to-speech model, dubbed "gpt-4o-mini-tts," not only aims to produce more lifelike and nuanced speech but is also more adaptable than its predecessors. Developers can now guide the model using natural language commands, such as "speak like a mad scientist" or "use a serene voice, like a mindfulness teacher." This level of control allows for a more personalized voice experience.
Here’s a sample of a "true crime-style," weathered voice:
And here’s an example of a female "professional" voice:
Jeff Harris, a member of OpenAI's product team, emphasized to TechCrunch that the objective is to enable developers to customize both the voice "experience" and "context." "In various scenarios, you don't want a monotonous voice," Harris explained. "For instance, in a customer support setting where the voice needs to sound apologetic for a mistake, you can infuse that emotion into the voice. We strongly believe that developers and users want to control not just the content, but the manner of speech."
Moving to OpenAI's new speech-to-text offerings, "gpt-4o-transcribe" and "gpt-4o-mini-transcribe," these models are set to replace the outdated Whisper transcription model. Trained on a diverse array of high-quality audio data, they claim to better handle accented and varied speech, even in noisy settings. Additionally, these models are less prone to "hallucinations," a problem where Whisper would sometimes invent words or entire passages, adding inaccuracies like racial commentary or fictitious medical treatments to transcripts.
"These models show significant improvement over Whisper in this regard," Harris noted. "Ensuring model accuracy is crucial for a dependable voice experience, and by accuracy, we mean the models correctly capture the spoken words without adding unvoiced content."
However, performance may vary across languages. OpenAI's internal benchmarks indicate that gpt-4o-transcribe, the more precise of the two, has a "word error rate" nearing 30% for Indic and Dravidian languages like Tamil, Telugu, Malayalam, and Kannada. This suggests that about three out of every ten words might differ from a human transcription in these languages.

The results from OpenAI transcription benchmarking. Image Credits: OpenAI
In a departure from their usual practice, OpenAI won't be making these new transcription models freely available. Historically, they released new Whisper versions under an MIT license for commercial use. Harris pointed out that gpt-4o-transcribe and gpt-4o-mini-transcribe are significantly larger than Whisper, making them unsuitable for open release.
"These models are too big to run on a typical laptop like Whisper could," Harris added. "When we release models openly, we want to do it thoughtfully, ensuring they're tailored for specific needs. We see end-user devices as a prime area for open-source models."
Updated March 20, 2025, 11:54 a.m. PT to clarify the language around word error rate and update the benchmark results chart with a more recent version.
Related article
Former OpenAI Engineer Shares Insights on Company Culture and Rapid Growth
Three weeks ago, Calvin French-Owen, an engineer who contributed to a key OpenAI product, left the company.He recently shared a compelling blog post detailing his year at OpenAI, including the intense
Google Unveils Production-Ready Gemini 2.5 AI Models to Rival OpenAI in Enterprise Market
Google intensified its AI strategy Monday, launching its advanced Gemini 2.5 models for enterprise use and introducing a cost-efficient variant to compete on price and performance.The Alphabet-owned c
Meta Offers High Pay for AI Talent, Denies $100M Signing Bonuses
Meta is attracting AI researchers to its new superintelligence lab with substantial multimillion-dollar compensation packages. However, claims of $100 million "signing bonuses" are untrue, per a recru
Comments (31)
0/200
BenHernández
July 23, 2025 at 4:50:48 AM EDT
Wow, OpenAI's new transcription and voice models sound like a game-changer! I'm curious how these 'agentic' systems will stack up against real-world tasks. Could they finally nail natural-sounding convos? 🤔
0
GeorgeTaylor
April 20, 2025 at 3:57:07 PM EDT
Os novos modelos de transcrição e geração de voz da OpenAI são um divisor de águas! Estou usando no meu podcast e as melhorias são impressionantes. O único ponto negativo? São um pouco caros, mas se você puder pagar, vale cada centavo! 🎙️💸
0
GregoryAllen
April 17, 2025 at 12:50:37 AM EDT
OpenAI's new transcription and voice models are a game changer! I've been using them for my podcast and the improvements are night and day. The only downside? They're a bit pricey, but if you can swing it, they're worth every penny! 🎙️💸
0
StevenAllen
April 17, 2025 at 12:38:26 AM EDT
OpenAI의 새로운 음성 인식 및 음성 생성 모델은 정말 혁신적이에요! 제 팟캐스트에서 사용 중인데, 개선이 눈에 띄어요. 단점은 조금 비싸다는 건데, 감당할 수 있다면 그만한 가치가 있어요! 🎙️💸
0
NicholasClark
April 16, 2025 at 1:54:41 AM EDT
OpenAIの新しい音声認識と音声生成モデルは革命的です!ポッドキャストで使っていますが、改善が劇的です。唯一の欠点は少し高価なことですが、払えるならその価値は十分にあります!🎙️💸
0
SamuelRoberts
April 15, 2025 at 5:24:36 PM EDT
Os novos modelos de transcrição e geração de voz da OpenAI são incríveis! A qualidade melhorou muito em comparação com as versões anteriores. Só queria que fossem um pouco mais rápidos, mas no geral, estou muito satisfeito! 😊
0
OpenAI is rolling out new AI models for transcription and voice generation via its API, promising significant improvements over their earlier versions. These updates are part of OpenAI's larger "agentic" vision, which focuses on creating autonomous systems capable of performing tasks independently for users. While the term "agent" can be debated, OpenAI's Head of Product, Olivier Godement, sees it as a chatbot that can interact with a business's customers.
"We're going to see more and more agents emerge in the coming months," Godement shared with TechCrunch during a briefing. "The overarching goal is to assist customers and developers in utilizing agents that are useful, accessible, and precise."
OpenAI's latest text-to-speech model, dubbed "gpt-4o-mini-tts," not only aims to produce more lifelike and nuanced speech but is also more adaptable than its predecessors. Developers can now guide the model using natural language commands, such as "speak like a mad scientist" or "use a serene voice, like a mindfulness teacher." This level of control allows for a more personalized voice experience.
Here’s a sample of a "true crime-style," weathered voice:
And here’s an example of a female "professional" voice:
Jeff Harris, a member of OpenAI's product team, emphasized to TechCrunch that the objective is to enable developers to customize both the voice "experience" and "context." "In various scenarios, you don't want a monotonous voice," Harris explained. "For instance, in a customer support setting where the voice needs to sound apologetic for a mistake, you can infuse that emotion into the voice. We strongly believe that developers and users want to control not just the content, but the manner of speech."
Moving to OpenAI's new speech-to-text offerings, "gpt-4o-transcribe" and "gpt-4o-mini-transcribe," these models are set to replace the outdated Whisper transcription model. Trained on a diverse array of high-quality audio data, they claim to better handle accented and varied speech, even in noisy settings. Additionally, these models are less prone to "hallucinations," a problem where Whisper would sometimes invent words or entire passages, adding inaccuracies like racial commentary or fictitious medical treatments to transcripts.
"These models show significant improvement over Whisper in this regard," Harris noted. "Ensuring model accuracy is crucial for a dependable voice experience, and by accuracy, we mean the models correctly capture the spoken words without adding unvoiced content."
However, performance may vary across languages. OpenAI's internal benchmarks indicate that gpt-4o-transcribe, the more precise of the two, has a "word error rate" nearing 30% for Indic and Dravidian languages like Tamil, Telugu, Malayalam, and Kannada. This suggests that about three out of every ten words might differ from a human transcription in these languages.
In a departure from their usual practice, OpenAI won't be making these new transcription models freely available. Historically, they released new Whisper versions under an MIT license for commercial use. Harris pointed out that gpt-4o-transcribe and gpt-4o-mini-transcribe are significantly larger than Whisper, making them unsuitable for open release.
"These models are too big to run on a typical laptop like Whisper could," Harris added. "When we release models openly, we want to do it thoughtfully, ensuring they're tailored for specific needs. We see end-user devices as a prime area for open-source models."
Updated March 20, 2025, 11:54 a.m. PT to clarify the language around word error rate and update the benchmark results chart with a more recent version.




Wow, OpenAI's new transcription and voice models sound like a game-changer! I'm curious how these 'agentic' systems will stack up against real-world tasks. Could they finally nail natural-sounding convos? 🤔




Os novos modelos de transcrição e geração de voz da OpenAI são um divisor de águas! Estou usando no meu podcast e as melhorias são impressionantes. O único ponto negativo? São um pouco caros, mas se você puder pagar, vale cada centavo! 🎙️💸




OpenAI's new transcription and voice models are a game changer! I've been using them for my podcast and the improvements are night and day. The only downside? They're a bit pricey, but if you can swing it, they're worth every penny! 🎙️💸




OpenAI의 새로운 음성 인식 및 음성 생성 모델은 정말 혁신적이에요! 제 팟캐스트에서 사용 중인데, 개선이 눈에 띄어요. 단점은 조금 비싸다는 건데, 감당할 수 있다면 그만한 가치가 있어요! 🎙️💸




OpenAIの新しい音声認識と音声生成モデルは革命的です!ポッドキャストで使っていますが、改善が劇的です。唯一の欠点は少し高価なことですが、払えるならその価値は十分にあります!🎙️💸




Os novos modelos de transcrição e geração de voz da OpenAI são incríveis! A qualidade melhorou muito em comparação com as versões anteriores. Só queria que fossem um pouco mais rápidos, mas no geral, estou muito satisfeito! 😊












