top of page
Gen-AI Employee Support & Automation Platform

ChatGPT's Advanced Voice Mode Amazes Testers with Realistic Speech



OpenAI has begun rolling out an alpha version of its Advanced Voice Mode (AVM) to a select group of ChatGPT Plus subscribers. This feature, designed to make AI conversations more natural and responsive, allows real-time voice interactions where users can interrupt the AI mid-sentence. Despite early criticism for its simulated emotional expressiveness and a legal spat with actress Scarlett Johansson over voice imitation, initial user feedback has been overwhelmingly positive.


Testers have reported that AVM can respond to emotional cues through vocal tone, provide sound effects during storytelling, and even simulate taking a breath while speaking. Tech writer Cristiano Giardina noted on X that ChatGPT pauses to catch its breath during lengthy counts, mirroring human speech patterns. This realistic feature results from training the AI on vast amounts of human audio, enabling it to imitate natural speech effectively.


Giardina also highlighted AVM's minimal latency and ability to perform accents and sound effects. However, it still retains an American accent when speaking other languages. User Kesku shared examples of ChatGPT using different voices for multiple characters and creating atmospheric sound effects in a sci-fi story. AI advocate Manuel Sainsily praised the feature for its real-time responses and integration with camera input, describing it as akin to FaceTiming a knowledgeable friend.


While AVM demonstrates impressive capabilities, it has limitations. It can occasionally generate incorrect responses, a known issue with large language models. OpenAI has implemented safety measures to prevent the AI from impersonating individuals or public figures by restricting outputs to four preset voices—additionally, filters block requests to generate music or other copyrighted audio, addressing potential legal concerns.


OpenAI collaborated with over 100 external testers from 29 geographical areas, collectively speaking 45 different languages, to refine AVM. Despite some audio "leakage" reports with unintended background music, these measures aim to ensure a secure and diverse user experience.


bottom of page