OpenAI has officially launched GPT-Realtime, calling it their “most advanced speech-to-speech model yet.” The model, announced in a blog post on Thursday (Aug. 28), delivers highly natural, expressive speech and lightning-fast response times.
Unlike traditional systems that chain together separate speech-to-text and text-to-speech models, GPT-Realtime processes everything in one unified model. This slashes latency and keeps speech sounding more human and nuanced.
OpenAI says GPT-Realtime was built in close collaboration with customers. It’s designed to shine in real-world tasks like customer support, education, and virtual assistance. The model handles complex instructions with better precision and now supports function calling, image inputs, and remote MCP servers making AI voice agents more useful and context-aware than ever.
Two new voices Cedar (male) and Marin (female) join the lineup, with updates to the original eight voices. Developers can tweak expressiveness with simple text instructions.
GPT-Realtime is not just faster it is smarter. It can recognize non-verbal cues like laughter, switch languages mid-sentence, and mimic user tone. The model also outperforms previous versions in handling alphanumeric data in multiple languages, such as Chinese, Spanish, and French. According to OpenAI, GPT-Realtime scored 82.8% on the Big Bench Audio benchmark up from 65.6% in its predecessor.
The new model is available exclusively through OpenAI’s Realtime API, which is now out of beta and open to all developers. The API allows developers to build low-latency, multimodal voice experiences with ease including phone calling through SIP integration. Pricing starts at $32 per million input tokens and $64 per million output tokens, with cached tokens at just $0.40 per million. With GPT-Realtime, OpenAI isn’t just advancing speech tech they are redefining what voice AI can do.