Google Unveils Flash TTS: Text-to-Speech With Natural Tone Control Beats Eleven v3 Latency by 40%
Google released Flash TTS on April 20—a new text-to-speech model that lets users control speaking style, pace, pitch, and emotional emphasis via natural language instead of preset voice selections. Traditional TTS systems force you to choose from 20–50 pre-built voices; Flash TTS lets you say "read this in an excited sales pitch voice with a slight Southern accent" and the model generates exactly that. The technical breakthrough: sub-50ms latency per request, which makes real-time conversation viable—speakers can interrupt mid-sentence and the system generates new audio in real-time without jarring delays. Early developers report Flash TTS is 40% faster than ElevenLabs Conversational v3 while offering more granular style control. Pricing: $0.50 per 1M characters (competitive with Google's existing TTS, significantly cheaper than Eleven Labs at $8 per 1M characters). Supported languages: 70+, with regional accent support (Indian English, Brazilian Portuguese, Nigerian English, etc.). Use cases exploding: podcast production (users can generate intro/outro voice variations without a voice actor), audiobook narration (style control enables distinct character voices), accessibility tools (narrate documents in the reader's preferred tone), and agentic customer service (agents use TTS to sound natural and conversational, not robotic). The market implication: ElevenLabs' competitive moat narrows. Flash TTS is enterprise-grade, cheaper, and faster. Expect ElevenLabs to respond with lower pricing or new features within 30 days. For startups building voice apps, Flash TTS is now the default choice.
Read original article →