Microsoft unveiled its first in-house speech-to-text model as part of a trio of MAI (Microsoft AI) foundational models released in late April 2026 — and the benchmark results are striking. The model achieves an average Word Error Rate of 3.8% on the FLEURS benchmark, the industry-standard multilingual transcription test, across the top 25 languages by Microsoft product usage. That beats OpenAI's Whisper-large-v3 on all 25 languages, Google's Gemini 3.1 Flash on 22 of 25, and ElevenLabs' Scribe v2 and OpenAI's GPT-Transcribe each on 15 of 25. For any product that needs accurate, multilingual transcription at scale — meeting notes, customer service call logs, video captions, voice-first interfaces — Microsoft now has a model it built entirely in-house that outperforms every major competitor on the industry benchmark. The strategic context matters as much as the benchmark. Until October 2025, Microsoft was contractually limited by its OpenAI partnership from independently pursuing artificial general intelligence. The MAI speech model is the clearest evidence yet that those constraints are gone and Microsoft is building seriously at the model layer. For enterprise customers running Teams, Azure, or any Microsoft communication surface, this model is expected to replace Whisper as the default transcription backend — with no additional licensing cost beyond existing Microsoft cloud contracts. Transcription tool vendors who have been building on top of Whisper should take note: Microsoft's best-in-class model is now in-house and integrated across every Microsoft product.
Read original article →
Weekly Newsletter
Get the best AI tools delivered weekly.
No spam, unsubscribe anytime.
✓ You're subscribed! Look out for next week's edition.