New Benchmark Reveals GPT and Gemini Strengths in Real-World Voice Agent Tasks

research#voice agent🔬 Research|Analyzed: Apr 7, 2026 21:06
Published: Apr 7, 2026 04:00
1 min read
ArXiv Audio Speech

Analysis

This research introduces a vital new benchmark for evaluating voice agents using real human speech, complete with natural disfluencies like stuttering or self-corrections. It is exciting to see top-tier models like GPT-Realtime and Gemini Live 3.1 being pushed to handle complex, multi-step tool use, moving the industry closer to truly conversational AI. The focus on 'full-duplex' capabilities—listening and thinking while speaking—marks a significant step forward in creating seamless human-computer interactions.
Reference / Citation
View Original
"GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5%); Gemini Live 3.1 achieves the fastest Latency (4.25 s) but the lowest turn-take rate (78.0%)."
A
ArXiv Audio SpeechApr 7, 2026 04:00
* Cited for critical analysis under Article 32.