Back to blog
PerformanceVoice AITechnology

Why Latency Shapes Whether Voice AI Sounds Natural

Latency is one of the biggest factors in a voice agent. Here’s what it is, why response time matters, and what to ask vendors.

·2 min read·Callex Team
Why Latency Shapes Whether Voice AI Sounds Natural

When people ask why a Callex agent sounds natural, a big part of the answer is not only the voice or the script. It’s latency.

What latency means in a voice call

Latency is the time between the moment a caller stops speaking and the moment the agent starts responding.

In human conversation, the natural pause is often around 200–400ms. Past roughly 700ms, the conversation can start to feel off. Beyond about 1.5 seconds, some callers begin to wonder if the line dropped.

What creates latency

An AI call moves through three stages:

Caller speech → STT (speech-to-text) → LLM (reasoning) → TTS (text-to-speech) → response

Each stage adds time:

  • STT: 100–300ms
  • LLM: 200–800ms, depending on model size and server load
  • TTS: 100–400ms with streaming

That totals anywhere from 400ms to 1.5s, and the difference is very noticeable to a caller.

How Callex keeps latency low

Streaming at every stage

Instead of waiting for each stage to finish, each stage can start working on the first chunk of output from the previous one. The LLM begins generating, TTS begins speaking the opening words, and by the time the LLM finishes, part of the sentence may already be spoken.

Right-sized models for the first turn

Not every question needs a frontier model. “What are your hours?” is a simple lookup. We route simple intents to small, fast models and reserve heavier reasoning for when it’s actually needed.

Infrastructure close to the caller

Multi-region deployment keeps the network distance short. Small round-trip improvements can matter in a live conversation.

How to evaluate a vendor

Ask one question: “What is your agent’s P95 latency?”

P95 is the latency that 95% of calls come in under. If a vendor can’t answer, it may mean they are not measuring it consistently.

  • Strong: around 600ms P95 or lower
  • Workable: around 900ms
  • Worth checking carefully: noticeably above that

Bottom line

Latency is one of the hardest things to forgive in an AI call. People can adjust to an imperfect voice, but it is harder to adjust to a conversation that feels delayed.

When you evaluate a voice AI vendor, ask to hear a live call, not only a recorded demo.

Ready to put voice AI into production?

Talk to our team about your use case and see Callex running on your own calls.