How AI voice agents work: no-jargon explanation

By Imraan, Founder

April 20, 2026

Direct answer

How AI voice agents work explained simply: speech recognition, LLM, text-to-speech, and call routing in plain English, plus what breaks in production.

How AI voice agents work explained simply: speech recognition, LLM, text-to-speech, and call routing in plain English, plus what breaks in production.
The strongest AI work starts with one operational bottleneck, one owner, and one result the team can inspect.
Use the article as the diagnosis layer, then move into a scoped build, proof path, or commercial workflow page.

How AI voice agents work: the four layers

AI voice agents work by chaining four technology layers that run in sequence during a live phone call. Speech recognition converts the caller's words into text. A language model reads that text and decides what to say. Text-to-speech turns the reply into audio. A telephony layer carries the audio in and out of the call. On a well-configured system the whole loop runs in under two seconds. Understanding each layer shows where deployments break and which vendors have solved the hard parts.

What happens in the first 500 milliseconds of a call

When a caller dials a number routed through an AI voice agent, the telephony provider answers and starts streaming the audio in real time to the voice AI platform. The platform does not wait for the caller to finish a sentence; it starts processing as soon as audio arrives, using voice activity detection to spot when speech begins and ends.

The telephony layer does more than move audio. It holds the SIP connection open, manages transfers to human agents, handles hold states, and ends the call cleanly. Problems here (dropped packets, high-latency connections, codec mismatches) hit every call. That is why testing with the real telephony provider in the real geography before launch matters: a system that runs perfectly in a UK datacenter test can behave very differently on a 4G mobile connection in a rural area.

Voice activity detection decides when the caller has stopped talking. Tune it badly and the agent either cuts callers off, treating a short pause as the end of a turn, or sits in silence for three seconds because the threshold is too long. Both failures need testing against real caller speech.

How speech recognition works in a voice agent

Speech recognition, technically automatic speech recognition or ASR, turns the caller's audio stream into text. The models running in production voice agents in 2026 transcribe natural speech in near real time, with low word error rates on standard accents and clear audio. Quality drops sharply on regional accents, fast speech, background noise, and non-native speakers, and a transcription error here cascades through everything downstream. If the caller says "cancel my appointment on Thursday" and the ASR hears "Tuesday", the language model gets the wrong input and produces a wrong answer no matter how capable it is.

The ASR models used in SME voice agent deployments in 2026 are primarily Deepgram and Google Speech-to-Text. Deepgram runs at lower latency and performs well on professional voice calls; Google Speech handles a broader range of accents out of the box. Some platforms let you choose the model; others bundle a single one. If your caller base includes strong regional accents or non-native English speakers, testing the specific ASR model against recordings of real calls before launch is not optional.

What the language model actually does on a call

The language model is the intelligence layer. It receives the transcribed text from the ASR, the conversation history, and a system prompt that defines the agent's role and knowledge, then generates a response: answer a question, ask a clarifying one, confirm a booking, or trigger a transfer to a human.

The system prompt is where the operator configures behavior: what the agent knows about the business, what it is allowed to do, how it handles situations outside its scope, and the tone it uses. A weak system prompt produces an agent that is technically functional and practically useless. A good one is the gap between an agent that handles 70% of calls correctly and one that handles 40%.

The model also receives function call definitions that describe its integrations. If the agent books appointments, the definition tells it what to collect and how to call the calendar integration. When the model works out that the caller wants to book, it calls that function, which writes the booking and returns a confirmation that the model phrases back to the caller.

How text-to-speech converts the response to voice

Text-to-speech, TTS, turns the language model's text into audio delivered through the call. The perceived quality of the whole agent rests more on this layer than any other: a high-quality engine with natural prosody sounds natural, while a low-quality one sounds robotic even when the underlying logic is perfect.

The two TTS providers used in premium SME deployments in 2026 are ElevenLabs and Cartesia. Both produce voices that pass casual listening tests with most callers. The real difference is latency: a TTS engine that takes 400 milliseconds to begin speaking adds a noticeable pause to every turn. ElevenLabs uses a streaming approach that starts delivering audio before the full text is converted, which cuts the perceived wait.

Voice selection matters beyond raw quality. A medical clinic deploying an AI receptionist with a young, casual voice sets different caller expectations than one with a measured, professional voice. The voice is part of the brand, worth an afternoon before launch rather than a platform default.

What breaks in a production AI voice agent deployment

Silence gaps between turns are the most common caller-experience problem, and they come from compounding latency. If ASR takes 200 milliseconds, the language model 600, and TTS 400, that is 1.2 seconds before audio plays; add network latency and the caller waits 1.5 to 2 seconds. Past 1.5 seconds, most callers say "hello", repeat the question, or talk over the agent. The fix is a filler phrase delivered instantly, like "one moment", or a streaming response where audio starts before the full reply is finished.

Context window management only appears in longer calls. Models cap how much history they hold: a booking call of four to six turns never hits the limit, but a long call across several topics can lose context and contradict something the caller said three minutes ago. The fix is conversation design that keeps call flows short.

Integration failures are the quietest and most damaging mode. The agent collects the right details, the booking function fires, the calendar write fails silently, and the caller still hears a confirmation for an appointment that was never created. The fix is explicit error handling plus a confirmation read-back: the agent reads back the time, date, and any reference number. If the write failed there is no reference number to read back, which triggers a fallback to a human.

How call routing and escalation work

Call routing decides what happens once the agent works out it cannot handle the call itself. The escalation path is the most important conversation design decision in any deployment, because an agent that cannot hand off to a human gracefully is worse than no agent at all: it traps callers in a loop. A well-designed escalation has three parts: a detection condition (a phrase the caller uses, a question outside scope, or a caller who asks for a human), a transfer mechanism (the call routes to a specific number, queue, or named agent), and a context pass (the receiving human gets a short summary of what the caller said and needs).

The context pass is the part most deployments skip, and the one callers feel most. A caller who has explained their situation to the agent and then has to repeat it to a human has a worse experience than one who just waited in a queue. A 30-word context note delivered before the call connects makes the handoff feel continuous rather than a restart.

How twohundred approaches a voice agent build

In practice, the layer that decides whether a deployment works is rarely the model and almost always the testing. We run the chosen ASR against recordings of real calls from the actual caller base before writing the system prompt, because accent and noise problems are cheaper to catch in transcripts than in live complaints. We treat the escalation path and the confirmation read-back as launch blockers, since they prevent the failures that cost trust. If you want this built and tested properly rather than demoed, that is the work an AI agent development company should be doing for you.

Frequently asked questions

How long does it take an AI voice agent to respond?

A well-configured AI voice agent responds in 1.2 to 2 seconds after the caller stops speaking. The drivers are ASR latency (100 to 300ms), language model generation (300 to 800ms), and TTS streaming latency (200 to 500ms to first audio), with telephony network latency on top. The total is perceptible but not disruptive in a transactional call.

Can AI voice agents handle multiple callers at once?

Yes. Unlike a human receptionist who takes one call at a time, an AI voice agent handles concurrent calls limited only by the platform's infrastructure, so an SME with 20 inbound callers at once can have all 20 handled simultaneously. This is one of the clearest operational advantages over a human front-desk model.

Do AI voice agents work on standard UK phone numbers?

Yes. AI voice agents connect to standard UK landline numbers (01 and 02 prefixes), freephone numbers (0800), and mobile numbers. The telephony provider, typically Twilio, handles number provisioning, and for an existing business number, calls are forwarded to the platform. The caller experience is identical to calling any other UK number.

Can you switch ASR or TTS vendors after launch?

It depends on the platform. Some let you swap the ASR (Deepgram, Google Speech-to-Text) or TTS (ElevenLabs, Cartesia) engine from a settings panel, so you can move a poor-performing voice without rebuilding the agent. Others lock you to a bundled stack, so ask before you commit: the freedom to switch is what lets you fix accent or latency problems once real callers start dialling in.

Related Services

For businesses looking to build autonomous workflows, AI agent development covers the architecture and delivery end to end. Once agents are in place, AI implementation services handles the deployment and rollout phase.

Related implementation paths

AI implementation services

Turn the article into a scoped first system with clear ownership, data, and measurement.

AI workflow automation

Automate one operational workflow inside the tools the team already uses.

AI agent development company

Design agents around jobs, tools, approval points, and measurable business outcomes.

Questions this article answers

How long does it take an AI voice agent to respond?

A well configured AI voice agent responds in 1.2 to 2 seconds after the caller stops speaking. The drivers are ASR latency (100 to 300ms), language model generation (300 to 800ms), and TTS streaming latency (200 to 500ms to first audio), with telephony network latency on top. The total is perceptible but not disruptive in a transactional call.

Can AI voice agents handle multiple callers at once?

Do AI voice agents work on standard UK phone numbers?

Can you switch ASR or TTS vendors after launch?

It depends on the platform. Some let you swap the ASR (Deepgram, Google Speech to Text) or TTS (ElevenLabs, Cartesia) engine from a settings panel, so you can move a poor performing voice without rebuilding the agent. Others lock you to a bundled stack, so ask before you commit: the freedom to switch is what lets you fix accent or latency problems once real callers start dialling in.

About the author

Imraan, Founder of twohundred

Imraan is the founder of twohundred, a US AI implementation lab. Before this he built six businesses, hired more than 200 people, and sold one to a public company. He started his career at UBS in London.

Working through one of these decisions?

Book a 30-minute call. We will look at the specific workflow you are trying to put AI into, and what it would actually take to make it work in production.

Book a call

How AI voice agents work: no-jargon explanation

How AI voice agents work: the four layers

What happens in the first 500 milliseconds of a call

How speech recognition works in a voice agent

What the language model actually does on a call

How text-to-speech converts the response to voice

What breaks in a production AI voice agent deployment

How call routing and escalation work

How twohundred approaches a voice agent build

Frequently asked questions

How long does it take an AI voice agent to respond?

Can AI voice agents handle multiple callers at once?

Do AI voice agents work on standard UK phone numbers?

Can you switch ASR or TTS vendors after launch?

Related reading

Related Services

Related implementation paths

AI implementation services

AI workflow automation

AI agent development company

Questions this article answers

How long does it take an AI voice agent to respond?

Can AI voice agents handle multiple callers at once?

Do AI voice agents work on standard UK phone numbers?

Can you switch ASR or TTS vendors after launch?

Imraan, Founder of twohundred

Working through one of these decisions?