Glossary

AI Voice Agent Glossary

Plain-English definitions of every AI voice agent term you need — 29 entries across the voice pipeline, telephony, compliance, architecture, and pricing. Updated for 2026.

Voice pipeline

Voice pipeline terms

STT (Speech-to-Text)

Converts caller audio into text in real time.

Also called automatic speech recognition (ASR). Modern streaming STT services like Deepgram, Google Speech-to-Text, and AssemblyAI return partial transcripts every 100–200 ms while the caller is still speaking. Quality differentiators include accent handling, noise robustness, and end-of-utterance detection.

TTS (Text-to-Speech)

Converts the AI agent’s response text into spoken audio.

Modern providers (ElevenLabs, OpenAI TTS, Cartesia, Play.ht) produce natural prosody, pauses, and inflection. Streaming TTS starts speaking the first sentence while later text is still generating, shaving 200–400 ms off perceived latency.

LLM (Large Language Model)

The brain — interprets intent and generates the agent’s reply.

In voice agents the LLM also calls tools (calendar, CRM, knowledge base) when the caller’s request requires real-world action. Common choices: Claude Sonnet/Haiku, GPT-4o family, Gemini Flash, and specialty voice-tuned LLMs.

VAD (Voice Activity Detection)

Detects when the caller is speaking vs silent.

Silero VAD is a common open-source choice. Good VAD is what lets the agent know when to listen, when to interrupt, and when the caller has finished an utterance.

Barge-in

The caller talking over the agent and getting heard.

Barge-in support means callers can interrupt the AI mid-sentence and the agent will stop speaking, listen, and respond. Without it, calls feel rigid and impatient callers hang up.

End-of-utterance detection

Knowing when the caller actually stopped talking.

Cut someone off after a 200 ms pause and you sound rude; wait 1.5 seconds and you sound dead. Tuning end-of-utterance is one of the harder voice-AI engineering problems and a major differentiator between platforms.

Round-trip latency

Time from caller stops speaking to AI starts replying.

A well-tuned 2026 stack runs 500–900 ms total: STT ~100–250 ms + LLM 200–500 ms + TTS first audio 100–250 ms. Sub-800 ms feels human. 1.5+ seconds feels broken.

Streaming pipeline

Each stage starts before the previous one finishes.

STT emits partial transcripts; the LLM starts generating before STT finalizes; TTS starts speaking the first sentence while the LLM continues. Streaming is the difference between a usable voice agent and a chatbot bolted onto a phone line.

Telephony

Telephony terms

IVR (Interactive Voice Response)

"Press 1 for sales" — fixed-tree phone menus.

Pre-AI phone-routing systems. Callers navigate by DTMF (touch-tone) or fixed-keyword speech. Hang-up rates are notoriously high. See our deeper take in /blog/ivr-vs-ai-voice-agent.

DTMF

The touch-tone key presses callers make.

Stands for Dual-Tone Multi-Frequency. AI voice agents typically don’t require DTMF (callers just speak), but DTMF capture is still useful for sensitive inputs like SSN digits or credit card numbers where typing is preferred.

SIP (Session Initiation Protocol)

The protocol that connects modern phone calls over IP.

Most AI voice agent platforms use SIP trunks via providers like Twilio, Telnyx, or Vonage. SIP carries the audio between the caller, the platform, and any human transfer endpoints.

Conditional forwarding

Send calls to AI only when you don’t answer.

Set on your existing business line: ring 4 times, then forward to the AI agent’s number. Lower-risk way to deploy AI without changing your published number.

Warm transfer

AI hands a live call to a human with full context.

Better than a cold transfer because the human picks up already knowing who’s calling, what they need, and what the AI has already collected. The caller doesn’t have to repeat themselves.

Architecture

Architecture terms

Knowledge base

The body of business info the AI is allowed to draw on.

Hours, services, pricing, policies, FAQs, escalation rules, accepted insurance — everything the AI uses to answer caller questions. Quality of the knowledge base is usually the biggest determinant of how good the agent feels.

Tool call (function call)

When the LLM triggers a real-world action.

Booking a calendar slot, looking up a customer in a CRM, sending an SMS confirmation, transferring to a human. Tool calls are what separate AI voice agents from chatbots that just chat.

Guardrails

Hard constraints the AI is not allowed to violate.

Configured rules like "never give medical advice," "always escalate emergencies to a human," "never quote prices outside the published range." Good agents are honest about what they can’t do and route accordingly.

No-code platform

Set up AI agents without writing code.

Configuration via plain English, drag-and-drop flow builders, and integrations connected through OAuth. JagCall, Goodcall, and Synthflow are no-code-leaning. Bland and Vapi are API-first (developer required).

API-first platform

Build the agent with code (engineers required).

You wire STT/LLM/TTS yourself, define logic in code, deploy on the platform’s runtime. Maximum flexibility, minimum onboarding speed. Best when you have engineers and a custom workflow.

After-hours coverage

AI handles calls when humans are unavailable.

Most service businesses get 25–40% of calls outside 9–5. AI handling those calls — even just for booking and overflow — typically pays for itself in the first month. See /use-cases/after-hours-support.

Pricing

Pricing terms

Per-minute pricing

Charged for each minute of call time.

Common for API-first platforms: $0.07–$0.20 per minute plus telephony costs. Predictable for high-volume operations; can surprise you on a busy month.

Plan-based pricing

Flat monthly fee with included minutes.

Common for SMB-focused no-code platforms: $49–$199/month all-in. Predictable monthly bill; overage rates kick in if you exceed included minutes.

Telephony fees

The cost of carrying the actual phone call.

Separate from the AI cost. Inbound DID minutes are typically $0.005–$0.02/min via Twilio or similar. Outbound calls cost more. Some plan-based platforms bundle telephony; some bill it separately.

Want more depth?

Start with the pillar guide, or dig into a specific topic.

Ready to deploy an AI voice agent?

Try JagCall free for 14 days. Most owners are answering live calls within an hour of signup.