AI Voice Agent Glossary
Plain-English definitions of every AI voice agent term you need — 29 entries across the voice pipeline, telephony, compliance, architecture, and pricing. Updated for 2026.
Voice pipeline
Voice pipeline terms
STT (Speech-to-Text)
Converts caller audio into text in real time.
Also called automatic speech recognition (ASR). Modern streaming STT services like Deepgram, Google Speech-to-Text, and AssemblyAI return partial transcripts every 100–200 ms while the caller is still speaking. Quality differentiators include accent handling, noise robustness, and end-of-utterance detection.
TTS (Text-to-Speech)
Converts the AI agent’s response text into spoken audio.
Modern providers (ElevenLabs, OpenAI TTS, Cartesia, Play.ht) produce natural prosody, pauses, and inflection. Streaming TTS starts speaking the first sentence while later text is still generating, shaving 200–400 ms off perceived latency.
LLM (Large Language Model)
The brain — interprets intent and generates the agent’s reply.
In voice agents the LLM also calls tools (calendar, CRM, knowledge base) when the caller’s request requires real-world action. Common choices: Claude Sonnet/Haiku, GPT-4o family, Gemini Flash, and specialty voice-tuned LLMs.
VAD (Voice Activity Detection)
Detects when the caller is speaking vs silent.
Silero VAD is a common open-source choice. Good VAD is what lets the agent know when to listen, when to interrupt, and when the caller has finished an utterance.
Barge-in
The caller talking over the agent and getting heard.
Barge-in support means callers can interrupt the AI mid-sentence and the agent will stop speaking, listen, and respond. Without it, calls feel rigid and impatient callers hang up.
End-of-utterance detection
Knowing when the caller actually stopped talking.
Cut someone off after a 200 ms pause and you sound rude; wait 1.5 seconds and you sound dead. Tuning end-of-utterance is one of the harder voice-AI engineering problems and a major differentiator between platforms.
Round-trip latency
Time from caller stops speaking to AI starts replying.
A well-tuned 2026 stack runs 500–900 ms total: STT ~100–250 ms + LLM 200–500 ms + TTS first audio 100–250 ms. Sub-800 ms feels human. 1.5+ seconds feels broken.
Streaming pipeline
Each stage starts before the previous one finishes.
STT emits partial transcripts; the LLM starts generating before STT finalizes; TTS starts speaking the first sentence while the LLM continues. Streaming is the difference between a usable voice agent and a chatbot bolted onto a phone line.
Telephony
Telephony terms
IVR (Interactive Voice Response)
"Press 1 for sales" — fixed-tree phone menus.
Pre-AI phone-routing systems. Callers navigate by DTMF (touch-tone) or fixed-keyword speech. Hang-up rates are notoriously high. See our deeper take in /blog/ivr-vs-ai-voice-agent.
DTMF
The touch-tone key presses callers make.
Stands for Dual-Tone Multi-Frequency. AI voice agents typically don’t require DTMF (callers just speak), but DTMF capture is still useful for sensitive inputs like SSN digits or credit card numbers where typing is preferred.
SIP (Session Initiation Protocol)
The protocol that connects modern phone calls over IP.
Most AI voice agent platforms use SIP trunks via providers like Twilio, Telnyx, or Vonage. SIP carries the audio between the caller, the platform, and any human transfer endpoints.
Conditional forwarding
Send calls to AI only when you don’t answer.
Set on your existing business line: ring 4 times, then forward to the AI agent’s number. Lower-risk way to deploy AI without changing your published number.
Warm transfer
AI hands a live call to a human with full context.
Better than a cold transfer because the human picks up already knowing who’s calling, what they need, and what the AI has already collected. The caller doesn’t have to repeat themselves.
Compliance & legal
Compliance & legal terms
BAA (Business Associate Agreement)
The contract a HIPAA-covered business signs with a vendor.
Required by HIPAA whenever a third party handles protected health information (PHI) on your behalf. Reputable healthcare-focused voice AI platforms will sign one. See HHS BAA standard provisions for what must be included.
TCPA (Telephone Consumer Protection Act)
US law governing automated outbound calls and texts.
Outbound calls to consumers — including AI-driven ones — generally require prior express consent. The FCC has issued specific rulings on AI-generated voice calls. Reputable platforms help with consent tracking, identification, opt-outs, and DNC compliance.
PHI (Protected Health Information)
Any health data tied to an individual.
In US healthcare, PHI must be protected per HIPAA. For voice AI, this means encrypted call recording storage, configurable retention, role-based access, audit logs, and a signed BAA.
ABA Model Rule 1.18
Confidentiality owed to prospective clients.
Lawyers owe a baseline confidentiality duty even to people who only inquire about representation. AI legal intake systems must treat every caller’s inputs as protected and design conflict checks accordingly.
AI disclosure laws
Some states require telling callers they’re talking to AI.
California SB 1001 and similar state laws require AI agents to identify themselves on commercial calls. Best practice: identify as AI in your opening greeting on every call regardless of jurisdiction.
TRS (Telecommunications Relay Services)
Federal accessibility requirements for phone systems.
FCC TRS rules require accessibility for callers with hearing or speech disabilities. AI voice systems should support TTY relay, real-time text, and human escalation paths.
Architecture
Architecture terms
Knowledge base
The body of business info the AI is allowed to draw on.
Hours, services, pricing, policies, FAQs, escalation rules, accepted insurance — everything the AI uses to answer caller questions. Quality of the knowledge base is usually the biggest determinant of how good the agent feels.
Tool call (function call)
When the LLM triggers a real-world action.
Booking a calendar slot, looking up a customer in a CRM, sending an SMS confirmation, transferring to a human. Tool calls are what separate AI voice agents from chatbots that just chat.
Guardrails
Hard constraints the AI is not allowed to violate.
Configured rules like "never give medical advice," "always escalate emergencies to a human," "never quote prices outside the published range." Good agents are honest about what they can’t do and route accordingly.
No-code platform
Set up AI agents without writing code.
Configuration via plain English, drag-and-drop flow builders, and integrations connected through OAuth. JagCall, Goodcall, and Synthflow are no-code-leaning. Bland and Vapi are API-first (developer required).
API-first platform
Build the agent with code (engineers required).
You wire STT/LLM/TTS yourself, define logic in code, deploy on the platform’s runtime. Maximum flexibility, minimum onboarding speed. Best when you have engineers and a custom workflow.
After-hours coverage
AI handles calls when humans are unavailable.
Most service businesses get 25–40% of calls outside 9–5. AI handling those calls — even just for booking and overflow — typically pays for itself in the first month. See /use-cases/after-hours-support.
Pricing
Pricing terms
Per-minute pricing
Charged for each minute of call time.
Common for API-first platforms: $0.07–$0.20 per minute plus telephony costs. Predictable for high-volume operations; can surprise you on a busy month.
Plan-based pricing
Flat monthly fee with included minutes.
Common for SMB-focused no-code platforms: $49–$199/month all-in. Predictable monthly bill; overage rates kick in if you exceed included minutes.
Telephony fees
The cost of carrying the actual phone call.
Separate from the AI cost. Inbound DID minutes are typically $0.005–$0.02/min via Twilio or similar. Outbound calls cost more. Some plan-based platforms bundle telephony; some bill it separately.
Want more depth?
Start with the pillar guide, or dig into a specific topic.
Ready to deploy an AI voice agent?
Try JagCall free for 14 days. Most owners are answering live calls within an hour of signup.