Foundations

What voice AI is and why it is redefining customer support, apps and the way users interact with your product.

Voice AI —voice artificial intelligence— is the technology that lets machines listen, understand and speak with humans in natural language, without friction, in real time. We're not talking about rigid IVR commands («press 1 for sales»), nor about text chatbots dressed up with robotic TTS. We're talking about conversational voice agents that combine speech recognition (STT), language models (LLM) and neural text-to-speech (TTS) in an end-to-end pipeline that responds in under a second, with human intonation and natural turn-taking.

2026 is the year this technology stopped being experimental. OpenAI's Realtime API and Gemini Live, cloned voices from ElevenLabs and Cartesia, WebRTC transport over global infrastructure and a new generation of dialogue-optimized models have made building a conversational voice experience 90% cheaper than it was two years ago. For any company that deals with customers by phone —support, scheduling, technical assistance, billing, onboarding or service access— ignoring voice AI is the equivalent of ignoring the web in 1998.

How a conversational voice agent works

A production voice agent isn't a single model — it's six pieces fitted together: Voice Activity Detection (VAD) to know when the user is speaking and when they stop; streaming Speech-to-Text (STT) with multi-speaker diarization and support for 50+ languages; an LLM with function calling and RAG over your corporate knowledge base, which decides what to say and which actions to execute; neural Text-to-Speech (TTS) with cloned voices and chunked streaming so the response starts playing while still being generated; real-time transport via WebRTC, SIP or PSTN; and the hardest layer — turn-taking and barge-in management, which lets the user cut off the agent whenever they take the floor again.

Latency is the product. If the user waits more than 800 milliseconds between the end of their sentence and the start of the agent's response, the experience breaks: it feels like the machine doesn't understand. A well-built agent stays under 500 ms end-to-end on 4G mobile and under 300 ms on WiFi. Achieving that requires streaming at every layer, edge peering, the right model and audio codec, and fine-grained jitter buffer tuning. It isn't a technical detail: it's what separates a cool demo from a usable product.

Voice AI vs traditional IVR: why they aren't comparable

IVR of the last twenty years is based on rigid decision trees: «press 1», «say a keyword», «wait 10 seconds». It works for two or three options and collapses for any request off-script. A voice agent powered by generative AI understands intent, context and nuance. It can resolve in a single sentence what a traditional IVR takes four nested menus to handle, it can escalate to a human when it detects it can't help, and it can personalize the conversation with real-time CRM data —customer name, history, preferences, order status— without the user having to type anything.

The economic impact is measurable. Early voice AI deployments in call centers show 40-70% reductions in cost per handled call, resolution times in seconds instead of minutes, and —critically— satisfaction scores equal to or above those of human agents for structured tasks: order lookup, appointment changes, tier-1 incidents, recurring payments, restocking, onboarding. Humans are reserved for the cases where they truly add differential value: complex sales, major incidents, strategic account relationships.

When voice AI makes sense for your company

Voice AI delivers the highest return when three factors combine: volume (hundreds or thousands of interactions per month currently handled by phone or email), structure (interactions follow repeatable —though not identical— patterns) and urgency (the user values an immediate answer). 24/7 customer support, healthcare triage, drive-thru and retail, hands-free automotive assistants, conversational language tutors and accessibility for users with reduced mobility or vision are the cases with the fastest payback. In contrast, one-off interactions, those with high complexity or zero error tolerance —medical interventions, binding legal decisions, large-scale financial operations— remain firmly human territory.

At Dribba we've been shipping voice AI to production since 2024, combining the most mature APIs on the market with our experience in Flutter apps, high-performance backends and integration with CRMs, ERPs and enterprise telephony. If you have a use case —an overloaded phone line, an app that could answer by voice, a repetitive process eating hours of your team— the first step is a 45-minute session where we analyze feasibility, the recommended stack and expected return. No forms, no commitment.

Technologies

The full voice AI stack, mastered.

A voice agent in production isn't a single model — it's six pieces precisely fitted together: perception, reasoning, speech and real-time transport.

STT / ASR

Speech-to-Text

Streaming Speech-to-Text with per-word confidence, speaker diarization and multilingual models. First-token latency under 300 ms.

Real-time streamingDiarization50+ languagesKeywords & boosting

TTS

Text-to-Speech

Neural Text-to-Speech with cloned voices, prosody control and chunked streaming. Voices that sound human on iOS, Android and telephony.

Voice cloningStreaming TTSSSML / prosodyMultilingual voices

LLM

Language Model

The agent's brain: function calling, RAG over your knowledge base, guardrails and system prompts tuned for spoken dialogue — not chat.

Function callingEnterprise RAGGuardrailsSession memory

VAD

Voice Activity Detection

Voice Activity Detection robust to noise and echo. Detects when the user starts and stops talking to trigger transcription and close turns without cutting off sentences.

Silero VADNoise suppressionEcho cancellationEndpointing

Realtime / WebRTC

Low-latency Transport

WebRTC, WebSockets and SIP to carry bidirectional audio with minimal latency. Integration with LiveKit, Daily, Twilio and the public telephone network.

WebRTC / SFUSIP / PSTNEdge peeringJitter buffer

Turn-taking

Barge-in & Flow Control

The hard part: barge-in, interruptions, natural pauses and turn management. What separates a usable agent from a modern IVR.

Barge-inInterruptionsNatural pausesBack-channeling

Agent integrations

Your app, as a tool for agents.

The other direction: exposing your product so that external agents —ChatGPT, Claude, Siri, Gemini or a custom orchestrator— can invoke actions on the user's behalf. MCP, App Intents, App Actions and webhooks, done right.

MCP

Model Context Protocol

We build MCP servers that expose your app's capabilities as typed tools discoverable and invokable in real time by Claude Desktop, Cursor, ChatGPT or any MCP client.

Typed toolsResourcesPromptsSSE / stdio

GPT Actions

OpenAI Apps SDK

We build GPT Actions with OpenAPI and Apps for ChatGPT with the Apps SDK. OAuth 2.0 auth, scopes, rate limits and validated schemas so your product lives inside your customers' ChatGPT.

GPT ActionsApps SDKOAuth 2.0OpenAPI 3.1

Claude Tools

Tool Use · Computer Use

We integrate your app with Claude via Tool Use and, where it fits, Computer Use for browser tasks. Guardrails, deterministic retries and per-turn logging to take it to production.

Tool UseComputer UseAnthropic MCPStreaming

App Intents

Siri · Apple Intelligence

We implement App Intents in Swift so your app can be invoked from Siri, Apple Intelligence, Shortcuts, Spotlight and the lock screen. Parameters, results and live views.

App IntentsShortcutsSpotlightLock Screen

App Actions

Gemini · Google Assistant

We register App Actions so Gemini and Google Assistant can launch flows in your Android app using common intents (ORDER_MENU_ITEM, GET_ORDER, etc.) or custom built-in intents.

App ActionsBII catalogSlice widgetsGoogle Assistant

Webhooks

n8n · LangGraph · Zapier

For multi-agent orchestration we connect to n8n, LangGraph, Pipedream, Zapier or Make. Bidirectional webhooks, retries, idempotency and per-event observability.

n8nLangGraphZapier / MakeEvent bus

Use cases

Where we're shipping voice agents.

🎧

24/7 customer support

Inbound and outbound voice agents that resolve FAQs, book appointments, qualify leads and escalate to a human. Integrated with CRMs, Zendesk, HubSpot and Twilio telephony.

Inbound / OutboundTwilioCRM24/7

🏥

Triage and digital health

Voice assistants for pre-visit triage, medication reminders and post-discharge follow-up. GDPR, HIPAA compliance and connection to existing EHRs.

TriajeHIPAA / GDPREHRRecordatorios

🚗

Automotive and HMI

Hands-free assistants for CarPlay and Android Auto. Voice control of navigation, climate, music and OEM functions with safety and eyes-on-road focus.

CarPlay / Android AutoHands-freeEyes-on-road

🏠

Smart home and accessibility

Voice-first interfaces for IoT, Matter and accessibility. Custom wake word, optional on-device and support for users with reduced mobility or vision.

AccesibilidadIoTMatterWake word

🛒

Drive-thru and retail

Voice order-taking in drive-thru, kiosks and digital kiosks, with POS and ERP. Multilingual, robust to traffic noise and adaptable to local menus.

Drive-thruPOSERPMultilingual

🎓

Education and languages

Conversational tutors for language practice with pronunciation correction, CEFR feedback and role-play. No friction: speak and learn, don't type.

EdTechConversationalPronunciationCEFR

Why it matters

Voice AI isn't chatbots with a speaker.

Latency is the product

Past 800 ms of response time, the user feels the machine "doesn't understand them". We design the full pipeline — streaming STT, LLM, chunked TTS, WebRTC — to stay under 500 ms end-to-end.

Barge-in and turn-taking are 80% of the work

Anyone can plug in Whisper and ElevenLabs. The hard part is cutting off the agent when the user speaks, not stepping on sentences, handling natural pauses and stopping the model from "hallucinating" answers without context.

The real world isn't a demo

Café noise, accents, elderly users, unstable 4G, echoing bluetooth. We train and test against real conditions, not a studio microphone.

Dribba has been operating voice AI from day one

We work with OpenAI Realtime, Gemini Live, ElevenLabs, Deepgram, LiveKit and Pipecat in production projects. We know which stack suits each case and which combinations are a trap.

Frequently asked questions

Common questions about voice agents.

With Realtime APIs (OpenAI or Gemini Live), streaming STT and chunked TTS over WebRTC we get 400–600 ms end-to-end on 4G mobile and <300 ms on WiFi. On PSTN telephony, between 600–900 ms depending on the carrier.

English, Spanish, Catalan, French, Portuguese, Italian and German work at production quality. For other languages we analyze which STT/TTS/LLM combination performs best before committing.

We combine noise suppression (RNNoise / NVIDIA Broadcast), robust VAD (Silero), domain adaptation in the prompt and real test sets recorded in the client's environments. We evaluate WER per cohort before launch.

Yes. We integrate via Twilio Voice, Vonage, direct SIP trunk or WebRTC embedded in your Flutter / web app. We also link to your CRM, ERP, EHR or proprietary backend via function calling and webhooks.

For HIPAA, banking or defense cases we deploy into your VPC (Azure, AWS, GCP) with self-hosted models (Whisper, Llama, local voices). We also run hybrid modes: on-device STT and LLM in a European cloud.

From €30,000 for a voice agent MVP with a focused use case and clear metrics. Enterprise projects with telephony integration, multilingual and SLA typically start from €80,000.

Yes. We build MCP servers so Claude Desktop, Cursor and ChatGPT (via MCP and the Apps SDK) can invoke your app. For Siri and Apple Intelligence we implement App Intents in Swift; for Gemini and Google Assistant, App Actions on Android. We also ship GPT Actions with OpenAPI if you prefer a traditional integration.

A voice agent lives inside your app: the user talks to your product. An agent integration inverts the direction: your product becomes a tool that ChatGPT, Claude, Siri or Gemini can invoke to run actions on the user's behalf. Both complement each other and are often shipped together.

Voice agents
and apps for agents.

What voice AI is and why it is redefining customer support, apps and the way users interact with your product.

How a conversational voice agent works

Voice AI vs traditional IVR: why they aren't comparable

When voice AI makes sense for your company

The full voice AI stack, mastered.

Your app, as a tool for agents.

Where we're shipping voice agents.

24/7 customer support

Triage and digital health

Automotive and HMI

Smart home and accessibility

Drive-thru and retail

Education and languages

Voice AI isn't chatbots with a speaker.

Latency is the product

Barge-in and turn-taking are 80% of the work

The real world isn't a demo

Dribba has been operating voice AI from day one

Common questions about voice agents.

Got a use case for voice AI?

Voice agentsand apps for agents.

What voice AI is and why it is redefining customer support, apps and the way users interact with your product.

How a conversational voice agent works

Voice AI vs traditional IVR: why they aren't comparable

When voice AI makes sense for your company

The full voice AI stack, mastered.

Your app, as a tool for agents.

Where we're shipping voice agents.

24/7 customer support

Triage and digital health

Automotive and HMI

Smart home and accessibility

Drive-thru and retail

Education and languages

Voice AI isn't chatbots with a speaker.

Latency is the product

Barge-in and turn-taking are 80% of the work

The real world isn't a demo

Dribba has been operating voice AI from day one

Common questions about voice agents.

What latency is actually achievable in production?

Which languages work well on the agent?

How do you handle noise, accents and microphone quality?

Can you connect the agent to our phone system or existing app?

What options exist for privacy, on-premise and sensitive data?

What project size makes sense?

Can you expose my app as a tool for ChatGPT, Claude or Siri?

What's the difference between a voice agent and an agent integration?

Got a use case for voice AI?

Voice agents
and apps for agents.