Voice · MCP · Tools · App Intents · Realtime
Two directions, one discipline. We integrate conversational voice agents inside your apps, and expose your products as tools so ChatGPT, Claude, Siri or Gemini can run actions on the user's behalf.
Foundations
Voice AI —voice artificial intelligence— is the technology that lets machines listen, understand and speak with humans in natural language, without friction, in real time. We're not talking about rigid IVR commands («press 1 for sales»), nor about text chatbots dressed up with robotic TTS. We're talking about conversational voice agents that combine speech recognition (STT), language models (LLM) and neural text-to-speech (TTS) in an end-to-end pipeline that responds in under a second, with human intonation and natural turn-taking.
2026 is the year this technology stopped being experimental. OpenAI's Realtime API and Gemini Live, cloned voices from ElevenLabs and Cartesia, WebRTC transport over global infrastructure and a new generation of dialogue-optimized models have made building a conversational voice experience 90% cheaper than it was two years ago. For any company that deals with customers by phone —support, scheduling, technical assistance, billing, onboarding or service access— ignoring voice AI is the equivalent of ignoring the web in 1998.
A production voice agent isn't a single model — it's six pieces fitted together: Voice Activity Detection (VAD) to know when the user is speaking and when they stop; streaming Speech-to-Text (STT) with multi-speaker diarization and support for 50+ languages; an LLM with function calling and RAG over your corporate knowledge base, which decides what to say and which actions to execute; neural Text-to-Speech (TTS) with cloned voices and chunked streaming so the response starts playing while still being generated; real-time transport via WebRTC, SIP or PSTN; and the hardest layer — turn-taking and barge-in management, which lets the user cut off the agent whenever they take the floor again.
Latency is the product. If the user waits more than 800 milliseconds between the end of their sentence and the start of the agent's response, the experience breaks: it feels like the machine doesn't understand. A well-built agent stays under 500 ms end-to-end on 4G mobile and under 300 ms on WiFi. Achieving that requires streaming at every layer, edge peering, the right model and audio codec, and fine-grained jitter buffer tuning. It isn't a technical detail: it's what separates a cool demo from a usable product.
IVR of the last twenty years is based on rigid decision trees: «press 1», «say a keyword», «wait 10 seconds». It works for two or three options and collapses for any request off-script. A voice agent powered by generative AI understands intent, context and nuance. It can resolve in a single sentence what a traditional IVR takes four nested menus to handle, it can escalate to a human when it detects it can't help, and it can personalize the conversation with real-time CRM data —customer name, history, preferences, order status— without the user having to type anything.
The economic impact is measurable. Early voice AI deployments in call centers show 40-70% reductions in cost per handled call, resolution times in seconds instead of minutes, and —critically— satisfaction scores equal to or above those of human agents for structured tasks: order lookup, appointment changes, tier-1 incidents, recurring payments, restocking, onboarding. Humans are reserved for the cases where they truly add differential value: complex sales, major incidents, strategic account relationships.
Voice AI delivers the highest return when three factors combine: volume (hundreds or thousands of interactions per month currently handled by phone or email), structure (interactions follow repeatable —though not identical— patterns) and urgency (the user values an immediate answer). 24/7 customer support, healthcare triage, drive-thru and retail, hands-free automotive assistants, conversational language tutors and accessibility for users with reduced mobility or vision are the cases with the fastest payback. In contrast, one-off interactions, those with high complexity or zero error tolerance —medical interventions, binding legal decisions, large-scale financial operations— remain firmly human territory.
At Dribba we've been shipping voice AI to production since 2024, combining the most mature APIs on the market with our experience in Flutter apps, high-performance backends and integration with CRMs, ERPs and enterprise telephony. If you have a use case —an overloaded phone line, an app that could answer by voice, a repetitive process eating hours of your team— the first step is a 45-minute session where we analyze feasibility, the recommended stack and expected return. No forms, no commitment.
Technologies
A voice agent in production isn't a single model — it's six pieces precisely fitted together: perception, reasoning, speech and real-time transport.
Speech-to-Text
Streaming Speech-to-Text with per-word confidence, speaker diarization and multilingual models. First-token latency under 300 ms.
Text-to-Speech
Neural Text-to-Speech with cloned voices, prosody control and chunked streaming. Voices that sound human on iOS, Android and telephony.
Language Model
The agent's brain: function calling, RAG over your knowledge base, guardrails and system prompts tuned for spoken dialogue — not chat.
Voice Activity Detection
Voice Activity Detection robust to noise and echo. Detects when the user starts and stops talking to trigger transcription and close turns without cutting off sentences.
Low-latency Transport
WebRTC, WebSockets and SIP to carry bidirectional audio with minimal latency. Integration with LiveKit, Daily, Twilio and the public telephone network.
Barge-in & Flow Control
The hard part: barge-in, interruptions, natural pauses and turn management. What separates a usable agent from a modern IVR.
Agent integrations
The other direction: exposing your product so that external agents —ChatGPT, Claude, Siri, Gemini or a custom orchestrator— can invoke actions on the user's behalf. MCP, App Intents, App Actions and webhooks, done right.
Model Context Protocol
We build MCP servers that expose your app's capabilities as typed tools discoverable and invokable in real time by Claude Desktop, Cursor, ChatGPT or any MCP client.
OpenAI Apps SDK
We build GPT Actions with OpenAPI and Apps for ChatGPT with the Apps SDK. OAuth 2.0 auth, scopes, rate limits and validated schemas so your product lives inside your customers' ChatGPT.
Tool Use · Computer Use
We integrate your app with Claude via Tool Use and, where it fits, Computer Use for browser tasks. Guardrails, deterministic retries and per-turn logging to take it to production.
Siri · Apple Intelligence
We implement App Intents in Swift so your app can be invoked from Siri, Apple Intelligence, Shortcuts, Spotlight and the lock screen. Parameters, results and live views.
Gemini · Google Assistant
We register App Actions so Gemini and Google Assistant can launch flows in your Android app using common intents (ORDER_MENU_ITEM, GET_ORDER, etc.) or custom built-in intents.
n8n · LangGraph · Zapier
For multi-agent orchestration we connect to n8n, LangGraph, Pipedream, Zapier or Make. Bidirectional webhooks, retries, idempotency and per-event observability.
Use cases
Inbound and outbound voice agents that resolve FAQs, book appointments, qualify leads and escalate to a human. Integrated with CRMs, Zendesk, HubSpot and Twilio telephony.
Voice assistants for pre-visit triage, medication reminders and post-discharge follow-up. GDPR, HIPAA compliance and connection to existing EHRs.
Hands-free assistants for CarPlay and Android Auto. Voice control of navigation, climate, music and OEM functions with safety and eyes-on-road focus.
Voice-first interfaces for IoT, Matter and accessibility. Custom wake word, optional on-device and support for users with reduced mobility or vision.
Voice order-taking in drive-thru, kiosks and digital kiosks, with POS and ERP. Multilingual, robust to traffic noise and adaptable to local menus.
Conversational tutors for language practice with pronunciation correction, CEFR feedback and role-play. No friction: speak and learn, don't type.
Why it matters
01
Past 800 ms of response time, the user feels the machine "doesn't understand them". We design the full pipeline — streaming STT, LLM, chunked TTS, WebRTC — to stay under 500 ms end-to-end.
02
Anyone can plug in Whisper and ElevenLabs. The hard part is cutting off the agent when the user speaks, not stepping on sentences, handling natural pauses and stopping the model from "hallucinating" answers without context.
03
Café noise, accents, elderly users, unstable 4G, echoing bluetooth. We train and test against real conditions, not a studio microphone.
04
We work with OpenAI Realtime, Gemini Live, ElevenLabs, Deepgram, LiveKit and Pipecat in production projects. We know which stack suits each case and which combinations are a trap.
Our technical stack
Frequently asked questions
With Realtime APIs (OpenAI or Gemini Live), streaming STT and chunked TTS over WebRTC we get 400–600 ms end-to-end on 4G mobile and <300 ms on WiFi. On PSTN telephony, between 600–900 ms depending on the carrier.
English, Spanish, Catalan, French, Portuguese, Italian and German work at production quality. For other languages we analyze which STT/TTS/LLM combination performs best before committing.
We combine noise suppression (RNNoise / NVIDIA Broadcast), robust VAD (Silero), domain adaptation in the prompt and real test sets recorded in the client's environments. We evaluate WER per cohort before launch.
Yes. We integrate via Twilio Voice, Vonage, direct SIP trunk or WebRTC embedded in your Flutter / web app. We also link to your CRM, ERP, EHR or proprietary backend via function calling and webhooks.
For HIPAA, banking or defense cases we deploy into your VPC (Azure, AWS, GCP) with self-hosted models (Whisper, Llama, local voices). We also run hybrid modes: on-device STT and LLM in a European cloud.
From €30,000 for a voice agent MVP with a focused use case and clear metrics. Enterprise projects with telephony integration, multilingual and SLA typically start from €80,000.
Yes. We build MCP servers so Claude Desktop, Cursor and ChatGPT (via MCP and the Apps SDK) can invoke your app. For Siri and Apple Intelligence we implement App Intents in Swift; for Gemini and Google Assistant, App Actions on Android. We also ship GPT Actions with OpenAPI if you prefer a traditional integration.
A voice agent lives inside your app: the user talks to your product. An agent integration inverts the direction: your product becomes a tool that ChatGPT, Claude, Siri or Gemini can invoke to run actions on the user's behalf. Both complement each other and are often shipped together.
Tell us the case, expected volume and channels. We'll tell you if it makes sense, which stack we recommend and what it would cost.