Realtime AI voice calls with latency under two seconds.
A low-latency voice system that combines streaming telephony, turn detection, entity extraction, and observability for production conversations.
The voice system moved from a slow demo to a production-ready conversation loop.
— Product team
Conversation delay cut sharply
WebSocket streaming stack
Custom speech boundary logic
Entities captured during calls
Voice AI fails quickly when the conversation feels slow.
The system started with too much latency for natural back-and-forth. Users had to wait, interruptions were hard to handle, and operations had limited visibility into call quality.
The build needed to improve speed without losing extraction, routing, or production observability.
Transport, speech detection, model calls, and synthesis each add delay unless the loop is designed tightly.
Bad VAD creates interruptions, awkward silences, or repeated responses.
NER had to capture useful information without slowing the reply path.
Debugging calls requires timestamps, traces, and failure modes across the stack.
We treated latency as the product experience, not a backend metric.
The stack was shaped around streaming audio, tighter VAD, faster response orchestration, and NER that could run without blocking the conversation.
Instrumentation made every call easier to inspect: where time was spent, what failed, and which moments needed fallback behavior.
- Twilio WebSocket voice stack for realtime conversations.
- Custom voice activity detection for cleaner turn-taking.
- NER pipeline for capturing useful entities during calls.
- Latency reduced from 7 seconds to under 2 seconds with observability in place.
Latency-first architecture
Every step was measured against the live conversation experience.
Custom VAD
Turn boundaries were tuned for the call flow instead of generic defaults.
Parallel extraction
NER ran alongside the voice loop where possible.
Streaming transport
WebSocket audio reduced wait time and improved responsiveness.
Observability
Per-call metrics made failures and cost easier to control.
Latency audit
Measured delay across transport, ASR, LLM, TTS, and handoff points.
Streaming loop
Reworked Twilio WebSocket handling and call state.
Turn detection
Tuned VAD and interruption behavior.
Extraction path
Added entity extraction without blocking the response loop.
Observability
Instrumented latency, failures, and cost signals.
- Twilio
- WebSockets
- Audio chunks
- Call state
- VAD
- Interruptions
- Silence windows
- Retries
- ASR
- LLM
- NER
- TTS
- Metrics
- Logs
- Fallbacks
- Cost controls
The voice stack only works when streaming, turn detection, model calls, and observability are designed as one loop. Every extra delay is part of the user experience.
Usable conversation speed: latency dropped from 7 seconds to under 2 seconds.
Production control: custom VAD, NER, metrics, and fallbacks made the system easier to operate under real call conditions.
The voice system moved from a slow demo to a production-ready conversation loop.
- Phone calls
- Twilio streams
- Caller context
- Knowledge rules
- WebSockets
- VAD
- ASR
- NER
- LLM response
- Tool calls
- Fallbacks
- TTS output
- Live call audio
- Call notes
- Alerts
- Logs
- Latency traces
- Cost controls
- Escalation rules
- Call history
Got a problem AI might solve? Let's find out.
30 minutes. Free. No NDA needed. You leave with a clear yes-or-no on whether to build — and a one-pager you can forward to your team the same day.