Case study · 13 / 39

Realtime voice · Twilio automation

Realtime AI voice calls with latency under two seconds.

A low-latency voice system that combines streaming telephony, turn detection, entity extraction, and observability for production conversations.

[ Client review ]

The voice system moved from a slow demo to a production-ready conversation loop.

— Product team

AI Voice System source visual showing a chat and call interface

Select case study

Find related work

Client

AI Voice System

Realtime voice · Twilio automation

Engagement

Product narrative

Voice architecture · latency tuning · observability

Role

AI builder

Realtime voice engineering

Year

2026

Project positioning

Buyer caselatency and reliability outcomes

7s to <2s

Latency reduced

Conversation delay cut sharply

Twilio

Voice transport

WebSocket streaming stack

VAD

Turn detection

Custom speech boundary logic

NER

Extraction

Entities captured during calls

[ 01 ]The Problem

Voice AI fails quickly when the conversation feels slow.

The system started with too much latency for natural back-and-forth. Users had to wait, interruptions were hard to handle, and operations had limited visibility into call quality.

The build needed to improve speed without losing extraction, routing, or production observability.

[ 02 ]Why This Was Hard

01Latency compounds everywhere–

Transport, speech detection, model calls, and synthesis each add delay unless the loop is designed tightly.

02Turn-taking is fragile+

Bad VAD creates interruptions, awkward silences, or repeated responses.

03Extraction cannot block speech+

NER had to capture useful information without slowing the reply path.

04Voice needs observability+

Debugging calls requires timestamps, traces, and failure modes across the stack.

[ 03 ]Approach

We treated latency as the product experience, not a backend metric.

The stack was shaped around streaming audio, tighter VAD, faster response orchestration, and NER that could run without blocking the conversation.

Instrumentation made every call easier to inspect: where time was spent, what failed, and which moments needed fallback behavior.

Twilio WebSocket voice stack for realtime conversations.
Custom voice activity detection for cleaner turn-taking.
NER pipeline for capturing useful entities during calls.
Latency reduced from 7 seconds to under 2 seconds with observability in place.

[ 04 ]Key Decisions

Latency-first architecture

Every step was measured against the live conversation experience.

Custom VAD

Turn boundaries were tuned for the call flow instead of generic defaults.

Parallel extraction

NER ran alongside the voice loop where possible.

Streaming transport

WebSocket audio reduced wait time and improved responsiveness.

Observability

Per-call metrics made failures and cost easier to control.

[ 05 ]How We Shipped

Week 1-2

Latency audit

Measured delay across transport, ASR, LLM, TTS, and handoff points.

Week 2-3

Streaming loop

Reworked Twilio WebSocket handling and call state.

Week 3-4

Turn detection

Tuned VAD and interruption behavior.

Week 4-5

Extraction path

Added entity extraction without blocking the response loop.

Week 5-6

Observability

Instrumented latency, failures, and cost signals.

[ 06 ]Value Profile

LatencyResponse time moved from awkward to usable.

Call reliabilityStreaming, turn-taking, and retries were hardened.

Operational visibilityObservability made call issues easier to debug.

Cost controlInfrastructure was tuned for lower runtime waste.

[ 07 ]How It Works

[ 01 ] Stream

Voice transport

Twilio
WebSockets
Audio chunks
Call state

[ 02 ] Detect

Turn handling

VAD
Interruptions
Silence windows
Retries

[ 03 ] Respond

AI voice loop

[ 04 ] Operate

Production layer

Metrics
Logs
Fallbacks
Cost controls

The voice stack only works when streaming, turn detection, model calls, and observability are designed as one loop. Every extra delay is part of the user experience.

[ 08 ]Outcome

Usable conversation speed: latency dropped from 7 seconds to under 2 seconds.

Production control: custom VAD, NER, metrics, and fallbacks made the system easier to operate under real call conditions.

The voice system moved from a slow demo to a production-ready conversation loop.

Product team

[ 09 ]Stack

Sources

Phone calls
Twilio streams
Caller context
Knowledge rules

Processing

WebSockets
VAD
ASR
NER

Answer layer

LLM response
Tool calls
Fallbacks
TTS output

Delivery

Live call audio
Call notes
Alerts
Logs

Governance

Latency traces
Cost controls
Escalation rules
Call history

Book a call

Got a problem AI might solve? Let's find out.

30 minutes. Free. No NDA needed. You leave with a clear yes-or-no on whether to build — and a one-pager you can forward to your team the same day.

Pick a time Contact on Upwork

[ Response ]

Within 24 hours

[ Timezone ]

GMT+5 · flexible

[ Discovery ]

Free · no NDA needed

[ Engagement ]

$1,000 / week sprint

Realtime AI voice calls with latency under two seconds.

Latency-first architecture

Custom VAD

Parallel extraction

Streaming transport

Observability

Latency audit

Streaming loop

Turn detection

Extraction path

Observability

Related case studies

Conversation intelligence for support agents.

Review signals inside important calls.

Got a problem AI might solve? Let's find out.