Building a Sub-300ms Voice Agent: What We Learned Shipping for Healthcare

In a voice conversation, latency is the whole experience. Past roughly a second of silence callers assume the line dropped — and in healthcare, a hesitant agent does not just feel slow, it feels untrustworthy. Hitting a sub-300ms perceived response meant treating every millisecond as a budget to defend.

The latency budget

Perceived latency has to cover speech-to-text, the model, and text-to-speech combined. The only way under the budget is to stream every stage and start speaking before the full answer is ready, rather than running the pipeline as a chain of blocking calls.

A live healthcare voice agent session with real-time transcript and escalation

What separates production from a demo

Barge-in: the caller can interrupt and the agent stops talking instantly.
Streaming everywhere: no synchronous handoffs between STT, model, and TTS.
Backchannel: natural pacing and acknowledgements, not robotic turns.
Graceful fallback: a clean recovery when a service times out mid-sentence.

Compliance can’t be bolted on

In healthcare the agent only ever needs the minimum: identity is verified, PHI is scoped, and every interaction is logged for audit. Redaction sits on the boundary so the model reasons over the conversation without holding data it should never see.

Callers forgive a voice agent that doesn’t know everything. They don’t forgive one that won’t let them reach a person.

Escalate like you mean it

Detect frustration and intent to escalate before the caller has to demand it.
Transfer with a summary so the patient never repeats themselves.
Decide in advance what happens when STT, the model, or TTS times out mid-call.

Nail latency, interruptions, and escalation and the agent stops feeling like an obstacle between the caller and help — and starts feeling like the fastest way to get it.

Frequently asked questions

Aim for a perceived response well under a second — sub-300ms feels conversational. Streaming every stage and starting playback before the full response is generated is what makes that achievable.

Verify identity, scope PHI to the minimum the task needs, log every interaction for audit, and redact on the boundary so the model never holds data it should not see.

With barge-in: the agent detects speech and stops talking instantly, then processes the new input. Without it the experience feels robotic and callers disengage.

Plan the fallback explicitly — a graceful message and a handoff beat dead air. The bad call is the one to design for, because it defines the caller’s impression.

The latency budget

What separates production from a demo

Barge-in: the caller can interrupt and the agent stops talking instantly.

Streaming everywhere: no synchronous handoffs between STT, model, and TTS.

Backchannel: natural pacing and acknowledgements, not robotic turns.

Graceful fallback: a clean recovery when a service times out mid-sentence.

Compliance can’t be bolted on

Callers forgive a voice agent that doesn’t know everything. They don’t forgive one that won’t let them reach a person.

Escalate like you mean it

Detect frustration and intent to escalate before the caller has to demand it.

Transfer with a summary so the patient never repeats themselves.

Decide in advance what happens when STT, the model, or TTS times out mid-call.

Nail latency, interruptions, and escalation and the agent stops feeling like an obstacle between the caller and help — and starts feeling like the fastest way to get it.

Frequently asked questions

Aim for a perceived response well under a second — sub-300ms feels conversational. Streaming every stage and starting playback before the full response is generated is what makes that achievable.

Verify identity, scope PHI to the minimum the task needs, log every interaction for audit, and redact on the boundary so the model never holds data it should not see.

With barge-in: the agent detects speech and stops talking instantly, then processes the new input. Without it the experience feels robotic and callers disengage.

Plan the fallback explicitly — a graceful message and a handoff beat dead air. The bad call is the one to design for, because it defines the caller’s impression.

Gen AI

CRM

Cloud

Automation

Why most AI agents fail in production — and the framework we use instead

Building a Sub-300ms Voice Agent: What We Learned Shipping for Healthcare

The latency budget

What separates production from a demo

Compliance can’t be bolted on

Escalate like you mean it

Frequently asked questions

Building something with AI? Let's talk.

Related articles

Building FHIR-First Healthcare Data Pipelines

What It Takes to Ship a Production AI Voice Agent

Have a project? Let’s talk.

Building a Sub-300ms Voice Agent: What We Learned Shipping for Healthcare

The latency budget

What separates production from a demo

Compliance can’t be bolted on

Escalate like you mean it

Frequently asked questions

Building something with AI? Let's talk.

Related articles

Building FHIR-First Healthcare Data Pipelines

What It Takes to Ship a Production AI Voice Agent

Have a project? Let’s talk.

The latency budget

What separates production from a demo

Compliance can’t be bolted on

Escalate like you mean it

Frequently asked questions

Never miss a post.

Building something with AI? Let's talk.

Related articles

Building FHIR-First Healthcare Data Pipelines

What It Takes to Ship a Production AI Voice Agent

Have a project? Let’s talk.

The latency budget

What separates production from a demo

Compliance can’t be bolted on

Escalate like you mean it

Frequently asked questions

Never miss a post.

Building something with AI? Let's talk.

Related articles

Building FHIR-First Healthcare Data Pipelines

What It Takes to Ship a Production AI Voice Agent

Have a project? Let’s talk.