In a voice conversation, latency is the whole experience. Past roughly a second of silence callers assume the line dropped — and in healthcare, a hesitant agent does not just feel slow, it feels untrustworthy. Hitting a sub-300ms perceived response meant treating every millisecond as a budget to defend.
The latency budget
Perceived latency has to cover speech-to-text, the model, and text-to-speech combined. The only way under the budget is to stream every stage and start speaking before the full answer is ready, rather than running the pipeline as a chain of blocking calls.

What separates production from a demo
- Barge-in: the caller can interrupt and the agent stops talking instantly.
- Streaming everywhere: no synchronous handoffs between STT, model, and TTS.
- Backchannel: natural pacing and acknowledgements, not robotic turns.
- Graceful fallback: a clean recovery when a service times out mid-sentence.
Compliance can’t be bolted on
In healthcare the agent only ever needs the minimum: identity is verified, PHI is scoped, and every interaction is logged for audit. Redaction sits on the boundary so the model reasons over the conversation without holding data it should never see.
Callers forgive a voice agent that doesn’t know everything. They don’t forgive one that won’t let them reach a person.
Escalate like you mean it
- Detect frustration and intent to escalate before the caller has to demand it.
- Transfer with a summary so the patient never repeats themselves.
- Decide in advance what happens when STT, the model, or TTS times out mid-call.
Nail latency, interruptions, and escalation and the agent stops feeling like an obstacle between the caller and help — and starts feeling like the fastest way to get it.


