Keeping government call data in-country: a self-hosted voice AI architecture

#ai #security #privacy #architecture

If you're wiring up a voice agent for a government helpline, the first question is where the audio goes, before you ever get to which model sounds best. A recording of someone calling about their benefits eligibility or a criminal record is about as sensitive as data gets, and the moment it leaves the country you have a procurement problem no feature list can fix.

Most hosted voice platforms make that leave-the-country decision for you. You point your phone line at their API, and every recording, every transcript, and every field the agent extracts lands in their cloud, usually somewhere you don't control and often on another continent. For a private startup that's fine. For a public agency bound by data residency rules, it's a non-starter before accuracy even enters the conversation.

So here's the architecture that keeps all of it in-country.

The data path is the whole design

A voice call moves through three model stages. Speech comes in and gets transcribed. The transcript goes to a language model that decides what to say and what to pull from your systems. The reply gets turned back into speech. Each of those stages is a place the audio or its text can leak out to a third party.

The fix is to run all three on infrastructure you already accredit. Whisper handles transcription locally. An open-weight LLM (Llama, Qwen, whatever your security team has cleared) does the reasoning on the same network. An open TTS voice speaks the reply. The caller's audio hits your servers, gets processed, and the extracted fields land in your own case system. Nothing about the call travels to a model vendor, so the border never gets crossed because there's nothing crossing it.

That's the part hosted SaaS can't give you, no matter how many compliance badges sit on the pricing page. If the inference runs in their cloud, the data goes to their cloud.

Colocate for latency, not just sovereignty

Keeping the models in-country has a second payoff infra people care about, which is latency. Voice is unforgiving. A caller notices half a second of dead air. If your STT is in one region, your LLM in another, and your TTS somewhere else, you pay round-trip time over and over, plus a trip out to a hosted vendor and back.

Colocating the transcription, the LLM, and the voice on the same box or the same rack collapses that. The audio round-trip stays inside a single network hop, so the agent can pace itself and wait for a slow-speaking caller, handling barge-in without the awkward gaps that make people hang up. Data residency and low latency end up wanting the same thing, which is the whole stack running close together on hardware you own.

Open source is what makes it auditable

Here's the part that closes a government security review. When the model weights and the pipeline code are open, your team can read exactly what the agent does with a call. Where the recording gets written. How long it's kept. What gets sent to which internal API. Which fields get extracted and where they land.

You can't audit a closed vendor's pipeline. You get a SOC 2 report and a promise. With an open stack the review becomes a code review, and the answer to "who can subpoena this recording, where does it get replicated, how long does it live" stays inside your own retention and access rules where an auditor can actually check.

Where Dograh fits

This is the constraint Dograh is built around. It's BSD-2 licensed and fully self-hostable, so you run the whole agent in your own environment and bring your own models. Whisper for transcription, an open LLM for reasoning, an open voice for playback. You pay for infrastructure instead of a per-minute platform fee, which for a high, steady government line changes the yearly number a lot. Most hosted platforms meter around 5 to 7 cents a minute just for the platform layer before any AI usage, climbing toward 15 cents all-in at volume. Own the stack and that meter stops running.

The buying decision for citizen data comes down to whether the audio ever leaves the building. Self-hosting with open models is how you keep it inside.

If you want the fuller picture, the six citizen-facing use cases this architecture serves plus the cost math and the CJIS and FISMA angles, we get into it in the full write-up.