Voice AI Outside the US: Double the Price, Worse Experience (And How We're Trying to Fix It)

#voiceai #tts #infrastructure #startup

Every major voice model lab hosts in us-east or us-west. If you're building voice agents in Sydney, Mumbai, or Jakarta, your audio round-trips to Virginia and back before a single word is spoken. For real-time conversation, that's not a tradeoff. That's a dead product.

We're SLNG. We're building the global execution layer for voice agents — regional edge infrastructure combined with the leading global models for hyper-low latency, fewer tokens, and less cost. We run Deepgram, Cartesia, Speechmatics, InWorld, and AssemblyAI on sovereign compute in 11 regions.

This is the story of why that business exists, what we got wrong early, and what we've learned about the surprisingly US-centric infrastructure underneath AI.

The control plane problem nobody talks about

When we launched, we assumed we could move fast by working with inference providers. Spin up a provider's API in a target region, route traffic locally, done.

What we found was more curious, and at the time, deeply frustrating.

Every major inference provider routes through US-based control planes by default. Even when you're provisioned on international compute, your traffic may briefly traverse US infrastructure. Authentication, scheduling, logging — all routed through Virginia or Oregon before your workload touches the GPU you're actually paying for.

Which defeats the entire purpose if your compliance requirement is data sovereignty, or your latency budget is under 200ms.

They all did it. Every single one. The only provider we've seen start to move on this is Modal, which now offers control plane access in three regions. Credit where it's due.

But that discovery changed our trajectory. We couldn't outsource our way to 11 sovereign regions. We had to build the infrastructure team far earlier than any early-stage startup wants to.

The GPU supply crunch hit differently outside the US

Working with inference providers isn't just about API routing. You're outsourcing compute acquisition to their procurement teams. When we couldn't rely on that, we had to acquire GPU capacity directly.

This is the standard challenge of any early-stage startup — competing for resources against companies with bigger balance sheets and longer time horizons. Nothing unique to us. But the GPU supply crunch added a regional dimension that caught us off guard. Many regions already had constrained supply. Some became simply impossible to source from. Minimum commitments got larger, lock-in periods got longer, and prices went up in tandem.

At one point, a major hyperscaler was charging us for compute they hadn't provisioned. They didn't have the GPUs on hand. Hadn't allocated them. But still thought they had the right to invoice for it.

That part was new.

The default is US compute. Every layer.

To be clear, this isn't a post about being pro or anti-US. So much of the AI innovation driving this space comes out of the US ecosystem. But even having worked at an inference provider myself, I hadn't grasped the full depth of the problem — or more accurately, the giant chasm between "we support your region" and actually building compliant local compute that doesn't touch US infrastructure at any point in the chain.

It gets harder the more you look at it.

Real-time voice agents don't run on a single service. They draw together an orchestrator (which has servers with a geography), a runtime (which has a geography), then the STT, LLM, and TTS components — each with their own geography.

Look at any single layer in that stack. The leading providers are US voice labs, US LLM labs, or companies like LiveKit and Pipecat who are also US-based.

So every time something gets built, the default starting point is US compute. Not because anyone chose it deliberately. Because that is the origin of the entire supply chain.

Double the price, worse experience

The cost inflation isn't one thing. It's three things compounding simultaneously.

Latency tax. A voice agent round-tripping from Sydney to Virginia adds 150–300ms per turn. In a real-time conversation, that's the pause that makes humans hang up. To compensate, you over-engineer: larger buffers, more aggressive endpointing, retry logic. All of it costs compute.

Token tax. Higher latency doesn't just slow things down — it breaks conversations. VAD misfires on the added delay, triggering partial transcriptions that each hit the LLM. Users repeat themselves. Endpointing fails and retries stack up. We've measured roughly 50% more LLM token consumption on US-routed voice calls versus locally-routed ones. Same intent. Same outcome. 50% more tokens because the conversation needed more turns to get there.

Egress tax. Audio data crossing regions isn't free. Streaming bidirectional audio from APAC through US endpoints and back accumulates egress charges that don't show up on any model lab's pricing page.

Add them up: the same voice agent that costs $0.04/min on US compute easily doubles to $0.08–0.10/min when routed internationally through US infrastructure.

Local compute changes the economics fundamentally. When the audio never leaves the region, the latency tax, token tax, and egress tax disappear. The architecture is the saving.

Why the model labs haven't fixed this

This isn't incompetence. It's incentive structure.

Model labs sell API calls. Their business model is usage-based. If your architecture generates more round-trips, more tokens, more egress — that's more revenue for them. Regional distribution would cannibalise their own margin.

They also don't have compliance teams for 11 jurisdictions. They don't want to navigate Australian data residency requirements, Indian RBI regulations, or Indonesian sovereignty laws. That's expensive, slow, and distracts from model development.

There's a deeper structural issue too. Most model labs are US-based, operating in a relatively narrow window of languages and a massive abundance of compute. Their models are optimised for that environment. When you move outside it — into Hindi, Bahasa, Tamil, Arabic — you're running models that weren't built for your language, on hardware that's harder to source, in regions where the compliance requirements are stricter. Everything compounds.

So the gap persists. Not because it's technically impossible, but because nobody with the models has the incentive to close it.

What we've learned (and what surprised us)

The partnerships are sometimes harder than the infrastructure. Getting the leading voice model labs to let you run their models on your own compute, in your own regions, under your own compliance frameworks — that takes trust, not just engineering. Every lab has different terms, different technical requirements, different views on what running their models outside their cloud means. Building those relationships took longer than we expected.

Compliance varies wildly — and it's not just data residency. Regional data sovereignty requirements are table stakes. The harder layer is industry-specific compliance. The companies that care most about where their voice data is processed are in healthcare and financial services — and those industries have their own local regulatory regimes on top of regional data laws. Indian financial services voice data has RBI requirements. Australian healthcare has its own privacy framework. We had to scope routing, storage, and compute independently per region, then layer industry compliance on top of each one.

Hardware variance is real. The same model doesn't perform identically across GPU types and configurations. We benchmark every model on every hardware profile in every region — and what's available in each region continuously shifts. We generally have more flexibility on the STT side because the unit economics of smaller GPUs align well with STT models, which are generally compact. TTS is a different problem. The models require larger GPUs as a baseline, and the provider licensing fees are higher, so the unit economics of STT and TTS are fundamentally different across regions. That tuning work is invisible to customers but critical. It's also where our margin often sits.

Where the constraints led us

These constraints are also where we've found our deepest innovation. Having to solve for all of them simultaneously — not in a single well-resourced US region, but across 11 — has forced us to build something more powerful than what you'd design if you started with abundant compute in Virginia.

Sovereign infrastructure across India, Australia, EU, UK, US, and Indonesia as primary hubs. Compliant compute, routing, and data storage scoped to each jurisdiction. Every major voice model available locally. A single API to access all voice models, where you can switch at any time between models, regions, and modalities.

By supporting voice agents across these regions — delivering exceptionally low latency, at a fraction of the usual cost, without losing quality across model labs — we believe we've built an execution layer for voice agents that will reshape how we build them globally.

If you're building voice pipelines across jurisdictions

We'd genuinely like to hear what you're hitting. Not a sales pitch — we're trying to understand the problem space better.

What regions are you deploying to?
What's your latency budget and are you meeting it?
Have you run into the control plane routing issue?
What compliance requirements are shaping your architecture?

Drop a comment or reach out. We're building this in the open because the problem is too big and too distributed for one company to solve by guessing.

I'm Luke, founder of SLNG. See what we've built from the above: a drop-in voice agent execution layer — in-region STT, TTS, noise reduction, VAD, language detection, endpointing, voice cloning, audio streaming, compliance routing, and LLM call optimisation. $0.01/min. One API. Bring your own orchestrator and LLM. Customer results include 50% less latency, >50% less cost, and fewer tokens per conversation.