Graham Morley

Posted on Apr 16 • Originally published at morleymedia.dev

AI Data Residency: When Cloud APIs Don't Meet Your Compliance Requirements

#dataresidency #llm #systemdesign #compliance

This guide covers the distinction between data residency and data sovereignty, the three real infrastructure options for AI compliance, and the operational reality of running self-hosted inference. It is written for engineering leaders and compliance teams evaluating whether cloud AI APIs meet their regulatory requirements, and what the alternatives look like if they do not.

For regulatory detail specific to your jurisdiction, see our regional guides:

Data Residency vs. Data Sovereignty

These terms get used interchangeably. The distinction matters for compliance. The next examples are Canada specific, but the article is for everyone, and not specific to one region.

Data residency means data is stored in a specific geographic location. Your cloud provider has a Canadian region, your database is in ca-central-1, your data physically sits on a server in Montreal.

Data sovereignty means data is subject to the laws of the country where it is stored, and only those laws. This is the harder requirement. A US-headquartered cloud provider operating a Canadian datacenter satisfies data residency. It does not necessarily satisfy data sovereignty, because the provider's parent company may be subject to foreign legal process that can compel disclosure regardless of where the data is physically stored.

Every major regulatory framework that touches AI data handling, including HIPAA, GDPR, GLBA, PIPEDA, and the EU AI Act, imposes requirements that depend on understanding this distinction. The specific requirements vary by jurisdiction (covered in our regional guides linked above), but the structural problem is the same everywhere: storing data in a local datacenter operated by a foreign-headquartered company does not insulate it from that company's home jurisdiction.

The CLOUD Act Problem

The US CLOUD Act (Clarifying Lawful Overseas Use of Data Act) is the specific legal mechanism that makes "data residency" insufficient for many compliance requirements outside the United States.

The CLOUD Act permits US authorities to compel production of data within the "possession, custody or control" of a covered entity, regardless of where that data is physically stored. A US-headquartered company operating a datacenter in Frankfurt, Toronto, or Sydney is still subject to CLOUD Act demands on that data.

This is not theoretical. On June 10, 2025, Microsoft France's Director of Public and Legal Affairs, Anton Carniaux, testified under oath before a French Senate inquiry commission investigating digital sovereignty in public procurement. When asked whether he could guarantee that data belonging to French citizens, hosted under government procurement agreements, would not be transmitted to US authorities without French authorization, his response was "No, I cannot guarantee it." This was a senior legal official under oath in a formal parliamentary proceeding.

Canada's Treasury Board has stated the same conclusion: "as long as a CSP that operates in Canada is subject to the laws of a foreign country, Canada will not have full sovereignty over its data." The Balsillie Papers research published in March 2026 went further, noting that Canadian government data can be compelled by US authorities without Canadian judicial review or governmental notification.

Every major cloud AI service (Azure OpenAI, Amazon Bedrock, Google Vertex AI) and every major AI API provider (OpenAI, Anthropic, Google) is operated by a US-headquartered parent company subject to CLOUD Act jurisdiction. Regional deployments from these providers satisfy data residency. None of them resolve the CLOUD Act jurisdiction question for organizations outside the US.

For US-based organizations, the CLOUD Act is not a problem in the same way: it is US law applied through US legal process to US companies. It becomes relevant when you serve international customers whose regulators care about foreign government access to their data. Our US regulatory guide covers this angle.

Customer-managed encryption keys (CMEK) are often positioned as a mitigation. In theory, if you hold the encryption keys, the provider cannot decrypt your data even under compulsion. In practice, CMEK does not fully protect against CLOUD Act orders in most implementations. The provider still has access to metadata, account information, file names, sharing structures, and activity logs. A CLOUD Act order can compel production of all of this. CMEK is a meaningful layer of defence, but it is not a complete solution to the jurisdictional exposure.

Three Infrastructure Options

Option A: Cloud Provider Residency with Contractual Controls

Select a regional deployment from a hyperscaler, execute the appropriate Data Processing Agreement and Business Associate Agreement (where applicable), implement CMEK where available, and document your risk acceptance.

When this is sufficient: Your regulator accepts contractual controls and documented risk assessments. Your threat model does not include foreign government compulsion as a primary concern. Your data classification does not require the strictest sovereignty controls. You need to move fast and your compliance team has signed off on the residual risk.

What it does not solve: The jurisdictional exposure described above. For many commercial workloads, this risk is accepted with appropriate documentation. For government data, healthcare data in certain jurisdictions, or financial data subject to stricter regulatory interpretation, it may not be.

Cost: Lowest. Standard cloud compute and API pricing with no capital expenditure on hardware.

Option B: Dedicated Single-Tenant Cloud Deployments

Azure OpenAI offers provisioned throughput deployments. AWS Bedrock offers dedicated throughput. These run on reserved capacity, not shared multi-tenant inference endpoints. Depending on configuration, data does not leave the specified region.

When this makes sense: You need guaranteed throughput and latency SLAs. Your compliance team is comfortable with the cloud provider's jurisdiction but wants network isolation from other tenants. You want to use frontier proprietary models that are not available as downloadable weights.

What it does not solve: The underlying jurisdiction question is the same as Option A. You get network isolation and dedicated compute, not jurisdictional independence.

Cost: Significantly higher than on-demand API pricing. Provisioned and dedicated throughput require committed spend, typically thousands to tens of thousands per month depending on model and throughput requirements.

Option C: Self-Hosted Infrastructure

You run the hardware. You run the models. You control the network. The inference endpoint is on infrastructure that is not subject to foreign jurisdiction because the entity that owns and operates it is domestic.

When this is the right answer: Your regulator requires that no foreign-jurisdiction entity can be compelled to disclose your data. Your threat model explicitly includes foreign government compulsion. You are processing data classified at a level that precludes third-party cloud processing. You need to run custom or fine-tuned models. Your inference volume is high enough that the capital expenditure breaks even against API costs within a reasonable timeframe.

What it costs you: Capital expenditure on GPU hardware, colocation or facility costs, power and cooling, a team capable of operating bare-metal infrastructure, and the ongoing operational burden of keeping it running.

The Self-Hosted Decision Matrix

Before committing to self-hosted infrastructure, work through these questions.

1. What is your regulator actually requiring?

There is a meaningful difference between "data must be stored in our country" (residency), "data must not be accessible to foreign governments" (sovereignty), and "data must be processed on infrastructure with no foreign corporate parent" (full-stack sovereignty). Read your specific regulatory guidance. Many organizations over-index on sovereignty requirements that their regulator has not actually imposed, and under-index on requirements like audit logging and access controls that the regulator cares about deeply.

2. Who is the adversary in your threat model?

If you are protecting against commercial data breaches and unauthorized access, cloud providers with SOC 2 Type II and ISO 27001 certifications are likely more secure than anything you will operate yourself. If you are protecting against foreign government compulsion via legal process, cloud provider certifications are irrelevant because the compulsion is lawful within the provider's jurisdiction. The threat model determines the infrastructure choice.

3. What is your inference volume?

Self-hosted GPU infrastructure has high fixed costs and low marginal costs. Cloud API pricing has low fixed costs and high marginal costs. There is a crossover point. For light, intermittent usage, cloud APIs win on total cost. For sustained high-volume inference, self-hosted hardware pays for itself. The break-even depends on your specific model size, throughput requirements, and GPU utilization rate. As a rough framework: if you are spending more than $15,000 to $20,000 per month on cloud AI API costs with consistent utilization, a capital expenditure analysis on self-hosted hardware is worth running.

4. Do you have the team to operate it?

Self-hosted GPU infrastructure is not "set up a server and walk away." It requires ongoing hardware monitoring, firmware and driver updates, model serving software maintenance, network security operations, and capacity planning. If your organization does not have infrastructure operations experience, you either need to hire it, contract it, or accept that you are taking on operational risk.

Need help evaluating your AI data residency requirements? We build and
operate compliant AI infrastructure for regulated industries. We have
multi-year production experience running physical server infrastructure across
Canadian and European datacenters, and we built an AI compliance platform that
achieved SOC 2 Type 1 and ISO 27001 certifications. From initial compliance
assessment through hardware planning, deployment, and ongoing operations, we
handle the full stack. Talk to our team.

Hardware Selection Criteria

The hardware landscape changes faster than blog posts age. We cover selection criteria that remain stable regardless of which specific generation is current, rather than recommending specific models or listing prices that will be outdated in months. We select the best fit for each engagement based on the client's budget, model requirements, deployment location, and throughput needs.

Memory capacity is the primary constraint for inference. The model's parameters must fit in GPU memory (VRAM). A model that does not fit on a single GPU requires tensor parallelism across multiple GPUs, which introduces inter-GPU communication overhead and operational complexity. Quantization (running the model at reduced numerical precision) shrinks memory requirements significantly, but it affects output quality to varying degrees depending on the model and method. The tradeoff between memory capacity, quantization level, and output quality is the first decision point.

Memory bandwidth determines inference throughput. Once the model fits in memory, the speed at which the GPU can read model weights during each forward pass determines tokens-per-second. For autoregressive language models, inference is memory-bandwidth-bound, not compute-bound. A GPU with more memory bandwidth will often outperform a GPU with more raw compute at the same price point for inference workloads specifically.

Power and cooling are non-negotiable constraints. Current-generation datacenter GPUs draw 600W to 1000W+ per unit. An 8-GPU server can draw 10kW or more. This requires appropriate power delivery (typically 208V or 240V three-phase), cooling infrastructure (liquid cooling is increasingly mandatory for the latest generation, not optional), and a facility that can support the power density. Standard office buildings and most commodity colocation cannot accommodate this without modifications.

Colocation vs. owned facility. Unless your organization already operates datacenter space, colocation is the practical choice. You ship your hardware to a facility that provides power, cooling, network connectivity, and physical security. You retain ownership and control of the servers. The colocation provider does not have logical access to your systems. Evaluate colocation providers on: power density per rack (you need more than standard 5-10kW racks), network connectivity (redundant uplinks, low-latency peering), physical security and compliance certifications, and whether they can support liquid cooling if your hardware requires it.

Redundancy planning. GPU hardware fails. Power supplies, fans, memory modules, and the GPUs themselves all have failure rates. Plan for N+1 redundancy at minimum for production workloads: enough spare capacity that losing a single GPU server does not take your inference endpoint offline.

Practical advice: Before purchasing hardware, run your target models on
rented cloud GPU instances to establish baseline performance requirements.
Measure tokens-per-second, latency percentiles, and memory utilization under
realistic load. Use those numbers to spec your purchase, rather than sizing
from spec sheets alone.

Model Selection for On-Premise Inference

The open-weight model ecosystem is mature enough that several model families are viable for production on-premise deployment. We do not recommend specific models here because the landscape shifts every few months, but the selection criteria are stable.

License terms. Some open-weight models are released under permissive licenses (Apache 2.0, MIT) that allow unrestricted commercial use. Others have custom licenses with usage restrictions, such as monthly active user caps or prohibitions on specific use cases. Read the license before committing infrastructure to a model. If you are building a product on top of it, license terms affect your business, not just your engineering.

Model size vs. hardware fit. Match the model's parameter count and precision requirements to your available GPU memory. A model that requires tensor parallelism across four GPUs to serve a single request is operationally more complex and more expensive per token than a model that fits on one or two. Mixture-of-experts (MoE) architectures are relevant here: a model with a large total parameter count but a small number of active parameters per token needs the memory of a large model but the compute of a smaller one.

Quality gap from frontier proprietary models. There is still a gap between the best open-weight models and the best proprietary models on certain tasks, particularly complex multi-step reasoning and nuanced instruction following. For many production use cases (classification, extraction, summarization, structured data generation, customer-facing chat with bounded scope), the quality gap is small enough to be irrelevant. For tasks at the frontier of model capability, it may matter. Evaluate on your actual use case, not on benchmark leaderboards.

Inference serving software. The model needs to be served through an inference engine that handles batching, quantization, and the HTTP/gRPC API layer. As of early 2026, vLLM is the production default for most deployments, using PagedAttention for efficient GPU memory management. SGLang is a strong alternative that outperforms vLLM by roughly 29% on throughput for workloads with shared context (chatbots, RAG, agents) through its RadixAttention caching. Hugging Face's Text Generation Inference (TGI) entered maintenance mode in December 2025, with Hugging Face explicitly recommending vLLM or SGLang for new deployments. llama.cpp remains the standard for running models on consumer hardware or CPU-based inference. The choice of serving software affects performance as much as the choice of GPU.

The honest assessment: If your use case requires the absolute best available model quality and you are not constrained by sovereignty requirements, proprietary cloud APIs will outperform what you can self-host. If your use case requires sovereignty, or if your inference volume makes self-hosting economically attractive, the open-weight ecosystem is good enough for most production workloads and improving rapidly.

Security Architecture for Self-Hosted AI

Running your own inference infrastructure means you own the entire security surface. This section covers the architecture decisions specific to self-hosted AI. For general server hardening, see our security checklist.

Network isolation. The inference cluster should be on an isolated network segment with no direct internet access. Client applications reach the inference API through a reverse proxy or API gateway in a DMZ. The inference servers themselves should not be able to initiate outbound connections. This limits the blast radius of a compromise and prevents data exfiltration through the inference layer.

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Client App    │────>│  API Gateway /  │────>│   Inference     │
│                 │     │  Reverse Proxy  │     │   Cluster       │
│                 │<────│  (DMZ)          │<────│  (Isolated Net) │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                              │                        │
                              ▼                        ▼
                        ┌───────────┐           ┌─────────────┐
                        │  Audit    │           │ No outbound │
                        │  Logging  │           │ access      │
                        └───────────┘           └─────────────┘

Key management. Encryption keys for data at rest (model weights if encrypted, input/output logs, cached embeddings) should be managed through a hardware security module (HSM) or a dedicated key management service. For the highest security requirements, an HSM provides tamper-evident, FIPS 140-2/3 validated key storage. For most production deployments, a software-based KMS (HashiCorp Vault or equivalent) is sufficient if configured with appropriate access controls and audit logging. The critical requirement: key material should not be stored on the inference servers themselves.

Audit logging. Every inference request should be logged with a timestamp, the requesting user or service identity, and input/output token counts. Whether to log input and output content depends on your retention policy and regulatory requirements. If you do log content, encrypt those logs at rest with keys managed separately from the inference infrastructure. Ship logs to a centralized logging system that is not on the same network segment as the inference cluster.

Air-gapped deployment patterns. For the most sensitive workloads, the inference cluster has no network connectivity to the internet at all. Model updates, software patches, and configuration changes are delivered via physical media or a one-way data diode. This is operationally expensive and only justified for classified workloads or environments with the strictest regulatory requirements.

Input/output filtering. Self-hosted models do not come with the same content filtering and safety layers that cloud APIs provide by default. If your use case involves end-user-facing interactions, you need to implement your own input validation, output filtering, and guardrails. This is additional development work that is easy to underestimate.

Operational Reality

This is the section that differentiates "we have read about self-hosted AI" from "we have operated physical infrastructure."

We have run physical servers across Canadian and European datacenters for multiple years, maintaining blockchain node infrastructure that processed real financial transactions with real money at stake. That infrastructure included ErgoPad (a token launchpad that reached over $20M in total value locked), Paideia (a DAO governance platform operating across multiple blockchain networks), and several other production systems. Across all of these deployments, over multiple years of continuous operation, we had zero security exploits. We also built and shipped Crystal aOS, an AI legal compliance platform that achieved SOC 2 Type 1 and ISO/IEC 27001:2022 certifications, with document ingestion pipelines, RAG, and data residency controls built in from the start.

That track record required operational discipline that is directly transferable to AI inference infrastructure, because the failure modes are the same: hardware fails, software needs patching, networks go down, and the systems need to keep running regardless.

Hardware monitoring is continuous, not periodic. GPU temperatures, memory utilization, fan speeds, power draw, and error rates need real-time monitoring. GPU memory errors (ECC corrections and uncorrectable errors) are early indicators of hardware failure. A GPU accumulating ECC errors will eventually fail. You want to replace it during a planned maintenance window, not during a production outage. IPMI/BMC access for out-of-band management is essential: you need to power cycle a server, access its console, and check hardware health without relying on the operating system being functional.

Driver and firmware updates are not optional. GPU drivers, firmware, BIOS updates, and inference serving software all receive regular updates that affect performance, stability, and security. These updates need to be tested in a staging environment before rolling to production, and they occasionally require reboots that take servers offline. Plan for regular maintenance windows.

Capacity planning requires forecasting. GPU procurement lead times can be weeks to months depending on availability. If your inference load is growing, you need to be ordering hardware before you need it, not when you run out of capacity.

Incident response for hardware is different from software. When a cloud VM has a problem, you click a button and get a new one. When a physical GPU server has a problem, someone needs to physically access the machine. If it is in a colocation facility, that means either driving to the datacenter or filing a remote-hands ticket and waiting. Factor this into your SLA calculations. If your colocation is in a different city, remote-hands response time and capability become critical vendor selection criteria.

Backups and disaster recovery are your problem. Model weights can be re-downloaded (assuming you have not fine-tuned them). Your fine-tuned models, your RAG indexes, your configuration, and your audit logs cannot be re-downloaded. Back them up. Test restoring from backups regularly. Have a documented procedure for rebuilding the inference stack from scratch on replacement hardware.

The operational cost that gets underestimated: It is not the hardware
purchase, it is the ongoing human cost of keeping the infrastructure running.
A production inference cluster requires monitoring, maintenance, security
patching, capacity planning, and incident response. If you are budgeting for
self-hosted AI, budget for the people, not just the servers.

When NOT to Self-Host

Self-hosting is the right answer for a specific set of requirements. It is the wrong answer more often than it is the right one.

Your compliance team has signed off on cloud provider residency with contractual controls. If your DPA, BAA, and risk assessment satisfy your regulator, the operational overhead of self-hosting is not justified. Most organizations fall into this category.

Your inference volume is low or intermittent. Self-hosted GPU hardware sits idle when you are not running inference. Cloud API pricing is per-token with no idle cost. If your usage is bursty or low-volume, you will pay more in depreciation and power for idle hardware than you would pay in API fees.

You do not have the team to operate it. Self-hosted infrastructure without operational expertise is a liability, not an asset. A misconfigured, unpatched, unmonitored inference cluster is worse for your security posture than a well-managed cloud deployment.

You need frontier model quality and it materially affects your product. The best proprietary models are available only through cloud APIs. If your use case requires the absolute best available model performance and the quality gap matters for your specific task, cloud APIs are the answer. Evaluate on your actual workload, not on published benchmarks.

You are prototyping or validating a use case. Use cloud APIs to validate that the AI feature works and that users want it. Migrate to self-hosted infrastructure after you have proven the use case and have the volume to justify the capital expenditure. Optimizing infrastructure before validating demand is a common and expensive mistake.

Summary

The decision framework:

Determine what your regulator actually requires. Read the specific guidance for your jurisdiction and sector (US, Canada, UK/EU).
Identify the adversary in your threat model. Commercial breach vs. foreign government compulsion are different problems with different solutions.
If cloud APIs satisfy your compliance requirements, use them. The operational simplicity is worth it.
If you need single-tenant isolation but can accept cloud provider jurisdiction, evaluate dedicated throughput offerings.
If you need full sovereignty, self-hosted on domestically owned and operated infrastructure is the path. Budget for the hardware, the facility, the team, and the ongoing operational cost.
Whichever path you choose, get the security fundamentals right: encryption at rest and in transit, audit logging, access controls, and incident response planning.

DEV Community