
Your SOC 2 cert covers the vendor’s infrastructure — not what your users paste into prompts. The moment someone feeds client data into a cloud model, the liability is yours. The fix is architectural.
Here are the three options, and when to use each.
On-Premise
Model runs on your hardware. Nothing leaves your network. The only option that satisfies air-gap requirements and strict data residency mandates.
Use it when:
•Air-gap or strict residency mandate applies
•Gov / defense / intelligence data involved
•>2M tokens/day makes infra TCO competitive with API spend
Reality check:
$80K–$250K+ upfront · 3–6 months to production · 0.5–1 FTE DevOps ongoing · hardware refresh every 3–4 years
OpenAI-compatible endpoint on your own hardware
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--host 0.0.0.0 --port 8000 --tensor-parallel-size 4
Proxy / Gateway
Doesn’t move the model — moves the control plane. Every request flows through a central gateway where PII is redacted, access is controlled, and every interaction is logged before anything reaches the cloud model.
Use it when:
•Shadow AI is your main risk (employees using AI without governance)
•Need governance live this quarter, not next year
•OPEX budget, not CAPEX
Good options:
•LiteLLM — open-source, Presidio PII built-in, 100+ providers
•Portkey — managed, analytics, fallback routing
•Kong AI Gateway — enterprise-grade API layer
LiteLLM with PII guardrails (litellm_config.yaml)
guardrails:
guardrail_name:pii-masking
litellm_params:
Hybrid — Local Redaction + Cloud Inference
Mask sensitive data locally. Send anonymized text to cloud model. Get frontier quality without residency violations. This is where most regulated enterprises are landing.
1.Local Presidio agent anonymizes all data before it leaves your infra
2.LLM Gateway enforces RBAC + logs the interaction
3.Cloud model processes clean anonymized text — never sees PII
More detail on the compliance logic behind this pattern: Questa AI — Cloud vs. On-Premise AI Compliance.
At a Glance
On-Premise Proxy / Gateway Hybrid
Data leaves? Never Anonymized only Anonymized only
Air-gap safe? Yes No No
Setup time 3–6 months 2–6 weeks 4–10 weeks
Cost $80K–$250K+ Low (SW) Medium
Frontier models? No Yes Yes
Best for Strict residency Shadow AI/gov Regulated+cloud
Further Reading
→ Full decision framework + infrastructure specs: LinkedIn Pulse
→ Leadership/compliance version: Substack
→ Technical deep-dive with full code: Hashnode
presidio
mode: pre_call # redact BEFORE model sees prompt
Hybrid — Local Redaction + Cloud Inference
Mask sensitive data locally. Send anonymized text to cloud model. Get frontier quality without residency violations. This is where most regulated enterprises are landing.
- Local Presidio agent anonymizes all data before it leaves your infra
- LLM Gateway enforces RBAC + logs the interaction
- Cloud model processes clean anonymized text — never sees PII
More detail on the compliance logic behind this pattern:
Questa AI — Cloud vs. On-Premise AI Compliance.
At a Glance
On-Premise Proxy / Gateway Hybrid
Data leaves? Never Anonymized only Anonymized only
Air-gap safe? Yes No No
Setup time 3–6 months 2–6 weeks 4–10 weeks
Cost $80K–$250K+ Low (SW) Medium
Frontier models? No Yes Yes
Best for Strict residency Shadow AI/gov Regulated+cloud
Further Reading
→ Full decision framework + infrastructure specs: LinkedIn Pulse
→ Leadership/compliance version: Substack
→ Technical deep-dive with full code: Hashnode
Top comments (0)