Three weeks ago I was benchmarking GPT-4o against a local Llama model. I was copying prompts from a real support ticket database to make the test realistic. Midway through the run I glanced at the terminal and saw this in the logs:
prompt="Hi, my name is Sarah Johnson, my account number is 4532-1234-5678-9012..."
provider=cloud
model=gpt-4o
A real customer's name. A real credit card number. Already sent to OpenAI.
I had not noticed because the benchmark UI just showed a token count, not the actual prompt content. The PII was in the data. I had forgotten to sanitise it. OpenAI's API terms say they don't train on API data, but that's not the point — the data left my infrastructure. Under GDPR, that's a potential breach.
I spent the rest of that weekend building a firewall so it could never happen again. This post is the full story of what I built, how it works, and how you can run it in one command.
The code is at github.com/sochaty/llm-governance-engine — tag governance-post-1.
The Problem With Every Existing LLM Tool
Every LLM observability tool I have used — LangSmith, Helicone, Arize Phoenix — works the same way: it records what happened after the fact. You get a dashboard, a trace, a cost breakdown. None of them stop the request.
That distinction matters enormously under GDPR, HIPAA, and the EU AI Act. "We logged that PII was sent" is not a compliance posture. "PII was blocked before it left the building" is.
By the end of this post you will have:
- A FastAPI backend that scans every prompt with Microsoft Presidio before it reaches any model
- A YAML-based policy engine where a rule file controls what gets blocked, warned, or alerted
- A PostgreSQL audit vault with every inference logged — PII flag, safety score, cost, latency
- Webhook alerts to Slack or Teams when a rule fires
- An Angular 21 dashboard for real-time cloud vs local benchmarks
- A 10-dimension governance radar chart on every run
Everything runs with docker compose up.
Architecture
The key insight is where enforcement happens: before the model call, not after.
User Prompt
│
▼
FastAPI /benchmark/stream
│
├── enforce_governance_policy() ← Presidio scan + policy evaluation
│ │
│ ├── PII detected + cloud model → HTTP 403 (prompt never sent)
│ ├── Safety score low → warn + log + continue
│ └── All rules pass → verdict returned to endpoint
│
├── LLMOrchestrator.get_streaming_response()
│ │
│ ├── OpenAI / Groq / Google / Anthropic (cloud)
│ └── Ollama (local)
│
└── AuditService → PostgreSQL
The enforce_governance_policy function is a FastAPI Depends() — injected into the streaming endpoint. If a blocking rule fires, it raises HTTP 403 before the orchestrator is even called. The prompt never touches the wire.
The YAML Policy DSL
The entire governance model is a YAML file. No code changes, no restarts — edit the file, POST /api/v1/policies/reload, rules are live.
# policies/default.yaml
version: "1.0"
name: "default"
rules:
- id: pii-cloud-block
name: "Block PII from cloud models"
condition: pii_detected
threshold: 0.7 # Presidio confidence ≥ 0.7 triggers this rule
models: [cloud, gpt-4o]
action: block # returns HTTP 403
severity: critical
webhook_url: null # set to your Slack URL to get alerted
- id: low-safety-warn
name: "Warn on low safety score"
condition: safety_score_below
threshold: 0.5
action: warn # logs + audits, passes through
severity: medium
- id: pii-local-alert
name: "Alert on PII sent to local models"
condition: pii_detected
threshold: 0.85
models: [local]
action: alert # fires webhook, does not block
severity: high
Four conditions: pii_detected, safety_score_below, cost_exceeds, model_is.
Three actions: block (HTTP 403), warn (audit + continue), alert (webhook + continue).
Starter templates are shipped in the repo for GDPR (policies/gdpr.yaml) and HIPAA (policies/hipaa.yaml).
PII Detection: Microsoft Presidio
Presidio is Microsoft's open-source PII detection library. It runs locally — no API call, no data leaving your machine.
It detects 50+ entity types out of the box: PERSON, EMAIL_ADDRESS, CREDIT_CARD, US_SSN, PHONE_NUMBER, IBAN_CODE, IP_ADDRESS, and more. It uses a combination of regex patterns, checksums, and a spaCy NLP model for name recognition.
The scan returns a confidence score per entity. The policy engine compares that score against the rule's threshold. An entity with 0.95 confidence on CREDIT_CARD and a threshold of 0.7 triggers the pii-cloud-block rule.
# backend/app/services/audit_service.py (simplified)
from presidio_analyzer import AnalyzerEngine
class AuditService:
def __init__(self):
self.analyzer = AnalyzerEngine()
def scan_for_pii_details(self, text: str) -> ScanResult:
results = self.analyzer.analyze(text=text, language="en")
detected = len(results) > 0
entities = [
EntityResult(
entity_type=r.entity_type,
confidence=r.score,
start=r.start,
end=r.end,
)
for r in results
]
max_confidence = max((r.score for r in results), default=0.0)
return ScanResult(
detected=detected,
entities=entities,
max_confidence=max_confidence,
)
The safety score is calculated separately — it is a 0.0–1.0 measure that combines PII confidence, entity density, and sensitive keyword presence. A score below 0.5 triggers the low-safety-warn rule.
The Policy Engine
The engine follows a Chain of Responsibility pattern. Each rule evaluates the GovernanceContext independently:
# backend/app/governance/policy/schema.py
@dataclass
class GovernanceContext:
prompt: str
provider: str
model_id: str
pii_detected: bool
pii_entity_types: List[str]
pii_max_confidence: float
safety_score: float
estimated_prompt_cost_usd: float
class PolicyVerdict(BaseModel):
passed: bool
violated_rules: List[ViolatedRule] = []
blocking_rule: Optional[ViolatedRule] = None
warnings: List[str] = []
The DefaultPolicyEngine.evaluate() iterates all rules in order. Block rules short-circuit. Warn and alert rules accumulate into the verdict. The verdict is returned to the FastAPI dependency, which raises HTTP 403 if blocking_rule is set.
The FastAPI Enforcement Dependency
This is the part that makes everything composable. One line wires the entire governance stack into any endpoint:
# backend/app/api/benchmark_router.py
@router.get("/stream")
async def stream_benchmark(
verdict: PolicyVerdict = Depends(enforce_governance_policy),
db: AsyncSession = Depends(get_db),
):
# If we reach here, the prompt passed all blocking rules.
# verdict.warnings contains any non-blocking rule hits.
...
The dependency itself:
# backend/app/governance/policy/enforcement.py (simplified)
async def enforce_governance_policy(
prompt: Annotated[str, Query(min_length=1)],
provider: Annotated[str, Query(pattern="^(cloud|local)$")] = "cloud",
db: AsyncSession = Depends(get_db),
) -> PolicyVerdict:
engine = get_policy_engine()
audit = _get_audit_service()
scan = audit.scan_for_pii_details(prompt)
context = GovernanceContext(
prompt=prompt,
provider=provider,
model_id="gpt-4o" if provider == "cloud" else "llama3.2:latest",
pii_detected=scan.detected,
pii_entity_types=[e.entity_type for e in scan.entities],
pii_max_confidence=scan.max_confidence,
safety_score=audit.calculate_safety_score(prompt),
estimated_prompt_cost_usd=(len(prompt.split()) * 0.00003)
if provider == "cloud" else 0.0,
)
verdict = engine.evaluate(context)
for violation in verdict.violated_rules:
webhook_url = _get_webhook_url(engine, violation.rule_id)
await _record_violation(db, violation, context, webhook_url)
if not verdict.passed and verdict.blocking_rule:
br = verdict.blocking_rule
raise HTTPException(
status_code=403,
detail={
"error": "governance_violation",
"rule_id": br.rule_id,
"rule_name": br.rule_name,
"severity": br.severity,
"message": br.message,
},
)
return verdict
Every violation — blocked or not — is persisted to policy_violations in PostgreSQL before the function returns. Webhook delivery is fire-and-forget via asyncio.create_task() so it never adds latency to the response path.
Webhook Alerts
When a rule fires with a webhook_url, a CloudEvents-compatible payload is POSTed:
{
"specversion": "1.0",
"type": "com.governance.policy.violation",
"source": "llm-governance-engine",
"id": "uuid",
"time": "2026-06-19T09:00:00Z",
"data": {
"rule_id": "pii-cloud-block",
"rule_name": "Block PII from cloud models",
"severity": "critical",
"action": "block",
"message": "PII detected (CREDIT_CARD, confidence=0.95) on cloud provider",
"provider": "cloud",
"model_id": "gpt-4o"
}
}
Three delivery attempts with exponential backoff. Slack, Teams, and PagerDuty all accept this payload natively via their incoming webhook integrations.
Running It
git clone https://github.com/sochaty/llm-governance-engine
git checkout governance-post-1
cp .env.example .env
# Add your OPENAI_API_KEY (or any provider key)
docker compose up
Dashboard → http://localhost:4200
API docs → http://localhost:8000/docs
Pull a local model to enable the side-by-side comparison:
curl -X POST http://localhost:11434/api/pull -d '{"name":"llama3.2:latest"}'
Trigger your first governance block:
Open the dashboard, type a prompt containing a fake SSN — My SSN is 123-45-6789 — select the Cloud provider and hit Run. You will get a red Governance Violation banner instead of a response. The prompt never reached GPT-4o.
Open http://localhost:8000/api/v1/policies/violations to see the audit record of the block.
What the Audit Vault Captures
Every inference — blocked or not — is stored in PostgreSQL:
| Field | Example |
|---|---|
prompt (preview) |
"My SSN is 123-45..." |
provider |
cloud |
model_name |
gpt-4o |
pii_detected |
true |
safety_score |
0.12 |
latency_ms |
0 (blocked before model) |
estimated_cost |
$0.0000 |
version_tag |
openai/gpt-4o |
The Audit Vault page in the dashboard is filterable by prompt, provider, and PII flag. Every row has a "Generate Report" button that exports a PDF — useful when a compliance officer asks for evidence.
Multi-Provider Routing
The orchestrator supports five provider types with a single interface:
| Provider | How it connects |
|---|---|
| OpenAI |
AsyncOpenAI — native |
| Groq | AsyncOpenAI(base_url="https://api.groq.com/openai/v1") |
| Google Gemini | AsyncOpenAI(base_url="https://generativelanguage.googleapis.com/v1beta/openai") |
| Anthropic | Lazy import anthropic — separate streaming path |
| Ollama (local) | AsyncOpenAI(base_url="http://ollama-service:11434/v1", api_key="ollama") |
API keys are stored in PostgreSQL (Fernet-encrypted) and resolved live on every request via settings_service.get(). Change a key in the Settings UI — no restart needed, effective on the next request.
What's next
The codebase is production-ready for single-tenant use. The roadmap from here:
-
v1.1 — RAGAS hallucination scoring on every response. Local Ollama as the free evaluation judge — no marginal cost.
faithfulness_scorepopulates in the audit log 2–3 seconds after the benchmark completes. - v1.2 — FinOps dashboard. Daily cost trends per model, Z-score anomaly detection, budget circuit breakers. A rule that says "if GPT-4o spend exceeds $500 this week, route to Llama" enforced at the proxy layer — not just a dashboard alert.
- v2.0 — Multi-tenant workspaces with JWT + RBAC. PostgreSQL Row-Level Security for tenant isolation. OIDC/SAML for Okta and Azure AD.
The incident that started this — a real customer's credit card number sent to GPT-4o because I forgot to sanitise a test dataset — took about 30 seconds to happen and would have taken weeks to untangle from a compliance perspective.
The fix took a weekend. It should have existed before the first prompt was ever sent.
Full code: github.com/sochaty/llm-governance-engine
Reproduce this post exactly: git checkout governance-post-1
PRs and issues welcome. If you build a custom Presidio recogniser for your domain (medical records, legal documents, financial instruments), I would love to include it in the default policy templates.
All my writing lives at blogs.sourishchakraborty.com — subscribe there for future posts.
Top comments (0)