DEV Community: ChrisRemo

EU AI Act and LLM Proxies - What Your Infrastructure Layer Needs to Know

ChrisRemo — Thu, 02 Apr 2026 12:38:07 +0000

The EU AI Act is rolling out in stages. Prohibited practices since February 2025, GPAI rules since August 2025, high-risk system obligations from August 2026. Full enforcement across all risk categories hits in August 2027.

If you're running a proxy or gateway between your applications and LLM providers, some of this applies to you - maybe not in the way you'd expect.

Who is responsible for what?

The AI Act doesn't assign blame to a single layer. It distributes responsibilities:

Model providers (OpenAI, Anthropic, or your self-hosted vLLM) handle model safety, training data, and GPAI obligations
Infrastructure (your proxy/gateway) provides secure access control, usage tracking, and routing
Your application owns the use case, risk classification, transparency, and monitoring
End users use the service within the terms you set

A proxy doesn't train models or decide how outputs are used. But it's not a completely neutral pipe either - RBAC controls, rate limits, and routing decisions shape the compliance landscape. The legal classification of AI infrastructure providers is still being refined. This is where things stand as of Q2 2026.

The real problem: prompt logging

Here's where it gets practical. Most LLM gateways log prompts and responses. Some by default, some as an opt-in "observability" feature, some because it was easier to log everything than to think about what to exclude.

Under GDPR, prompt content becomes personal data the moment someone types a name, email address, or anything identifiable. Once your proxy stores that, you're on the hook for:

A legal basis to process it
Records of processing activities
Data subject access requests (someone asks "what data do you have on me?")
Right to deletion ("delete everything about me")
Breach notification scope (a data breach now includes prompt content)

This isn't theoretical. If your proxy vendor stores prompts in a database, all of this applies to your DPA.

What the AI Act actually requires for logging

Article 12 requires "automatic recording of events" for traceability - but primarily for high-risk AI systems. The key word here is events, not content.

What the regulation expects: who made the request, when, which model, how many tokens, what happened. Metadata.

What most proxies do: store the full conversation. That's not what was asked for, and it creates obligations that weren't necessary.

Capability	Relevant for compliance	Common in proxies
Usage tracking (who, when, which model)	Yes	Yes
Rate limiting per user/team	Recommended	Sometimes
Audit logs for admin actions	Yes (high-risk)	Rare
Model access control (RBAC)	Recommended	Basic or none
Prompt content logging	Not the baseline requirement	Often default
Cost tracking per team/org	Recommended	Sometimes

In certain high-risk deployments, more detailed documentation may be required for incident investigation. But that obligation sits with the deployer's application layer - the proxy doesn't have the context to decide what needs to be recorded for a specific use case.

Data minimization as architecture

GDPR Article 5(1)(c) requires data minimization - collect only what you need. Article 25 requires Privacy by Design - build it into the system, don't bolt it on later.

For a proxy, this means: if you don't need prompt content, don't store it. Not "we have a toggle to disable logging" - there should be no logging code to toggle. The proxy reads the model field for routing, streams bytes to the provider and back, tracks metadata, and forgets the content.

This is the approach we took with VoidLLM. Not because the AI Act forced us - we made this decision before the regulation was finalized - but because zero-knowledge is the right architecture for infrastructure that handles other people's data.

The practical benefit: your Data Processing Agreement covers metadata only. No prompt content in scope means no content-related breach notifications, no content-related access requests, no content retention policies for the proxy layer.

What a proxy should give you

For AI Act readiness at the infrastructure level:

Access control - org/team/user/key hierarchy so you know who has access to what
Usage tracking - metadata per request, not content
Rate limits and token budgets - constrain usage at every level
Audit logs - track administrative actions (who changed permissions, created keys, modified models)
Metrics - Prometheus or similar for monitoring and alerting
Health checks - know when your upstream providers are degraded

What a proxy should not do

Validate AI outputs (that's the model provider's or your responsibility)
Detect prohibited use cases (that's governance, not infrastructure)
Inject AI disclosure labels (your app handles user-facing transparency)
Assess risk levels (the use case determines classification, not the proxy)

A useful principle

Log content where you understand and take responsibility for it. For most architectures, that's your application - where you have context about the use case, the user, and what needs to be recorded. Not your proxy, which sees bytes flowing through.

Some regulated industries may want content logging for their own reasons. That's a legitimate choice based on risk assessment. The point is it should be a conscious decision, not a proxy default you didn't know was on.

We're engineers, not lawyers. If you're evaluating proxies for a regulated environment, the architecture matters more than the feature list. We wrote a more detailed analysis with diagrams on our blog.

VoidLLM vs LiteLLM - An Honest Comparison from the Builder's Perspective

ChrisRemo — Wed, 01 Apr 2026 18:36:21 +0000

If you're running LLMs in production, you've probably evaluated LiteLLM. It's the most popular gateway out there - 100+ providers, massive community, used by companies like Stripe and Netflix.

I built VoidLLM with a different set of priorities. Here's an honest comparison - including where LiteLLM is ahead.

Why I built something different

We were running self-hosted models in Kubernetes, hitting vLLM directly. No proxy, network policies were the only access control. It worked until we needed to know which team was burning through GPU hours.

LiteLLM was the obvious first choice, but the Python runtime, startup time, and dependency tree felt heavy for what we needed. We also had a hard GDPR requirement - no prompt content could be stored anywhere.

So we built VoidLLM in Go.

What VoidLLM does differently

Privacy by architecture. There's no "disable content logging" toggle - because there's no content logging code. The proxy reads the model field from the request body, streams bytes between client and upstream, and forgets. Usage events track who, which model, how many tokens - nothing else.

Single binary. One Go binary (~25MB) with the admin UI embedded. No Python, no pip, no virtualenv. Download, configure, run.

Performance. Under 500 microseconds of proxy overhead at 2000 RPS. We benchmarked this with Vegeta at sustained load on a 12-core machine.

Built-in admin UI. Key management, usage tracking, model configuration, playground, team management - all embedded in the binary. Not a separate service.

MCP Gateway. VoidLLM doubles as an MCP gateway - register external MCP servers, proxy tool calls with scoped access control. Plus Code Mode: AI agents write JavaScript that orchestrates multiple MCP tool calls in a single WASM-sandboxed execution.

RBAC. Org/team/user/key hierarchy with four roles. Rate limits, token budgets, and model access control at every level. Most-restrictive-wins inheritance.

Load balancing. Multi-deployment models with round-robin, least-latency, weighted, and priority routing. Automatic failover with per-deployment circuit breakers.

Where LiteLLM is better

I'll be honest about this.

Provider coverage. LiteLLM supports 100+ providers. VoidLLM supports 6 (OpenAI, Anthropic, Azure, Ollama, vLLM, custom). If you need native Bedrock, VertexAI, or Cohere integration, LiteLLM has us beat.

Community. Thousands of users, extensive docs, large contributor base. VoidLLM is new. Our docs are solid but our community is just getting started.

Python SDK. If your stack is Python-native and you want a library you can import directly, LiteLLM's SDK is a natural fit. VoidLLM is a standalone proxy - you point your SDK at it.

Observability integrations. LiteLLM connects to Langfuse, Lunary, MLflow for request-level observability. VoidLLM deliberately avoids content-level logging - that's the privacy trade-off.

Quick comparison

	VoidLLM	LiteLLM
Language	Go	Python
Proxy overhead	< 500us P50	~8ms P95
Providers	6	100+
Content logging	Never (by design)	Optional
Deployment	Single binary	Python runtime + deps
Admin UI	Embedded	Separate service
MCP Gateway	Built-in + Code Mode	Recent addition
RBAC	Org/team/user/key	Virtual keys
Load balancing	4 strategies + failover	Retry/fallback
License	BSL 1.1	MIT

Switching is easy

Both are OpenAI-compatible. Switching from LiteLLM to VoidLLM (or back) is a base URL change:

from openai import OpenAI

# Before (LiteLLM)
client = OpenAI(base_url="http://litellm:4000/v1", api_key="sk-...")

# After (VoidLLM)
client = OpenAI(base_url="http://voidllm:8080/v1", api_key="vl_uk_...")

Your application code stays the same.

Bottom line

If you need 30+ LLM providers and a Python SDK with a large ecosystem, use LiteLLM.

If you care about privacy by design, want zero operational overhead (one binary, SQLite default), need sub-millisecond proxy performance, or want an MCP gateway built in - take a look at VoidLLM.

Different problems, different trade-offs. Pick what fits.

Code Mode: Batching MCP Tool Calls in a WASM Sandbox to Cut LLM Token Usage by 30-80%

ChrisRemo — Sun, 29 Mar 2026 00:19:42 +0000

The Problem: One Tool Call Per Turn Is Expensive

If you've worked with LLMs and tool use, you know the pattern. The model decides it needs to call a tool. It emits a tool call. Your system executes it, returns the result. The model reads the result, reasons about it, and decides it needs another tool call. Repeat.

Every round trip burns tokens. The model re-reads the entire conversation history each time. For workflows that touch 5-10 tools — think "look up the customer, check their subscription, fetch recent invoices, calculate usage, draft a summary" — you're paying for the same context window over and over. The token cost adds up fast, and latency compounds with each turn.

The Solution: Let the LLM Write the Orchestration

Code Mode flips the pattern. Instead of one tool call per LLM turn, the model writes a short JavaScript program that orchestrates multiple tool calls in a single execution. The model gets the results all at once and reasons over the complete picture.

This is inspired by Cloudflare's Code Mode concept. The difference: VoidLLM's implementation is fully self-hosted, runs in a WASM-sandboxed runtime, and integrates with any MCP server you're already running.

In practice, this reduces token usage by 30-80% depending on the complexity of the tool workflow.

Architecture

The execution pipeline:

The LLM receives auto-generated TypeScript type declarations describing all available MCP tools
The LLM emits a JavaScript block that calls the tools it needs
VoidLLM executes the JS inside a QuickJS runtime compiled to WebAssembly
Tool calls within the JS are dispatched to upstream MCP servers via Streamable HTTP
Results are collected and returned to the LLM in a single response

The WASM layer is powered by Wazero, a pure Go WebAssembly runtime. No CGO, no external dependencies. VoidLLM stays a single static binary.

Tools are exposed through an ES6 Proxy pattern — the LLM can call any tool by name without per-tool bindings.

Code Example

const [customer, invoices] = await Promise.all([
  tools.crm.get_customer({ id: "cust_8a3f" }),
  tools.billing.list_invoices({ customer_id: "cust_8a3f", limit: 5 })
]);

let churnRisk = "low";
if (invoices.some(inv => inv.status === "overdue")) {
  const usage = await tools.analytics.get_usage({
    customer_id: "cust_8a3f",
    period: "30d"
  });
  churnRisk = usage.trend === "declining" ? "high" : "medium";
}

console.log(`Evaluated ${invoices.length} invoices`);
return { customer, invoices, churnRisk };

Five tool calls, conditional logic, parallel execution — all in one LLM turn instead of five. The console.log output is captured and returned alongside the result.

Security

The QuickJS WASM runtime has:

No filesystem access — cannot read or write files on the host
No network access — only dispatched MCP tool calls go through VoidLLM's controlled proxy layer
No host access — the WASM module runs in an isolated memory space

On top of the sandbox, admins get a per-tool blocklist. You can allow Code Mode access to your CRM tools but block it from calling database.execute_raw_sql. Configuration is managed through VoidLLM's admin API and UI.

Getting Started

Add your MCP servers to voidllm.yaml:

mcp_servers:
  - name: AWS Knowledge
    alias: aws
    url: https://knowledge-mcp.global.api.aws
    auth_type: none

settings:
  mcp:
    code_mode:
      enabled: true
      pool_size: 8
      timeout: 30s
      max_tool_calls: 50

The Code Mode endpoint lives at /api/v1/mcp. Connect your IDE (Claude Code, Cursor, Windsurf) and the LLM will have list_servers, search_tools, and execute_code available alongside your regular tools.

Limitations

Being upfront about what Code Mode doesn't do yet:

SSE transport not supported — only Streamable HTTP. Deprecated SSE servers are auto-detected and flagged.
No OAuth for upstream MCP servers — API keys and custom headers only.
Single instance only — WASM pool is in-memory.

Try It

VoidLLM is a lightweight LLM proxy written in Go — less than 2ms overhead, org/team/user hierarchy, key management, and usage tracking. Code Mode is the newest addition.

GitHub: https://github.com/voidmind-io/voidllm

Why I built a privacy-first LLM proxy

ChrisRemo — Tue, 24 Mar 2026 00:05:58 +0000

Every LLM gateway I evaluated had the same problem: they logged my prompts.

I'd spin up a proxy, route my team's requests through it, check the dashboard - and there they were. Full request bodies, full response bodies, sitting in someone's database. Sometimes on someone else's infrastructure.

For a lot of use cases that's fine. But when you're working with customer data, internal documents, or anything remotely sensitive, "we store everything by default" isn't a feature. It's a liability.

The problem

I needed something simple:

Route LLM requests through a single endpoint
Manage API keys so developers don't share raw provider keys
Track who's using what, how much it costs
Never touch the actual prompts

Sounds basic, right? But every solution I found either:

Logged everything - full request/response bodies in their database
Charged per-request - on top of what I'm already paying the provider
Were way too complex - sprawling microservice architectures, dozens of config files, hours of setup for something that should take minutes

I just wanted a proxy. A dumb pipe with access control and a dashboard.

So I built one

VoidLLM is a single Go binary that sits between your apps and LLM providers. It's OpenAI-compatible - change your base URL, keep your SDK.

docker run -p 8080:8080 \
  -e VOIDLLM_ADMIN_KEY=$(openssl rand -hex 32) \
  -e VOIDLLM_ENCRYPTION_KEY=$(openssl rand -base64 32) \
  ghcr.io/voidmind-io/voidllm:latest

That's it. SQLite database, admin UI, API - all in a 63MB Docker image.

What it does

Access control: Org → Team → Key hierarchy with 4 RBAC roles. Create API keys per user, per team, or per service account. Keys are HMAC-SHA256 hashed — we never store the raw key.

Usage tracking: Every request logs who made it, which model, how many tokens, how long it took, what it cost. No prompt content. No response content. Just metadata.

Rate limiting: Per org, per team, per key. Token budgets (daily/monthly) and request limits (per minute/per day). In-memory for single instance, Redis for distributed.

Provider adapters: OpenAI (passthrough), Anthropic (full message format translation), Azure OpenAI (URL mapping), Ollama, vLLM, or any OpenAI-compatible endpoint.

What it doesn't do

It never sees your prompts. This isn't a toggle. There's no "enable content logging" option. The proxy reads the request body only to extract the model field for routing, then passes it through. Request and response bodies exist in memory for the duration of the request - they're never written to disk, never logged, never stored anywhere.

This is a hard architectural constraint, not a policy. You can audit the code - there's no code path that persists content.

The privacy angle

I keep emphasizing this because it matters more than people think.

If you're in the EU, GDPR applies to LLM prompts that contain personal data. If your proxy logs those prompts, congratulations - you now have a data processing operation that needs a legal basis, a retention policy, a deletion mechanism, and probably a DPIA.

Or you could just... not log them.

VoidLLM is GDPR-compliant by architecture. There's nothing to delete because there's nothing stored. The DPO's favorite proxy.

Tech decisions

A few choices that might be interesting if you're building something similar:

Go + Fiber v3: Fiber runs on fasthttp, not net/http. The proxy overhead is under 2ms. For a pass-through proxy where every millisecond counts (especially with streaming), this matters.

SQLite default: Zero dependencies to get started. modernc.org/sqlite is pure Go — no CGO, no shared libraries. For single-instance deployments (which is most people), it just works. PostgreSQL is there when you need to scale.

Embedded UI: The React admin dashboard is compiled into the Go binary via embed.FS. No separate frontend deployment, no CORS, no reverse proxy config. One binary, one port.

HMAC-SHA256 for API keys: Not bcrypt. The auth check is on the hot path of every proxy request. HMAC with a server-side secret gives O(1) lookup with constant-time comparison. Bcrypt would add 100ms+ per request.

Ed25519 license keys: Enterprise features are gated by a signed JWT that's verified offline. No license server call on the hot path. Daily heartbeat refreshes the key in the background.

What's next

The project is still very early. Some things on the roadmap:

Model routing / fallback chains
More provider adapters
Documentation site

If you're running LLMs (self-hosted or managed) and need access control without the prompt logging, give it a try:

github.com/voidmind-io/voidllm

I'm looking for early adopters. If you're willing to test it in your setup and share honest feedback, I'll give you a free Enterprise license.
Open an issue or reach out directly — I want to know what breaks, what's missing, and what you'd actually need before you'd trust this in production.

It's BSL 1.1 licensed — source available, self-hosting permitted. Converts to Apache 2.0 after 4 years.

This project was built with significant assistance from AI (Claude by Anthropic).