I Built an E2EE Proxy So LLM Providers Can't Read My Prompts

#ai #privacy #opensource #security

Every time you call an LLM API, your prompt travels in plaintext. The provider sees it. Their logs see it. Anyone with access to the infrastructure sees it.

For most use cases this is fine. But when you're generating legal documents, discussing medical cases, or brainstorming business strategy — "trust us" isn't good enough.

I wanted true end-to-end encryption for LLM inference. So I built a proxy that makes it happen.

What if the model couldn't see your plaintext either?

Venice AI runs E2EE models inside Trusted Execution Environments (TEEs) — hardware-isolated enclaves where even Venice's own engineers can't access the data being processed. The model's memory is encrypted at the hardware level.

The catch: the E2EE protocol requires client-side cryptography — ECDH key exchange, AES-256-GCM encryption, streaming decryption — that standard OpenAI SDKs don't support.

So if you're using Python's openai library, LangChain, or any other OpenAI-compatible client, you can't use E2EE models out of the box.

The proxy approach

Instead of forking every SDK, I built a local proxy that sits between your app and Venice:

Your app (plaintext) --> localhost:5111 --> [E2EE Proxy] --> Venice API (encrypted)

Your app sends normal plaintext requests to localhost:5111. The proxy handles the entire E2EE handshake transparently. From your app's perspective, it's just talking to another OpenAI-compatible endpoint.

How the crypto works

Here's the full flow for each request:

1. Key generation

The proxy generates an ephemeral secp256k1 key pair — the same curve Bitcoin uses. This happens once per session (sessions auto-rotate every hour).

2. TEE attestation

Before trusting any public key, the proxy fetches hardware attestation from Venice. This proves the model is actually running inside a TEE, not on a regular server. The attestation includes a nonce to prevent replay attacks.

3. ECDH key exchange

Using the model's attested public key and our ephemeral private key, we compute a shared secret via Elliptic Curve Diffie-Hellman. Neither party ever transmits the shared secret — it's derived independently on both sides.

4. Key derivation + encryption

The shared secret is fed through HKDF-SHA256 to derive a 256-bit AES key. Each message is encrypted with AES-256-GCM using a random 12-byte nonce. The encrypted payload looks like:

[ephemeral public key (65B)] [nonce (12B)] [ciphertext + auth tag]

5. Streaming decryption

Venice streams SSE responses with encrypted chunks. The proxy decrypts each chunk in real-time and forwards it as standard SSE to your app. You see plaintext streaming in — but it was encrypted on the wire.

Security decisions

A few things I was deliberate about:

Constant-time crypto — all ECDH operations use @noble/curves, an audited library with constant-time implementations. No timing side-channels.
Ephemeral key zeroing — the private key is overwritten with zeros immediately after deriving the shared secret.
No plaintext logging — prompts and responses never appear in logs. Only metadata (model name, message count) is logged.
Localhost only — the server binds to 127.0.0.1 with CORS restricted to localhost origins.
Request size limits — 10 MB max body, 60s upstream timeout.
Race condition prevention — per-model session mutex prevents duplicate key exchanges under concurrent requests.

How to use it

Setup

git clone https://github.com/x1m4x/e2ee-llm-proxy.git
cd e2ee-llm-proxy
npm install
VENICE_API_KEY="your-key" node server.js

Python

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:5111/v1", api_key="unused")
response = client.chat.completions.create(
    model="e2ee-gpt-oss-120b-p",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Node.js

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://127.0.0.1:5111/v1",
  apiKey: "unused",
});
const stream = await client.chat.completions.create({
  model: "e2ee-gpt-oss-120b-p",
  messages: [{ role: "user", content: "Hello!" }],
  stream: true,
});
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || "");
}

Any framework that speaks OpenAI format (LangChain, LlamaIndex, etc.) works the same way — just change base_url.

Limitations

These are inherent to Venice's E2EE protocol, not the proxy:

Streaming only — non-streaming is not supported
No function calling — tools, functions, structured outputs don't work
No file uploads or vision — text in, text out
TEE attestation caveat — attestation is verified server-side by Venice. For maximum trust, you'd want client-side verification against hardware root certificates (Intel TDX / AMD SEV-SNP). This is documented in the README.