Guatu

Posted on May 15 • Originally published at guatulabs.dev

Privacy-Routed LLM Inference: Keeping Sensitive Data Out of the Cloud

#aiagents #localllm #privacy #ollama

I spent three hours debugging a "hallucination" in my agent's daily briefing only to realize the agent wasn't hallucinating at all. It had simply failed to access my local financial spreadsheets because of a tool denylist I'd configured for security, and instead of admitting it couldn't see the data, it had tried to "guess" based on a few fragments it had previously cached in a cloud-based session. Even worse, I discovered that a fallback trigger in my orchestration layer had sent a summarized snippet of my private data to a cloud API because the local inference node had a momentary timeout.

If you're building AI agents that touch real-world data, the "happy path" is usually just a prompt and an API key. The reality is a minefield of data leaks, prompt injections, and silent failures that send your private keys or bank statements to a third-party server because a local GPU pod decided to restart.

This is a problem for anyone running autonomous agents that have read or write access to a local filesystem. If your routing logic is flawed, your privacy isn't a policy; it's a coin flip.

The Wrong Way: Trusting the Orchestrator

My first attempt at "privacy" was naive. I used a simple conditional in my agent's logic: if the query contained words like "bank," "password," or "private," route it to a local Ollama instance. Otherwise, send it to GPT-4o.

This failed immediately for three reasons. First, keyword filtering is a joke. A user (or a prompt injection) can easily bypass "bank" by asking about "financial liquidity instruments." Second, I assumed the orchestrator was a neutral party. In reality, the orchestrator often handles the context window, meaning the sensitive data is already in the prompt before the routing decision is even made. Third, I had no fail-safe. When the local model timed out, the system defaulted to the cloud provider to ensure "high availability." In a privacy-first system, unavailability is better than exposure.

I also hit a wall with tool access. I had disabled sandbox.mode to let my agents actually do work, but I quickly found that built-in tools like read and edit can be manipulated to bypass exec allowlists. I saw a specific instance where a prompt injection convinced the agent to use a read-chunk command (a hidden utility in some knowledge base scripts) to dump raw data from a file that should have been summarized first.

The Actual Solution: Two-Tier Privacy Routing

The only way to actually guarantee privacy is to move the routing logic as close to the data as possible and treat the cloud LLM as an untrusted guest. I implemented a two-tier architecture: a local "Privacy Gate" and a reference-only knowledge base.

1. The Reference-Only Knowledge Base

Instead of feeding raw files to the LLM, I use a system where the LLM never sees the original document. I use poppler-utils for PDF extraction and a local embedding model to populate a Qdrant vector store. The agent queries the vector store, but the results are filtered through a local script before being sent to any inference engine.

2. The Privacy Gate (Routing Layer)

I wrote a wrapper, knowledge.sh, that handles the routing. It doesn't rely on keywords. It relies on the data source. If the data comes from a "Sensitive" tagged volume in my cluster, the request is hard-pinned to the local GPU node.

Here is a simplified version of how I handle a private query:

#!/bin/bash
# knowledge.sh query - Local-first routing

QUERY=$1
MODEL="qwen2.5:14b"
# The local endpoint is a dedicated GPU node in my K8s cluster
LOCAL_ENDPOINT="http://ollama-gpu-node.internal/v1/chat/completions"

# Check if the query requires sensitive data access
if [[ "$QUERY" == *"--private"* ]]; then
    echo "Routing to local inference..."
    # We use a local model and a local endpoint. No cloud fallback.
    curl -X POST "$LOCAL_ENDPOINT" \
         -H "Content-Type: application/json" \
         -d "{
           \"model\": \"$MODEL\",
           \"messages\": [{\"role\": \"user\", \"content\": \"$QUERY\"}],
           \"stream\": false
         }"
else
    # Non-sensitive queries can go to the cloud orchestrator
    ./route-to-cloud.sh "$QUERY"
fi

3. Hardening the Execution

To prevent the "hallucination via missing data" problem, I stopped letting the LLM handle the final delivery of sensitive reports. I use a pattern where the LLM generates a template or a summary, but a local Python script handles the actual data insertion and delivery.

For my daily briefings, I use a wrapper script that ensures the data collection is isolated from the cloud inference:

#!/bin/bash
# life-briefing-run.sh

# 1. Collect raw data locally (Private)
./daily-briefing.sh --collect-only

# 2. Format the data using a local script (No LLM involved here)
# This prevents the LLM from accidentally leaking raw data in its output
python3 /opt/scripts/format-and-send-briefing.py

And the Python script handles the delivery via a secure API (like Telegram) without ever sending the raw content to a third-party LLM for "polishing":

import json
import requests

def send_telegram_message(message):
    # Tokens are managed via SealedSecrets in K8s
    bot_token = 'ANONYMIZED_TOKEN'
    chat_id = 'ANONYMIZED_ID'
    url = f'https://api.telegram.org/bot{bot_token}/sendMessage'
    payload = {
        'chat_id': chat_id,
        'text': message,
        'parse_mode': 'Markdown'
    }
    requests.post(url, json=payload)

# Load the locally generated briefing
with open('/tmp/briefing.txt', 'r') as f:
    content = f.read()
    send_telegram_message(content)

Why This Works

This approach works because it removes the "decision" from the LLM. If you ask an LLM "Should I send this to the cloud?", it will eventually say yes. By moving the routing to a bash wrapper and a Python script, the logic is deterministic.

The use of a local model like qwen2.5:14b via Ollama provides enough reasoning capability to summarize private data without needing the massive parameter counts of GPT-4. I've found that for most RAG (Retrieval-Augmented Generation) tasks, a 14B model is the sweet spot between performance and the VRAM limits of my GPU nodes.

By separating the synthesis (LLM) from the delivery (Python script), I've created a circuit breaker. Even if the LLM is compromised via prompt injection, it cannot "leak" the data to the cloud because it doesn't have the API keys for the cloud provider; those are held by the orchestrator, which is gated by the knowledge.sh script.

For those managing the underlying hardware, ensuring these local models stay performant requires a stable infrastructure. I've written about how I handle GPU passthrough on Proxmox and why the NVIDIA Container Toolkit is non-negotiable for this to work in a Kubernetes environment.

Lessons Learned

The biggest surprise was how often "convenience" features in agent frameworks are actually security holes. For example, I found that sessionKey in some cron-job implementations is often misunderstood. I assumed it provided hard isolation, but it turns out it's often just a routing hint. To get actual isolation, you have to explicitly set the session to isolated, or you risk your private data bleeding into the "main" session context, which might be shared with a cloud-connected agent.

Another gotcha was the Qdrant MCP. I hit several "Not existing vector name" errors during the rollout. This wasn't a bug in my code but a version mismatch between the MCP server and the Qdrant instance. In a bare-metal K8s setup, pinning your versions is the only way to avoid waking up to a broken pipeline.

If I were to do this again, I'd implement a more formal "Taint and Toleration" system in Kubernetes. I'd taint my GPU nodes with privacy=high and only allow pods with the corresponding toleration to run there. This would prevent a non-private, cloud-connected pod from ever being scheduled on the same physical hardware where my sensitive local models are processing data in memory.

For those looking to scale this into a professional environment, this kind of architecture is a core part of what I do in AI agent and infrastructure consulting. Moving from a "it works on my machine" script to a production-grade, privacy-routed pipeline is where most of the complexity lives.

The takeaway is simple: if the data is sensitive, the cloud is a liability. Build your gate, pin your models, and never let your LLM decide where your data goes.

Top comments (1)

Privacy.Fish • May 17

Loved the post! The bit I’d add is auditability: privacy routing needs a log of why a request was kept local vs sent to a cloud model, without storing the sensitive payload itself. That makes the policy reviewable later instead of becoming another black box in the stack.