I built a localhost proxy that stops LangChain from hallucinating on dead RAG data (and cuts token burn by 50%)

#webdev #ai #devops #architecture

The problem nobody's proxy was solving

Every RAG pipeline I'd worked with had the same blind spot: once a document made it into the retrieval index, the LLM treated it as gospel forever. A 2022 blog post about a deprecated API, a GitHub issue that got resolved eight months ago, or an arXiv paper that has since been contradicted does not carry a "this might be stale" flag by the time it lands in your prompt. The model just reasons over it, confidently, as if it were current.

That isn't a retrieval problem. It's a freshness problem, and nothing in the standard LLM-app stack owns it. So I built KU-Gateway. It is an OpenAI-compatible proxy that sits in front of your model calls. It scores every piece of your embedded context for staleness and strips out anything that is too decayed to trust before a single token ever gets sent upstream.

The shape of it

KU-Gateway is deliberately boring to integrate. It's a FastAPI service that speaks the same /v1/chat/completions contract as OpenAI, so adopting it is a one-line change in any existing client:

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-openai-key",
    base_url="http://localhost:8000/v1",   # ← the only line that changes
)

Behind that familiar interface, every request goes through a four-stage pipeline:

Your App  ──▶  KU-Gateway (:8000)  ──▶  LLM Provider (OpenAI / Anthropic / Gemini / local)
                    │
                    ├─▶ extract <context> blocks from the request
                    ├─▶ score each block via the Knowledge Universe API
                    ├─▶ drop anything past the decay threshold
                    └─▶ forward the trimmed request + print live telemetry

The scoring itself is delegated to the Knowledge Universe API, which returns a decay_score (0 = fresh, 1 = fully decayed). It also provides a knowledge_velocity label (like frozen, slow, moderate, fast, or hypersonic ) and a conflict_detected flag for content that actively contradicts more current knowledge. KU-Gateway's entire job is turning that score into a clean keep or drop decision so you do not have to think about it.

Extraction: finding the context in the first place

Requests don't arrive pre-tagged as "this part is retrieved context, this part is the user's actual question." KU-Gateway assumes anything wrapped in <context>...</context> tags inside a message is fair game for scoring:

def extract_context_chunks(messages: List[Dict[str, Any]]) -> List[ContextChunk]:
    chunks = []
    context_pattern = r"<context>(.*?)</context>"

    for msg in messages:
        content = msg.get("content", "")
        matches = re.findall(context_pattern, content, re.DOTALL)

        for i, match in enumerate(matches):
            chunk = ContextChunk(
                id=f"chunk_{i+1}_{hash(match[:100])}",
                content=match.strip(),
                source=None,
                url=None,
                title=None
            )
            chunks.append(chunk)

    return chunks

If there is nothing to extract, the request passes straight through. This setup completely avoids wasting API calls on plain conversational turns. This part of the codebase is definitely a v1 implementation. It currently assumes flat, unnested context blocks. If you want to use more elaborate embedding schemes like nested tags or structured JSON blobs, you will need a more capable parser. That upgrade is already on the roadmap.

Filtering: the actual gate

Once chunks come back scored, the decision logic is refreshingly simple. A chunk survives if its decay score is lower than the specific threshold set for its source:

def filter_chunks(self, chunks, results):
    fresh_chunks, fresh_results, blocked_chunks = [], [], []
    result_map = {r.chunk_id: r for r in results}

    for chunk in chunks:
        result = result_map.get(chunk.id)
        if result is None:
            # No evaluation result — fail open, keep the chunk
            fresh_chunks.append(chunk)
            continue

        threshold = self.get_threshold_for_source(chunk.source)

        if result.decay_score < threshold:
            fresh_chunks.append(chunk)
            fresh_results.append(result)
        else:
            blocked_chunks.append((chunk, result))

Two design decisions worth calling out:

Per-source thresholds. A single global cutoff doesn't make sense across domains because arxiv content ages differently than a Stack Overflow answer. KU_SOURCE_THRESHOLDS lets you set {"arxiv": 0.7, "github": 0.3} and the gateway will respect it per chunk.
Fail-open, deliberately. If the evaluator can't produce a result for a chunk at all (an uncaught exception, not a scoring failure), that chunk is kept rather than silently dropped. A gateway that fails closed on infrastructure hiccups is a gateway that quietly breaks your app in production. If the KU API itself is reachable but errors, it instead returns a neutral decay_score=0.5, confidence=0.0. This just goes through normal threshold comparison rather than getting special-cased.

What it actually looks like in the terminal

This isn't a hypothetical pipeline. Here is a real request against a running gateway, unmodified terminal output:

)

That request came in with 5 context chunks and 42 original tokens. Four chunks got blocked with decay scores of 0.80, 0.88, 0.73, and 0.54. leaving one fresh chunk and a clean payload of 21 tokens. 50% of the tokens in that request were pure staleness, stripped before the model ever saw them. Average decay across the blocked chunks: 0.74. That's the whole value proposition in one log line.

Every request gets this treatment automatically, and the same numbers are queryable over HTTP:

curl http://localhost:8000/v1/telemetry

{
  "requests": 12,
  "total_tokens_saved": 340,
  "total_cost_saved": 0.00068,
  "total_conflicts": 2,
  "recent": [ ... ]
}

A confession: the startup banner was dead code

Here's the kind of thing that only surfaces when you actually trace the call graph instead of trusting your own memory of what you built. telemetry.py has a fully-implemented print_startup() method that covers API key tier detection, decay thresholds, supported source counts. It even uses a Nice Rich-rendered panel. Except main.py never called it. There was no @app.on_event("startup") handler at all, only a shutdown hook that printed the session summary. The banner had been sitting there, fully written, entirely unreachable, since the initial commit.

The fix was small:

@app.on_event("startup")
async def startup_event():
    if settings.telemetry_enabled:
        telemetry.print_startup(settings.ku_api_key, settings.decay_threshold, settings.port)

I'm including this not because it's impressive, since it's a two-line fix, but because "does the thing I documented actually run" turned out to be a more useful question than "does the code compile." Worth asking of your own projects before you write the README.

Configuration, in brief

Everything is environment-variable driven via pydantic-settings, loaded from .env. The essentials:

Variable	Default	What it does
`KU_API_KEY`	(required)	Your Knowledge Universe key, must start with `ku_`
`KU_DECAY_THRESHOLD`	`0.5`	Global cutoff — chunks at or above this are blocked
`KU_SOURCE_THRESHOLDS`	`{}`	Per-source overrides as JSON
`UPSTREAM_LLM_BASE_URL`	`https://api.openai.com`	Where cleaned requests get forwarded
`KU_REDIS_ENABLED`	`false`	Cache decay scores instead of re-scoring identical chunks every call

Full reference is in the repo README.

Shipping it

There's a Dockerfile and a docker-compose.yml that spins up the gateway alongside a mock KU API and a mock LLM echo server. This setup is genuinely useful for kicking the tires without live credentials:

make docker-up

And for a real cluster, kubernetes/ has Deployment, Service, Ingress, ConfigMap, and Secret manifests ready to kubectl apply -f.

What's not done yet

I'd rather tell you this than have you find out the hard way:

Rate limiting and auth middleware are stubs. KU_RATE_LIMIT doesn't do anything yet. The middleware classes exist but pass every request through.
BYOK vault support is scaffolded, not built. KU_VAULT_ENABLED has no effect.
YAML config isn't wired up. There's a .ku-gateway.yaml.example in the repo describing a config format the code doesn't actually read yet. Environment variables are the only real input right now.
Message reconstruction is intentionally simple. It handles flat <context> blocks well; deeply nested or structured context embedding will need more work.

None of this blocks the core use case of request in, stale context stripped, clean request out. However, I'd rather be upfront about the beta-ness than have you discover a stub in production.

Try it

git clone https://github.com/VLSiddarth/KU-Gateway.git
cd KU-Gateway
python -m venv venv && source venv/bin/activate
make install
cp .env.example .env # Mint your free KU_API_KEY at https://api.knowledgeuniverse.tech/dashboard.html
make run

It's MIT-licensed and I'd genuinely welcome issues, PRs, or just someone telling me the decay thresholds are wrong for their domain. If you're building anything that hands retrieved documents to an LLM and you've never once asked "how old is this, actually?" then that is exactly the gap this fills.

Repo: https://github.com/VLSiddarth/KU-Gateway