How to Cheat LLM Context: A Lightweight AI Doc Assistant Architecture

#ai #architecture #llm #rag

Dropping your entire Markdown documentation folder into an LLM prompt sounds easy - until you see the API bill. Large contexts mean large costs, especially when users ask repetitive or highly specific questions.

When building the documentation assistant for my project, LinkShift.app (a programmable redirect and link-mapping platform running on the edge), I knew the learning curve would be steep for users dealing with Regex, Liquid templates, and edge routing rules. Instead of taking the easy route and watching my API budget melt, I designed a multi-tier, ultra-low-cost AI agent architecture.

Here is how I solved token bloat and kept response times blazing fast.

The Tiered Architecture at a Glance

Instead of throwing a massive model at the full chat history and documentation for every single query, the system filters the request through three distinct phases:

User Request -> [1. Receptionist (gpt-5.4-nano)] -> Intent Filtering & File Routing
                                                                  |
                                                                  v
User Response <- [3. Response Gen (gpt-5.4-mini)] <- [2. Inject Relevant Files (Usually 3-6)]

Step 1: Smart Data Ingestion (Preprocessing)

Feeding raw Markdown files dynamically to an LLM is incredibly inefficient.

All 28 Markdown files of my documentation were pre-processed and summarized beforehand using a tiny gpt-5.4-nano model.
For the OpenAPI/API Reference, I split the main schema by tags (endpoints). Each section got its own highly compressed summary.

Step 2: The "AI Receptionist" Guardrail

When a user asks a question, it doesn't touch the main, more expensive LLM right away. The first line of defense is a gpt-5.4-nano model acting as a "receptionist." It handles two critical tasks:

Intent Validation: It verifies if the query is actually relevant to LinkShift. This ensures no one is using my API budget to do their computer science homework.
File Routing: It scans the pre-made lightweight summaries and pinpoints the exact documentation files needed to answer the question. I set a safe upper limit of 10 files, but the model usually dynamically selects just 3-6 highly relevant ones.

The result? We only pass a fraction of the total documentation into the next stage.

Step 3: Precise Generation & The "Low-Token Context Hack"

Only now does the slightly heavier model, gpt-5.4-mini, enter the scene. It ingests the user's query and only the specific files isolated by the receptionist to compile a high-quality, hallucination-free answer.

The Chat History Hack:
Keeping full chat logs in memory quickly bloats the context window. To bypass this, every time gpt-5.4-mini responds, it also generates a single-sentence micro-summary of the conversation so far. On the next turn, I inject only this micro-summary instead of the entire chat history.

This keeps the context perfectly intact, answers lightning-fast, and the API bill down to literally pennies.

The Over-Engineering Syndrome

The best part about this whole setup? I spent days obsessing over this architecture, refining prompts, and stress-testing edge cases - despite currently having exactly zero users (free or paid).

It’s the classic indie hacker / software engineer trap: building a hyper-optimized, infinitely scalable infrastructure for massive traffic before making a single dollar.

On the bright side, the system is bulletproof, safe from wallet-draining exploits, and ready for whatever comes next.

If you want to test it out, try to break the receptionist guardrail, or just see how it handles technical queries, feel free to play with it here: linkshift.app/docs.

How do you handle context costs in your own LLM projects? Do you use a similar routing system, or do you prefer standard vector databases (RAG)? Let’s discuss in the comments!