<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Piotr Zielinski</title>
    <description>The latest articles on DEV Community by Piotr Zielinski (@p-zielinski).</description>
    <link>https://dev.to/p-zielinski</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3963384%2Fe6e6c08f-a49b-4e2f-b3c9-ed425e9a1e1c.jpg</url>
      <title>DEV Community: Piotr Zielinski</title>
      <link>https://dev.to/p-zielinski</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/p-zielinski"/>
    <language>en</language>
    <item>
      <title>How to Cheat LLM Context: A Lightweight AI Doc Assistant Architecture</title>
      <dc:creator>Piotr Zielinski</dc:creator>
      <pubDate>Tue, 02 Jun 2026 19:48:40 +0000</pubDate>
      <link>https://dev.to/p-zielinski/how-to-cheat-llm-context-a-lightweight-ai-doc-assistant-architecture-3hl1</link>
      <guid>https://dev.to/p-zielinski/how-to-cheat-llm-context-a-lightweight-ai-doc-assistant-architecture-3hl1</guid>
      <description>&lt;p&gt;Dropping your entire Markdown documentation folder into an LLM prompt sounds easy - until you see the API bill. Large contexts mean large costs, especially when users ask repetitive or highly specific questions.&lt;/p&gt;

&lt;p&gt;When building the documentation assistant for my project, &lt;strong&gt;&lt;a href="https://linkshift.app/" rel="noopener noreferrer"&gt;LinkShift.app&lt;/a&gt;&lt;/strong&gt; (a programmable redirect and link-mapping platform running on the edge), I knew the learning curve would be steep for users dealing with Regex, Liquid templates, and edge routing rules. Instead of taking the easy route and watching my API budget melt, I designed a multi-tier, ultra-low-cost AI agent architecture.&lt;/p&gt;

&lt;p&gt;Here is how I solved token bloat and kept response times blazing fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tiered Architecture at a Glance
&lt;/h2&gt;

&lt;p&gt;Instead of throwing a massive model at the full chat history and documentation for every single query, the system filters the request through three distinct phases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Request -&amp;gt; [1. Receptionist (gpt-5.4-nano)] -&amp;gt; Intent Filtering &amp;amp; File Routing
                                                                  |
                                                                  v
User Response &amp;lt;- [3. Response Gen (gpt-5.4-mini)] &amp;lt;- [2. Inject Relevant Files (Usually 3-6)]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 1: Smart Data Ingestion (Preprocessing)
&lt;/h3&gt;

&lt;p&gt;Feeding raw Markdown files dynamically to an LLM is incredibly inefficient.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All 28 Markdown files of my documentation were pre-processed and &lt;strong&gt;summarized&lt;/strong&gt; beforehand using a tiny &lt;code&gt;gpt-5.4-nano&lt;/code&gt; model.&lt;/li&gt;
&lt;li&gt;For the OpenAPI/API Reference, I split the main schema by tags (endpoints). Each section got its own highly compressed summary.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: The "AI Receptionist" Guardrail
&lt;/h3&gt;

&lt;p&gt;When a user asks a question, it doesn't touch the main, more expensive LLM right away. The first line of defense is a &lt;code&gt;gpt-5.4-nano&lt;/code&gt; model acting as a "receptionist." It handles two critical tasks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Intent Validation:&lt;/strong&gt; It verifies if the query is actually relevant to &lt;a href="https://linkshift.app/" rel="noopener noreferrer"&gt;LinkShift&lt;/a&gt;. This ensures no one is using my API budget to do their computer science homework.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File Routing:&lt;/strong&gt; It scans the pre-made lightweight summaries and pinpoints the exact documentation files needed to answer the question. I set a safe upper limit of 10 files, but the model usually dynamically selects just 3-6 highly relevant ones.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result? We only pass a fraction of the total documentation into the next stage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Precise Generation &amp;amp; The "Low-Token Context Hack"
&lt;/h3&gt;

&lt;p&gt;Only now does the slightly heavier model, &lt;code&gt;gpt-5.4-mini&lt;/code&gt;, enter the scene. It ingests the user's query and only the specific files isolated by the receptionist to compile a high-quality, hallucination-free answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Chat History Hack:&lt;/strong&gt;&lt;br&gt;
Keeping full chat logs in memory quickly bloats the context window. To bypass this, every time &lt;code&gt;gpt-5.4-mini&lt;/code&gt; responds, it also generates a single-sentence micro-summary of the conversation so far. On the next turn, I inject &lt;em&gt;only&lt;/em&gt; this micro-summary instead of the entire chat history.&lt;/p&gt;

&lt;p&gt;This keeps the context perfectly intact, answers lightning-fast, and the API bill down to literally pennies.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Over-Engineering Syndrome
&lt;/h2&gt;

&lt;p&gt;The best part about this whole setup? I spent days obsessing over this architecture, refining prompts, and stress-testing edge cases - despite currently having exactly &lt;strong&gt;zero users&lt;/strong&gt; (free or paid).&lt;/p&gt;

&lt;p&gt;It’s the classic indie hacker / software engineer trap: building a hyper-optimized, infinitely scalable infrastructure for massive traffic before making a single dollar.&lt;/p&gt;

&lt;p&gt;On the bright side, the system is bulletproof, safe from wallet-draining exploits, and ready for whatever comes next.&lt;/p&gt;

&lt;p&gt;If you want to test it out, try to break the receptionist guardrail, or just see how it handles technical queries, feel free to play with it here: &lt;a href="https://linkshift.app/docs" rel="noopener noreferrer"&gt;linkshift.app/docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do you handle context costs in your own LLM projects? Do you use a similar routing system, or do you prefer standard vector databases (RAG)? Let’s discuss in the comments!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz39cju0mlsnmv18n2bn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxz39cju0mlsnmv18n2bn.png" alt=" " width="800" height="576"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
