DEV Community: parupati madhukar reddy

The Token Tax Problem: How I Built a Super Memory Layer for AI Coding Assistants using LLM Wiki

parupati madhukar reddy — Wed, 06 May 2026 21:40:49 +0000

The Token Tax Problem: How I Built a Super Memory Layer for AI Coding Assistants

We Solved the Wrong Problem First

When AI coding assistants arrived, we celebrated. Faster delivery. Less repetitive work. Developers doing more meaningful things.

Then the invoices arrived.

Token utilization had quietly become one of the fastest-growing line items in engineering costs. Every session, every agent, every code suggestion — all of it burning through context tokens. And the root cause was embarrassingly simple: we were paying for AI tools to re-learn our codebase from scratch, over and over again.

Round One: The Obvious Fixes

We started with the basics. Things that genuinely helped:

Context window hygiene — Being deliberate about what goes into context rather than dumping entire file trees at every agent invocation
Model switching — Using faster, cheaper models for repetitive low-complexity tasks and reserving powerful models for architecture decisions and complex debugging
Preprocessed context — Writing structured markdown instruction files that encode team conventions once and reuse them everywhere, instead of expecting agents to infer them from raw code
Scoped agents — Purpose-built agents for specific tasks (test generation, code review, planning) rather than one general-purpose agent doing everything

These helped. But they didn't solve the fundamental issue. Agents were still spending tokens exploring the codebase before doing any real work.

We needed something closer to a cache layer.

The Core Idea: A Super Memory Layer

The inspiration came from Andrej Karpathy's concept of the LLM Wiki — the idea that an AI system benefits enormously from a persistent, structured knowledge index rather than re-reading raw source on every request.

Think of it like CloudFront or Redis in front of your origin server.

Instead of every agent making expensive round trips into raw source code, they read from a pre-built knowledge graph. That graph becomes a shared memory layer — a single source of architectural truth accessible by any AI tool: Copilot, Factory, Claude, Cursor, or whatever comes next.

For the implementation, I used Graphify (github.com/safishamsi/graphify), an open-source tool that converts a codebase into a knowledge graph:

Nodes — functions, components, hooks, utilities
Edges — relationships between them (imports, calls, dependencies)
Output — a plain-language report, interactive visualization, and GraphRAG-ready JSON

The POC: Steps We Actually Followed

Step 1 — Full Codebase Attempt (Hit a Wall)

First instinct: run it on the entire codebase at once.

The corpus exceeded the tool's recommended limits immediately (~900+ files). This is actually a healthy constraint — feeding an LLM a massive undifferentiated codebase produces poor graph quality anyway.

Lesson: Large codebases need a per-module strategy.

Step 2 — Module-by-Module Analysis

We split the codebase by independent modules and ran the graph pipeline on each one separately.

Each run was completely free — Graphify's AST extraction is pure static analysis with zero LLM API calls. The graph structure emerged from the code itself:

Module	Source Files	Nodes	Edges
Module A	354	606	1,599
Module B	318	549	1,501
Module C	166	248	509
Module D	108	193	514
Module E	27	37	60

Step 3 — Debugging the Tool Itself

During a couple of runs, report generation failed due to API signature changes between Graphify versions. We patched the calls and kept moving.

Lesson: Pin your open-source tooling versions. APIs shift.

Step 4 — Merging Module Graphs

With individual module graphs ready, we wrote a merge script to combine them into a single unified knowledge graph.

First attempt had a subtle bug — the script accidentally read the same module's extract file multiple times (once per module), producing a graph full of duplicates. We caught it because all sets of nodes were identical.

Fix: Rebuilt the merge from each module's actual AST cache files, prefixing node IDs with the module name to prevent collisions:

node_id = f'{module_name}::{original_node_id}'

Correct merged result:

1,600+ unique nodes across all modules
4,000+ edges
28 structural communities detected
Zero LLM tokens consumed to build it

Step 5 — Discovering the God Nodes

The most valuable output wasn't the graph itself — it was what the graph revealed.

God nodes are the most connected abstractions in the codebase. The functions, utilities, and components that everything else depends on. Most experienced developers know these intuitively but have never seen them mapped explicitly.

Once you know your god nodes, you can:

Prioritise documentation specifically for these high-impact functions
Instruct agents to proceed carefully whenever changes touch them
Use them as architectural anchors in any context window

Step 6 — Wiring the Graph into Agent Instructions

We updated the agent instruction files used by each tool (GitHub Copilot, Factory/droid, etc.) to point at the merged graph report as their primary architecture reference:

Before answering architecture or codebase questions,
read the merged graph report at graphify-out/GRAPH_REPORT_MERGED.md

This means any agent that loads these instructions starts with architectural knowledge already loaded — without scanning source files to build that understanding themselves.

A 9KB markdown report replacing several megabytes of source scanning. Every session.

Step 7 — Running the Token Experiment

To quantify the impact, we set up an A/B test. We commented out the graph instructions from all agent configuration files, then ran identical tasks in both configurations and compared token consumption.

<!-- graphify section disabled for token utilization analysis
     re-enable when experiment is complete -->

Results from the experiment will follow in a separate post.

The Three Outputs: What Agents Actually Consume

Every Graphify run produces three files, each serving a different consumer:

File	Typical Size	Best For
`GRAPH_REPORT.md`	~9 KB	Copilot, Cursor, any LLM reading markdown
`graph.json`	~1 MB	GraphRAG queries, programmatic traversal, MCP tools
`graph.html`	~1 MB	Human review, architecture walkthroughs

For token efficiency, agents read only the markdown (~9KB). The JSON is available for tools that can query it selectively.

Honest Pros and Cons

✅ What Works

Zero build cost — AST extraction consumes no LLM tokens
Tool-agnostic — Works with any tool that reads files (Copilot, Factory, Claude, Cursor)
Shared memory — One knowledge base, many consumers; no duplication of analysis
God node awareness — Agents automatically know which abstractions are highest-impact
Community detection — Related code clusters surface naturally without manual documentation

⚠️ What Doesn't

Stale risk — Graph must be regenerated after structural changes; a stale graph actively misleads agents
Velocity tension — Codebases with rapid daily structural changes will find frequent regeneration expensive in time, even if not in tokens
Corpus size limits — Large repos must be split by module; cross-module edges are inferred, not extracted
No semantic understanding — AST-only extraction misses business intent and domain meaning; semantic extraction adds LLM cost
Merge complexity — Combining module graphs requires care; duplicate nodes and ID collisions are easy mistakes to make

Suggestions to Take This Further

Incremental updates — Re-extract only changed files after each commit, not the full module
Automate in CI — Regenerate affected module graphs as a post-merge pipeline step, triggered only when source files change
Selective semantic enrichment — Run LLM-assisted extraction only on shared utilities and god nodes, not on every file
Add a wiki navigation layer — Generate a navigable index.md so agents load only the relevant section of the graph rather than the full report
Commit only the report — The markdown report is the token-saver; the JSON and HTML can stay gitignored to avoid bloating the repo

The Bigger Picture

Token cost is the new technical debt of the AI-assisted development era.

Every pattern that reduces it — pre-processed context, structured instructions, scoped agents — points in the same direction:

Give agents knowledge, not raw data.

A project knowledge graph is one concrete implementation of that principle. It is not magic, and it is not free to maintain. But as a cache layer between your codebase and your AI tools, it fundamentally changes the economics of agent-assisted development.

The experiment is ongoing. I'll share the token comparison numbers once the A/B test wraps up.

If you're working on similar token efficiency problems or have taken a different approach, I'd love to hear about it in the comments.

Resources

🔧 Graphify on GitHub
💡 Concept inspiration: Andrej Karpathy — LLM Wiki
📊 Token experiment results: coming soon

Tracing the agent flow in Openai-agents

parupati madhukar reddy — Wed, 06 May 2026 21:39:18 +0000

AI Job Hunt Match Agent in n8n (Using AI_Job_Hunt_Agent_N8N)

parupati madhukar reddy — Sat, 11 Apr 2026 02:48:24 +0000

I updated my workflow to use the AI_Job_Hunt_Agent_N8N file as the source of truth.

Instead of generating tailored resumes for every role, this version focuses on job-match intelligence:

pull jobs
compare JD vs resume profile
score fit
send ranked opportunities

Repo: https://github.com/parupati/AI_Job_Hunt_N8N

What the workflow does

Every day at 7:00 AM:

Scrapes fresh jobs from SerpAPI (google_jobs) for AI / Sr Full Stack Engineer
Loads a structured resume profile from a code node (summary, skills, experience, achievements)
Sends each job description + resume profile to GPT-4o
Parses AI response into structured fields like:
- match_score
- match_tier
- apply_recommendation
Sorts by score and selects the top 5 opportunities
Sends a daily HTML email report with:
- company
- role
- location
- posting time
- match %
- tier
- recommendation
- job link

n8n node flow

Schedule Trigger (daily at 7 AM)
HTTP Request (SerpAPI jobs endpoint)
Code: Load Resume Data
Code: Prepare Jobs for Match Analysis
OpenAI (GPT-4o) for JD-vs-resume fit analysis
Code: Parse & Enrich Result
IF + Code (filter/sort/top 5)
Aggregate
Gmail (send report)

Local setup

I run n8n with Docker:

services:
  n8n:
    image: n8nio/n8n:latest
    ports:
      - "5678:5678"

Start:

docker compose up -d

Open:

http://localhost:5678

Why this helped

The workflow doesn’t auto-apply to jobs.

It automates job triage so I can spend time only on high-fit opportunities.

This gave me:

daily ranked shortlist instead of random browsing
consistent JD-vs-resume evaluation
faster decision-making on where to apply

Next improvements

enforce score threshold directly in IF node
add company blacklist/whitelist
generate optional cover note for top matches
send Slack + email notifications

If you want, I can share my importable AI_Job_Hunt_Agent_N8N.sanitized.json workflow and setup checklist.

Building a RAG System from Scratch: Turning Aviation Disruption Data into an AI-Powered Q&A App

parupati madhukar reddy — Mon, 09 Mar 2026 04:37:04 +0000

I recently built a Retrieval-Augmented Generation (RAG) system that lets you ask natural language questions about the 2026 Iran-US conflict's impact on global civil aviation — and get accurate, source-backed answers in seconds.

Try the live demo: https://parupati.com/aviationRag

Source code: GitHub

In this article, I'll walk through the architecture, the decisions I made, and what I learned along the way.

The Problem

The Global Civil Aviation Disruption 2026 dataset on Kaggle contains 6 CSV files with 218 records covering airline financial losses, airport disruptions, airspace closures, flight cancellations, reroutes, and a timeline of conflict events.

Raw CSV data isn't exactly user-friendly. If you wanted to know "Which airline suffered the most?" or "What airports in Iran were closed?", you'd have to manually dig through spreadsheets. I wanted to make this data conversational — ask a question, get a clear answer with sources.

That's exactly what RAG does.

What is RAG?

RAG (Retrieval-Augmented Generation) is a pattern that combines two things:

Retrieval — Find the most relevant pieces of information from your data
Generation — Feed those pieces to an LLM to produce a human-readable answer

The key insight: instead of fine-tuning a model on your data (expensive, slow), you just give the LLM the right context at query time. The model doesn't need to "know" your data — it just needs to read it.

Architecture

Here's what I built:

CSV Files (6 tables, 218 records)
  → Python ingestion script converts each row to natural language
  → HuggingFace sentence-transformers embeds each chunk (all-MiniLM-L6-v2)
  → ChromaDB stores the vectors locally
  → FastAPI serves the /query endpoint
  → Angular frontend provides the chat UI
  → Deployed on Hugging Face Spaces (Docker)

Tech Stack

Layer	Tool	Why
Orchestration	LangChain	Mature RAG framework, pluggable components
Embeddings	HuggingFace all-MiniLM-L6-v2	Fast, runs on CPU, no GPU needed
Vector Store	ChromaDB	Zero-config, file-based, perfect for small-medium datasets
LLM	OpenAI GPT-4o	Best answer quality for generation
API	FastAPI	Async, auto-generates Swagger docs, production-ready
Frontend	Angular	Integrated into my existing portfolio site
Deployment	Hugging Face Spaces (Docker)	Free tier, auto-scaling, git-based deploys

The Interesting Part: Structured Data + RAG

Most RAG tutorials use PDFs or text documents. My dataset was structured CSV data — rows and columns, not paragraphs. This required an extra step: converting each row into a natural language sentence before embedding.

For example, a row from airline_losses_estimate.csv:

Emirates, UAE, 4200000, 18, 62, 2835200, 9180

Becomes:

"Emirates (UAE) faces an estimated daily financial loss of $4,200,000 USD due to the Iran-US conflict. 18 flights were cancelled and 62 were rerouted, incurring $2,835,200 in additional fuel costs. Approximately 9,180 passengers were impacted."

This is important because embedding models understand natural language, not CSV columns. Each of the 6 CSV files has its own conversion function that produces a descriptive sentence with all the context needed for retrieval.

Building It: Step by Step

1. Ingestion

The ingestion script reads all 6 CSVs, converts each row to a natural language chunk, and stores it in ChromaDB with metadata (source file, category, original field values).

# Each CSV file has a dedicated row-to-text converter
def row_to_text_airline_losses(row):
    return (
        f"{row['airline']} ({row['country']}) faces an estimated daily "
        f"financial loss of ${row['estimated_daily_loss_usd']:,.0f} USD..."
    )

218 documents across 6 categories — small enough to fit in a single ChromaDB collection, large enough to need proper retrieval.

2. Embedding

I used all-MiniLM-L6-v2 from HuggingFace's sentence-transformers. It produces 384-dimensional vectors and runs comfortably on CPU. No GPU, no cloud embedding API, no cost.

embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
)

3. Retrieval + Generation

At query time, the user's question is embedded with the same model, and ChromaDB returns the top-k most similar chunks. These chunks are injected into a prompt template and sent to GPT-4o:

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

The prompt instructs the model to act as an aviation intelligence analyst and answer using ONLY the provided context — no hallucination.

4. API

FastAPI wraps the RAG pipeline into a clean REST endpoint:

POST /query
{
  "question": "Which airline had the highest financial loss?",
  "k": 5
}

Response includes the answer and the source documents used to generate it — full transparency.

5. Deployment

The entire system is containerized with Docker and deployed on Hugging Face Spaces (free tier). The vector store is built during the Docker build phase, so it's baked into the image — no cold-start database initialization.

What I Learned

1. Structured data needs extra love in RAG. You can't just throw CSVs at an embedding model. Converting rows to natural language sentences dramatically improves retrieval quality.

2. You don't need a GPU for embeddings. all-MiniLM-L6-v2 runs in milliseconds on CPU for small datasets. Don't over-engineer the infrastructure.

3. ChromaDB is perfect for prototyping. Zero config, runs embedded in your Python process, persists to disk. For 218 documents, it's instant.

4. Hugging Face Spaces is underrated for API hosting. Free Docker-based deployment with auto-generated URLs. The cold-start after inactivity (30-60 seconds) is the main trade-off.

5. Context-stuffing beats RAG for small data. I also built a portfolio chatbot endpoint on the same API — it just stuffs the entire markdown file into the system prompt. No embeddings, no vector store. When your data fits in the context window, keep it simple.

Try It Yourself

Live Demo: https://parupati.com/aviationRag

Example questions to try:

"Which airline suffered the highest daily financial loss?"
"What airports in Iran were closed?"
"How many flights were cancelled from Dubai on March 1st?"
"What was the aviation impact of the Natanz airstrike?"
"Which countries closed their airspace and for how long?"

Source Code: https://github.com/parupati/IranUSAviationDisruptionRAG

API Docs: https://parupati-iran-us-aviation-rag.hf.space/docs

What's Next

Adding hybrid search (vector + keyword) via Azure AI Search for better retrieval
Exploring streaming responses for a more interactive chat experience
Evaluating retrieval quality with metrics like precision@k and MRR

If you're building your first RAG system, start small — a few CSVs, a local vector store, and a cloud LLM. Get the pipeline working end-to-end, then optimize. The fundamentals transfer directly to production-scale systems.

Built with Python, LangChain, ChromaDB, HuggingFace, OpenAI GPT-4o, FastAPI, Angular, and Hugging Face Spaces.

Connect with me on LinkedIn or check out more projects on GitHub.

Building Production-Ready AI Agents with OpenAI Agents SDK and FastAPI

parupati madhukar reddy — Mon, 20 Oct 2025 02:41:39 +0000

Overview

This guide demonstrates how to leverage the OpenAI Agents SDK with FastAPI to create scalable, production-ready AI agent systems. The OpenAI Agents SDK provides a robust framework for building structured AI agents, while FastAPI offers high-performance API exposure with automatic documentation and validation.

🤖 OpenAI Agents SDK: The Foundation

Core Components
The OpenAI Agents SDK provides several key abstractions:

Agent: The core AI entity with specific instructions and capabilities
Runner: Execution engine for running agents
AgentOutputSchema: Structured output validation using Pydantic
Model Integration: Seamless integration with OpenAI's latest models

🏗️ Architecture Overview

The system follows a clean separation of concerns:

Backend: Python-based AI agents orchestrated through FastAPI
Frontend: Modern web application with responsive design
Integration: RESTful APIs connecting the two layers

🤖 Building AI Agents: The Core Pattern

1. Agent Definition Structure

Every agent in the system follows a consistent pattern using the agents framework:

from pydantic import BaseModel
from agents import Agent, AgentOutputSchema

class AgentOutput(BaseModel):
    """Structured output schema for the agent"""
    result_data: str
    success: bool
    message: str

AGENT_INSTRUCTIONS = """
You are a specialized AI agent that performs specific tasks.
Your instructions define the agent's behavior and expertise.
"""

agent = Agent(
    name="SpecializedAgent",
    instructions=AGENT_INSTRUCTIONS,
    model="gpt-4o",
    output_type=AgentOutputSchema(AgentOutput, strict_json_schema=False),
)

2. Real-World Example: Database Schema Agent

Here's how the project implements a database schema generation agent:

from database_research.create_db_schema import DatabaseSchema

SCHEMA_GENERATION_INSTRUCTIONS = """
You are a senior database architect specialized in creating database schemas 
from functional requirements. Analyze requirements and generate normalized 
database schemas with proper relationships and constraints.
"""

create_db_schema_agent = Agent(
    name="CreateDBSchemaAgent",
    instructions=SCHEMA_GENERATION_INSTRUCTIONS,
    model="gpt-4o",
    output_type=AgentOutputSchema(DatabaseSchema, strict_json_schema=False),
)

3. Agent Execution Pattern

Agents are executed using the Runner pattern:

from agents import Runner

async def generate_database_schema(requirements: list[str]) -> DatabaseSchema:
    input_text = f"Functional Requirements:\n" + "\n".join(requirements)

    result = await Runner.run(
        create_db_schema_agent,
        input_text,
    )

    return result.final_output_as(DatabaseSchema)

🚀 Exposing Agents via FastAPI

1. FastAPI Server Setup

The project uses FastAPI to create a production-ready API server:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI(title="AI Agents API", version="1.0.0")

class AgentRequest(BaseModel):
    input_data: str
    parameters: dict = {}

@app.post("/agent/process")
async def process_with_agent(request: AgentRequest):
    try:
        # Execute agent logic
        result = await run_agent_workflow(request.input_data)
        return {"success": True, "data": result}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

2. Multi-Agent Orchestration

The system implements complex workflows that chain multiple agents:

class PRDResearchManager:
    async def run(self, prd_text: str, db_instructions: str = ""):
        # Step 1: Extract requirements
        requirements = await self.extract_functional_requirements(prd_text)

        # Step 2: Generate database schema
        if requirements.success:
            schema = await self.generate_database_schema(
                requirements.requirements, db_instructions
            )

        # Step 3: Generate API contracts
        contracts = await self.generate_api_contracts(schema, prd_text)

        return {
            "requirements": requirements,
            "schema": schema,
            "contracts": contracts
        }

3. Streaming Responses

For long-running agent operations, the system supports streaming responses:

from fastapi.responses import StreamingResponse

@app.post("/research/stream")
async def research_stream_endpoint(request: ResearchRequest):
    async def generate_updates():
        manager = ResearchManager()
        async for update in manager.run(request.query):
            yield f"data: {json.dumps({'update': str(update)})}\n\n"

    return StreamingResponse(
        generate_updates(), 
        media_type="text/plain",
        headers={"Cache-Control": "no-cache"}
    )

🔧 Configuration Management

The system implements secure configuration management:

class Config:
    def get_openai_api_key(self) -> str:
        # Try config file first
        api_key = self._config.get("openai_api_key")

        # Fallback to environment variable
        if not api_key:
            api_key = os.getenv("OPENAI_API_KEY")

        return api_key

    def set_environment_variables(self):
        api_key = self.get_openai_api_key()
        os.environ["OPENAI_API_KEY"] = api_key

🎨 Frontend Integration

Modern Web Interface

The frontend provides an intuitive interface for interacting with AI agents:

async function callAgentAPI(endpoint, payload) {
    const response = await fetch(`/api/${endpoint}`, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify(payload)
    });

    return await response.json();
}

// Example: Generate database schema
async function generateDatabaseSchema() {
    const result = await callAgentAPI('database', {
        prd_text: prdContent,
        db_instructions: instructions
    });

    displayResults(result);
}

Proxy Architecture

The UI server acts as a proxy to the AI agents backend:

// UI Server (Node.js/Express)
app.post('/api/database', async (req, res) => {
    const response = await fetch('http://localhost:8000/database', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify(req.body)
    });

    const data = await response.json();
    res.json(data);
});

🛠️ Key Agent Types in the System

1. Research Agent

Purpose: Web search and information gathering
Input: Search queries and research topics
Output: Comprehensive research reports

2. Database Schema Agent

Purpose: Generate database schemas from requirements
Input: Functional requirements and constraints
Output: Complete database schema with tables, relationships

3. Contract Generation Agent

Purpose: Create OpenAPI specifications from database schemas
Input: Database schema and business requirements
Output: Complete OpenAPI 3.0.3 specification

4. Sequence Diagram Agent

Purpose: Generate PlantUML sequence diagrams
Input: Architecture requirements and constraints
Output: PlantUML diagram code

📊 Benefits of This Architecture

Scalability

Each agent is independently deployable
FastAPI provides async support for concurrent requests
Streaming responses prevent timeouts on long operations

Maintainability

Clear separation between agent logic and API exposure
Consistent patterns across all agents
Comprehensive error handling and logging

Flexibility

Framework-agnostic OpenAPI specifications
Multiple output formats (JSON, streaming, files)
Easy to add new agents following established patterns

🚀 Getting Started

Setup Backend:

   cd server
   pip install -r requirements.txt
   cp config.template.json config.json
   # Add your OpenAI API key to config.json
   python server.py

Setup Frontend:

   cd UI
   npm install
   npm start

Test the System:
- Navigate to http://localhost:3000
- Try generating a database schema from requirements
- Explore the generated API contracts

💡 Best Practices Learned

Agent Design: Keep agents focused on single responsibilities
Error Handling: Always provide meaningful error responses
Configuration: Use secure configuration management
API Design: Follow REST conventions and provide clear documentation
Frontend Integration: Use proxy patterns for clean separation

This architecture demonstrates how to build production-ready AI agent systems that can scale from prototype to enterprise deployment. The combination of structured agents, robust APIs, and modern web interfaces creates a powerful platform for AI-driven development workflows.