DEV Community: Venu gopal varma Bhupathiraju

Building My Memory Layer - The Thinking Process Behind Every Layer

Venu gopal varma Bhupathiraju — Mon, 20 Jul 2026 14:28:29 +0000

Last week, I decided not to start by writing code.

Instead, I wanted to understand one simple question.

If humans remember naturally, why does an AI assistant forget so easily?

I promised myself that I wouldn't copy an existing framework or library. Every time I thought I had found the answer, I kept asking one more question: Why does this still fail?

That simple habit completely changed the direction of the project.

Instead of designing a memory system, I slowly discovered that every solution created a new problem. Every new problem forced me to design a new layer. Looking back now, I realize the Memory Layer wasn't built in one step. It was built through six failures.

Memory Layer (1/6)
The Question That Started Everything

The journey started with something everyone experiences.

An AI assistant remembers everything during the conversation, but the next day it behaves like meeting you for the first time.

The first failure wasn't forgetting information.

The real failure was making people repeat themselves.

If someone has to explain the same preferences, goals and background again and again, the assistant never truly becomes personal.

That realization gave birth to the very first layer.

A place where important user information can live beyond one conversation.

Memory Layer (2/6)
My First Solution Failed

Once I knew memory had to exist, my first solution felt obvious.

Why not simply send every previous conversation back to the AI every time?

At first it looked perfect.

Nothing would ever be forgotten.

Then another problem appeared.

Most old conversations have nothing to do with today's question.

As conversations become larger, the assistant spends more time reading things that no longer matter.

I wasn't solving forgetting anymore.

I was creating noise.

That failure changed one important principle.

Don't send everything. Send only what matters.

Memory Layer (3/6)
Retrieval Looked Like the Answer

The next idea was much smarter.

Instead of sending every conversation, why not retrieve only the memories related to the user's question?

Now the assistant receives less information.

The response becomes faster.

The conversation stays focused.

Again, it looked perfect.

Until I imagined a simple situation.

Someone once lived in one city.

Months later they moved somewhere else.

The system successfully retrieved the old city because it matched the question.

The retrieval wasn't broken.

The answer was.

That became one of the biggest turning points.

Finding a memory doesn't mean it should be trusted.

Memory Layer (4/6)
The Decision Engine

This question stayed in my head for a long time.

If retrieving a memory isn't enough, then who decides whether it should still be used?

That single question gave birth to an entirely new layer.

Not another storage system.

Not another search system.

A decision system.

Its only job is to decide.

Should this memory be stored?

Should it replace an older one?

Should both memories exist together?

Should the user be asked for confirmation?

Should it simply be ignored?

I realized memory isn't about saving information.

Memory is about making good decisions.

Memory Layer (5/6)
Memory Is Alive

Even after building the Decision Engine, something still didn't feel right.

Every memory was being treated exactly the same.

But humans don't remember everything in the same way.

Some memories stay for years.

Some disappear after a day.

Some become part of our history.

Some change as life changes.

That completely changed my thinking.

Instead of storing memories forever, every memory should have its own life.

Some memories become stronger.

Some slowly become less important.

Some move into history.

Some become active again when needed.

Memory shouldn't be permanent.

Memory should evolve.

Memory Layer (6/6)
The Final Memory Layer

After solving each failure one by one, the final picture became much clearer.

Understanding what the user means.

Deciding whether it deserves memory.

Storing it safely.

Finding it again when needed.

Checking if it is still correct.

Only then allowing it to influence the response.

I stopped thinking of memory as a database.

I started thinking of it as a living system where every part has one responsibility and every decision exists because a previous idea failed.

The final Memory Layer wasn't designed all at once.

It emerged naturally by solving one failure after another.

Why Each Layer Exists

Imagine meeting a friend after several years.

They tell you:

"I recently moved to a new city because I joined a new company."

A human brain doesn't simply remember every sentence forever.

It naturally decides what matters.

My Memory Layer follows the same idea.

1. Input Understanding Layer

Before remembering anything, the conversation is understood as a complete thought.

Instead of remembering individual sentences, it first understands what actually happened.

Real-life example

A friend says,

"I moved because of a new job."

You don't remember two separate facts.

You remember one complete event.

2. Decision Engine

Every piece of information asks a simple question.

Does this deserve memory?

If yes, how should it be stored?

If no, let it go.

Real-life example

A friend casually says,

"I'm having pizza today."

You probably won't remember it next month.

But if they say,

"I'm allergic to peanuts."

You will.

3. Memory Network

Once something is important, it needs a place to live.

Not inside today's conversation.

Somewhere that lasts.

Real-life example

You remember your friend's birthday long after today's conversation ends.

4. Active and Non-Active Memory

**
Not every memory needs attention every day.

Some stay close because they matter now.

Others quietly move into the background until they become useful again.

Real-life example

You don't think about your first school every morning.

But if someone asks where you studied, the memory quickly comes back.

5. Retrieval Engine

When a new question arrives, the system doesn't open every memory.

It looks for only the ones connected to the current situation.

Real-life example

If someone asks,

"Where does your friend live?"

You don't remember their favorite movie first.

You remember their city.
**

Context Validation Layer **

Finding a memory isn't enough.

The system asks one final question.

Is this memory still true today?

Real-life example

You remember your friend lived in Delhi.

Then you also remember they recently moved to Bengaluru.

You naturally use the newer memory instead of the older one.

7. Language Model

Only after all these steps does the assistant prepare its response.

Instead of relying on every past conversation, it receives only memories that are relevant, current, and verified.

Real-life example

A good friend doesn't answer by repeating everything they know about you.

They answer using only what matters in that moment.

ML Mindset

Venu gopal varma Bhupathiraju — Fri, 03 Jul 2026 18:05:29 +0000

the pragmatic ml roadmap: from business case to production rollback
Most machine learning projects fail before writing a single line of code. They fail because the team starts with the technology instead of the business problem.

Before training a model, you need to define the financial or operational goal. If the model improves prediction accuracy by five percent, what does that mean for the bottom line?

If you cannot calculate this value, you should not build the model. A high accuracy score on a validation set does not pay the bills. You need to connect the model output directly to a business decision.

starting with business value
Every machine learning project should begin with a business hypothesis. You need to know what decision the model will automate or improve.

If you are building a recommendation engine, the goal is not to improve the precision score. The goal is to increase user engagement or sales. You must establish a baseline using existing historical data before writing code. This baseline helps you estimate whether a model is worth the investment.

model selection: the case for simplicity
When you start building, choose the simplest model first. Do not start with a deep neural network or a large language model.

Start with a simple heuristic or a linear regression. This simple baseline gives you something to measure your progress against. It also helps you understand the data.

You should only increase model complexity when a simple model cannot meet the business requirements. Complex models require more compute and more debugging time. They also require more training data. The extra performance must justify these costs.

explainable predictions over black boxes
A model that cannot be explained is a business liability. If a model rejects a loan application or flags a transaction as fraud, you must be able to explain why.

Using interpretable features helps build trust with users and auditors. It also helps developers debug the system when predictions go wrong.

You can use simple models like decision trees to keep the system explainable. If you must use a complex model, use tools like SHAP or LIME to explain the predictions. If you cannot explain the model decisions, the model is too risky to deploy.

evaluation: testing for failure
Standard model evaluation is often misleading. Training and testing on historical data does not guarantee success when the model meets the real world.

You must evaluate your model on future data. This step reveals how the model handles data drift and changes in user behavior.

Evaluating a model is not just about computing average error. You need to identify edge cases and failure modes. What happens when the input data is corrupt? What happens when a user inputs unexpected values?

Finally, you must justify the development cost. Compare the cost of training and maintaining the model against the business value it creates. If the maintenance cost is higher than the value, the model is a failure.

production: the operational safety net
Writing model code is ten percent of the work. The rest is building the operational safety net to keep the system running.

You must containerize your model. This makes the environment predictable and easy to deploy across different servers.

You must also set up versioning and rollback procedures. If a new model version begins to fail, you need to revert to the previous version in seconds.

Monitoring is essential. You need to track both system metrics like latency and machine learning metrics like feature drift.

A production system is incomplete without documentation. You need a clear readme file and api documentation. You also need a deployment guide and a monitoring playbook. The monitoring playbook should explain exactly what to do when an alert triggers. This documentation allows the engineering team to manage the model without constantly relying on the data science team.

Why do we import 100MB of frameworks to run a 50-line LLM reasoning loop?

Venu gopal varma Bhupathiraju — Thu, 25 Jun 2026 15:31:08 +0000

Stop Importing Bloated Frameworks: Build a Python AI Agent from Scratch

You want to build an AI agent.

So you head to the docs of a popular orchestration framework, copy the boilerplate, import 20 modules, and spin up an agent. It works—until it doesn't.

Suddenly, you're looking at a 50-line stack trace originating from a library wrapper. You don't know where the query failed, what the exact prompt was, or why the tool call failed to parse.

Here is the truth: You don't need AutoGen, LangChain, or CrewAI to build a working AI agent.

You just need vanilla Python and a basic understanding of the three core pillars of agentic design.

The Three Pillars of an AI Agent

Any basic agent can be broken down into three simple components:

The State (Memory): A list of message dictionaries (role and content) passed to and from the LLM.
The Schema (Tools): A dictionary mapping tool names to standard Python functions.
The Loop (Reasoning): A standard while loop that calls the LLM, checks if it wants to use a tool, runs the tool if requested, appends the result to the State, and repeats until the LLM returns a final answer.

Let’s build one.

Coding the Agent (under 60 lines of Python)

This example uses the official openai SDK, but the same logic applies to Anthropic, Gemini, or local models running via Ollama.


python
import os
import json
from openai import OpenAI

# Initialize client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# 1. Define the tools our agent can use
def get_weather(location):
    if "tokyo" in location.lower():
        return "Tokyo is sunny and 25°C."
    return "Cool and rainy, 15°C."

# Map the function name to the actual function object
tools_map = {
    "get_weather": get_weather
}

# Define the JSON schema so the LLM knows how to call it
tool_definition = {
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string"}
            },
            "required": ["location"]
        }
    }
}

# 2. The Agent reasoning loop
def run_agent(user_prompt):
    # Initialize the State (Memory)
    messages = [
        {"role": "system", "content": "You are a helpful assistant. Call tools when necessary."},
        {"role": "user", "content": user_prompt}
    ]

    # Run the loop (max 5 turns to prevent infinite runs)
    for _ in range(5):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            tools=[tool_definition]
        )

        message = response.choices[0].message
        messages.append(message)

        # Check if the model wants to call a tool
        if message.tool_calls:
            for tool_call in message.tool_calls:
                name = tool_call.function.name
                args = json.loads(tool_call.function.arguments)

                print(f"[*] Calling tool: {name} with args: {args}")
                tool_output = tools_map[name](**args)

                # Append tool response back to state
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "name": name,
                    "content": tool_output
                })
        else:
            # If no tool was called, this is the final answer
            return message.content

# Run it
if __name__ == "__main__":
    result = run_agent("What is the weather like in Tokyo right now?")
    print(f"\n[Agent Response]: {result}")

Token economics: from model internals to agent costs

Venu gopal varma Bhupathiraju — Wed, 24 Jun 2026 17:25:35 +0000

LLM debates usually focus on parameter counts and benchmark scores. But in production, a much simpler constraint dictates performance and cost: tokens.
Every prompt, system message, tool call, and history block consumes tokens. If you do not manage them, you get a system that works in testing but is too expensive to run.
Here is how tokens behave under the hood, why they cost so much, and how to reduce consumption.

The mechanics: how models process text
To do anything with text, a model has to convert it to numbers. This is tokenization.
Most modern models use Byte Pair Encoding (BPE). It merges common character pairs into subwords. Common words get their own tokens; rare words get chopped up.
The problem is that BPE is semantically blind. It merges based on character frequency, not meaning. A word like "intelligently" might split into "intelligent" and "ly," or into an odd set of characters depending on what the model was trained on.
New methods like Attention Guided BPE try to fix this by adding semantic rules to the merges, which keeps word boundaries intact and makes the vocabulary more efficient.

Once a token has an ID, the model maps it to a dense vector (an embedding) that represents its meaning. Words with similar usage, like "king" and "queen," end up near each other in vector space. Transformers adjust these embeddings during training to capture context.

Inside the model, processing happens in parallel. For each token, the transformer calculates query, key, and value vectors. The query represents what the token wants, the key represents what it offers, and the value holds the actual content. Attention scores weight how much each token should focus on every other token. Because this parallel work loses track of sequence, positional encodings are added to preserve the order of words.

Even though models process subwords, they build full words internally. Research into "detokenization" shows that early and middle layers group subwords back into whole concepts. If you pass a split word like "un" + "h" + "appiness," you can decode the full word "unhappiness" from the representation in the later layers. This suggests models hold a latent lexicon. You can exploit this to expand a vocabulary without retraining by fusing multi token representations.

The math: why the bills spike
Model APIs charge per million input and output tokens. That looks cheap, but agentic loops compound the cost.
A 10,000 token system prompt sent on every turn of a 50 turn session consumes half a million tokens before the model generates a single word of output.
When agents run in a loop, every step is a new API call. If a five step task runs 500 times, you are paying for the history over and over. One fintech startup ended up spending $4,100 a month on Q&A inference. As traffic grew, the projected bill hit $13,000.

Ten ways to cut token usage

Prompt caching
Resending the same system prompt, tool schemas, or documentation on every turn is wasteful. Caching those blocks on Anthropic or OpenAI saves about 90% on input costs.
You just have to structure the prompt so the stable blocks sit at the very beginning. In a test of 500 agent sessions, caching cut API bills by 40% to 80% and speeded up response times for the first token by up to 30%.
Model routing
You do not need Claude Opus or GPT-4 for simple jobs. A router can inspect incoming queries and send easy tasks to cheaper models.
The cost gap is huge: Claude Haiku is $1 per million input tokens, while Opus is $15.
In that fintech setup, 61% of queries were simple tasks like database lookups or text formatting. Routing those to smaller models cut overall costs by 57%.
Semantic caching
Many users ask the same questions using different words. "What is the KYC rule?" and "Explain the know-your-customer process" want the same answer.
A semantic cache embeds incoming questions and checks them against previous ones. If a new query is close enough to an old one, the system returns the cached answer instantly, bypassing the model entirely.
In the fintech system, the cache hit rate hit 34% within two weeks. That is a third of all queries answered for free.
Compacting history
Sending the entire conversation history on every turn is a bad default. By turn twelve, you are paying for turns one through eleven again.
Instead, try these approaches:
Keep only the last few turns (sliding window).
Summarize older turns using a cheap model.
Save facts in a database and only inject the specific rows you need.
Prompt compression
You can trim prompts before they reach the API. LLMLingua removes low information tokens while preserving the meaning. On fintech queries, it compressed prompts by 38% with almost no loss in quality.
For outputs, techniques like CROP penalize output length during prompt optimization. This cut output tokens by 80% in tests without hurting accuracy.
Auditing tool schemas
Tool definitions are just tokens. Descriptions are often bloated. One team audited 11 tools and found descriptions averaged 180 tokens. Shortening them saved 1,400 tokens on every turn.
Loop caps and budget guards
Runaway agent loops are the fastest way to blow a budget. An agent retrying a broken tool 14 times is worse than 200 normal runs.
Set a hard limit on loops, use exponential backoff, and show a clear error to the user when things fail.
Short-circuiting outputs
If your system parses the model's output programmatically, you do not need to wait for the full response. Stream the tokens and abort the call as soon as you have the fields you need. This saved 22% of output tokens in tests.
Self hosting simple tasks
Small tasks like classification, embedding generation, and reranking run fine on open models. You can host these yourself for a flat monthly fee instead of paying API rates.
Optimizing structured data
JSON is heavy on syntax: braces, quotes, and repeated keys. Using simpler formats for input context can shrink the token count.

The results
Here is how one team got a 73% cost reduction:

Prompt caching: 31%
Model routing: 19%
Self hosting simple tasks: 11%
History compaction: 6%
Auditing tool schemas: 3%
Loop caps: 2%
Output short-circuiting: 1%

Another team dropped their bill from $4,100 to $1,560:

Prompt compression saved $162 a day.
Model routing handled 61% of tasks.
Semantic caching answered 34% of queries. Using GraphRAG also cut token usage by 80% compared to standard RAG.

Where to start
If you are optimizing an LLM system from scratch:

Log everything. You need per query token counts and cache hit rates to know what to fix.
Enable prompt caching. It is the easiest change and the biggest win.
Set up model routing. Keep the expensive model for hard reasoning.
Compact history. Stop resending the entire chat.
Set loop caps. Protect against runaway budgets.

Token optimization is not a single fix; it is a set of small adjustments. Five small wins are easier to build and ship than one massive change, and they add up quickly. Measure where the waste is and start there.

7 AI Models Are Quietly Running Your Workflow. Do You Know Which One Should Be?

Venu gopal varma Bhupathiraju — Sat, 20 Jun 2026 06:39:14 +0000

Here's an uncomfortable truth: most developers pick one AI model, fall in love with it, and then use it for everything — debugging, writing docs, brainstorming, research, even creative work. That's like using a hammer to turn a screw. It works, technically. It's just not optimal.
AI isn't one tool anymore. It's an entire toolbox, and each model in it was built with different tradeoffs in mind — speed vs. depth, openness vs. polish, real-time data vs. careful reasoning. Knowing which tool fits which job is quickly becoming as important a skill as knowing how to prompt one in the first place.
So let's break down the major players shaping how work gets done in 2026 — not as a popularity ranking, but as a field guide for when to reach for what.
The landscape, model by model
ChatGPT (OpenAI)
The generalist. Strong for writing, research, day-to-day guidance, and rapid prototyping of ideas. If you need a flexible all-rounder and don't want to think too hard about which tool to open, this is usually the default.
Claude (Anthropic)
Built with a heavy emphasis on safety, nuanced reasoning, and handling long, complex context without losing the thread. Developers tend to reach for it on coding tasks that involve large codebases, multi-step reasoning, or anything where you need the model to stay precise over a long conversation.
Gemini (Google DeepMind)
Less a standalone chatbot, more an ambient layer across the tools you already use — Search, Docs, YouTube. Its strength is integration: AI assistance baked directly into the workflow you're already in, instead of a separate tab you have to context-switch to.
DeepSeek
An efficient, open model that punches above its weight on reasoning and logic-heavy tasks. A favorite for teams that want strong performance without the overhead (or cost) of closed, proprietary systems.
Mixtral (Mistral AI)
A mixture-of-experts architecture built for speed and scale. It's less about raw creative flair and more about throughput — good for applications that need to serve a lot of requests, fast.
Llama (Meta)
Open-source and built for tinkering. If you want to fine-tune, self-host, or build research on top of a model rather than just consume it through an API, Llama's openness is the draw.
Grok (xAI)
Plugged directly into real-time social signal. Where other models reason over static training data, Grok leans into "what's happening right now" — useful for trend-aware or fast-moving contexts.
The real skill isn't picking a favorite — it's knowing when to switch
None of these models are strictly "better." They're optimized for different jobs:

Long, complex reasoning or large codebases → Claude
Fast, general-purpose writing and brainstorming → ChatGPT
Work that lives inside Google's ecosystem → Gemini
Cost-efficient reasoning at scale → DeepSeek
High-throughput applications → Mixtral
Full control, fine-tuning, self-hosting → Llama
Real-time, trend-aware context → Grok

AI is moving fast, and the developers who get the most out of it aren't the ones who memorized one model's quirks — they're the ones who treat these tools like a toolbox and match the model to the moment.
Over to you
Which model is doing the heavy lifting in your stack right now — and where do you think it's falling short? Drop it in the comments. I'm always curious whether people's real-world usage matches the "official" strengths of each model.****

Agentic RAG Isn't Just Fancy Autocomplete. It's a Whole New Infrastructure Problem.

Venu gopal varma Bhupathiraju — Fri, 19 Jun 2026 02:53:40 +0000

We've all read the headlines. "Agentic RAG is the next big thing." "AI systems that think for themselves." It sounds like magic.

But let’s be honest: have you actually tried to build one?

I’ve spent the last few weeks in the trenches with this stuff, going from a simple RAG prototype to trying to build a genuinely "agentic" system. And I can tell you, the reality is a lot more humbling than the hype suggests.

Most of the conversations around Agentic RAG feel like a bait-and-switch . One minute you're reading a blog post that says it's just RAG with "extra steps" like booking a flight or drafting a post. The next, you're looking at a tangled mess of agent loops and scratching your head, trying to figure out why it hallucinated your customer's invoice . The leap from a "smart librarian" to a "personal project manager" is an infrastructure nightmare .

The core insight from the cohort material is simple: RAG gives an LLM memory, but agents give it hands [citation:doc1]. That's the killer feature. An Agentic RAG system isn't just fetching documents; it's looking at your question, deciding which of multiple data sources to query, writing that query, retrieving the results, and then doing something with that information . This is an "observe-think-act" loop that keeps running until the task is complete [citation:doc1].

This is where things get interesting for a developer. It's no longer about just writing a prompt. It's about building a state machine.

I decided to test this out. I wanted a system that could take a vague question like, "What's the status of invoice inv_8891?" and do something useful with it, like check the customer's history and then draft an email.

My mental model shifted from "one-and-done" to a multi-turn loop:

Observe: The system receives the user's query.

Think: The LLM (the brain) analyzes the query and its available tools. It sees a tool called get_customer and another called get_invoice.

Act: The system triggers the first tool call to get the customer ID.

Observe: The tool returns the customer's data and any related invoice IDs.

Think: The LLM determines it has the right invoice ID and calls the get_invoice tool.

Act: The invoice is retrieved.

Think: The LLM checks a knowledge base for the refund policy.

Act: It drafts a response and sends it back.

This is a world away from a standard RAG pipeline. In LangChain, for instance, this process is managed by a graph, where each "turn" either returns a final answer or calls a tool . Each iteration chews up tokens and time.

The dirty secret I discovered is that building this isn't just about stringing API calls together. You run into real system design headaches:

Tool Routing: How does the agent know which of the 10 databases or APIs to query first? In a simple RAG setup, the answer is pre-configured. In an Agentic system, the LLM has to decide this on the fly . This "smart routing" is where a ton of complexity hides.

The Infinite Loop: Without careful boundaries, your agent can get stuck. It'll call a tool, get a result, think it needs more info, call another tool, and never actually return a final answer. You need to set hard limits on how many "thinking" steps (or "turns") it can take .

Latency: This "observe-think-act" loop is not fast. Each loop requires a round trip to the LLM and back. A simple question that takes 2 seconds in a standard RAG setup can take 15-20 seconds in an Agentic system. The user experience suffers.

The takeaway here is one of the "bitter lessons" from the course: a simpler architecture (like a standard RAG pipeline) using a more powerful LLM will often outperform a complex Agentic system, especially for simple tasks [citation:doc1]. You don't build an Agentic RAG system because it's cool. You build it because you have a problem that requires multi-step reasoning and tool use.

So, if you're jumping into this world, don't think you're just building a smarter chatbot. You are building a distributed system. You are building an orchestrator. You're now a systems engineer for an AI that has a mind of its own. And that is a whole new kind of fun.