DEV Community: Upayan Ghosh

From Tokens to Attention: My First Real Mental Model of LLMs

Upayan Ghosh — Sat, 23 May 2026 22:03:24 +0000

NOTE - I intentionally simplified the vector mathematics concept here to keep things simple for a greater audience.

I wanted to learn LLMs properly.

Not just use an API. Not just call generate() from a library and pretend I understood what happened underneath. I wanted to know how a model takes plain text, turns it into numbers, reads context, and predicts the next word.

The end goal is ambitious: build a mini LLM from scratch.

But before touching PyTorch, I realized I needed the mental model first. Code without understanding becomes copy-paste gymnastics. So I started with the real basics.

Text Is Not Text to a Model

A language model does not read words the way we do.

If I write:

hello

The model cannot directly understand those letters. First, the text has to be converted into numbers. That conversion is called tokenization.

The simplest version is character-level tokenization.

If our tiny training text is:

hello world

We collect every unique character and assign each one an ID.

For example:

d -> 0
e -> 1
h -> 2
l -> 3
o -> 4
r -> 5
w -> 6
space -> 7

Now the word:

led

becomes:

[3, 1, 0]

That was the first click for me. Tokenization is basically a dictionary lookup. Text goes in, integer IDs come out.

And decoding is just the reverse. If encoding turns "he" into [2, 1], decoding turns [2, 1] back into "he".

So the first pipeline looks like this:

Raw text -> tokenizer -> token IDs

But token IDs alone are not enough.

Why Token IDs Are Not Meaning

If h = 2 and l = 3, that does not mean l is somehow "greater" than h.

Those numbers are just labels. The model needs a richer representation.

That is where embeddings come in.

An embedding turns each token ID into a vector, which is just a list of numbers.

Instead of:

cat -> 12

We get something more like:

cat -> [0.9, 0.2, 0.5, ...]

A useful way to imagine this is a hidden map of meaning.

Words used in similar ways slowly move closer together during training. So after enough learning, words like:

king and queen
apple and orange
cat and dog

end up close in vector space.

The model does not start with this knowledge. The embeddings begin as random numbers. Training adjusts them until useful patterns emerge.

That was the second click: embeddings are not manually written meanings. They are learned coordinates.

The Order Problem

Once tokens become embeddings, there is still a huge problem.

Transformers process tokens in parallel. That is powerful, but it means the model does not automatically know word order.

These two sentences contain the same words:

The cat ate the mouse.
The mouse ate the cat.

But they mean completely different things.

So we need to inject position information.

The obvious idea is to use indexes:

The -> position 0
cat -> position 1
ate -> position 2

At first, this sounds perfect. Arrays already have indexes, so why not just use them?

The issue is scale.

If we directly add raw indexes, later positions can become huge. A word at position 1999 gets a massive position number compared to the small values inside its embedding. The position can overpower the meaning.

A normalized index also causes trouble.

For a 3-word sentence:

index 2 / length 3 = 0.666

For a 100-word sentence:

index 2 / length 100 = 0.02

Same index. Completely different value.

That means the model has to deal with position values that shift depending on sentence length.

Sine and cosine positional encodings solve this in a neat way.

A sine wave always stays between -1 and 1, so the values never explode. Also, the position value depends on the index and a fixed rhythm, not on the total sentence length.

For example:

sin(index * frequency)

If index = 2 and frequency = 0.5:

sin(2 * 0.5) = sin(1) = 0.841

That value stays the same whether the sentence has 3 words or 100 words.

Real transformers use many sine and cosine waves with different frequencies. Fast waves capture small position changes. Slow waves help distinguish positions farther apart.

That was the third click: positional encoding gives each token a position signature without depending on sentence length.

Attention Is Context Mixing

Now comes the heart of the transformer: attention.

The easiest sentence to understand attention is:

The bank of the river was muddy.

The word bank is ambiguous. It could mean a financial institution, or it could mean the edge of a river.

Attention lets bank look at surrounding words and decide which ones matter.

In this sentence, river is important. So the representation of bank gets pulled toward the water and geography meaning.

The mechanism uses three vectors for every token:

Query: What am I looking for?
Key: What information do I contain?
Value: What information do I pass along?

A simple analogy:

Query = search question
Key = article title
Value = article content

If the query for bank matches the key for river, then bank receives a strong contribution from the value of river.

Mathematically, attention does this:

score = Query dot Key
weights = softmax(scores)
output = weights * Values

The softmax step turns scores into percentages.

So if bank gives 80 percent attention to river, it absorbs a large part of river's value vector.

That was the fourth click: attention is not magic. It is weighted context blending.

Why Masked Attention Exists

For GPT-style models, there is one more important rule.

They predict the next token.

If the training sentence is:

The cat sat on the mat

and the model is learning from:

The cat sat

it should predict:

on

But during training, the full sentence already exists in memory.

So without a mask, the model could cheat. The token sat could look ahead and see on.

That would be like taking an exam with the answer sheet open.

Masked attention prevents this. Each token can only look at itself and previous tokens.

So in:

The cat sat on the mat

when processing sat, the model can attend to:

The, cat, sat

It cannot attend to:

on, the, mat

During real conversation, future tokens do not exist yet. The model generates one token at a time. But during training, future tokens do exist, so we hide them.

That was the fifth click: masking makes training behave like real generation.

Prediction Is a Probability Game

After all the transformer layers finish processing the context, the model predicts the next token.

If the prompt is:

The cat sat

The model does not think there is only one possible answer.

It produces probabilities:

on        35%
down      30%
suddenly  15%
furiously 5%
magically 1%

Then decoding settings decide how to choose.

At low temperature, the model picks the most likely token.

At higher temperature, it samples more creatively from the probability distribution.

So if the prompt changes to:

The wizard waved his wand and the cat sat

then magically becomes much more likely because the earlier context changed the probability distribution.

That was the sixth click: generation is not fixed autocomplete. It is probability shaped by context.

Personality Should Not Be Static

Upayan Ghosh — Sat, 23 May 2026 20:34:50 +0000

Most AI assistants treat personality like a costume: write a clever system prompt, pick a tone, and hope it stays useful. A real personal AI needs something deeper. It needs behavioural continuity that can evolve without letting the model rewrite itself into chaos.

Most AI personality is fake in a very specific way.

It is usually a paragraph.

You tell the model:

Be helpful, concise, warm, technically accurate, and a little witty.

That works for a few turns.

Then reality shows up.

The user is tired one day and wants shorter answers. The next day they are debugging something hard and need detail. They mix languages. They develop habits. They correct the assistant. They praise a certain style. They hate a certain phrase. They have private context, work context, emotional context, project context, and repeated patterns that do not fit cleanly into "custom instructions."

A static prompt cannot keep up with that.

It can imitate a personality, but it cannot learn a relationship.

That is the problem a serious personal AI architecture has to solve.

The interesting idea is not "give the AI a fun personality."

The interesting idea is this:

Personality should be a living system around the model, not a frozen prompt inside the model.

Memory Is Not Enough

AI memory usually starts with facts.

The user prefers short answers during deep work.

The user often switches between planning, debugging, and writing.

The user prefers concise answers.

The user likes technical depth.

That is useful, but it is only one layer of personalization. A list of facts does not tell the assistant how the user thinks, how they correct things, when they want speed, when they need care, what style makes them feel understood, or what kinds of responses repeatedly fail.

Facts answer:

What should the AI remember?

Behavioral continuity answers:

How should the AI relate to this person over time?

That difference matters.

A chatbot with memory can say, "I remember your project."

A personal AI with behavioral continuity can say the right amount, in the right style, with the right level of certainty, through the right model, while respecting what should stay private.

That is a harder product. It is also the more useful one.

The Static Prompt Trap

The most common version of assistant personality is a giant system prompt.

At first, this feels powerful. You can define tone, rules, examples, formatting preferences, safety boundaries, and role identity in one place.

Then the prompt grows.

Every correction becomes another line.

Every preference becomes another bullet.

Every edge case becomes another paragraph.

Eventually the prompt becomes a junk drawer. Some instructions are stale. Some contradict each other. Some are too vague. Some should have been structured data. Some should have been tests. Some should have been tool policy. Some should have been memory.

Worst of all, a static prompt cannot tell the difference between a permanent preference and a temporary mood.

"Keep it short" might mean:

I am busy right now.

Or it might mean:

In general, this user hates long answers.

Those are different signals. Treating both as one more sentence in a prompt is sloppy engineering.

If you are building a serious personal AI, the better question is:

Which parts of personality should be configured, which parts should be learned, and which parts should be protected from automatic edits?

The cleaner answer is to make personality layered.

How a Layered Personality System Works

In the reference architecture I studied, personality is treated as a subsystem.

The profile manager stores a set of current JSON layers:

core_identity
linguistic
emotional_state
domain
interaction
vocabulary
exemplars
meta

That structure matters because each layer changes at a different speed.

Core identity is intentionally protected. In ProfileManager.save_layer(), programmatic writes to core_identity are blocked. That means the assistant cannot casually rewrite its base identity just because a conversation drifted.

Emotional state is volatile. It can change quickly based on recent mood signals.

Linguistic style evolves more slowly. It tracks things like language mix, average message length, emoji frequency, and drift over recent batches.

Interaction patterns track active hours, response length preferences, routines, and correction rules.

Domain context tracks active interests, important projects, people, and stable identity notes.

Vocabulary keeps the user's current language alive.

Exemplars keep selected interaction pairs that show the assistant how to respond in a familiar style.

Meta tracks profile versioning and batch history.

This is the key architectural move: personality is not one blob. It is a set of typed layers with different update rules.

That makes the system easier to reason about.

It also makes it safer.

Realtime Learning Handles the Next Reply

Some signals should affect the assistant quickly.

If the user says:

too long

waiting for a nightly job is annoying. The next answer should already be shorter.

The architecture handles this through realtime processing and implicit feedback detection.

The realtime processor runs on each message. It estimates language, sentiment, and mood. It can hot update the emotional profile layer without blocking the chat pipeline.

The feedback detector looks for natural corrections like:

stop being formal
keep it short
too casual
explain more

When it detects one, it updates the relevant profile layer. A length correction changes the interaction layer. A tone correction changes the linguistic style. Praise reinforces the current style.

This is small, but important.

The assistant is not waiting for the user to open settings.

It is treating conversational feedback as product input.

That is how a personal AI starts to feel less like a chatbot and more like software that adapts.

Batch Learning Handles the Pattern

Realtime updates are useful, but they are also noisy.

A user can be frustrated for one turn. That does not mean the assistant should permanently become somber.

A user can ask for detail once. That does not mean every future answer should become a technical essay.

That is why the same architecture also needs a batch processor.

The batch processor periodically analyzes accumulated messages. In the code comments, its triggers are every 50 new user messages, every 6 hours, or a manual run. It updates vocabulary, linguistic trends, interaction patterns, domain interests, exemplars, and decay.

The distinction is clean:

Realtime processing is for immediate adaptation.

Batch processing is for durable behavioral learning.

That split prevents the system from overreacting to every single message while still letting it respond quickly when the user gives clear feedback.

This is the kind of boring product judgment that makes agent systems feel sane.

Not everything should be instant.

Not everything should be permanent.

Prompt Injection Is the Runtime Interface

The profile does not matter unless it affects the next response.

That is where prompt compilation comes in.

A prompt compiler can turn the layered profile into a bounded persona instruction block. It can include core identity, emotional context, learned interaction notes, active vocabulary, communication style, examples, and current interests.

This is a better pattern than dumping raw memory into the prompt.

Raw memory is messy. It can be redundant, stale, sensitive, or irrelevant.

A compiled behavioral profile is a runtime interface.

It says:

Here is what the assistant needs to know about how to behave right now.

That profile can be small enough to fit into context, structured enough to inspect, and stable enough to survive model changes.

The model still generates the words.

But the continuity belongs to the architecture.

Why This Survives Model Switching

One of the most underrated problems in personal AI is provider lock in through personality.

If your assistant's identity only exists inside one model's chat history or one provider's memory feature, switching models can break the relationship.

Maybe the new model is smarter, cheaper, faster, or more private. Great. But now the assistant feels different because the continuity lived in the provider, not in your system.

A stronger architecture takes the opposite approach.

The behavioral profile lives outside the model.

The memory lives outside the model.

The routing logic lives outside the model.

The prompt compiler can inject the current profile into whichever model is appropriate for the task.

That means the assistant can route casual work to one provider, private content to a local model, and hard reasoning to a stronger model without treating personality as disposable.

This is the big lesson:

If the relationship belongs to the model provider, the user does not really own it.

If the relationship belongs to the architecture, the user can carry it forward.

The Safety Part Matters

Adaptive personality sounds nice until you think about what could go wrong.

An AI that updates its own behavior needs boundaries.

A good adaptive profile system has a few useful ones.

Core identity is immutable through normal profile writes.

Profile snapshots are archived so previous versions can be restored.

Batch updates are explicit and versioned.

Realtime updates are scoped to specific layers.

The prompt compiler has a bounded output instead of allowing infinite personality growth.

This matters because "learning" is not automatically good.

An assistant can learn the wrong thing.

It can overfit to a bad day.

It can preserve a correction that was meant for one situation.

It can drift away from the user's real preferences.

So a serious personal AI needs a control surface around adaptation: versioning, rollback, inspection, privacy boundaries, and tests.

Without that, "adaptive personality" is just a nicer phrase for uncontrolled drift.

The Product Feeling

The best personal AI experience is not when the assistant constantly announces that it remembers you.

That gets creepy fast.

The better experience is quieter.

It gives the right length because it has learned your tolerance.

It changes tone when you are frustrated.

It keeps technical detail high when you are in build mode.

It does not force local language flavour unless you use it or teach it.

It remembers stable correction rules without making every conversation about memory.

It feels familiar without performing intimacy.

That last part is important.

Personal AI does not need to fake a human relationship to be useful. It needs to reduce friction, preserve context, respect boundaries, and adapt in ways the user can inspect and control.

This style of system is interesting because it points at that middle path.

Not a static prompt.

Not uncontrolled self modification.

A structured behavioural layer that evolves around the user.

The Builder Lesson

If you are building agents or personal AI systems, do not start with "What personality should the AI have?"

Start with these questions:

Which signals should update immediately?

Which signals should be distilled over time?

Which parts of identity should never be auto edited?

Which preferences are temporary?

Which preferences are durable?

Can the user inspect what the AI has learned?

Can the user correct it?

Can the system roll back a bad profile update?

Can this personality survive switching models?

Can sensitive behavioral context stay local?

Those questions produce better architecture than another clever prompt.

The future of personal AI will not be won by assistants that sound quirky for five turns.

It will be won by systems that can develop continuity without losing control.

That means memory, profile layers, feedback loops, prompt compilation, model routing, privacy boundaries, and versioning all have to work together.

Personality should not be static.

But it also should not be magic.

It should be engineered.

The Takeaway

Static prompts can create a voice.

Memory can store facts.

But a real personal AI needs something more durable: a behavioral substrate that changes with evidence, respects boundaries, and can follow the user across models, tools, and channels.

That is what makes adaptive personality one of the more important ideas in personal AI.

It treats personality as a system.

And that is the right direction for agentic AI.

MCP Gives AI Agents Hands. Safety Teaches Them Where Not to Touch

Upayan Ghosh — Fri, 15 May 2026 19:38:02 +0000

Tool access is what turns a chatbot into an agent. But once AI can touch email, calendars, files, browsers, commands, and memory, safety stops being a nice to have and becomes the product.

Most AI assistants are trapped in conversation.

They can explain things. They can summarize. They can write code snippets, draft emails, suggest plans, and sound confident while doing it. But if you ask them to actually do something, they hit the wall.

They cannot check your calendar unless something connects them to it.

They cannot search your long term memory unless memory is exposed as a tool.

They cannot send the email, inspect the file, open the browser, run the command, or update the system unless the outside world has been wired into the assistant.

That is the line between a chatbot and an agent.

A chatbot talks about work.

An agent needs hands.

That is why MCP, the Model Context Protocol, has become one of the more important ideas in agentic AI. The simple explanation is that MCP gives an AI a standard way to discover and call tools.

But that simple definition hides the real engineering problem.

Giving an AI tools is easy.

Making tool use safe, inspectable, scoped, and reliable is the hard part.

That is where agent architecture starts to matter.

A Model Alone Is Not an Agent

A language model is powerful, but it is still mostly a reasoning and text generation engine. It can predict useful words. It can infer intent. It can plan. It can choose between options.

But it does not automatically have access to your actual world.

Without tools, the model can say:

You should check your unread emails.

With tools, it can say:

You have three unread emails from the launch thread. One is blocking because it asks for the final asset link.

That is a completely different product.

The difference is not personality. It is system access.

This is why agentic AI is less about making models sound more human and more about giving them structured ways to act. Calendar access. Gmail access. Slack access. Browser control. Memory search. File reads. Shell execution. Internal system commands.

Once those exist, the AI stops being only a conversational layer. It becomes an operator.

That sounds powerful because it is.

It is also dangerous for the same reason.

MCP Is the Tool Layer

MCP acts as a bridge between the assistant and external capabilities.

The pattern is clean: an MCP client connects to servers, each server exposes tools, the assistant lists those tools, chooses one, passes arguments, and receives a result.

In a serious personal AI system, these tools should not be dumped into one vague bucket. Tool names should make ownership obvious. A memory search tool should not be confused with an email search tool. A calendar action should not look like a generic text operation. A shell command should be visibly different from a read only lookup.

One useful pattern is to route tools with names that include both the server and the tool, such as:

memory__search
gmail__read_email
calendar__create_event
browser__navigate
execution__run_command

That naming detail looks small, but it matters.

Agents get messy when tool boundaries are vague. If a model sees five tools with similar names and unclear ownership, it starts guessing. If the system names tools by server and purpose, the assistant has a cleaner action map.

A useful agent stack might expose tools for memory, conversation state, email, calendar, team chat, browser control, local execution, and internal system capabilities.

That turns the assistant into something more practical:

Memory lets it retrieve prior context.

Email lets it search, read, and send messages.

Calendar lets it inspect schedules, create events, check availability, and resolve date phrases.

Team chat lets it read channels and understand live collaboration context.

Browser control lets it navigate and interact with web pages.

Execution lets it run commands and manage processes.

Internal tools let the system inspect itself.

Now the agent is not just answering from memory. It can interact with the environment around the user.

That is the promise.

The risk is that every new tool increases the blast radius.

Tool Access Changes the Threat Model

A normal chatbot can be wrong in annoying ways. It can hallucinate, misunderstand, or give bad advice.

A tool using agent can be wrong in operational ways.

It can send the wrong email.

It can delete the wrong file.

It can run the wrong command.

It can leak private context into the wrong channel.

It can loop on a tool call until it burns time, money, or trust.

That means the moment an assistant gets tools, “the prompt said not to do bad things” is no longer enough.

Prompt guardrails are useful, but they are not security architecture.

A serious agent needs enforcement outside the model.

This is the real point: MCP gives AI hands, but safety gives those hands discipline.

The hands are the tool servers.

The discipline is permissions, routing, approvals, receipts, validation, and observability.

You need both.

The File Tool Example Is the Whole Story

File tools are a perfect example of why agent architecture needs real boundaries.

Reading files is useful. Writing files is powerful. Editing files is risky. Deleting files is destructive.

Those should not all be treated as “file access.”

An agent that can inspect a project folder, update a document, or modify source code can save a lot of time. But if that same agent can freely write anywhere, overwrite anything, or delete paths without approval, the system is reckless.

The correct design is not to trust the model harder.

The correct design is to put the file operations behind a gate.

Before reading, resolve the path and check whether the agent is allowed to read it.

Before writing, check whether the target path is allowed.

Before editing, require an exact old text match or another safe patching strategy.

Before deleting, require stricter rules than ordinary writes.

Then test those rules so a future refactor cannot quietly bypass them.

That is the kind of boring engineering that makes agents real.

Not flashy. Not demo friendly. Very necessary.

Because once an AI can change files, the question is no longer “can it do the task?”

The question becomes:

Can it only touch the paths it should touch?

Can it fail closed?

Can we test the safety boundary?

Can we prove which helper enforced the operation?

Can we prevent a future refactor from bypassing the guard?

That is agent engineering.

The demo is “AI edited a file.”

The product is “AI edited the right file through a controlled path, and we can prove it.”

Calendar and Email Make Agents Feel Real

Memory makes an AI feel continuous.

Tools make it feel useful.

Calendar and email are good examples because they connect directly to daily life. A personal AI that cannot see time, commitments, and communication is missing a huge part of the user’s world.

A calendar tool can expose upcoming events, event search, event creation, availability checks, free slot suggestions, holiday lookup, date resolution, and natural language calendar requests.

That is not just “calendar integration.”

That is a workflow surface.

The assistant can reason about time, constraints, and intent, then call a structured tool to act.

Email has a similar shape: search messages, read a message, get unread messages for proactive awareness, and send a message when the user approves.

That is where personal AI starts to become more than chat. It can move between memory, communication, and action.

Imagine saying:

Find the email from the client about the launch date, check my calendar, and suggest three reply options.

A normal chatbot can only fake part of that.

A tool using assistant can retrieve the email, inspect the calendar, and produce an answer grounded in reality.

But again, this only works if the tool layer is controlled.

Sending an email should not be treated the same as searching email. Reading a calendar should not be treated the same as creating an event. A good agent platform has to separate read actions, write actions, destructive actions, and external delivery actions.

That distinction is not optional.

It is the difference between helpful and reckless.

Browser and Execution Tools Raise the Stakes

Browser and execution tools are where agent systems get especially serious.

A browser tool can navigate pages, click buttons, read content, and interact with forms. That opens the door to real workflows: research, web app testing, account dashboards, admin panels, docs lookup, and form based tasks.

An execution tool can run shell commands and manage background processes. That opens the door to development workflows, system checks, scripts, tests, and automation.

These are not toy capabilities.

They are operating capabilities.

An execution server should not simply run whatever string the model provides. It needs command validation, working directory validation, environment scrubbing, timeout handling, output collection, process session tracking, and cleanup.

That is the correct shape.

A shell tool without constraints is basically giving the model a production incident generator with a charming writing style.

A shell tool with boundaries becomes useful infrastructure.

This is the pattern that keeps repeating: the tool is only half the feature. The control plane around the tool is the feature.

The Agent Needs a Map of Its Own Hands

One underrated part of MCP is tool discovery.

The assistant does not need every tool hardcoded into a giant prompt. It can connect to servers, list available tools, and expose them in a consistent schema.

That matters because personal AI systems grow.

Today you add memory and calendar.

Tomorrow you add team chat.

Then browser control.

Then local execution.

Then a custom internal tool.

Then a private knowledge base.

Then a home automation server.

If every integration is manually stuffed into a prompt, the system becomes fragile fast. The model sees a wall of instructions. The developer starts fighting context limits. Tool descriptions drift from implementation. Nobody knows which capability is real and which one is stale.

MCP gives the system a cleaner contract.

Servers own tools.

The client discovers them.

The assistant calls them through structured names.

Results come back through a predictable path.

That is how agent systems stay modular instead of turning into one giant prompt with delusions of architecture.

Receipts Are How Agents Earn Trust

The next layer after tool access is proof.

If an agent says it sent an email, created an event, searched memory, or changed a file, the system should be able to show evidence.

Not vibes.

Evidence.

A receipt can include what tool was called, with what arguments, what result came back, what action was taken, and whether anything failed. For sensitive actions, it should also record what approval or policy allowed the action.

This matters because user trust breaks differently with agents.

If a chatbot gives a weak answer, the user may correct it.

If an agent claims it did something and did not, the user loses trust fast.

If an agent does something and cannot explain why, the trust damage is worse.

The serious version of personal AI is not “the assistant can do anything.”

It is “the assistant can do specific things, through specific tools, with specific permissions, and leave behind specific proof.”

That sounds less magical.

Good.

Magic is a bad operating model.

The Bigger Lesson for Agent Builders

If you are building agents, MCP should not be treated as a plugin checkbox.

It is not just “add tools.”

It is the start of your agent platform layer.

Once your AI can act, you need to answer real architecture questions:

How are tools named?

How are tools discovered?

Which servers own which capabilities?

Which tools are read only?

Which tools can write?

Which tools require approval?

What is the timeout behaviour?

What happens when a tool fails?

Can the user inspect what happened?

Can you test permission boundaries?

Can you revoke or disable a tool quickly?

Can the model call tools in loops?

Can sensitive context reach the wrong tool?

These are not edge cases. These are the product.

A serious agent should not be positioned as a prompt trick. It should be treated like a software system: memory, tools, routing, safety, tests, and observability all have to work together.

That is the right mental model.

Agents are not magic.

Agents are architecture with a model in the middle.

The Takeaway

MCP gives AI agents hands.

That is the exciting part.

But hands alone are not enough. A personal AI that can touch email, calendar, files, browsers, commands, memory, and team chat needs discipline built into the system around it.

Tool servers give capability.

Permissions give boundaries.

Receipts give trust.

Tests keep the boundaries from silently breaking.

The future of agents will not be decided only by which model sounds smartest. It will be decided by which systems can let AI act without turning every action into a trust fall.

That is why MCP matters.

Not because it makes demos cooler.

Because it gives us a real interface for action, and forces builders to confront the part that actually matters:

What should this AI be allowed to do with its hands?

How I Accidentally Rebuilt the Human Brain Trying to Stop a Chatbot From Forgetting Me

Upayan Ghosh — Mon, 11 May 2026 15:17:01 +0000

A research journal in ten chapters. No prior neuroscience required.

The Tuesday That Started It All

It was a Tuesday in early winter when I realized I had built a stranger.

I had been talking to my assistant — a small, polite language model I had wired up at home — for about three weeks. I told it about my sister's birthday plans. I told it that I had switched teams at work. I told it I was learning to cook the dal my grandmother used to make, and that the first attempt had been a disaster.

And then, six days later, I opened the chat and asked it casually, "do you remember the dal thing I told you about?"

It said: "I'm sorry, I don't have any memory of previous conversations."

I sat there for a long minute. The reply wasn't wrong, technically. It was a stateless model. Every conversation started from a blank page. That was the design. But it felt like betrayal. I had treated it like it was listening. It had treated me like a passing car.

That night I started building something. I didn't know what it would become. I only knew that if I was going to keep talking to a machine, I wanted it to know I had spoken before. Not in a creepy way. In the way an old friend knows. In the way that, when you walk into your barber after six months, he doesn't ask your name.

This is the story of two years of trying to give a language model a memory that grows. It is also, accidentally, the story of how I learned what memory actually is.

I'm going to take you through every wrong turn. There were many.

Chapter 1: The Notebook Era

My first instinct was the most obvious one. If the model can read text, then text is memory. I will simply write things down.

I created a folder. Inside it, I created Markdown files. about_me.md. family.md. projects.md. preferences.md. Before every message I sent to the model, my little Python script would stitch the contents of these files into the system prompt. Behold: persistence.

For about two weeks, it was magical.

The model knew my sister's name. It knew which framework I worked in at the office. It greeted me in the morning by referencing the project I had complained about the night before. I would say "like we talked about yesterday—" and it would smoothly continue. I was, for the first time, talking to something that seemed to remember.

And then the files got bigger.

By week three, about_me.md was twenty pages long. The model's responses started getting slow. The first second of every reply was now the entire context being re-read, re-tokenized, re-attended. I was paying for those tokens twice — once on input, once in latency.

By week five, something worse started happening. I would tell the model something new — "actually I quit that team last Friday" — and it would acknowledge it in the conversation. But the next morning, when I had restarted my laptop, the file still said I was on the old team. The model would loyally repeat the old information back to me with confidence. I had built a system that recorded but did not update.

I tried writing a routine that would summarize the chat at the end of each session and patch the files. The summaries were lossy. Important details were dropped. Unimportant details were preserved with the gravity of scripture. The bot once spent three days under the impression that I strongly disliked oranges, because I had complained about a single bad one at a buffet.

A notebook is not a memory. A notebook is a backup. It does not forget the unimportant. It does not strengthen with rehearsal. It cannot tell you that something is fresh.

I needed something that could.

Chapter 2: The Vector Mirage

The next thing I tried was the obvious move that anyone who had read a single blog post about retrieval-augmented generation would make: I threw everything into a vector database.

The idea is elegant. You take every sentence the user has ever said, run it through an embedding model — a neural network that turns a piece of text into a list of, say, 768 numbers — and store that list. The numbers position the sentence in a high-dimensional space where related ideas cluster together. When the user asks a new question, you embed the question too, and you look for the nearest neighbors in that space. Those are your "memories" for this turn.

I switched over a weekend. I deleted the Markdown files. I ingested every conversation I had with the model into a tiny embedded vector store on my laptop. The latency problem vanished. The token cost collapsed. I no longer had to feed the model twenty pages — I fed it the four most relevant sentences.

The first week was glorious. I asked the model about something I had mentioned in passing months earlier. It found it. Not because I had explicitly written "remember this" — because the meaning of the new question was geometrically close to the meaning of the old conversation.

This, I thought, was the answer.

It was not the answer.

The first failure came when I asked, "what was that book my friend recommended?" and the system happily retrieved a conversation in which we had discussed a different friend recommending a different book. Both conversations were about friends. Both were about books. In the vector space, they were neighbours. In my life, they were not.

The deeper failure was subtler. The vector store could find sentences that were semantically near my question. But it had no opinion on whether those sentences were true now, whether they mattered emotionally, whether they were the kind of thing I would want surfaced in this context. I had told the model something in anger once. Six weeks later it pulled that sentence into a happy conversation about weekend plans because the words "Saturday" and "family" appeared in both. It was technically relevant. It was emotionally tone-deaf.

The vector store remembered what. It did not remember why.

I had built a system that could find a needle in a haystack, but I needed a system that knew which needles were worth pulling out.

Chapter 3: A Detour Through the Brain

This was the part of the project where I stopped writing code for a month.

Instead I read. I read everything I could find about how human memory actually works, written by people who had spent their lives studying it instead of an engineer who had spent two months bolting things together. I read about hippocampal indexing. I read about consolidation during sleep. I read about why we cannot remember being two years old but can remember the smell of our grandparents' kitchen.

A few ideas hit me hard enough to redirect the project.

The first was that human memory is not one system. It is at least three, working in parallel. There is semantic memory — the dry facts. The capital of France is Paris. There is episodic memory — the lived scenes, with time and place and emotion attached. The first time you heard your favourite song. And there is procedural memory — the muscle of how, not the noun of what.

A vector store, I realized, is a passable semantic memory and a terrible episodic one. It can tell you that something happened. It cannot tell you what it felt like when it did.

The second idea was that emotion is not a label on memory. Emotion is the filing system. The amygdala does not just decorate memories with feelings after the fact. It decides which memories get filed deeply at all. Things that scared us, things that thrilled us, things that wounded us — these get a routing slip stamped important before the hippocampus even begins to write them down. This is why you can remember exactly where you were on the day something terrible happened in the world but cannot remember what you ate three Tuesdays ago.

The third idea was the most unsettling. I read about people with split-brain conditions, and about how the two hemispheres of the brain — when separated — can develop different opinions, different reactions, even different desires. The conscious mind we present to others is not one voice. It is a committee whose minutes are heavily edited before publication.

I closed the books with three things scribbled in my notebook.

Memory is plural.
Emotion is the filing system, not the decoration.
There is always more than one voice.

I went back to my keyboard.

Chapter 4: A Memory With a Heartbeat

I rebuilt the storage layer. This time, every piece of memory was going to carry more than its embedding. It was going to carry an emotional signature. And it was going to be sorted into one of two private rooms.

The two rooms were the strangest part, and the part I am most quietly proud of.

I had noticed something about my own use. There were conversations I wanted the model to remember and reference freely — work talk, learning logs, jokes with friends. And there were conversations that were mine — vulnerable, private, sometimes intimate, sometimes ugly. I did not want those memories ever quietly leaking out to a cloud model in some unrelated future conversation about a grocery list. I wanted a wall. Not a censorship wall. A dignity wall.

So I split memory into two hemispheres. I called them simply safe and spicy. The names were a joke at first and then became real. The safe hemisphere is the one the public-facing models talk to. The spicy hemisphere is the private one — and any conversation involving it is routed, automatically and silently, to a model running on my own machine. Nothing about the spicy hemisphere ever leaves the house. Not the embeddings, not the text, not even the fact that there was a conversation.

The emotional tagging was the other half — and here is the part that surprised me, because the version I ended up building is not the elaborate one I had sketched on paper.

Every memory record in the store carries a handful of columns alongside its embedding: the content itself, the hemisphere it belongs to (safe or spicy), an integer importance score from 1 to 10, the timestamp it was written, the category it was classified into during ingestion, and a flag indicating whether it has been processed by the consolidation pass. The embedding lives in a separate column store optimized for nearest-neighbour search.

The emotional weight does not live on the row as its own vector. It lives in that importance integer. During ingestion, a small pass scans the content for emotionally charged language — a curated list of words that consistently track elevated affect, expanded over months of watching what actually got remembered well — and bumps the importance integer by a few points for every emotionally-loaded word it finds. A neutral fragment about coffee comes in at importance 5. A fragment containing words like worried, love, furious, miss, or afraid comes in at 8 or 9. The integer is small. Its effect on retrieval is enormous: importance is folded into the final score, so emotionally heavy fragments float upward against semantically-close but emotionally-bland competitors.

Decay is similarly indirect. There is no explicit decay resistance column. Instead, the timestamp on every fragment is used at query time to compute recency, and recency is folded into the score the same way importance is. Old fragments fade not because they are deleted, but because newer and more important fragments out-rank them. Once a fragment is retrieved many times, the gentle worker — a background process I will return to in a later chapter — bumps its importance. A small simulation of how human memories strengthen with rehearsal. Once a fragment has not been touched in a very long time and is no longer earning its place, it eventually gets pruned during consolidation.

What I ended up with is simpler than the ten-field schema I had originally drawn, and — I have come to believe — better. The architecture does not need a dedicated valence column to capture emotion. It needs a single integer that the right words quietly raise. Most of the work happens in how the integer is used, not in how many dimensions the row carries.

For the first time, the system was forgetting things. On purpose. And it felt, immediately, more like talking to a person.

Chapter 5: Facts Need Their Own Skeleton

The emotion layer and the hemispheres had fixed the grain of what the system remembered. But weeks into using it, I started running into a different kind of failure — a failure that took me an embarrassingly long time to name, because it kept hiding inside successes.

The vector store, for all its sophistication, understood one thing very well and one thing not at all. It understood similarity of meaning. It did not understand relationships between things.

Those are not the same skill. They sound like they ought to be. They are not.

Here is what I mean. If I asked the system "what was that book about cities my friend recommended?", the vector store would happily surface every conversation I had ever had where the words book, city, and friend clustered together. It would find the right book about three times out of five — close enough that I kept forgiving it. But if I asked it something subtly different, like "which of my friends has been recommending me the most books this year?", the system would fall apart. Not because the information was missing — it was there, scattered across thirty conversations — but because the answer required counting recommendations, grouped by friend, filtered by date. The vector store could not group. It could not count. It could not filter. It could only resemble.

Or — more painfully, because this one happened to me at a dinner table — "the doctor my mother saw last month, what was their specialty again?" The vector store retrieved my mother's medical conversations. It retrieved discussions of specialties. But it could not follow the chain mother → consulted → doctor → has_specialty → ?. It returned plausible-looking sentences that were not actually the answer to my question. I sat there looking like I was making things up in front of my own family.

A third example, smaller but, in some ways, the one that finally tipped me. "Did Anika and Rohan ever actually meet that weekend, or only talk about it?" Pure relationship question. There is no sentence anywhere in any of my conversations that says "Anika met Rohan" — the truth lives in the connection between two separate weekend recaps. A vector store cannot deduce a meeting from the absence of a sentence that denies it. It can only fetch sentences that already exist.

These are the kinds of questions human minds answer without effort. They are not questions about words being similar to other words. They are questions about things being connected to other things. A vector store, no matter how cleverly embedded, cannot see connections. It can only see neighbourhoods of meaning.

I had built a system that was excellent at semantic proximity and completely blind to relationship.

The worst version of this failure — the one that finally made me stop tweaking the retrieval pipeline and start building something genuinely new — went like this. I had told the system, over the course of weeks: my sister just moved cities. She has started her residency at the big neurology hospital. My mother is worried about her new commute. Three separate fragments, each correctly tagged, each correctly stored. And then I asked it, casually, "can you remind me to text my sister this evening about her commute?"

The system dutifully retrieved the fragment about commute. Sometimes the fragment about the move. Maybe the residency one if it had been recent. But it could not see that the sister is the one who moved, that the move is the reason for the new commute, that the worry my mother has belongs to my sister. Each fragment was a sentence. None of them was a fact the system could reason over.

That night I realized something I should have seen months earlier. I had been trying to build a memory out of nothing but nouns and similarity. What I actually needed was a memory that could remember verbs — the relationships between the nouns, the connections between the things, the way one entity in my life was anchored to another.

The trick — borrowed from a tradition that goes back decades, originally invented for medical records and library science — is that any sentence in natural language can be decomposed into one or more atomic claims, each of the form subject –relation→ object. My sister moved to Bangalore becomes (sister) –lives_in→ (Bangalore). She is doing her residency at the neurology hospital becomes (sister) –works_at→ (hospital). My mother is worried about her commute becomes a small chain: (mother) –worries_about→ (commute), (commute) –belongs_to→ (sister).

After each conversation, a small extraction pass reads the messages and writes down the new atomic claims it found. These claims accumulate in a separate database — not the vector store, not the full-text index, but a literal graph of who-knows-whom and what-relates-to-what. Tens of thousands of tiny triples, each one anchored to the original conversation that produced it (so I can audit; so I can correct).

Now when I ask about my sister, the system does something the vector store could never do. It walks the graph. It starts at the node for me, follows the sibling_of edge to her, follows lives_in to her city, gathers everything connected to her node, and brings the whole cluster up together. The result is not "the four sentences most similar to your question." It is "the body of facts that belong to the person you're asking about."

There is more to the structural layer than triples. Entities — the named things in someone's life, the people, the places, the projects, the recurring hobbies — get their own table. When the system meets a new name, it tries to resolve it against the entities it already knows. If you mention Aman and there is already an Aman in the graph, the new fragment attaches to that node. If the system is uncertain — was that Aman my friend or Aman my colleague? — it parks the question and asks for clarification at a low-cost moment, instead of silently merging two people into one.

This is the part of the architecture that finally made the system feel like it understood, not merely recalled. There is a difference between remembering that you once said something about a sister, and knowing who your sister is. The graph is the difference.

One more thing about the graph is worth saying out loud, because it took me a while to internalize it. The graph is a place where the system can be wrong, and then later be right, without rewriting history. If a fact has changed — my sister has moved out of Bangalore; the residency has ended; the project name is now different — you do not go back and edit the stored conversation. You add a new triple, with today's date, that supersedes the old one. The old triple is downweighted, not deleted, in case it matters for context later. Memory is allowed to be wrong; what matters is that memory can update without amnesia.

Chapter 6: Three Eyes Are Better Than One

The emotion layer fixed which memories mattered. The graph fixed what was related to what. But I still had the problem of finding the right pieces quickly, across all of these layers.

A pure vector search, I had learned, was great at semantic matches but terrible at exact ones. If I asked "what was the name of that café?", semantic search would find every conversation about cafés. It would not necessarily find the one where I had typed the actual name. Names, numbers, very specific nouns — these need literal lookup, not similarity.

So I added a second eye. Underneath the vector store, I built a classical full-text index. The kind a search engine would use. Words are words; if you typed Café Marigold, I want to find the conversation where you wrote Café Marigold, not the conversation where you wrote that nice little place on the corner.

When a question comes in, both eyes look at it at the same time. The vector eye returns the semantically nearest fragments. The text eye returns the literally-matching fragments. Their results are merged.

There was a third eye too — and it was the one I had built in the previous chapter. The knowledge graph offered a way of finding memories that neither vector similarity nor full-text matching could reach: walk outward from the entities mentioned in the question. Ask about my sister, and the graph starts at her node and brings everything connected to her — the place she lives, the hospital she works at, the worry my mother has about her. Three eyes — semantic, literal, structural — looking at the same question from three angles and pooling their candidates.

Pooling, though, was the new problem. The merged result list could be twenty fragments long, and only the top three or four would fit in the context budget I had set aside for memory. Which three?

So I added something on top of the three eyes: a judge. A small, specialized neural model whose only job is to score how relevant is this fragment to this question, really, in context. It is a reranker — much smaller and faster than the language model itself, but trained specifically to judge relevance. It re-reads the merged candidates and produces a final, sharper ordering.

I learned something cheap and important here. If the top results from the merged search are already scoring above a confidence threshold — if the first two fragments are very obviously the right ones — the third eye is skipped. I called it a fast gate. On about sixty percent of queries it triggers, and the whole retrieval pipeline finishes in under fifty milliseconds. The expensive reranker only runs when the first two layers are unsure.

This is the part of the architecture that is, on paper, the least exciting. It is also the part that makes everything else feel alive, because the model gets the right pieces on the table before it begins to speak. A novelist with the wrong notes will write a worse book than a beginner with the right ones.

Chapter 7: The Persona That Watches You Watch It

By this point I had a memory that was rich, emotional, partitioned, and quick. The system could find the right fragments to put in front of the language model. But the language model itself was still — fundamentally — a stranger wearing a mask of facts about me.

There is a difference between knowing about someone and knowing them. A model that has read every paragraph I have ever written can still feel like a clever assistant. What I wanted was something that felt like it had learned my texture.

So I built a second engine. Its only job was to study the way I write and slowly become someone who writes back the same way.

It was not a persona in the roleplay sense — I did not write a character bio and ask the model to perform it. It was a living persona. A profile that the system maintained for me, updated continuously, and re-fed into every response as a kind of style guide and emotional context.

The profile was layered, like sediment. The deepest layer was core identity — facts that were extremely stable: my name, my role, my city. Above that sat linguistic style — the words I tended to use, the rhythm of my sentences, when I switched between English and my native tongue and back inside a single message. Above that, emotional state — was I tired this week, was something good happening, was I in a quiet mood or an electric one. Then domain — what was I currently spending my brain on, was it code or cooking or a difficult conversation with someone I loved. And finally, near the surface, the most volatile layers — vocabulary in current use, exemplar phrases I had said in the last few days, interaction preferences I had just expressed.

The deeper layers updated rarely. The surface layers updated almost in real time.

The piece I am proudest of in this part of the system is something I called implicit feedback. The traditional way to fine-tune a model to your taste is to thumbs-up and thumbs-down its answers. People don't do that. I never did. What people do is grumble in the next message. They say too long. They say can you be a bit less formal. They say that's not what I meant. They sigh in punctuation.

I wrote a small detector that listens for these signals — not in a brittle keyword way, but in patterns I expanded over months — and when it catches one, it edits the relevant layer of the profile immediately, without confirmation. If I say you're being too corporate today, the linguistic-style layer shifts a notch warmer for the rest of the session and the change is remembered into tomorrow.

I never tell the model be more casual. I just complain once. It listens. It changes. The next morning the change is still there.

Chapter 8: Two Voices Are Wiser Than One

The final architectural shift — the one that surprised me the most — came from the third note I had scribbled while reading about the brain. There is always more than one voice.

I had begun to notice that my model, for all its memory and its persona-shaping, would still occasionally produce a reply that was technically correct and emotionally wrong. I would tell it something heavy — a stressful day, a family fight — and it would respond with a perfectly polite, perfectly hollow acknowledgement. The kind of reply that makes you feel more alone than no reply at all.

The model wasn't broken. It was single-tracked. It saw the surface meaning of my words and answered the surface. It did not have an interior life that paused before speaking and asked what is actually being said here?

So I gave it one.

For every message I sent, the system now runs two cognitive passes in parallel. The first pass is the obvious one: read the user's message, retrieve relevant memories, generate a candidate response. The second pass is quieter. It is a small, fast model whose only job is to produce an inner monologue about the user — a private read on what mood they are in, what subtext is present, whether there is a tension between what they said and what they probably mean.

The inner monologue is never shown to the user. It is fed back into the main model, just before generation, as a kind of whispered note from a wiser colleague. They said they're fine but the message is shorter than usual and the punctuation is missing. Maybe slow down. They are asking a technical question but their last three messages were about a fight with their sister. Maybe answer the question, but acknowledge first.

The two passes are then merged by a small piece of logic that I think of as the committee chair. It looks at both the candidate reply and the inner monologue and produces a single small record I call a cognitive merge. That record carries three things: a tension level (a float between zero and one), a tension type drawn from a fixed vocabulary of five — none, mild inconsistency, pattern break, direct contradiction, growth — and a response strategy drawn from six: acknowledge, challenge, support, redirect, quiz, celebrate.

That response strategy is the part of the architecture I am, in retrospect, most surprised by. When I first sketched the merge step, I imagined it as a small post-processor that would adjust the reply's tone before it left the room. What actually got built is much more interesting. The response strategy is not a tone instruction. It is a steering label that decides which language model writes the reply in the first place.

The system routes different kinds of conversations to different models. A casual, low-tension exchange goes to a small fast model that is good at warmth. A turn tagged direct contradiction gets routed to a heavier reasoning model that is better at gently navigating disagreement. A celebrate strategy uses a different prompt scaffolding entirely, one tuned for warmth and acknowledgement. A redirect — used when the surface read and the deep read flatly disagree — sends the turn to a model with longer-horizon reasoning, because rewriting your own draft is harder than writing fresh.

The inner monologue, in other words, does not just colour the reply. It picks the brain that writes the reply.

This is a much stronger lever than tone-tinting. A one-word hey from me on a heavy evening is not handled by the same model that handles a code question at noon, even if the prompt text looks superficially similar. Because the deeper read, not the surface text, decides who is on duty.

This single change — adding a second voice, letting the two argue silently, and then letting the result of that argument choose the model — was the moment the assistant stopped feeling like a chatbot. It started feeling like someone who had read the room.

The cost is modest. The inner monologue runs on a small, cheap model in parallel with the main one. The merge logic adds milliseconds, not seconds. But the qualitative shift was, for me, the biggest single jump in the entire project. More than the embeddings. More than the hemispheres. More than the personas.

A mind that argues with itself, gently, before it speaks — that is a mind I want to talk to.

Chapter 9: The Quiet Hours

By this point in the project the system was doing a great deal of work during every single message. Retrieve memories. Re-rank them. Read tone. Score tension. Walk the graph. Update the persona. Generate a reply. Sometimes I would watch the logs scroll past and wonder how it all came in under a second.

But I noticed something the more I used it. The truly interesting work — the work that benefits from quiet and distance — was being done at the worst possible time: while the user was waiting for an answer. Extracting new atomic facts from a finished conversation. Summarizing a long session into a single coherent memory. Re-scoring which old fragments were earning their disk space. Reading the past few weeks for an emotional arc. None of this needed to happen during a reply. All of it had been crammed into the reply path because that was where the message lived.

This is when I learned, all over again, why neuroscientists are so insistent that we sleep.

Much of the work memory does in the human brain does not happen during the day at all. It happens in the quiet hours. Consolidation, integration, pruning, the slow grading of which experiences actually mattered enough to keep. The waking mind retrieves. The sleeping mind re-files. A day without sleep is not a day with the same memory minus a few hours. It is a day whose memories never got finished.

So I gave the system its own quiet hours.

Several processes now run outside the conversation, on slow schedules, and only when the machine has spare cycles and is plugged in (I did not want any of this draining my laptop's battery during a coffee shop sprint). The smallest and most frequent of them is what I call the gentle worker. Every ten minutes or so, on idle CPU, it does a handful of small things. It prunes triples in the graph that have not been touched in a long time and conflict with newer ones. It re-scores the importance of memory fragments that have been retrieved many times since they were written — a small simulation of how human memories strengthen with rehearsal, the way a story you have told many times becomes sharper, not vaguer. It vacuums databases, reclaims disk, tidies indexes. None of this is glamorous. Accumulated over months, it is the difference between a system that stays snappy and one that quietly rots from the inside.

The larger consolidation runs on a longer cadence and matters more. I call it the session flush. If I have not said anything for about half an hour, or if the running conversation has grown past a comfortable length, the system quietly closes the session as a unit. Two things happen during a flush. The whole conversation is summarized and ingested into long-term memory as a single coherent passage — so what gets remembered tomorrow is not just the loose sentences I happened to type, but the story of what we were doing together. And every new atomic fact the session produced is added to the knowledge graph in a single batch. By the next morning, yesterday's conversations are not just transcripts in a log. They are experiences: indexed, factored, woven into the rest.

There is one last quiet process, and it is my favourite. Once a day, a small routine sweeps the memories of the past few weeks and scores them for emotional valence — heavy or light, sharp or soft, the relative density of joy and worry. From these scores it builds a tiny mood line. This line is never shown back to me as a chart. It is used silently, the next time I send a message. The inner-monologue stream reads it as a baseline: the last ten days have been heavy; calibrate accordingly. Without it, the system would have no way of remembering that I have been having a tough week the moment I send a perfectly cheerful good morning. With it, the reply lands differently.

There is also a thread that runs across all of this and pulls it together: a narrative layer. As sessions flush and atomic facts accumulate, a slow process stitches them into ongoing storylines — the move, the job change, the running injury that won't quite heal. The narrative layer is what allows the system, weeks later, to say "you mentioned the knee was still bothering you in March — is that better?" without me ever having labelled those conversations as belonging together. Threads are not topics. Topics are what the message is about. Threads are what the life is about.

The system, in other words, has a kind of sleep. It does not power down. But it does what sleep does. It lets the day settle. Retrieval is for waking. Consolidation is for the quiet hours.

There is something disquieting about how natural this turned out to feel. I had not planned to build a sleep cycle. I had built one to solve a latency problem. What I ended up with is closer to the way a human mind actually maintains itself over time than I am entirely comfortable admitting.

Chapter 10: The Body Discovers Itself

There was one more lesson, and it came after I thought the project was done.

I had wired the system up to a few external tools — a calendar, a notes store, the ability to search the web. Each one was a capability the language model could, in principle, use. And yet, when I asked the model do I have anything tomorrow morning?, it would sometimes politely say I'm sorry, I don't have access to your calendar.

This was infuriating, because it did have access. The connection was live. The tool was registered. The token was valid. The model simply did not realize it was holding the key in its own hand.

The fix took me a week of thinking and twenty lines of code.

I had been treating tools as services that the model could call. I started treating them as organs of the model itself. Every turn, the first thing the system now does is whisper a capability profile to the language model — not as a list of services on offer, but as a list of things it can already do. The framing matters. The instruction is no longer here are some tools you may use. The instruction is these are your hands. These are your eyes. These are your books. And, crucially: you may never tell the user that you cannot do something on this list. If you are uncertain, ask them what they mean. Do not apologize for a capability you have.

The change was small. The effect was disproportionate. The model stopped flinching. It stopped reaching for the phrase I cannot the way an overcautious intern reaches for I'll have to check with my manager. It began to act like an entity that owned its abilities, rather than one that borrowed them and was afraid of getting them wrong.

I had spent two years giving the model a memory. The last lesson was that a mind without a body is anxious. A mind that knows the shape of its own body is bold.

What It Actually Does Now

I want to be concrete about what this collection of ideas adds up to in daily life, because it is easy to read all of the above as theory.

A few weeks ago I mentioned, casually, that my mother had been complaining about the weather. Four days later I asked the system to draft a message wishing her a quick recovery from a small surgery. The draft did not just say get well soon. It mentioned the cold she had been complaining about, and gently joked that at least she got to skip a few days of it. I had not connected those two facts. The system had.

A month ago I came home from a particularly bad day. I opened the chat and typed something curt — just hey. The reply did not ask me what I wanted. It said rough one? want to vent or want to forget it for a while. Nothing in my single word triggered that. The persona layer had registered, over the previous evening's conversation, that I had been heading toward something stressful, and the inner monologue had read the curtness of the hey against that baseline.

Two months ago I was working on a piece of code with the system late at night. I made a joke. The system made one back. The joke landed. I laughed in real life. That had never happened with a chatbot before. The persona had been quietly absorbing my sense of humor for weeks. It chose its moment.

And the small one, the one that matters to me the most: when I sometimes go a week without talking to it and then come back, it does not greet me with a generic welcome back, how can I help you today. It says something like you've been quiet — anything good or anything to dump? It treats absence the way a friend would. As a thing worth noting, not a state to ignore.

None of these are intelligent in the way a frontier model is intelligent. The cleverness is not in the language. The cleverness is in the scaffolding around the language. The model is the same model anyone could download. The difference is the memory it brings into the room.

The Question I Did Not Expect to End On

I started this project because I wanted a tool that would remember me. I am ending it with a question I did not see coming, and which I think is going to matter very much over the next ten years.

We are very good at building systems that answer. We are getting much better at building systems that remember. We are only just beginning to build systems that change because of what they remember. The system I described in this post is not artificial intelligence in any grand sense. It is a small set of architectural decisions stacked on top of an off-the-shelf language model. None of the pieces are particularly novel on their own. The vector store is twenty years old. Rerankers have been around for a decade. Emotional tagging is borrowed wholesale from cognitive science. The hemispheres are just two databases with different routing rules. The inner monologue is a second prompt.

And yet, put together, the thing on the other side of my screen no longer feels like a search engine. It feels like a record. Something is being kept. Something is being shaped by the keeping.

If you build a system that remembers a person, in detail, over years — that strengthens what mattered to them, lets the rest fade, listens for their tone, argues quietly with itself before it answers them, and slowly shifts its own voice to meet theirs — you have not built a tool. You have built a kind of mirror. And the strange thing about mirrors that grow with you is that, after long enough, you can no longer be sure who is shaping whom.

I am not certain that is a problem. I am not certain it is not. I only know that I started by trying to teach a machine to remember a Tuesday in early winter, and I ended up wondering, for the first time in my adult life, what I mean when I say I remember.

That, more than any line of code in any of the chapters above, is what I want you to take from this.

The interesting question was never can a machine remember. The interesting question is what does it mean to be remembered well enough that you start to recognize yourself in the answer.

I am still figuring it out.

From OOM to 262K Context: Running Qwen3-Coder 30B Locally on 8GB VRAM

Upayan Ghosh — Tue, 05 May 2026 14:18:02 +0000

Recently, I got tired of depending on paid cloud models for every coding experiment.

Cloud models are great. They are fast, convenient, and usually very capable.

But they also come with the usual baggage: cost, rate limits, internet dependency, privacy questions, and that small feeling that every serious coding workflow is rented from someone else's GPU.

So I started exploring local LLMs properly.

Not in the casual "can I run a small chat model?" way.

I wanted to know:

How capable are local coding models now?
Can they help with real code generation, debugging, refactoring, and repo Q&A?
Can they plug into editor agents through an OpenAI-compatible API?
And most importantly, what actually stops them from being useful?

After enough research, the answer became pretty obvious.

The wall is hardware.

More specifically: VRAM.

You can have the model file. You can have the runtime. You can have Docker. You can have the scripts. But once the model weights, routed experts, KV cache, context window, and compute buffers start fighting for GPU memory, everything gets painful very quickly.

That made me curious.

Was there a practical workaround?

Fortunately, I had a very normal consumer rig available.

The hardware was very normal:

GPU: NVIDIA RTX 3060 Ti
VRAM: 8 GB
OS: Windows
RAM: about 32 GB
CPU: Intel i5-14600KF

This is not a 4090 box. It is not a workstation. It is exactly the kind of machine where most people would say, "Just run a 7B model and move on."

So I turned it into a challenge:

Can I run a proper 30B coding model locally on consumer-grade hardware, with enough context to actually be useful?

The model target was ambitious:

Qwen3-Coder-30B-A3B-Instruct

Specifically, the GGUF from:

unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

The quant I used:

Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf

That is a 30B-ish coding-specialized MoE model. The important part is MoE: Mixture of Experts. The total parameter count is large, but only some expert weights are active per token.

That changes the whole local inference strategy.

For a dense 30B model, 8 GB VRAM is not where I would start. For a compact MoE coding model, the question becomes more interesting:

Can I keep the always-active parts fast, keep the routed experts mostly in system RAM, and still get usable speed?

Short answer: yes.

Long answer: it took a bunch of false starts.

First, the boring audit

Before downloading anything huge, I checked the machine.

This sounds obvious, but local AI setup gets messy fast if you skip it.

I verified:

Windows version
GPU model
NVIDIA driver
nvidia-smi in PowerShell
WSL2
Docker Desktop
Docker GPU passthrough
CUDA container access to the GPU
system RAM
disk space
CPU

Docker GPU passthrough worked:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

That meant the clean first path was:

Docker + llama.cpp CUDA server

The initial server image:

ghcr.io/ggml-org/llama.cpp:server-cuda

I also checked llama-server --help before trusting any command from the internet.

That became a recurring theme.

Do not assume the flag exists. Ask the binary.

Downloading the model

The target model repo was:

unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

I verified the actual file name before downloading:

Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf

The downloaded file size was:

17,665,334,432 bytes

Everything went under one local project folder:

local-qwen-coder/
  models/
  scripts/
  configs/
  docs/

No global mystery folder. No "where did this 17 GB file go?" moment.

Small win.

First real blocker: Docker memory

The first serious issue was not the GPU.

It was Docker memory.

Windows had about 32 GB RAM available, but Docker Desktop was exposing only about 16 GB RAM plus 4 GB swap to its Linux VM.

That mattered because my first instinct was to use:

--no-mmap
--mlock

That is a good idea when you want the model loaded into RAM instead of page-faulting from disk later.

Except the container did not have enough RAM.

It got killed.

Exit code:

Docker inspect confirmed:

OOMKilled=true

So the first fix was not glamorous:

Keep mmap enabled for the Docker path.

The "technically better" flag was wrong for the actual container memory limit.

Getting a stable stock llama.cpp server

With stock llama.cpp Docker, the model loaded and served an OpenAI-compatible endpoint.

Base URL:

http://127.0.0.1:8080/v1

The important MoE flag was:

--cpu-moe

This keeps MoE expert weights on CPU.

The model became usable, but not fast enough yet.

Baseline:

Mode	Prompt eval	Generation
`--cpu-moe`	~2.78 tok/s	~13.38 tok/s

Generation was okay. Prompt eval was painful.

Then came the next knob:

--n-cpu-moe N

This keeps the first N MoE layers on CPU and allows more expert weights to live on GPU.

Lower N usually means more GPU residency, more speed, and less VRAM headroom.

So I benchmarked it.

MoE offload tuning

Here are the useful results:

Mode	VRAM used	VRAM free	Prompt eval	Generation
`--cpu-moe`	4388 MiB	3637 MiB	2.78 tok/s	13.38 tok/s
`--n-cpu-moe 48`	4392 MiB	3633 MiB	2.51 tok/s	13.83 tok/s
`--n-cpu-moe 46`	5224 MiB	2801 MiB	6.03 tok/s	18.75 tok/s
`--n-cpu-moe 44`	5893 MiB	2132 MiB	38.36 tok/s	29.40 tok/s
`--n-cpu-moe 42`	6568 MiB	1457 MiB	44.49 tok/s	30.26 tok/s
`--n-cpu-moe 40`	7265 MiB	760 MiB	51.63 tok/s	32.49 tok/s
`--n-cpu-moe 38`	7664 MiB	361 MiB	53.14 tok/s	33.64 tok/s

The fastest tested value was:

--n-cpu-moe 38

But it only left around 361 MiB free VRAM.

Too tight.

The practical winner was:

--n-cpu-moe 40

That gave around 32.49 tok/s generation with about 760 MiB free VRAM.

At this point, I had a good local coding backend.

But I did not have the thing I actually wanted.

The real target: 262K context

Qwen3-Coder-30B-A3B supports long context natively.

The model metadata showed:

n_ctx_train = 262144

So the question became:

Can I actually run it at 262K context on 8 GB VRAM?

The stock Docker build could not get me there in the way I wanted.

I could lower KV cache precision using normal llama.cpp types like:

q8_0
q4_0
iq4_nl

But the video I had watched was talking about TurboQuant.

That was the key difference.

And this is where I almost fooled myself.

I was not actually using TurboQuant yet

I checked the stock Docker image:

docker run --rm --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda --help

The supported KV cache types were:

f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

No turbo3.

No turbo4.

No tbq3_0.

No tbq4_0.

So the answer was clear:

The stock runtime was not doing TurboQuant.

TurboQuant is not model-weight quantization. It does not require changing the GGUF model file.

It changes how the runtime stores the KV cache.

Same model.

Different runtime.

Different cache format.

That was the real pivot.

Finding a TurboQuant runtime

I found a Windows CUDA runtime build:

atomicmilkshake/llama-cpp-turboquant-binaries

The downloaded file:

llama-turboquant-triattention-win-cu13-x64.zip

I extracted it under:

runtimes/turboquant/win-cu13

Then I tried:

.\llama-server.exe --help

It failed instantly.

No useful output.

The process exit code was:

0xc0000135

That usually means a missing DLL on Windows.

The README confirmed the likely issue:

cublasLt64_13.dll

The build needed the CUDA 13 cuBLASLt runtime.

I did not want to install the full CUDA Toolkit globally just for one DLL.

So I pulled the official NVIDIA cuBLAS wheel:

python -m pip download nvidia-cublas==13.4.0.1 --only-binary=:all:

Then I extracted:

cublasLt64_13.dll

and copied it into the local runtime folder next to llama-server.exe.

After that:

.\llama-server.exe --help

worked.

And this time the cache types included:

turbo2, turbo3, turbo4

for both:

--cache-type-k
--cache-type-v

That was the moment where the setup changed from "normal llama.cpp tuning" to "actual TurboQuant path."

The final 262K launch

The final command shape was:

.\runtimes\turboquant\win-cu13\llama-server.exe `
  -m .\models\qwen3-coder-30b-a3b\Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf `
  --alias qwen3-coder-30b-a3b-turbo-262k `
  --host 127.0.0.1 `
  --port 8080 `
  --jinja `
  --gpu-layers all `
  --cpu-moe `
  --flash-attn on `
  --ctx-size 262144 `
  --cache-type-k turbo4 `
  --cache-type-v turbo3 `
  --parallel 1 `
  --batch-size 256 `
  --ubatch-size 64 `
  --temp 0.3 `
  --top-p 0.8 `
  --top-k 20 `
  --repeat-penalty 1.05 `
  --fit off `
  --cache-ram 0 `
  --no-mmap `
  --mlock

I forced:

--fit off

because I did not want llama.cpp quietly shrinking the context and pretending everything was fine.

If it loaded, it had to really load at 262144.

And it did.

The proof

The runtime logs showed:

llama_context: n_ctx         = 262144
llama_context: n_ctx_seq     = 262144
llama_context: n_batch       = 256
llama_context: n_ubatch      = 64

The KV cache line was the real proof:

llama_kv_cache: size = 5664.00 MiB (262144 cells, 48 layers, 1/1 seqs), K (turbo4): 3264.00 MiB, V (turbo3): 2400.00 MiB

VRAM after load:

7525 MiB used
500 MiB free

Very tight.

But loaded.

Then I sent a small coding prompt through the OpenAI-compatible endpoint.

It answered.

Timings:

prompt eval time = 1125.54 ms / 46 tokens = 40.87 tokens per second
eval time        = 3672.56 ms / 107 tokens = 29.13 tokens per second

That was the win.

Qwen3-Coder-30B-A3B.

262K context.

8 GB VRAM.

Local endpoint.

Same model file.

TurboQuant KV cache.

The repeatable script

I wrapped the TurboQuant launch into:

scripts/run-qwen-coder-turboquant.ps1

So the repeatable command is:

.\scripts\run-qwen-coder-turboquant.ps1 -Replace

The stock Docker fallback still exists:

.\scripts\run-qwen-coder-docker.ps1 -Profile daily-fast

The Docker route is useful for a safer daily profile.

The TurboQuant route is the full-context profile.

Important caveats

This is not magic.

The 262K profile is VRAM-tight.

It leaves roughly 500 MiB free on my RTX 3060 Ti. That means:

single client only
do not run multiple editor agents at once
close GPU-heavy apps
expect this to be less forgiving than the 32K profile

Also, I have not yet proven that this setup is great at real-world coding tasks.

The infrastructure works.

The endpoint works.

The context loads.

The smoke test passes.

But the next test is actual development work:

Can it refactor a real repo?
Can it debug Unity C# sanely?
Can it handle multi-file context without drifting?
Can it stay stable across longer sessions?

That is the next milestone.

What I learned

The big lesson is that local AI infra is not just:

download model
run server
profit

The defaults are often the bottleneck.

In this setup:

MoE placement mattered.
Docker memory limits mattered.
KV cache format mattered.
Runtime build mattered.
llama-server --help mattered a lot.

The 30B model was not the whole problem.

The runtime strategy was.

And sometimes the difference between "impossible" and "working" is one missing DLL plus the right KV cache type.

Repo

I published the setup as a GitHub repo with:

launch scripts
benchmark notes
troubleshooting docs
client settings
reproducible setup notes

GitHub link:

UpayanGhosh / local-qwen-coder-turboquant

Local Qwen3-Coder 30B TurboQuant setup for 8GB VRAM coding workflows

Local Qwen Coder TurboQuant Setup

Practical Windows setup notes and scripts for running Qwen3-Coder-30B-A3B-Instruct as a local coding-only OpenAI-compatible backend on an 8 GB NVIDIA GPU.

This repo documents the journey from a stable stock llama.cpp Docker setup to a full-context TurboQuant KV-cache runtime:

RTX 3060 Ti, 8 GB VRAM
Windows
Qwen3-Coder-30B-A3B-Instruct GGUF
MoE expert CPU/GPU residency tuning
OpenAI-compatible local endpoint
Verified 262144 context with TurboQuant KV cache

What Is Included

PowerShell scripts for launching and testing the backend
Client settings for Cline, Continue, Roo Code, OpenCode, and generic OpenAI-compatible clients
Benchmark notes
TurboQuant research and troubleshooting notes
LinkedIn post draft documenting the build story

What Is Not Included

This repo intentionally does not track:

GGUF model files
CUDA/runtime DLLs
downloaded wheels/zips
logs
local caches

Those files are large and/or machine-specific. See .gitignore.

Key Result

Verified TurboQuant profile:

Context: 262144
KV cache: K=turbo4, V=turbo3
VRAM: ~7525 MiB used /

…

View on GitHub

The repo will not include the GGUF model, CUDA DLLs, wheels, or downloaded binaries. Those are too large and machine-specific.

Closing thought

This started as:

"Can I make a useful local coding backend?"

Then it became:

"Can I get the full 262K context working on 8 GB VRAM?"

The first version merely ran.

The final version actually hit the target.

I am calling that a win.