DEV Community: Ritvika Mishra

I built an external context layer for AI agents - most of it already exists, here's what doesn't

Ritvika Mishra — Wed, 20 May 2026 02:48:06 +0000

Imagine you're deep into a brainstorming session with an AI, going back and forth for an hour - and then the free tier rate limit hits. Now you have to switch to another AI and re-explain everything from scratch. The context, the problem, what you've tried, where you're stuck. All of it.

Because you're the only one who knows what you're working on, to query any AI assistant you have to carry all that knowledge alone, every single time.

So two weeks ago, during an offline hackathon - I built a small version of my idea - Meniscus - a layer that sits outside your tools and holds that knowledge for you. One shared picture of your current working state: what you've tried, what's blocking you, what's been decided. Any AI you switch to reads from it instead of starting from zero.

The core idea:

Context shouldn't live inside individual tools. It should live outside as a separate external layer and AI tools should read from it.

Right now, you're the only one who knows what you're working on right now and you got to query any AI assistant you got to carry all that knowledge alone. Meniscus is the layer that finally holds that knowledge for you.

It captures user activity across tools, structures it into threads - which are the core working units representing what the user is actively doing - and lets you retrieve relevant context as a subgraph instead of raw history.

The architecture consists of three primitives:

Event is the atomic unit - an immutable, timestamped record of one thing you did. Asked ChatGPT something, watched a YouTube video, made a GitHub commit, updated a Notion page -- each one captured, normalized, stored.
An entity is a meaningful concept extracted from an event. Not the full text -- just the signal. "JWT", "middleware", "refresh token", "auth" - the keywords that actually tell you what the event was about.
A thread is a cluster of related events - not similar in the textual sense, but connected by shared work context. "I am debugging a JWT auth bug" is a thread. It spans a GitHub commit, a ChatGPT conversation, a Notion architecture doc, a YouTube video on token expiry. Individually those events look unrelated and together they're one line of work.

The pipeline goes like this:

→ activity comes in
→ entities get extracted
→ each new event gets compared against existing threads by entity overlap and temporal proximity
→ assigned to the right thread or a new one gets created
→ the whole thing is stored as a graph with explicit edges between events, entities, and threads.

when an agent queries Meniscus, it doesn't get a raw dump of your history. It gets a subgraph - the relevant thread, its events, its entities. A bounded, structured slice of context instead of everything at once. The agent injects that into its prompt and answers grounded in your actual work.

...

For the demo of the project, here's what i did:

simulated events from ChatGPT, YouTube and GitHub -- 4 threads from a realistic day of work, 12 events total.
the query system routes through three modes: Retrieve (traverses through the entities) --> Overview (cross-thread summary of what you've been doing) --> General (conversational, if there's nothing relevant to retrieve, it says "i don't know").

Whenever starting with a project, I like to think on the lines of SHOULDs -- how something should be done and question my own decisions aggressively at every step before I can come to an architecture that I deem to be good enough.

Whatever was showed in the demo, it was only a small part of the whole implementation I had planned. When the hackathon ended, after 2 days I decided to sit with my project once again and finish the remaining implementation. However I found some major loopholes and realized whatever I am doing is nothing different from what already exists - Zep, Mem0, Supermemory, Rewind etc.

most of what I built is already there, and in better shape than I could ship.

External memory layers, graph storage, episodic retrieval, agent APIs - these are solved or being actively solved by well-funded teams. Hence no point in redoing the same.

However there's one specific architectural component that is the only differentiating factor and it got a persisting question surrounding it, which I need to research upon thoroughly before coming to a conclusion.

Every existing system retrieves by similarity - cosine distance, semantic search, ranked chunks. You ask a question, it finds the most textually similar pieces of your history and hands them back.

But "what am I currently working on" isn't a similarity problem. It's a working state problem. The agent doesn't need the most similar chunks. it needs the current state of an ongoing task - what the goal is, what's been tried, what's blocking progress, what's been decided. Those are different questions and similarity search doesn't answer them cleanly.

Then the obvious question is - what about just dumping everything into a long context window? frontier models like Gemini and Claude have massive context windows. Why not hand them your entire history and let them figure out the working state themselves?

And honestly, they'd do a decent job. give Claude enough of your activity and it can synthesize "what you're working on" reasonably well.
but three problems:

1st, cost. sending hundreds of thousands of tokens on every single query isn't free at scale.
2nd, the lost-in-the-middle problem - empirically documented, models perform worse on information buried deep in long contexts. More tokens doesn't mean better reasoning over those tokens.
3rd, even if the model synthesizes working state correctly from raw history, it's doing that work fresh every single time you query it. Meniscus does it once and maintains it continuously. The synthesis is already done when the agent needs it.

Here are two hypotheses -

Thread-state packet retrieval produces better agent answers to active working state queries than hybrid search.
Thread-state packet retrieval injects fewer tokens for the same query - because a structured state object is already present there for agent to retrieve.

I might have guessed the answers but need to be very sure and the honest thing to do is build a benchmark, compare thread-state packet retrieval against state of the art retrieval methods on active working state queries, measure token count and answer quality, and write about what I find.

Thanks for reading :)

github repo: https://github.com/magic-bubblez/meniscus-

Local Voice Controlled AI Agent

Ritvika Mishra — Wed, 15 Apr 2026 10:13:59 +0000

a voice controlled agent is a piece of software that sits on your machine, listens to what you say, and actually does things in response. not a chatbot that talks back — an agent that acts. it hears intent, decides what to do, and does the thing. speech goes in, the filesystem moves, apps open, files get written, screens get captured. the gap between a thought and a side effect on your computer shrinks to a sentence.

i decided to build one of my own. here's what it can do right now:

create files and folders, sandboxed to a dedicated output directory so nothing on the machine gets stomped
generate code from a spoken description — "write a python function that reverses a string and save it as reverse.py" — and pop the file straight open in VS Code so you can see what was made
open apps on the device — Preview, Spotify, Terminal, Firefox, Chrome, whatever's installed — aliases and fuzzy names handled
open websites in the browser, with or without a specific browser named; search Google when asked to "look something up online"
open local files by searching for them via macOS Spotlight — "open the DDIA pdf" finds it wherever it lives
take screenshots — full screen, a window you pick, or a region you drag — auto-opens in Preview
summarize local files by actually reading them (pypdf for PDFs, plain read for text) and feeding the contents to the model
general chat when you just want to ask it something and have it respond in words

baby steps. nothing close to what i imagine an ideal version looking like — some hardware constraints and a few deliberate tradeoffs carved the scope down to what you see. the sections below are a slow
ramble on how the thing got constructed; the questions that surfaced while building, the choices that were made, the moments where self-posed questions looped until something clicked.

i like thinking by asking myself questions and answering them and re-questioning the answers — repeating that until there's some satisfaction (which, there never really is).

implementation was the comparatively easy part (much love to my dear claudette!) the actual interesting work was in the shape of the design, thinking and making elaborate plans.

the shape of the thing

the whole system is a pipeline. five layers. audio comes in one end, actions come out the other, and every layer in between does exactly one job and hands its output to the next.

audio bytes
→ [speech-to-text] → text string
→ [intent classifier] → structured JSON plan
→ [orchestrator] → handler calls
→ [tool handlers] → side effects on the machine
→ [UI] → displays every stage back to you

each arrow is a contract. "i promise to give you something of this shape, and you promise to produce something of that shape. neither of us cares how the other works internally." those contracts are the
whole point. they're what lets the pipeline be swappable — pull out any single box, replace its implementation, and nothing else in the system notices.

now layer by layer.

1. audio input

input capture. the browser already knows how to do this — a microphone button captures audio, an upload field takes a file. the web ui hands off bytes. no custom audio code to write. running in a
browser means the browser solves threading, permissions, and buffering — problems that don't belong to the agent.

2. speech-to-text

bytes become words. a whisper model runs locally (faster-whisper, specifically — the CTranslate2-backed variant, which avoids pulling in a heavy deep-learning framework just to run inference). input:
audio path. output: a string. stateless. deterministic-ish. no memory across calls. this layer is essentially a pure function.

3. intent classifier

the only layer where real intelligence lives. a local llm reads the transcribed text and produces a structured JSON plan — a list of actions with parameters. this is the hard layer. natural language is
infinite in its phrasings; the output has to be a small, bounded, structured thing. the llm is doing compression here — from an unbounded input space to a finite one.

the classifier does not execute anything. it does not touch the filesystem. it does not know how any of the tools work internally. it only describes what the user wants in a format the rest of the system can read. the understanding of language is cleanly separated from the taking of action.

4. orchestrator

the interface between intent and action. the layer that takes the plan and makes it happen. zero intelligence here. no llm calls. the orchestrator is a table lookup:

  REGISTRY = {
      "create_file":  handle_filesystem,
      "write_code":   handle_llm_generate,
      "summarize":    handle_llm_generate,
      "open_app":     handle_open_app,
      "screenshot":   handle_screenshot,
      "general_chat": handle_llm_generate,
  }

action type goes in, handler function comes out. call it. collect the result. move to the next action. when actions depend on each other (generate code before saving it to a file), a tiny set of
ordering rules reshuffles the list. that's the entire orchestrator.

this is called a registry pattern. dispatch tables instead of if/elif chains. the benefit is quiet but enormous — adding a new capability is one entry in the table. the orchestrator itself never changes when the system grows.

5. the UI

a thin layer. reads the results of the pipeline and lays them out: transcription, detected intent, actions taken, final output, per-stage timing, session history. the ui has no opinions about what
anything means — it just formats whatever the pipeline produces.

intents, tools, handlers — three words, three meanings

this distinction unlocks a lot of the design:

intent: what the user wants. user-facing. effectively infinite — every new phrasing is a new intent.
tool (or capability): what the system can do. system-facing. finite. small.
handler: the code that implements a tool. one python function.

a single intent might need multiple tools. many intents collapse onto a single tool. the classifier is what maps from the infinite space of possible intents to the finite space of tools. without that
mapping layer, you'd have to write a handler for every way a user could phrase something — which is impossible, because the set of phrasings is unbounded.

compress where the complexity actually is. language is complex; keep it in one layer. execution is simple; keep it in another. don't spread intelligence across the system.

what makes a design good

a few questions worth asking any system you build:

can a new capability be added without touching existing code?
is each capability isolated — can one be removed without breaking others?
can any implementation be swapped without cascading changes through the rest of the system?
is there a single place where a given concern lives — not smeared across multiple layers?
do the boundaries between layers stay clean — no internals of one layer leaking into another?

these aren't independent. they compound. clean interfaces enable isolation. isolation enables swapping. swapping enables growth. a system that answers "yes" to all five can be modified without a
rewrite, and that property — the cost of change — is the actual test of whether the design is doing its job.

the opposite property has a name too: abstraction leakage. when one layer has to know the internals of another to function, the boundary has been violated. every violation makes the next change a little
harder. accumulate enough of them and you can't reason about any piece in isolation — the whole system becomes one tangled thing, and adding a capability means understanding all of it.

thanks for reading ^^