DEV Community

Gokul Jinu
Gokul Jinu

Posted on

Why we built tag-graph memory for AI agents — and shipped a Python SDK for it

I spent most of last year trying to solve a deceptively narrow problem: how do you give an LLM agent persistent memory that's bounded, predictable, and doesn't blow your token bill?

I tried a lot of things. Vector DBs gave me fuzzy results that were impossible to token-budget. Raw conversation history blew past context windows in 5 turns. "Summarize and re-inject" silently dropped the one fact the agent needed three turns later.

Today I shipped the first Python SDK for what we ended up building — MME (Memory Management Engine). It's a bounded tag-graph memory engine, and it's a different shape of memory than the vector-DB-by-default story you hear everywhere.

This is a writeup of the design choices, why each one matters in production, and what's in the SDK if you want to try it.

The three problems with vector-search agent memory

Vector retrieval is the default answer to "how do I give my agent memory" because embeddings are universal and pgvector / Pinecone / Weaviate are easy to host. But for agent memory specifically (as opposed to RAG over documents), three things keep biting you:

1. You can't token-budget the result

You ask for top-K = 5 documents. You get five chunks back. Each chunk could be 80 tokens or 800 tokens. You don't know until you tokenize the response. So either you over-budget (waste money on every call) or under-budget (truncate mid-sentence and the LLM gets garbage).

2. Cosine similarity rewards the wrong things

For a question like "what are my food preferences?", a chunk containing the literal phrase "food preferences" beats a chunk that says "I'm allergic to peanuts and I prefer dark chocolate" — even though the second chunk is what you actually want. Embeddings encode lexical similarity at least as much as semantic relevance.

3. There's no learning loop

Every retrieval is independent. The system never improves from "this pack worked for that query" feedback. To improve, you re-train embeddings or re-chunk — both are heavy ops you don't run on every accepted pack.

The shape that worked: a bounded tag-graph

The core idea: instead of embeddings, store memories as structured tag sets, and retrieve by walking a graph from query tags to memory tags.

When you save a memory like "I prefer dark chocolate", MME extracts a small set of structured tags — food, preference, dark_chocolate, food_item — with weights. These tags become nodes in a graph; their co-occurrence creates weighted edges.

When you query "what are my food preferences?", MME does the same tag extraction on the query (yielding seed tags S like food, preference), then walks the graph:

  • From each seed, follow up to M = 32 highest-weight edges
  • Repeat to depth D = 2 with a decay factor α applied per hop
  • Trim the activated tag set to a beam width B = 128
  • Find memories whose tags are in the activated set
  • Score by activation × recency × importance − diversity penalty
  • Pack greedily until the token budget is hit (exact tiktoken count)

The bound is mathematical: O(|S| · M^D) tags activated, hard cap at beam width. In practice this gives p95 latency of 135 ms across a 25-minute soak of 150K requests with 0% errors. (I obsessed over this; the bounds aren't decorative.)

Why each piece matters

Bounded propagation. Without a depth + beam cap, graph walks degenerate to "activate everything" on dense graphs. The cap means latency is predictable regardless of graph size. This is the single biggest reason MME is production-runnable.

Token-budgeted packs. The packer is a hard constraint, not a soft target. You ask for 1024 tokens, you get ≤ 1024. Items that don't fit are skipped, not truncated. This means you can prompt-engineer with confidence: your context window allocation is real.

Online learning. When the agent accepts a pack and the downstream call succeeds, MME updates edge weights via EMA from the feedback signal. After a few hundred accepted packs, the graph self-tunes to your usage patterns. No retraining, no embedding refreshes, no offline pipeline.

Online tagging. New memories get tagged at write time by a small LLM-backed tagger that knows the existing tag vocabulary. Tags are reused where possible (the dark_chocolate tag persists across users in the same scope), so the graph densifies as you save more memories — which is what you want.

The Python SDK (shipped today)

pip install railtech-mme
Enter fullscreen mode Exit fullscreen mode

Three calls cover 90% of what you'll do:

from railtech_mme import MME

with MME() as mme:
    # Save a memory — auto-tagged on the server
    mme.save("I prefer dark chocolate.")
    mme.save("I'm allergic to peanuts.")

    # Inject — get a token-budgeted pack
    pack = mme.inject(
        "What are my food preferences and allergies?",
        token_budget=1024,
    )
    for item in pack.items:
        print(item.excerpt)

    # Feedback — close the learning loop
    mme.feedback(pack_id=pack.pack_id, accepted=True)
Enter fullscreen mode Exit fullscreen mode

There's a parallel AsyncMME for async stacks, full Pydantic models on every request/response, and an exception taxonomy (MMEAuthError, MMERateLimitError, MMETimeoutError, etc.) so you can write proper error handling.

LangChain integration

The integration is a first-class extra, not a wrapper:

pip install 'railtech-mme[langchain]'
Enter fullscreen mode Exit fullscreen mode
from railtech_mme.langchain import MMEInjectTool, MMESaveTool
from langgraph.prebuilt import create_react_agent

tools = [MMEInjectTool(), MMESaveTool()]
agent = create_react_agent(llm, tools)
Enter fullscreen mode Exit fullscreen mode

Both tools have proper Pydantic schemas, so the LLM sees clean parameter descriptions when deciding whether to call them. MMEInjectTool returns a token-budgeted pack; MMESaveTool lets the agent persist new memories with optional section/source tags.

What's not yet there (honest beat)

  • The SDK is one day old. v0.1.0 shipped yesterday; v0.1.1 today after end-to-end verification surfaced two real bugs (recent() was crashing on real responses, and the README quickstart was using a paraphrase that didn't activate the cold-start tag graph). Both fixed.
  • Docs are minimal. The README has a quickstart, the dashboard at mme.railtech.io has the Python section, but you'll find missing pieces — please open issues.
  • The backend has been in production for ~6 months, so the server is mature. The Python client is what's new and what I'd love feedback on.
  • LangGraph examples beyond basic tool-binding aren't in the repo yet. They're next.

Why I'm sharing this

Two reasons:

First, I think tag-graph memory is genuinely a different design point than vector search, and I'd like more people to push on it — find where it breaks, find where it shines. The math is in the README; the code is Apache-2.0 on GitHub.

Second, this is launch day for the Python SDK. If you're building agents in Python and you've felt the pain of "my agent doesn't remember things well," I'd love it if you tried it and told me what's clunky about the API.

Links

Happy to answer technical questions in the comments — the bounded retrieval math, the LangChain tool design, why we chose tag-graph over hybrid vector+keyword, or anything about the prod observability stack.

Top comments (0)