ravi kashyap

Posted on Apr 6

I built an open-source memory layer for LLMs — here's how it works

#opensource #ai #llm #python

LLMs are stateless by design. You send a message, you get a reply, and the model instantly forgets everything. Every conversation starts cold.

That's fine for one-off tasks. It's a real problem when you're building anything personal — a coding assistant that knows your stack, a writing tool that remembers your style, an agent that tracks what you've decided across sessions.

The usual answers are: roll your own RAG pipeline, use a cloud memory service, or spend a weekend stitching together embeddings, a vector database, and prompt injection logic. None of those feel like the answer.

So I built MemoryWeave — an open-source Python library that gives any LLM long-term memory in three lines of code.

from memoryweave import MemoryWeave

memory = MemoryWeave()
memory.add("My name is Ravi. I prefer Python and FastAPI.")
ctx = memory.get("What stack should I recommend?")

print(ctx.summary)
# → Relevant memories:
# → - My name is Ravi. I prefer Python and FastAPI. (relevance: 0.94)

ctx.summary is a ready-to-inject string. Paste it into your system prompt. Done.

Why not just vector search?

Most memory libraries are thin wrappers around a vector database. You embed text, store vectors, and retrieve by cosine similarity. It works, but it has a blind spot.

Vector search finds similar text. It struggles with related facts.

Say you store: "Ravi uses FastAPI" and "FastAPI uses Uvicorn". If you query "What server does Ravi use?", a pure vector search will miss the inference. The connection lives in the relationship between facts, not in any single embedding.

MemoryWeave solves this with a dual-retrieval architecture.

How it works

Here's the full pipeline — both add() and get() in one view:

memory.add(text)
  │
  ├── spaCy NLP          extract entities + subject-verb-object facts
  ├── sentence-transformers   embed text → 384-dim vector
  ├── Vector store        save embedding (InMemory or ChromaDB)
  └── Knowledge graph     add entities and facts as nodes/edges (NetworkX)

memory.get(query)
  │
  ├── Embed query
  ├── Vector search       top-k similar memories by cosine similarity
  ├── Graph query         related facts by keyword overlap
  └── Ranker              fuse scores → 0.6 × vector + 0.4 × graph → MemoryContext

Let's walk through each part.

1. NLP extraction (spaCy)

When you call memory.add(text), the first thing that happens is a spaCy pass over the raw text. It extracts:

Named entities — people, places, organizations, tech names
Subject-verb-object triples — structured facts like (Ravi, prefers, Python)

These become nodes and edges in a knowledge graph (NetworkX under the hood). This is what makes relational queries possible later.

2. Embedding (sentence-transformers)

In parallel, the same text is embedded using all-MiniLM-L6-v2 — a compact, fast sentence-transformers model that produces 384-dimensional vectors. These go into either an in-memory store (great for development) or ChromaDB (for production, persistence across restarts).

Everything runs locally. No API keys, no data sent to any external service.

3. Deduplication

Before storing anything, MemoryWeave checks cosine similarity against existing embeddings. If a new entry scores ≥ 0.98 against something already stored, it's silently dropped. This keeps memory clean when the same fact gets re-added across sessions.

4. Retrieval and fusion

When you call memory.get(query):

The query is embedded with the same model
Vector search returns the top-k most similar memories
Graph query does a keyword overlap walk across the knowledge graph, surfacing related facts that may not be textually similar to the query
A weighted ranker fuses both: final_score = 0.6 × vector_score + 0.4 × graph_score

The weights are configurable. If your use case is mostly factual (e.g., a personal knowledge base), bump graph_weight up. If you're doing more semantic search over long-form text, keep vector weight dominant.

The result is a MemoryContext object:

Field	Description
`summary`	Ready-to-inject string for your system prompt
`entries`	Vector search hits with scores
`facts`	Graph facts with scores
`has_results`	`False` if nothing was found

Plugging into OpenAI or Anthropic

MemoryWeave ships with first-class adapters for both:

# OpenAI
from memoryweave.adapters.openai import OpenAIAdapter

adapter = OpenAIAdapter(memory, system_prompt="You are a helpful assistant.")
messages = adapter.prepare(messages)   # injects memory into system prompt
# ... call OpenAI ...
adapter.remember(messages)             # stores the turn for next time

# Anthropic
from memoryweave.adapters.anthropic import AnthropicAdapter

adapter = AnthropicAdapter(memory)
system, messages = adapter.prepare(messages)
# ... call Anthropic with system= ...
adapter.remember(messages)

The adapters handle prompt injection automatically. You don't touch the system prompt manually.

Multi-user sessions

Every MemoryWeave instance is scoped to a session_id. Sessions never bleed into each other:

alice = MemoryWeave(MemoryConfig(default_session_id="alice"))
bob   = MemoryWeave(MemoryConfig(default_session_id="bob"))

alice.add("Alice likes TypeScript.")
bob.add("Bob prefers Rust.")

print(alice.get("language").summary)  # → TypeScript
print(bob.get("language").summary)    # → Rust

REST API + TypeScript SDK

If your app isn't Python, MemoryWeave also ships a FastAPI server and a TypeScript SDK:

# Start the server
uvicorn memoryweave.server:app --reload

# Optional: lock it with an API key
MEMORYWEAVE_API_KEY=my-secret uvicorn memoryweave.server:app

import { MemoryWeave } from "@memoryweave/sdk";

const memory = new MemoryWeave({ baseUrl: "http://localhost:8000", sessionId: "user-1" });
await memory.add("Ravi prefers Python.");
const ctx = await memory.get("What language?");
console.log(ctx.summary);

Current state

The library is at v1.1.0, sitting at 248 tests and 91% coverage with CI green across Python 3.10–3.12. The full phase list:

✅ Phase 1 — Foundation
✅ Phase 2 — NLP extraction (spaCy)
✅ Phase 3 — Storage layer (vector + knowledge graph)
✅ Phase 4 — Core memory API v0.1.0
✅ Phase 5 — TypeScript SDK
✅ Phase 6 — FastAPI REST server
✅ Phase 7 — Documentation
✅ Phase 8 — Launch v1.0.0
✅ Phase 9 — Deduplication, async methods, LLM adapters, server auth v1.1.0

What's next

A few things on the roadmap:

Forgetting strategies — time-decay and relevance-decay so stale memories don't pollute retrieval
Streaming support — auto-extract and store from streamed LLM responses
Memory summaries — periodic compression of older memories into higher-level facts

Try it

pip install memoryweave
python -m spacy download en_core_web_sm

GitHub: github.com/ravii-k/memoryweave

If you're building something with it — or you've hit the same problem and solved it differently — I'd genuinely like to hear about it in the comments.

DEV Community