LLMs are stateless by design. You send a message, you get a reply, and the model instantly forgets everything. Every conversation starts cold.
That's fine for one-off tasks. It's a real problem when you're building anything personal — a coding assistant that knows your stack, a writing tool that remembers your style, an agent that tracks what you've decided across sessions.
The usual answers are: roll your own RAG pipeline, use a cloud memory service, or spend a weekend stitching together embeddings, a vector database, and prompt injection logic. None of those feel like the answer.
So I built MemoryWeave — an open-source Python library that gives any LLM long-term memory in three lines of code.
from memoryweave import MemoryWeave
memory = MemoryWeave()
memory.add("My name is Ravi. I prefer Python and FastAPI.")
ctx = memory.get("What stack should I recommend?")
print(ctx.summary)
# → Relevant memories:
# → - My name is Ravi. I prefer Python and FastAPI. (relevance: 0.94)
ctx.summary is a ready-to-inject string. Paste it into your system prompt. Done.
Why not just vector search?
Most memory libraries are thin wrappers around a vector database. You embed text, store vectors, and retrieve by cosine similarity. It works, but it has a blind spot.
Vector search finds similar text. It struggles with related facts.
Say you store: "Ravi uses FastAPI" and "FastAPI uses Uvicorn". If you query "What server does Ravi use?", a pure vector search will miss the inference. The connection lives in the relationship between facts, not in any single embedding.
MemoryWeave solves this with a dual-retrieval architecture.
How it works
Here's the full pipeline — both add() and get() in one view:
memory.add(text)
│
├── spaCy NLP extract entities + subject-verb-object facts
├── sentence-transformers embed text → 384-dim vector
├── Vector store save embedding (InMemory or ChromaDB)
└── Knowledge graph add entities and facts as nodes/edges (NetworkX)
memory.get(query)
│
├── Embed query
├── Vector search top-k similar memories by cosine similarity
├── Graph query related facts by keyword overlap
└── Ranker fuse scores → 0.6 × vector + 0.4 × graph → MemoryContext
Let's walk through each part.
1. NLP extraction (spaCy)
When you call memory.add(text), the first thing that happens is a spaCy pass over the raw text. It extracts:
- Named entities — people, places, organizations, tech names
-
Subject-verb-object triples — structured facts like
(Ravi, prefers, Python)
These become nodes and edges in a knowledge graph (NetworkX under the hood). This is what makes relational queries possible later.
2. Embedding (sentence-transformers)
In parallel, the same text is embedded using all-MiniLM-L6-v2 — a compact, fast sentence-transformers model that produces 384-dimensional vectors. These go into either an in-memory store (great for development) or ChromaDB (for production, persistence across restarts).
Everything runs locally. No API keys, no data sent to any external service.
3. Deduplication
Before storing anything, MemoryWeave checks cosine similarity against existing embeddings. If a new entry scores ≥ 0.98 against something already stored, it's silently dropped. This keeps memory clean when the same fact gets re-added across sessions.
4. Retrieval and fusion
When you call memory.get(query):
- The query is embedded with the same model
- Vector search returns the top-k most similar memories
- Graph query does a keyword overlap walk across the knowledge graph, surfacing related facts that may not be textually similar to the query
- A weighted ranker fuses both:
final_score = 0.6 × vector_score + 0.4 × graph_score
The weights are configurable. If your use case is mostly factual (e.g., a personal knowledge base), bump graph_weight up. If you're doing more semantic search over long-form text, keep vector weight dominant.
The result is a MemoryContext object:
| Field | Description |
|---|---|
summary |
Ready-to-inject string for your system prompt |
entries |
Vector search hits with scores |
facts |
Graph facts with scores |
has_results |
False if nothing was found |
Plugging into OpenAI or Anthropic
MemoryWeave ships with first-class adapters for both:
# OpenAI
from memoryweave.adapters.openai import OpenAIAdapter
adapter = OpenAIAdapter(memory, system_prompt="You are a helpful assistant.")
messages = adapter.prepare(messages) # injects memory into system prompt
# ... call OpenAI ...
adapter.remember(messages) # stores the turn for next time
# Anthropic
from memoryweave.adapters.anthropic import AnthropicAdapter
adapter = AnthropicAdapter(memory)
system, messages = adapter.prepare(messages)
# ... call Anthropic with system= ...
adapter.remember(messages)
The adapters handle prompt injection automatically. You don't touch the system prompt manually.
Multi-user sessions
Every MemoryWeave instance is scoped to a session_id. Sessions never bleed into each other:
alice = MemoryWeave(MemoryConfig(default_session_id="alice"))
bob = MemoryWeave(MemoryConfig(default_session_id="bob"))
alice.add("Alice likes TypeScript.")
bob.add("Bob prefers Rust.")
print(alice.get("language").summary) # → TypeScript
print(bob.get("language").summary) # → Rust
REST API + TypeScript SDK
If your app isn't Python, MemoryWeave also ships a FastAPI server and a TypeScript SDK:
# Start the server
uvicorn memoryweave.server:app --reload
# Optional: lock it with an API key
MEMORYWEAVE_API_KEY=my-secret uvicorn memoryweave.server:app
import { MemoryWeave } from "@memoryweave/sdk";
const memory = new MemoryWeave({ baseUrl: "http://localhost:8000", sessionId: "user-1" });
await memory.add("Ravi prefers Python.");
const ctx = await memory.get("What language?");
console.log(ctx.summary);
Current state
The library is at v1.1.0, sitting at 248 tests and 91% coverage with CI green across Python 3.10–3.12. The full phase list:
✅ Phase 1 — Foundation
✅ Phase 2 — NLP extraction (spaCy)
✅ Phase 3 — Storage layer (vector + knowledge graph)
✅ Phase 4 — Core memory API v0.1.0
✅ Phase 5 — TypeScript SDK
✅ Phase 6 — FastAPI REST server
✅ Phase 7 — Documentation
✅ Phase 8 — Launch v1.0.0
✅ Phase 9 — Deduplication, async methods, LLM adapters, server auth v1.1.0
What's next
A few things on the roadmap:
- Forgetting strategies — time-decay and relevance-decay so stale memories don't pollute retrieval
- Streaming support — auto-extract and store from streamed LLM responses
- Memory summaries — periodic compression of older memories into higher-level facts
Try it
pip install memoryweave
python -m spacy download en_core_web_sm
GitHub: github.com/ravii-k/memoryweave
If you're building something with it — or you've hit the same problem and solved it differently — I'd genuinely like to hear about it in the comments.
Top comments (0)