As someone who runs multiple AI agents for day-to-day tasks, I kept hitting the same wall: every new conversation was a total memory wipe. I'd ask my agent about a project we discussed yesterday, and it had no idea what I was talking about. Frustrating.
Sure, you can try to feed context into each prompt manually, or build custom plugins for each agent framework. But that's brittle and doesn't scale. I needed something that worked across any agent — Hermes, Claude Code, even my own custom scripts — without invasive surgery.
That's why I built Memory Sidecar.
The Core Idea
A sidecar process that runs alongside your agent. It watches your conversations, extracts what's important, and builds a long-term memory structure. When your agent needs context, the sidecar injects relevant information into the system prompt. All without patching the agent or rewriting its internals.
How It Works
Agent writes sessions to a data directory (state.db + plain text files). The sidecar periodically checks for new data, processes it, and updates three memory layers:
- Hot layer: A fast, temporary buffer (5KB limit) that holds the most recent interactions. It's like your agent's short-term memory.
- Warm layer: A PostgreSQL-backed store using Hindsight for semantic similarity. This is where all the "I think I saw something like that last week" queries land.
- Cold layer: A persistent knowledge graph (g-brain) with full-text search via SQLite FTS5. This stores long-term knowledge — people names, project details, recurring problems — as interconnected nodes.
When a new query comes in, the sidecar performs a tiered retrieval: it first checks the hot layer for immediate context, then does a semantic search over the warm layer, and finally queries the cold knowledge graph for entity-based relationships. The results get compacted into the agent's system prompt, so the agent can act on them naturally.
Why This Matters for Production
Memory Sidecar is not a toy prototype. It's at version 3.1.1, built for reliability. The checkpointing ensures you never double-process sessions. The three-tier architecture means you always have the right trade-off between latency and recall depth.
Because it's a separate process, you can restart your agent without losing memory. You can also scale the warm layer independently, host the cold graph on a bigger machine, or swap out components as needed.
Getting Started
You need Python 3.9+ and a running agent that writes sessions to a directory. Clone the repo, install dependencies, and configure the sidecar to watch that directory. The project includes a ready-to-use runner script.
For exact setup, check the documentation. But here's the gist: point the sidecar at your agent's output dir, and it starts building memory immediately. You'll see your agent's recall improve within a few interactions.
Who It's For (and Who It's Not)
If you're running an agent in a daily workflow and you're tired of repeating yourself, this will save you time. It's especially useful for personal assistants, coding agents, and any scenario where the agent needs to remember projects, preferences, or past decisions.
It's not for you if you need zero-latency responses or if you're running on extremely memory-constrained devices. The sidecar itself is lightweight, but the retrieval adds a few milliseconds to inference.
Conclusion
Giving an agent memory doesn't have to mean ripping out its internals. A sidecar approach scales and works across different agents. Memory Sidecar gave my agents a persistent knowledge base, and I've been using it daily for months.
If you're looking to solve the same problem, give it a try. Check out the repo: Memory Sidecar on GitHub
P.S. — The architecture is well-documented in the repo, including a full architecture breakdown. I recommend skimming that before diving into the code.
Top comments (0)