TL;DR
@Mosescreates posted his Hermes fleet — six agent profiles across two machines, all writing to a shared memory layer made of mem0 + Qdrant + Ollama. I spent the weekend reproducing a stripped-down version with two profiles on a single MacBook. The architecture held.
Load-bearing pieces (don't skip):
- Shared Qdrant collection
- Local Ollama embeddings
- mem0 as memory abstraction
- Claude Code Stop hook writing each turn
- Native-only OpenRouter provider pinning
- Local LLM fallback (Gemma 2 9B 4-bit in my case)
Everything else — second machine, launchd units, backup cron, fleet status CLIs — is nice to have.
Architecture
┌────────────────┐ ┌──────────────────┐
│ Claude Code │ │ Telegram Agent │
└────────┬───────┘ └────────┬─────────┘
│ Stop hook │ on every reply
▼ ▼
┌────────────────────────────────────────────┐
│ mem0 (Python SDK) │
│ LLM: OpenRouter (primary) / Ollama fallback│
│ Embedder: Ollama nomic-embed-text (768d) │
└────────────────────┬───────────────────────┘
▼
┌────────────────────────┐
│ Qdrant (Docker) │
│ collection: fleet_memory│
└────────────────────────┘
Every tool is a reader and a writer of the same store. That's the whole idea.
Bringing up the three services
# Qdrant
docker run -d --name qdrant -p 6333:6333 \
-v "$HOME/qdrant_data:/qdrant/storage" qdrant/qdrant
# Ollama + embedding model
ollama pull nomic-embed-text
ollama serve &
# mem0
python3.11 -m venv ~/.venv/fleet && source ~/.venv/fleet/bin/activate
pip install mem0ai
Config that actually works
# ~/.mem0/config.yaml
vector_store:
provider: qdrant
config:
host: 127.0.0.1
port: 6333
collection_name: fleet_memory
embedding_model_dims: 768 # <- don't omit this
llm:
provider: ollama
config:
model: llama3.1:8b
ollama_base_url: http://127.0.0.1:11434
embedder:
provider: ollama
config:
model: nomic-embed-text
ollama_base_url: http://127.0.0.1:11434
Three things that differ from the original snippet:
-
All three blocks (
vector_store,llm,embedder) are specified. mem0's defaults are OpenAIgpt-4o+text-embedding-3-small. If you only override the embedder, mem0 silently still calls OpenAI for the LLM piece and you get a confusing 401. -
embedding_model_dims: 768is explicit. nomic-embed-text returns 768-dim vectors; mem0's Qdrant default assumes 1536 (OpenAI). Missing this causes silent insert failures — see mem0 issue #3441. -
127.0.0.1overlocalhost. Happy Eyeballs bit me multiple times on macOS when services resolved to both IPv6 and IPv4.
The Claude Code Stop hook
This is the piece that wires Claude Code into the shared memory. Settings:
{
"hooks": {
"Stop": [
{
"matcher": "",
"hooks": [
{
"type": "command",
"command": "/Users/you/.venv/fleet/bin/python3 /Users/you/.claude/hooks/mem_broadcast.py"
}
]
}
]
}
}
Important: use the venv Python explicitly. The hook runs with a non-login shell env, so python3 resolves to system Python, which doesn't have mem0 installed.
The hook script
The original blogpost uses payload.get("transcript", []). The actual Stop hook payload gives you transcript_path — a JSONL file path. You have to open and parse it.
#!/usr/bin/env python3
"""Claude Code Stop hook → mem0 writer with redaction + idempotency."""
import json, os, re, sys
from pathlib import Path
from mem0 import Memory
REDACT_PATTERNS = [
re.compile(r"sk-[A-Za-z0-9_-]{20,}"),
re.compile(r"ghp_[A-Za-z0-9]{36,}"),
re.compile(r"Bearer\s+[A-Za-z0-9._\-]+"),
]
def redact(text: str) -> str:
for pat in REDACT_PATTERNS:
text = pat.sub("[REDACTED]", text)
return text
def main() -> int:
try:
payload = json.load(sys.stdin)
except json.JSONDecodeError:
return 0
# Prevent infinite loops
if payload.get("stop_hook_active"):
return 0
transcript_path = payload.get("transcript_path")
if not transcript_path or not Path(transcript_path).exists():
return 0
turns = []
with open(transcript_path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line: continue
try:
turns.append(json.loads(line))
except json.JSONDecodeError:
continue
user_turn = next((t for t in reversed(turns) if t.get("role") == "user"), None)
assistant_turn = next((t for t in reversed(turns) if t.get("role") == "assistant"), None)
if not user_turn or not assistant_turn:
return 0
session_id = payload.get("session_id", "unknown")
turn_index = len(turns)
user_text = redact(str(user_turn.get("content", "")))[:2000]
assistant_text = redact(str(assistant_turn.get("content", "")))[:2000]
m = Memory.from_config(os.path.expanduser("~/.mem0/config.yaml"))
m.add(
f"User asked: {user_text}\nAssistant answered: {assistant_text}",
user_id="fleet",
metadata={
"source": "claude_code",
"session_id": session_id,
"turn_index": turn_index,
"idempotency_key": f"{session_id}:{turn_index}",
},
)
return 0
if __name__ == "__main__":
sys.exit(main())
Known Stop hook bug
Claude Code issue #11786 reports a regression in v2.0.42+ where prompt-based Stop hooks can't access transcript content. Reading transcript_path directly (as above) mostly dodges it, but I occasionally saw the last JSONL turn not yet flushed when the hook fired — adding time.sleep(0.2) at the top of main() smoothed it over in practice.
The four-line YAML that matters most
OpenRouter defaults to falling back to cheaper/faster providers. That's fine for a chat UI, catastrophic for an agent that writes to shared memory.
primary:
provider: openrouter
model: qwen/qwen-3-72b-instruct
provider_config:
only: ["alibaba"]
allow_fallbacks: false
fallback:
provider: ollama
model: gemma2:9b-instruct-q4_0
OpenRouter's provider routing docs officially support provider.only + allow_fallbacks: false. If the pinned provider is down, the call fails loudly instead of silently drifting to another one. Loud failure is what you want when memory consistency is on the line.
Local fallback from day one
30 seconds without network during a conversation — user barely noticed. Memory writes kept going against local Gemma. Recovery was silent. Pull the model once:
ollama pull gemma2:9b-instruct-q4_0
Three lessons the hard way
-
Happy Eyeballs is real. Pin services to
127.0.0.1, setAddressFamily inetin~/.ssh/config, usecurl -4. -
Idempotency keys aren't optional. I skipped them for day one and by Sunday I had duplicate memory entries from mem0 retries after a Qdrant timeout.
session_id:turn_index— do it from the start. - Self-hosted embeddings are self-ownership. Cost was a trivial amount of electricity. Privacy is total. If OpenRouter vanishes tomorrow, my memory layer is untouched.
The qualitative part
After roughly 10 minutes of Claude Code work, I asked my Telegram bot "what was I working on?" and it answered coherently from the shared store. I kept catching myself context-switching without re-explaining.
None of this is novel tech. The work is in the wiring — making every tool a reader/writer of the same store, plus enough discipline around idempotency and redaction to trust what's in there.
Closing
Moshe's "nothing lock-in" point hits harder after doing it. When something broke I fixed it. When I wanted a new profile I copied four lines of YAML. No platform was in the loop.
Start: Qdrant + Ollama + mem0. Get Claude Code Stop hook writing. Everything else builds from there.
Credit: Moshe (@Mosescreates) for the original Hermes thread. mem0 / Qdrant / Ollama / Claude Code teams for the underlying pieces. This is just a weekend reproduction.
Full Korean writeup with additional troubleshooting: https://qjc.app/blog/hermes-fleet-self-hosted-agent-stack
Top comments (0)