DEV Community

정상록
정상록

Posted on

Building a Hermes Fleet: Reproducing a Self-Hosted Agent Stack in a Weekend (mem0 + Qdrant + Ollama + Claude Code Stop hook)

TL;DR

@Mosescreates posted his Hermes fleet — six agent profiles across two machines, all writing to a shared memory layer made of mem0 + Qdrant + Ollama. I spent the weekend reproducing a stripped-down version with two profiles on a single MacBook. The architecture held.

Load-bearing pieces (don't skip):

  1. Shared Qdrant collection
  2. Local Ollama embeddings
  3. mem0 as memory abstraction
  4. Claude Code Stop hook writing each turn
  5. Native-only OpenRouter provider pinning
  6. Local LLM fallback (Gemma 2 9B 4-bit in my case)

Everything else — second machine, launchd units, backup cron, fleet status CLIs — is nice to have.

Architecture

┌────────────────┐        ┌──────────────────┐
│ Claude Code    │        │ Telegram Agent   │
└────────┬───────┘        └────────┬─────────┘
         │ Stop hook               │ on every reply
         ▼                         ▼
┌────────────────────────────────────────────┐
│             mem0 (Python SDK)              │
│ LLM: OpenRouter (primary) / Ollama fallback│
│ Embedder: Ollama nomic-embed-text (768d)   │
└────────────────────┬───────────────────────┘
                     ▼
         ┌────────────────────────┐
         │ Qdrant (Docker)        │
         │ collection: fleet_memory│
         └────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Every tool is a reader and a writer of the same store. That's the whole idea.

Bringing up the three services

# Qdrant
docker run -d --name qdrant -p 6333:6333 \
  -v "$HOME/qdrant_data:/qdrant/storage" qdrant/qdrant

# Ollama + embedding model
ollama pull nomic-embed-text
ollama serve &

# mem0
python3.11 -m venv ~/.venv/fleet && source ~/.venv/fleet/bin/activate
pip install mem0ai
Enter fullscreen mode Exit fullscreen mode

Config that actually works

# ~/.mem0/config.yaml
vector_store:
  provider: qdrant
  config:
    host: 127.0.0.1
    port: 6333
    collection_name: fleet_memory
    embedding_model_dims: 768   # <- don't omit this

llm:
  provider: ollama
  config:
    model: llama3.1:8b
    ollama_base_url: http://127.0.0.1:11434

embedder:
  provider: ollama
  config:
    model: nomic-embed-text
    ollama_base_url: http://127.0.0.1:11434
Enter fullscreen mode Exit fullscreen mode

Three things that differ from the original snippet:

  • All three blocks (vector_store, llm, embedder) are specified. mem0's defaults are OpenAI gpt-4o + text-embedding-3-small. If you only override the embedder, mem0 silently still calls OpenAI for the LLM piece and you get a confusing 401.
  • embedding_model_dims: 768 is explicit. nomic-embed-text returns 768-dim vectors; mem0's Qdrant default assumes 1536 (OpenAI). Missing this causes silent insert failures — see mem0 issue #3441.
  • 127.0.0.1 over localhost. Happy Eyeballs bit me multiple times on macOS when services resolved to both IPv6 and IPv4.

The Claude Code Stop hook

This is the piece that wires Claude Code into the shared memory. Settings:

{
  "hooks": {
    "Stop": [
      {
        "matcher": "",
        "hooks": [
          {
            "type": "command",
            "command": "/Users/you/.venv/fleet/bin/python3 /Users/you/.claude/hooks/mem_broadcast.py"
          }
        ]
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Important: use the venv Python explicitly. The hook runs with a non-login shell env, so python3 resolves to system Python, which doesn't have mem0 installed.

The hook script

The original blogpost uses payload.get("transcript", []). The actual Stop hook payload gives you transcript_path — a JSONL file path. You have to open and parse it.

#!/usr/bin/env python3
"""Claude Code Stop hook → mem0 writer with redaction + idempotency."""
import json, os, re, sys
from pathlib import Path
from mem0 import Memory

REDACT_PATTERNS = [
    re.compile(r"sk-[A-Za-z0-9_-]{20,}"),
    re.compile(r"ghp_[A-Za-z0-9]{36,}"),
    re.compile(r"Bearer\s+[A-Za-z0-9._\-]+"),
]

def redact(text: str) -> str:
    for pat in REDACT_PATTERNS:
        text = pat.sub("[REDACTED]", text)
    return text

def main() -> int:
    try:
        payload = json.load(sys.stdin)
    except json.JSONDecodeError:
        return 0

    # Prevent infinite loops
    if payload.get("stop_hook_active"):
        return 0

    transcript_path = payload.get("transcript_path")
    if not transcript_path or not Path(transcript_path).exists():
        return 0

    turns = []
    with open(transcript_path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line: continue
            try:
                turns.append(json.loads(line))
            except json.JSONDecodeError:
                continue

    user_turn = next((t for t in reversed(turns) if t.get("role") == "user"), None)
    assistant_turn = next((t for t in reversed(turns) if t.get("role") == "assistant"), None)
    if not user_turn or not assistant_turn:
        return 0

    session_id = payload.get("session_id", "unknown")
    turn_index = len(turns)
    user_text = redact(str(user_turn.get("content", "")))[:2000]
    assistant_text = redact(str(assistant_turn.get("content", "")))[:2000]

    m = Memory.from_config(os.path.expanduser("~/.mem0/config.yaml"))
    m.add(
        f"User asked: {user_text}\nAssistant answered: {assistant_text}",
        user_id="fleet",
        metadata={
            "source": "claude_code",
            "session_id": session_id,
            "turn_index": turn_index,
            "idempotency_key": f"{session_id}:{turn_index}",
        },
    )
    return 0

if __name__ == "__main__":
    sys.exit(main())
Enter fullscreen mode Exit fullscreen mode

Known Stop hook bug

Claude Code issue #11786 reports a regression in v2.0.42+ where prompt-based Stop hooks can't access transcript content. Reading transcript_path directly (as above) mostly dodges it, but I occasionally saw the last JSONL turn not yet flushed when the hook fired — adding time.sleep(0.2) at the top of main() smoothed it over in practice.

The four-line YAML that matters most

OpenRouter defaults to falling back to cheaper/faster providers. That's fine for a chat UI, catastrophic for an agent that writes to shared memory.

primary:
  provider: openrouter
  model: qwen/qwen-3-72b-instruct
  provider_config:
    only: ["alibaba"]
    allow_fallbacks: false

fallback:
  provider: ollama
  model: gemma2:9b-instruct-q4_0
Enter fullscreen mode Exit fullscreen mode

OpenRouter's provider routing docs officially support provider.only + allow_fallbacks: false. If the pinned provider is down, the call fails loudly instead of silently drifting to another one. Loud failure is what you want when memory consistency is on the line.

Local fallback from day one

30 seconds without network during a conversation — user barely noticed. Memory writes kept going against local Gemma. Recovery was silent. Pull the model once:

ollama pull gemma2:9b-instruct-q4_0
Enter fullscreen mode Exit fullscreen mode

Three lessons the hard way

  1. Happy Eyeballs is real. Pin services to 127.0.0.1, set AddressFamily inet in ~/.ssh/config, use curl -4.
  2. Idempotency keys aren't optional. I skipped them for day one and by Sunday I had duplicate memory entries from mem0 retries after a Qdrant timeout. session_id:turn_index — do it from the start.
  3. Self-hosted embeddings are self-ownership. Cost was a trivial amount of electricity. Privacy is total. If OpenRouter vanishes tomorrow, my memory layer is untouched.

The qualitative part

After roughly 10 minutes of Claude Code work, I asked my Telegram bot "what was I working on?" and it answered coherently from the shared store. I kept catching myself context-switching without re-explaining.

None of this is novel tech. The work is in the wiring — making every tool a reader/writer of the same store, plus enough discipline around idempotency and redaction to trust what's in there.

Closing

Moshe's "nothing lock-in" point hits harder after doing it. When something broke I fixed it. When I wanted a new profile I copied four lines of YAML. No platform was in the loop.

Start: Qdrant + Ollama + mem0. Get Claude Code Stop hook writing. Everything else builds from there.


Credit: Moshe (@Mosescreates) for the original Hermes thread. mem0 / Qdrant / Ollama / Claude Code teams for the underlying pieces. This is just a weekend reproduction.

Full Korean writeup with additional troubleshooting: https://qjc.app/blog/hermes-fleet-self-hosted-agent-stack

Top comments (0)