Andrew

Posted on May 26 • Originally published at andrew.ooo

Honcho Review: Plastic Labs' Agent Memory Layer (2026)

#agentmemory #honcho #plasticlabs #statefulagents

Originally published on andrew.ooo — visit the original for any updates, code snippets that aged out, or follow-up posts.

TL;DR

Honcho is open-source memory infrastructure for stateful AI agents, built by Plastic Labs and released under a permissive license. It's currently trending on GitHub — 4,301 stars total with 644 new stars this week — and it just posted state-of-the-art numbers on three independent agent-memory benchmarks. The highlights:

Reasoning-first memory: extracts conclusions from conversations and events, not just chunks to match later
Peer-centric model: users, AI agents, groups, projects, and ideas are all first-class "peers" that change over time
90.4% on LongMem S (92.6% with Gemini 3 Pro), 89.9% on LoCoMo, top scores on BEAM — all while using a median 5% of the available context
Managed or self-hosted: hit api.honcho.dev ($100 free credits on signup) or run the FastAPI server yourself
First-class integrations for Claude Code, OpenCode, Cursor, Windsurf, Cline, and any MCP client
Python + TypeScript SDKs, PostgreSQL + pgvector for storage, Redis for caching
Apache 2.0 / AGPL-style licensing (check repo for current terms before commercial deployment)

If you've ever shipped an AI assistant that forgets your user between sessions — and then watched retention crater because of it — Honcho is the most credible open-source attempt yet at fixing that.

Quick Reference


Repository	github.com/plastic-labs/honcho
Vendor	Plastic Labs (VC-backed, NYC)
Language	Python (FastAPI server) + TS/Python SDKs
Stars	4,301 (+644 this week)
Install (Python)	`pip install honcho-ai`
Install (Node)	`npm install @honcho-ai/sdk`
Hosted endpoint	`api.honcho.dev` (signup → $100 free credits)
Self-host stack	FastAPI + PostgreSQL (pgvector) + Redis
MCP endpoint	`https://mcp.honcho.dev`

What Is Honcho?

Most "agent memory" libraries on the market are basically wrappers around a vector database. You store messages, embed them, and retrieve the top-k similar chunks at query time. That works fine for trivia recall, but it fails the moment your user says "I prefer concise answers" in session 1 and you want session 47 to act on that preference — the relevant chunk is buried, the embedding is fuzzy, and you'd need a perfect query to surface it.

Honcho takes a different angle. It treats memory as a reasoning problem, not a retrieval problem. When messages arrive, a small fine-tuned model extracts latent information — preferences, beliefs, facts, contradictions — and writes it into a structured representation of the speaker. In the background, the system "dreams" across ingested messages and prior reasoning, drawing new inferences over time. When you query Honcho, you don't search a vector store; you ask a research agent a natural-language question, and it returns a synthesized answer.

The other thing Honcho gets right is multi-peer modeling. Most memory libraries assume a single "user" entity. Honcho's primitive is the peer — and a peer can be a human, an AI agent, a project, or even an idea. Sessions are many-to-many with peers, which means you can model "what does Alice know about Bob" or "what does the support-agent persona think the customer wants" cleanly. For multi-agent systems this is a much better fit than shoehorning everything into a user/assistant pair.

Why It's Trending Now

Three forces are converging on agent memory in mid-2026:

Long-context isn't a memory replacement. Frontier models now ship with 1M+ token windows, but Plastic Labs' own LongMem benchmark shows that just dumping context in drops Claude Haiku 4.5 from 89.2% (oracle) to 62.6% (full haystack) — a 26.6 point drop. More tokens ≠ better recall. Models need a structured memory layer to perform well at scale.
MCP is the new integration substrate. Honcho ships a hosted MCP endpoint at mcp.honcho.dev, so Claude Code, Cursor, Cline, Windsurf, and Codex CLI can all add it with a single command. The team has also published official plugins for Claude Code, OpenCode, and Hermes.
The market matured past "Mem0 or roll your own." The first wave of memory libraries (Mem0, Letta, Zep) trained the market on the concept of agent memory. The second wave — Honcho, Hindsight, Supermemory, Holographic — is competing on benchmarks and reasoning depth. Honcho's recent benchmark results put it at the top of the leaderboard for LongMem and LoCoMo.

The end result: every team building a stateful assistant in mid-2026 is shopping for a memory stack. Honcho is one of maybe four credible options.

Key Features (With Code)

1. The Honcho Loop: Store → Reason → Query → Inject

The mental model is a four-step loop. You store messages, Honcho reasons in the background, you query for context or insights, and you inject the result into your model of choice.

import os
from honcho import Honcho

honcho = Honcho(
    workspace_id="my-app-testing",
    api_key=os.environ["HONCHO_API_KEY"],
)

# 1. Store: peers and messages on a session
alice = honcho.peer("alice")
tutor = honcho.peer("tutor")
session = honcho.session("session-1")
session.add_messages([
    alice.message("Hey there — can you help me with my math homework?"),
    tutor.message("Absolutely. Send me your first problem!"),
])

# 2. Reason: happens asynchronously in the background.

# 3. Query: ask Honcho what it knows, or pull prompt-ready context.
answer = alice.chat("What learning styles does the user respond to best?")
context = session.context(summary=True, tokens=10_000)

# 4. Inject: hand the context to your model of choice.
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=context.to_openai(assistant=tutor),
)

Notice what's missing: any direct calls to a vector DB, any prompt-engineering boilerplate to extract "preferences," any explicit memory invalidation. Honcho does all of that asynchronously.

2. Peer-Centric Modeling

The peer abstraction unlocks multi-agent and multi-user scenarios that are genuinely awkward in mem0-style libraries:

# Peers are first-class: humans, agents, projects, ideas
customer = honcho.peer("customer-12345")
support_agent = honcho.peer("support-agent")
billing_bot = honcho.peer("billing-bot")

# Configurable "observation" — which peers see which others
session = honcho.session("ticket-87234", peers=[customer, support_agent, billing_bot])
session.add_messages([...])

# Ask one peer what it knows about another
intel = support_agent.chat(
    "What does the customer seem to actually want here?",
    target=customer,
)

This is Honcho's "theory of mind" feature — modeling what one peer knows about another — and it's what differentiates the project from straightforward retrieval-augmented memory.

3. Hybrid Search (BM25 + Vector)

When you do need raw retrieval, Honcho exposes hybrid search out of the box, combining keyword and dense-vector recall:

results = session.search("pricing complaints from last month")
# or scoped to a peer
results = customer.search("billing")

4. Drop-in MCP Server

For coding agents, the easiest path is the hosted MCP endpoint:

claude mcp add honcho \
  --transport http \
  --url "https://mcp.honcho.dev" \
  --header "Authorization: Bearer hch-your-key-here" \
  --header "X-Honcho-User-Name: YourName"

Once installed, Claude Code (or any MCP client — Cursor, Cline, Windsurf, Codex CLI) gains persistent memory across sessions. You can also use the richer Claude Code plugin:

/plugin marketplace add plastic-labs/claude-honcho
/plugin install honcho@honcho

Architecture & How It Works

Honcho ships as a FastAPI server with a few moving parts you need to be aware of if you self-host:

PostgreSQL with pgvector — stores messages, peers, sessions, and embeddings
Redis — caches and coordinates the async background workers
An LLM — used for ingestion-time reasoning, session summarization, and the chat endpoint's research agent. The published benchmarks use gemini-2.5-flash-lite for ingestion and claude-haiku-4-5 for the chat endpoint
An embedding model — for the hybrid search component

The interesting architectural choice is the two-stage reasoning pipeline:

Ingest-time: a small fine-tuned model captures latent information from each new message — preferences, claims, observations — and updates the peer's representation immediately. This is fast.
Dream-time: a separate background process periodically revisits prior messages and reasoning, drawing new deductions. This is where Honcho gets its high-recall, high-reasoning numbers.

Token efficiency falls out of this design. On LongMem S, Honcho answers correctly 90.4% of the time while using a median of 5% of the available context per question (mean 11%). That's the difference between a $0.50 query and a $0.05 query at scale, which adds up fast.

Benchmarks: How Honcho Stacks Up

Plastic Labs published full benchmarks at evals.honcho.dev. The headline numbers (mid-2026):

Benchmark	Honcho	Notes
LongMem S	90.4%	(92.6% with Gemini 3 Pro)
LongMem Oracle	91.8%	Beats the underlying model alone
LoCoMo	89.9%	Beating their own prior 86.9%
BEAM	Top scores across all subtests

The most important number is actually the non-headline one: Claude Haiku 4.5 alone scores 62.6% on LongMem S, but 89.2% on LongMem Oracle (only the relevant sessions in context). Honcho with Haiku 4.5 underneath scores 90.4% — better than the oracle case. That means the memory layer is genuinely improving the model's reasoning, not just retrieving relevant chunks.

For comparison, recent third-party comparison work (Vectorize, glukhov.org, Atlan) places Mem0 around the 60-70% range on LongMemEval and similar benchmarks, with Letta and Zep clustered in the 65-85% band depending on configuration. Honcho's numbers are at the top of the leaderboard as of May 2026.

Real-World Use Cases (From the Community)

From discussions in the Plastic Labs Discord, Reddit r/LocalLLaMA, and Hermes/OpenClaw integration docs:

AI tutoring apps that need to remember a student's misconceptions across weeks of sessions — the canonical Plastic Labs demo
Customer support agents modeling individual customer history and preferences across tickets
Coding agents (via the Claude Code / Cursor plugins) maintaining persistent memory of project conventions, your coding style, and prior decisions
Multi-agent simulations where agents need to model what other agents know — Honcho's peer model is uniquely well-suited
Personal AI projects (the Hermes integration, OpenClaw integration) where the agent represents you and needs identity-level continuity

First Impressions From the Community

From the Reddit and DEV.to comparison threads (r/gluk thread on agent memory providers, the DEV.to comparison):

"You want the agent to model how you think → Honcho." — Vectorize comparison guide, March 2026

"Honcho and Mem0 require the most moving parts" — glukhov.org, noting Honcho's PostgreSQL + Redis + LLM + embedding dependency stack

"The peer model is the right abstraction for what we're building" — common refrain in the Plastic Labs Discord from multi-agent simulation developers

The consensus is positive on capability and benchmarks, more measured on operational complexity. If you want the lowest-friction managed memory, Mem0 is still ahead on time-to-first-query. If you want the deepest reasoning and the cleanest multi-peer model, Honcho is the pick.

Getting Started

The fastest path is the managed service:

# 1. Get an API key at app.honcho.dev (you get $100 free credits)
export HONCHO_API_KEY="hch-..."

# 2. Install
pip install honcho-ai

# 3. Start storing and querying — that's it
python -c "
from honcho import Honcho
h = Honcho(workspace_id='demo')
alice = h.peer('alice')
s = h.session('s1')
s.add_messages([alice.message('I prefer terse answers.')])
print(alice.chat('What kind of answers does the user prefer?'))
"

For self-hosting, clone the repo and use the provided Docker Compose:

git clone https://github.com/plastic-labs/honcho
cd honcho
docker compose up -d
# Server will be on http://localhost:8000

You'll still need to point Honcho at an LLM provider (OpenAI, Anthropic, Gemini, or a custom endpoint) and an embedding model. The repo's .env.example walks through the configuration.

Who Should Use Honcho (And Who Shouldn't)

Use Honcho if:

You're building a stateful AI assistant that needs to understand users, not just remember facts about them
You're shipping a multi-agent system where modeling cross-agent knowledge matters
You care about token efficiency — Honcho's 5%-median context usage is a significant cost lever at scale
You want a permissively-licensed, self-hostable option (no vendor lock-in)
You're already in the MCP ecosystem (Claude Code, Cursor, Cline, Windsurf)

Skip Honcho if:

You need memory in under 30 seconds with zero infra → use Mem0's managed tier
You can't run PostgreSQL + Redis + an LLM dependency (e.g., browser-only or strict-edge environments)
Your "memory" needs are really just chat-history-with-summaries — overkill for that case
You need a single-binary, no-external-service option → look at Hindsight instead

Comparison With Alternatives

Tool	Memory Model	Self-host	Headline benchmark	Setup time
Honcho	Reasoning-first, peer-centric	✅ (FastAPI + Postgres + Redis)	90.4% LongMem S	~30 min
Mem0	Vector + LLM extraction	✅ Managed + OSS	~65% LongMemEval	~30 sec
Letta (MemGPT)	Hierarchical / OS-style	✅	83.2% (per Atlan)	~15 min
Zep	Temporal knowledge graph	✅ + managed	63.8% temporal LongMemEval	~10 min
Hindsight	Bundled, minimal deps	✅ (single binary)	91.4% overall	~5 min
Supermemory	Hybrid, managed-first	Managed	85.4%	~30 sec

Honcho's position: highest reasoning + benchmark scores, more operational complexity than the lightweight options. If memory quality matters more than setup speed, it wins.

FAQ

Q: Is Honcho open source and self-hostable, or is it just a managed API?
Both. The FastAPI server is open source on GitHub (see the repo for current license terms — historically Apache-2.0-style). You can run it on your own infrastructure with Docker Compose, or use the managed service at api.honcho.dev with $100 of free credits on signup.

Q: How does Honcho compare to Mem0 for production use?
Mem0 wins on setup time and operational simplicity — you can be up and running in 30 seconds. Honcho wins on reasoning depth, multi-peer modeling, and benchmark performance (90.4% vs ~65% on LongMem-class evals). For a "remember last week's preferences" assistant, both will work. For multi-agent or identity-modeling use cases, Honcho is the better abstraction.

Q: Can I use Honcho with Claude Code or Cursor?
Yes — Plastic Labs ships an official Claude Code plugin (/plugin install honcho@honcho after adding the plastic-labs/claude-honcho marketplace) and a hosted MCP endpoint at https://mcp.honcho.dev that works with any MCP-compatible client including Cursor, Cline, Windsurf, and Codex CLI.

Q: What's the actual cost at scale?
You're paying for: (a) the LLM you point Honcho at — ingestion model and chat endpoint model are separate and can be different (e.g., gemini-2.5-flash-lite for ingestion, claude-haiku-4-5 for chat); (b) embedding API calls; (c) PostgreSQL + Redis hosting if self-hosted. Honcho's ~5% median context-window usage per query keeps the chat-endpoint costs significantly lower than naive "dump everything into context" approaches.

Q: Does Honcho work with non-OpenAI/Anthropic models?
Yes. Honcho is LLM-provider-agnostic — you can point it at OpenAI, Anthropic, Gemini, or a custom OpenAI-compatible endpoint (which means most local servers via Ollama, vLLM, or LM Studio also work).

Q: How does the "dreaming" background process affect cost?
Dreaming is a configurable background process that re-reasons across stored messages to surface new deductions. It runs asynchronously and you can tune the token budget per workspace. Plastic Labs' benchmark configs are published on GitHub at honcho-benchmarks if you want to see real numbers.

Honcho is the cleanest open-source memory layer I've seen in 2026. It's not the simplest to operate — you're committing to PostgreSQL, Redis, and an LLM bill on top of the application — but if you're building anything beyond a single-turn chatbot, the reasoning-first architecture and peer model will save you from re-inventing this exact wheel a year from now. Try it on the managed tier first with the $100 of free credits, then self-host when you've validated the fit.

GitHub: github.com/plastic-labs/honcho · Docs: honcho.dev · Evals: evals.honcho.dev

DEV Community