This is a submission for the Hermes Agent Challenge: Write About Hermes Agent
The header looks trivial. One line. But it's doing something architecturally significant. Here's exactly what happens when you pass X-Hermes-Session-Id to Hermes — and why it matters more than it appears.
The Naive Mental Model (and Why It's Wrong)
Most developers assume persistent session = stored chat history. Request comes in → look up conversation log → prepend to messages → send to LLM. Like a database-backed chatbot.
That's not what Hermes does.
The naive model has a linear cost problem:
Turn 1: send 100 tokens
Turn 10: send 1,000 tokens
Turn 100: send 10,000 tokens
Turn N: send N × average_turn_length tokens
At 1000 turns you're sending a short novel on every request. This is why "just store the history" breaks for long-running agents.
What's Actually Happening: Compressed State, Not Transcript Replay
Hermes maintains a continuously updated compressed state per session ID — not a raw transcript that grows without bound.
Prior turns are distilled into the model's retained understanding. The context window stays bounded regardless of how many turns have occurred. New inputs are processed against accumulated understanding, not against a raw replay of every prior message.
The practical effect:
# Turn 1 — explicitly stated
chat("My name is Alex. I'm building a distributed cache in Rust.")
# Turn 200 — two months and 199 interactions later
# No history sent. No RAG lookup. Just the session ID.
chat("What tech stack are we using again?")
# "You're building a distributed cache in Rust."
The model doesn't "find" that fact. It retained it.
The Session ID as a Namespace
Each unique X-Hermes-Session-Id value is a completely isolated memory namespace. Sessions never bleed into each other. This makes session IDs a first-class design primitive.
from openai import AsyncOpenAI
client = AsyncOpenAI(base_url="http://localhost:11434/v1", api_key="hermes")
async def chat(message: str, session_id: str) -> str:
response = await client.chat.completions.create(
model="hermes",
messages=[{"role": "user", "content": message}],
extra_headers={"X-Hermes-Session-Id": session_id},
)
return response.choices[0].message.content
# These sessions are completely isolated brains
await chat("Commit: removed Redis cache, caused 3 outages", "repo:acme/backend")
await chat("Commit: added Redis cache layer for performance", "repo:widgets/frontend")
# Each query draws only from its own session
result = await chat("What cache decisions were made?", "repo:acme/backend")
# Knows about the Redis removal — knows nothing about widgets/frontend
Map session IDs to your domain:
| Domain | Session ID Pattern |
|---|---|
| Per-user memory | user:{user_id} |
| Per-repository memory | repo:{owner}/{name} |
| Per-customer support | support:{customer_id} |
| Per-project context | project:{id}:v{version} |
What Gets Retained and How
Every message sent through a session is processed and distilled. Hermes prioritizes retention of:
Explicit facts — names, decisions, stated preferences, numbers
"We use PostgreSQL 15 on RDS with read replicas in us-east-1"
→ retained verbatim
Causal relationships — X was done because of Y
"Removed Redis because cache invalidation bugs caused stale product prices"
→ the causal link is retained, not just the removal
Temporal markers — when things happened relative to each other
"Tried GraphQL in Q1, reverted in Q2 due to N+1 issues"
→ the sequence and the reason are retained together
Contradictions — when new information conflicts with what's stored
Prior: "We're committed to microservices"
New: "Merged all services back into a monolith"
→ Hermes flags this as a reversal when asked about architecture decisions
This is the distinction from retrieval. RAG finds text. Hermes retains understanding of relationships between facts.
The Cron Integration: Memory Meets Autonomy
Hermes's /api/jobs endpoint connects the session memory system to time. A registered job is a prompt that fires on a schedule — and crucially, it runs through the same accumulated session context.
import httpx
# Register a job that runs against its own accumulated memory
httpx.post(
"http://localhost:11434/api/jobs",
headers={"Authorization": "Bearer hermes"},
json={
"name": "weekly-pattern-report",
"schedule": "0 9 * * 1",
"prompt": (
"You are the Shadow CTO for acme/backend. "
"Review the engineering decisions you have stored in memory "
"from the past week. Identify any recurring failure patterns "
"or decisions that were reversed. Prepare a concise report."
),
},
)
The agent isn't querying an external database. It's asking itself what it remembers. This is the architecture that enables genuinely autonomous behavior — not polling, not retrieval, not RAG. Introspection over accumulated memory.
Streaming: The Architecture Underneath
For user-facing features, always use the streaming endpoint. Hermes reasons before answering — on questions about accumulated history, full responses can take 10–20 seconds. Streaming makes that latency invisible.
# Streaming via SSE in FastAPI
async def generate_sse(session_id: str, question: str):
stream = await client.chat.completions.create(
model="hermes",
messages=[{"role": "user", "content": question}],
stream=True,
extra_headers={"X-Hermes-Session-Id": session_id},
)
async for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
# Escape newlines for SSE wire format
yield f"data: {delta.replace(chr(10), chr(92) + 'n')}\n\n"
yield "data: [DONE]\n\n"
The frontend side is a standard EventSource. The user sees the answer build character by character, which feels fast even when total generation takes 15 seconds.
When This Architecture Wins vs. RAG
| Scenario | RAG Better | Hermes Session Better |
|---|---|---|
| Search across 10k static documents | ✅ | ❌ |
| Remember context across 6 months of activity | ❌ | ✅ |
| Precise source citation with page numbers | ✅ | ⚠️ |
| Understanding causality and sequence over time | ❌ | ✅ |
| "What changed and why" questions | ❌ | ✅ |
| Real-time document ingestion at scale | ✅ | ⚠️ |
| Autonomous scheduled analysis | ❌ | ✅ |
| Detecting reversals and contradictions | ❌ | ✅ |
The OpenAI Compatibility Layer
Because Hermes wraps an OpenAI-compatible API, migration from existing OpenAI code is nearly zero-cost:
# Before — OpenAI, stateless
from openai import AsyncOpenAI
client = AsyncOpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = await client.chat.completions.create(
model="gpt-4o",
messages=conversation_history, # you manage this
)
# After — Hermes, persistent
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="http://localhost:11434/v1",
api_key="hermes",
)
response = await client.chat.completions.create(
model="hermes",
messages=[{"role": "user", "content": latest_message}], # just the new message
extra_headers={"X-Hermes-Session-Id": user_session_id}, # Hermes handles the rest
)
You drop the conversation history management. You add one header. Tool use, function calling, and streaming patterns all work unchanged.
Summary
X-Hermes-Session-Id isn't a database lookup key. It's a namespace for a persistent reasoning state that accumulates understanding rather than replaying transcripts. The cost is bounded. The knowledge compounds. The autonomy follows naturally from the scheduling integration.
That's the architectural bet Hermes is making: that the future of AI agents is stateful participants that get smarter over time, not stateless query engines that start from zero on every call.
Based on what you can build with a single header, it's a bet worth taking.
Top comments (0)