A hands-on experiment in what changes when your dev assistant lives on your machine, runs continuously, and remembers your codebase.
The context-reconstruction tax
It's 9:14 on a Tuesday. My coffee is still too hot. I've opened a terminal and I'm trying to remember what I was doing on Friday. There are 47 unread messages in #payments, three new commits on main, two PRs waiting for review, and a Linear ticket I don't remember being assigned. Before I write a single line of code I'll spend twenty minutes reconstructing context that was, in some sense, perfectly available — just not in any one place.
Every developer pays this tax. AI tools were supposed to fix it, and in narrow ways they have: completion, ad-hoc Q&A, draft commit messages. But the shape is wrong. They're request/response. You ask, they answer, they forget. They wait for you to invoke them. They never grind on your behalf overnight. They have no idea what you were doing on Friday because you never told them.
There's a reason for this shape, and it's economic. Running four agents in the background all day on a hosted API would cost real money, so nobody does it. We've collectively settled for an interactive AI assistant when what we wanted was an ambient one.
This post is about what becomes possible when you flip that constraint.
Why open weights change the math
The experiment uses Nous Research's Hermes 3, an open-weight LLM family that comes in 3B, 8B, 70B, and 405B sizes and has been explicitly trained for function calling. None of those facts are individually exciting; the combination is.
- Open weights means I run inference on my own box. There is no per-token bill, no rate limit, no request quota. An agent that wakes up every time I save a file is no longer a budget question — it's a thermal one.
- Native function calling means multi-agent designs aren't fighting the model. Hermes was trained on a corpus where tools are declared inside tool blocks and called inside tool call blocks. You don't bolt agentic behavior onto a chat model with prompt engineering; you use the format the model already speaks.
- Mixed sizes means the "router agent" pattern is practical, not aspirational. A small 8B model can classify incoming events and dispatch to a 70B specialist when synthesis is needed. Both stay resident; the small one is always warm.
A nice side effect: nothing leaves the box. Slack messages, private repos, half-baked design notes — all the stuff you'd never paste into a hosted product becomes fair game for ingestion.
The whole thesis fits in one sentence: open-weight inference makes the ambient developer assistant economically possible for the first time, and Hermes' native tool-calling makes it architecturally cheap.
The shape: ambient daemon over a memory layer
| Layer | Description |
|---|---|
| SURFACES | Tray · Morning brief · CLI · Editor hint |
| AGENT RUNTIME | Router → Specialist agents (Hermes 3); triggered by events, schedule, on-demand |
| MEMORY LAYER | Vector store + Structured index + Raw log; fed by ingestion adapters (git, slack, …) |
Three layers. The interesting design choice is that the agent runtime and the memory layer are symbiotic. Every agent's first move is a memory lookup. Every meaningful event the agents observe goes back into memory. A vector store with no agents is dead weight; agents with no memory are stateless chatbots. The point of the project is the loop between them.
The runtime activates along three paths:
- Reactive — file watch, git hooks, webhooks. The cheapest path; agents only run when something changed.
- Scheduled — nightly memory consolidation, weekday morning brief. Cron-shaped.
-
On-demand —
hermes ask, tray click, editor invocation. Synchronous, the only path the user feels latency on.
A priority-aware queue sits between triggers and agents so a rebase or a mass save doesn't fan out into dozens of parallel runs. On-demand beats reactive beats scheduled.
The agent roster envisioned for v1 is small: an indexer that keeps memory current, a synthesizer that rolls raw events into weekly summaries, a test runner that runs affected tests on save, a commit helper that drafts messages, a doc keeper that flags drift, a standup composer for the morning brief, a Q&A agent for on-demand questions, and a router that decides which of those to wake. Eight agents, none of them clever on their own, all of them useful when fed by a shared memory.
I have not built all eight. The point of an experiment is to build the smallest part that proves the rest is worth building.
A day with it
Imagine you have the whole thing. Here's what a day looks like.
08:42 — You open a terminal. A first-shell hook prints the morning brief:
Yesterday you finished the retry logic in
worker/dispatch.pyand opened PR #482. Overnight: PR #482 got two review comments from Sam (both about the backoff curve). Main has 3 new commits, none touch your files. Linear ticket ENG-1209 was assigned to you. Suggested first move: address Sam's backoff comment — relevant prior discussion in#eng-platformon 2026-04-30.
You spent zero minutes reconstructing context. You start working.
09:15 — You edit worker/dispatch.py. The test runner agent silently runs the 14 affected tests in the background. One fails. The tray icon flips amber; clicking shows the failure with a memory-pulled note: "This test also failed during the April incident; root cause was clock skew in the fixture." That note didn't come from a prompt — it came from retrieval over your own incident postmortems and PR threads.
11:02 — You stage a commit. The commit helper has prefilled the message based on the diff and the open Linear ticket. You edit one word and commit.
14:30 — A teammate asks why the rate limiter uses Redis. You ask:
$ hermes ask "why did we choose redis over memcached for the rate limiter"
The Q&A agent calls the memory search against the vector store, pulls the relevant ADR, two PR discussions, and a Slack thread from eight months ago, and answers in four sentences with citations. You paste them into the channel.
18:00 — The doc keeper notices that README.md still describes the old config schema that today's commit changed. It drops a notification: "README config section drifted — draft fix ready." You accept; a follow-up commit is staged.
Overnight — The synthesizer rolls the day's events into the weekly theme index. The indexer ingests the day's Slack messages from #payments. Tomorrow's brief reflects both. You sleep.
The thing the daemon is selling is not any individual agent. It's the disappearance of the morning context-reconstruction tax, and the quiet accumulation of useful work in the background.
How the interesting parts work
The full design is more than I could ship in a weekend, but four pieces carry the architectural weight. Here's each as pseudocode close enough to the working code to be honest.
1. The ingestion loop
The memory layer starts with git log. Other sources — Slack, Linear, PRs — plug in the same way later, but git is the one that proves the shape.
def ingest(repo_path, store_path):
raw = run(["git", "log", "--pretty=format:...", "-p", "-n", 500],
cwd=repo_path)
commits = parse_git_log(raw)
table = lancedb.create_or_reset_table(store_path, dim=768)
for commit in commits:
for chunk in chunks_for_commit(commit):
chunk["vector"] = ollama_embed(chunk["content"]) # nomic-embed-text
table.add(chunk)
Two non-obvious choices. First, chunk granularity: each commit produces one chunk for the message and one chunk per file in the diff. Per-message chunks get retrieved when someone asks about intent ("why did we drop kafka?"). Per-file chunks get retrieved when someone asks about code ("how is the retry backoff implemented?"). Mix the two and a single vector search covers both query shapes.
def chunks_for_commit(commit):
yield {"source": "commit_message", "content": commit.message, ...}
for file_path, file_diff in split_diff_by_file(commit.diff):
yield {"source": "diff", "file_path": file_path,
"content": file_diff[:8000], ...}
Second, truncation at 8K characters per file diff. Very large diffs (reformats, generated code) destroy retrieval signal if embedded whole. Truncating biases toward the start of the diff, which is usually where the meaningful change lives.
2. The Hermes agent loop
This is the piece that lives or dies on Hermes' tool-calling. It's short:
SEARCH_TOOL = {
"type": "function",
"function": {
"name": "search_memory",
"description": "Search repo memory; return top-k matches.",
"parameters": {"type": "object",
"properties": {"query": {"type": "string"},
"k": {"type": "integer"}},
"required": ["query"]},
},
}
def ask(question):
messages = [
{"role": "system",
"content": tools_system_prompt([SEARCH_TOOL]) + INSTRUCTIONS},
{"role": "user", "content": question},
]
for _ in range(MAX_ITERS):
response = ollama_chat(messages, model="hermes3:8b")
calls = parse_tool_calls(response) # parses tool calls
if not calls:
return response
messages.append({"role": "assistant", "content": response})
for call in calls:
result = TOOLS`[Looks like the result wasn't safe to show. Let's switch things up and try something else!]`
messages.append({"role": "tool",
"content": format_tool_response(call["name"],
result)})
The shape will be familiar to anyone who's used OpenAI tool calling. The interesting bit is what's missing: there's no special tool API on the model side. Hermes just emits tool call markers in its normal text output, and we parse them. tools_system_prompt builds the standard Nous template that wraps the JSON schema in tool blocks; parse_tool_calls runs a regex over the response looking for tool call markers.
That's the whole mechanism. Multi-agent isn't a framework to install; it's a parsing convention.
3. The router pattern
I haven't shipped a router in the smallest slice, but it's the pattern that pays off the moment you have more than one specialist. The router is a small Hermes (8B), kept warm, doing one job:
def route(event):
classification = ollama_chat(
[{"role": "system",
"content": "Classify the event. Reply with one of: "
+ ", ".join(SPECIALISTS.keys())},
{"role": "user", "content": describe(event)}],
model="hermes3:8b",
)
specialist = SPECIALISTS[classification.strip()]
return specialist.handle(event)
The trick isn't the code — it's the model-size split. You don't want a 70B reasoning over whether a save event is a test trigger or a doc trigger. You want an 8B doing the dispatch in 200ms, and the 70B only waking when there's actual synthesis to do. With hosted APIs you'd eyeball this for cost. With open weights you eyeball it for VRAM.
4. The format that makes it boring
The fourth thing isn't really code — it's the Hermes function-calling format itself. Tools declared in the system prompt:
TOOLS:
{"type":"function","function":{"name":"search_memory", ...}}
Calls emitted in the response:
TOOL_CALL:
{"name":"search_memory","arguments":{"query":"redis rate limiter"}}
Results fed back as another turn:
TOOL_RESPONSE:
{"name":"search_memory","content":[{...}, {...}]}
That's the entire contract. Once you've written the parser (twenty lines), every agent in the system speaks the same protocol. You can add a read_file tool, a run_tests tool, a git_blame tool — they plug in by appending to the tool list.
The reason this is worth dwelling on: most "multi-agent frameworks" are solving for the absence of this. With Hermes, you don't need the framework.
What I learned building it
Some of what surprised me, in honesty order:
- Ingestion is the hard problem, not the agent loop. I thought wiring the multi-agent runtime would be the interesting work. It wasn't. The agent loop is eighty lines. The careful choices live in the chunker — how to split a diff so each chunk carries enough signal, how to handle binary-or-massive files, how to dedup on re-ingest. Every additional source (Slack, Linear, PR threads) is its own ingestion problem and its own dedup story. Plan for this.
- Notification rate-limiting matters more than token rate-limiting. The failure mode of an ambient tool is becoming noise. You build the test-runner agent, it surfaces a real failure, you click. It surfaces a flaky test, you click. It surfaces a legitimately fixed test you forgot about, you don't click. By the third week you've muted it. The work isn't making agents that produce output; it's making agents that produce output the user reads.
-
Local inference latency is fine for background work, awkward for the CLI. A
hermes askquery that takes seven seconds feels slow next to a hosted Claude or GPT-4 call. The ambient surface (briefs, notifications) hides that completely; the on-demand CLI exposes it. Mitigation: streaming output, smaller default model for routing, keeping the specialist warm so first-token latency isn't model-load latency. - The memory layer drifts without a synthesizer. Raw events accumulate. Vector retrieval signal degrades. Briefs start to feel repetitive. A periodic rollup agent — "summarize this week's themes" — isn't optional infrastructure. It's how the memory stays useful past the first month.
- Right-sizing each agent is iterative. The first pass overuses 70B out of laziness. The second pass moves classification to 8B. The third pass kills the agents whose output the user never opens. Three passes seem to be the natural number.
Try the smallest slice
The full daemon is a project. The smallest slice — git history ingest plus on-demand Q&A — is small enough to read in one sitting and useful enough to keep around. The whole thing is around 400 lines of Python.
You'll need:
- Python 3.11+
- Ollama running locally
- A model and an embedder pulled into Ollama:
ollama pull hermes3:8b # ~4.7 GB
ollama pull nomic-embed-text # ~270 MB, 768-dim
If your GPU has the room, hermes3:70b is a meaningful quality bump and is what the agent loop is most enjoyable on.
Then:
git clone <this-repo> hermes && cd hermes
python -m venv .venv && source .venv/bin/activate
pip install -e .
hermes ingest . # build memory from git log
hermes ask "what changed about retry handling?"
hermes ask "who has touched the rate limiter recently and why?"
A few honest notes:
- Re-running
ingestwipes and rebuilds the store. There's no incremental indexing in v0; the chunker runs end-to-end each time. For a few hundred commits this is a one-coffee operation. - Keep the LanceDB store off
/mnt/cif you're on WSL2 — the default at~/.hermes/store/already is. DrvFs makes small writes painful. - The agent emits tool call markers and the loop parses them. If you want to read the most interesting eighty lines, start in
hermes/llm.pyandhermes/agent.py. They're worth more than this whole post.
The next step from here is the file watcher and the second agent (the test runner is the natural next piece — it has the clearest input/output shape and the most-obvious daily-use loop). After that, the standup composer and the synthesizer make the memory layer earn its keep.
If you build any of that, I'd love to see it. The interesting part of this experiment isn't whether I finish the daemon. It's whether the shape — open weights, ambient runtime, cooperating agents over a shared memory — turns out to be the thing the rest of us have been waiting for.
Where to take it:
The case for Hermes, in one breath: **open weights** make background inference free at the margin, so always-on agents stop being a budget question; **native function calling** makes multi-agent a parsing convention rather than a framework you install; **mixed sizes** let a cheap 8B router keep a 70B specialist asleep until there's real work to do; and **nothing leaves the box**, so private code and private chat become first-class inputs. The reason a continuous developer assistant is suddenly feasible isn't any one of those properties — it's the way they compose.
The publicly accessible repo lives at https://github.com/Piwe/hermes . **Clone it** and point it at a codebase you care about — your own, your team's, an OSS project you've spent time in. Ask it the kind of question you'd normally answer by digging through `git log` and stale PR threads. If the answer is useful, the shape works. If it isn't, the chunker probably wants tuning, and that's where I'd start.
Or **fork it** and build the next agent. The test runner is the natural next piece: clearest input (file save), clearest output (failing tests surfaced with memory context), shortest path to a daily-use loop. After that, a synthesizer keeps the memory layer earning its keep, and a standup composer makes the morning brief real. PRs welcome on any of it; new ingestion adapters, alternate memory schemas, and entirely different agents are all in scope.
The interesting question isn't whether *I* finish the daemon. It's whether the shape — open weights, ambient runtime, cooperating agents over a shared memory — turns out to be the thing the rest of us have been waiting for. Build it and tell me where you took it.
Top comments (0)