I use Claude Code everyday, lately Anthropic started shipping some tools that are really relevant for always-on autonomous agent, and I decided to create the glue of all these tools & features to create personal AI assistants that are just a CC instance. My goal was simple, to be able to run multiple always-on personal AI assistants on my laptop with a single Claude Subscription.
The result is claude-code-hermit, an open-source Claude Code Plugin, a glue layer that turn Claude Code into a 24/7 personal assistant, with a heartbeat, routines, reflection, it learns from its work, and runs on your hardware. One Claude subscription, multiple hermits, one laptop. The compute lives in the cloud at Anthropic; the state, the credentials, and the agency live with you.
This is the story of what I built and what 30 days of running 8 of them actually looks like.
Why Claude Code was already 90% of the way there
/loop for heartbeat. CronCreate does idle-gated cron routines. Channels handles Discord, Telegram. Auto-memory. Native Tasks for plan tracking. Auto mode or bypassPermissions inside a Docker container without anxiety for non prompted loop.
The compute lives at Anthropic. I don't host a model. I don't pay per runtime hour. My laptop is the orchestrator and the policy boundary, not the compute. That's the trick. Claude Code is already doing the expensive part; the only question is whether you've wired the rest into a 24/7 loop.
So the first version of hermit was small: a hatch wizard, an OPERATOR.md template, a proposal pipeline. The rest is integration.
What hermit added
Hermit boils down to four things on top of the native stack.
A wizard that adapts to whatever folder you point it at. /hatch scans the folder. It asks four or five questions: agent identity, what should it focus on, who approves what, how should it notify you etc. It writes an OPERATOR.md. Every session loads that file first. Works on an existing codebase, an empty directory, or a folder I created just to host the hermit.
A self-learning loop gated by proposals. Hermit is token usage aware, it reflects on a cron, between sessions.
New routines, skills, subagents, claude.md changes etc, anything relevant, it drafts self-improvement proposals from repeated patterns, and surfaces them in Discord / Telegram for me to accept, defer, or dismiss.
A raw/compiled knowledge model. Borrowed from Andrej Karpathy's llm-wiki: the hermit writes to raw/ unedited source material it collects (snapshots, audit logs, source dumps, etc). And to compiled/ the short, curated artifacts the agent distills from raw (briefings, reviews, assessments, etc). At the start of every new session, recent compiled notes get loaded back into context, so the agent picks up where it left off without dragging the entire history along. The result is a small, growing wiki you can read and edit, instead of a giant prompt you can't.
Docker is the always-on substrate. Running a 24/7 agent on a host means solving four problems at once: defensible bypassPermissions if you are not using Auto mode, automatic recovery from crashes and host reboots, isolated credentials, and a reproducible environment. /docker-setup ships them in one wizard: kernel-enforced hardening (no-new-privileges, cap_drop: ALL, pids_limit), restart: unless-stopped so crashes and reboots self-heal, named volumes that persist OAuth across container rebuilds, and a SIGTERM trap that closes the session cleanly before exit. An opt-in /docker-security layer adds nftables LAN containment, CPU/memory bounds, and a plugin-install audit log - each toggle with honest disclosure of what it doesn't catch.
Setup
-
claude plugin marketplace add gtapps/claude-code-hermitadds the marketplace. -
claude plugin install claude-code-hermit@claude-code-hermit --scope projectinstalls the CC plugin. -
/claude-code-hermit:hatchruns the setup wizard — agent identity, language, timezone, four-or-five behavior questions, channel preference, optional plugin recommendations. GeneratesOPERATOR.mdfrom a fresh project scan plus your answers. Quick path is ~3 minutes. -
/claude-code-hermit:docker-setupwrites the Docker scaffolding, builds the image, runs OAuth login, walks through channel pairing, and verifies the container is healthy.
If you want the always-on flow without Docker, .claude-code-hermit/bin/hermit-start boots a tmux session instead.
How the self-learning loop works
The naive way to run a "self-improving" agent is to ask it, every session, "how could you do better?" and let it write whatever comes to mind. In the beginning of the project it might be useful, but overtime that produces noise. Half the suggestions are local minima, half are inventions of patterns that haven't actually recurred, and the operator gets fatigue-buried within a week. What hermit does instead is gate every step.
/reflect (cron, idle, manual)
|
v
precheck.js: 5 phases
|
+-- EMPTY (~94%) ---> no LLM, exit
|
+-- RUN|phases
|
v
reflect (LLM): three-condition rule
|
v
reflection-judge (Sonnet)
|
+-- SUPPRESS ---> memory only
| codes: no-evidence,
| no-sessions,
| covered-by-memory
|
+-- DOWNGRADE:N ---> lower tier
|
+-- ACCEPT
|
v
proposal-triage (Haiku)
|
+-- SUPPRESS ---> memory only
| codes: weak-recurrence,
| weak-consequence,
| not-actionable,
| covered-by-memory
|
+-- DUPLICATE:PROP-NNN ---> link
|
+-- CREATE ---> PROP-NNN.md
-> Discord / Telegram
Every arrow that doesn't end in CREATE is a gate that fired before tokens were spent on the next step. Verdict strings (ACCEPT, DOWNGRADE:N, SUPPRESS — <code>, DUPLICATE:PROP-NNN, CREATE) are the literal codes the two subagents return; the 94% precheck-EMPTY rate is the fleet measurement reported later in §Cost & local-first. On the noisiest hermit it's still 87%, on the quietest a research project that ran a full month without a single LLM-backed reflection, it's 100%.
The precheck
Before reflect even runs, a Node script (reflect-precheck.js) decides whether any of five phases are actually due:
-
compute— has there been new session activity since the last reflect? -
resolution_check— are any accepted proposals due for a "did this fix the pattern?" verification? -
cost_spike— is today's spend more than 2× the 7-day median? -
digest— has it been 7+ days since the last juvenile-phase digest? -
newborn— is this hermit less than 3 days old (different surfacing rules apply)?
If none of those fire, the precheck writes a single line to the session log and exits — no LLM call. On a typical idle day, the precheck handles most reflect invocations for free. Only when something is actually due does the script return RUN|<phases-json>, and only then does Claude read it and proceed.
This is the load-bearing reason hermit can run 24/7 without tearing through tokens. The expensive thing, letting Claude reflect on session journals and spot patterns only happens when the cheap thing has already established that there's something to reflect on.
Two subagents stand between reflect and your inbox
When reflect does run and produces a candidate, the candidate doesn't go straight to a proposal file. Two subagents run on it first.
reflection-judge verifies the evidence. The candidate cites session reports (S-NNN-REPORT.md); the judge actually reads them and confirms the pattern is in there. If reflect hallucinated the citation, which LLMs sometimes do under context pressure, the judge returns SUPPRESS with a canonical code (no-evidence, no-sessions, covered-by-memory). This is the load-bearing check against the failure mode where the agent cites a session it didn't actually read.
proposal-triage deduplicates. It reads the existing PROP-NNN.md files and memory files and checks whether a proposal already covers this pattern, whether it conflicts with a stated operator preference, or whether one of the three conditions fails. Verdict is one of CREATE | DUPLICATE:<id> | SUPPRESS.
Survive both gates, and the candidate becomes a proposal that surfaces in Discord / Telegram. Fail either, and it's recorded as a sub-threshold observation in memory, there if a recurrence later promotes it, invisible to the operator until it does.
The three-condition rule
Before reflect even drafts a candidate, it applies a rule that's quoted verbatim from the skill source:
- Repeated pattern - observed in 2+ sessions (1+ in newborn phase for Tier-1 micro-proposals)
- Meaningful consequence - something tangibly goes wrong without fixing it
- Operator-actionable change - a specific concrete change you can approve
If any of the three can't be stated without hand-waving, the candidate doesn't make it past reflect. This is what stops the "you should be more efficient" generic-pattern fatigue that always-on agents drift into.
Phase-aware behavior
Hermit knows how old it is. Reflect classifies it on every run:
-
newborn(< 3 days) - surfaces single-occurrence sub-threshold observations inline asNoticed: <pattern>so the operator gets visible early signal -
juvenile(3–14 days) - emits weekly digests instead of per-observation lines -
adult(≥ 14 days) - silent baseline; only graduating recurrences surface
Newborn hermits aren't useful if they're silent for two weeks. Adult hermits aren't useful if they're noisy. The phasing is what makes the same skill work at day 1 and day 30.
What "accept" actually does
The operator accepts a proposal in Discord / Telegram. What happens next depends on the proposal type:
-
Routine proposal - the JSON config block is parsed and appended to
config.jsondirectly. The next session picks up the new routine. No further work. -
Skill proposal - if
/skill-creator(an Anthropic-shipped plugin) is installed, hermit queues a session task with guidance to use it else Claude does it on its own. -
Everything else - writes a
NEXT-TASK.mdfile. The next/session-startoffers it as the default task, with the proposal's evidence and proposed plan attached.
There is no path where the hermit auto-modifies its own skills, agents, or hooks. Operator approval is a precondition for anything that mutates the hermit's behavior beyond a config-only routine entry. The auto-curriculum is real — Voyager-style, in that the agent proposes new capabilities based on what it observed itself doing — but the editor-in-chief role is permanent and human.
Cost & local-first
I run several of these: dev work, company ops, my Home Assistant, fitness with strava sync, my wife's social media management and a few other. Each lives in its own project directory with its own memory, budget, and routines. They share nothing but my Claude subscription and my laptop's CPU. One important thing for me was token usage efficiency, since I run this with one subscription. That's why self learning was so important, that's why Hermits are token usage aware, they brief you on it's token usage and you can ask it directly regarding it as well.
Two numbers tell the architectural story:
Cache hit ratio: ~92% across the fleet. Hermit hits Anthropic's prompt cache aggressively - memory, compiled artifacts, OPERATOR.md, the bottom of CLAUDE.md. Of every token Claude reads in a hermit session, more than nine in ten are cache reads, billed at a fraction of the input rate. My quietest hermit hits 90.4%, my noisiest 95.5%. The single largest cost-control lever is keeping the cache warm — and the wizard-driven memory layout makes that the default rather than something I have to engineer.
Precheck-EMPTY: ~94% across the fleet. This is the architectural claim from earlier, now measured. On one hermit, a quiet research project, reflect-precheck.js short-circuited 100% of reflect invocations — zero LLM calls for the entire month. On the noisiest, it still handled 87%. Across the fleet, ~94% of reflects never hit the model. The cheap script does more cost work than any model-routing decision could.
Try it yourself
Hermit is open-source.
claude plugin marketplace add gtapps/claude-code-hermit
claude plugin install claude-code-hermit@claude-code-hermit --scope project
/claude-code-hermit:hatch
/claude-code-hermit:docker-setup
Repo and full README at github.com/gtapps/claude-code-hermit.
Top comments (0)