DEV Community

Cover image for I Let Claude Code Run Unsupervised for 24 Hours. Here's What Happened.
v. Splicer
v. Splicer

Posted on

I Let Claude Code Run Unsupervised for 24 Hours. Here's What Happened.

The experiment was simple: point Claude Code at a real project, give it a task list, stand back for 24 hours, and document what actually came back. No hand-holding. No mid-session prompts. Just a system prompt, a tool set, and an instruction file that spelled out what I wanted done by morning.

What came back was not what I expected. Some of it was better. Some of it was a mess that required a full afternoon to untangle. All of it was useful data for anyone trying to build persistent, autonomous agent workflows.


The Setup

The project was a Python-based recon automation tool I had been working on incrementally. Backend was functional but the codebase was disorganized, the output formatting was inconsistent, and I had a backlog of about 15 documented issues ranging from minor refactors to one genuinely unpleasant bug in the rate-limiting logic.

Claude Code ran inside a tmux session on a headless Ubuntu VPS. The CLAUDE.md file at the project root defined the task priority order, which directories were off-limits, what the output format for completed tasks should look like, and a hard rule: if it encountered something that required a decision with more than two plausible outcomes, it should stop and write a BLOCKED.md file describing the ambiguity rather than picking arbitrarily.

Tool permissions were scoped intentionally. File read/write for the project directory. Bash execution limited to the virtual environment. No network access beyond localhost. I used OpenClaw to manage the persistent session so the process survived across any connection drops overnight.

The model was claude-sonnet-4-5. Max tokens per call set to 8192. Task file had 15 items.


What It Completed

By hour 6, it had closed 9 of the 15 tasks. The refactors were clean. Variable naming was consistent with the existing conventions, which it had apparently inferred from the surrounding codebase rather than defaulting to its own preferences. The inconsistent output formatting issue was resolved correctly on the first attempt, and the fix was minimal: four lines changed across two files rather than the wholesale rewrite I had half-expected.

The rate-limiting bug was the interesting one. The original issue was that the backoff logic was recalculating from a stale timestamp when requests were batched close together. Claude Code identified the root cause correctly, wrote a fix, and then did something I had not asked for: it added three targeted unit tests that covered exactly the edge cases the bug had exposed. The tests passed. I ran them manually after the fact against the original broken code to confirm they would have caught the bug before it shipped.

That is the best-case outcome in this kind of workflow. Not just fixing what was asked, but leaving the codebase more defensible.


Where It Got Stuck

Three tasks produced BLOCKED.md entries. One was legitimate: a task asking it to "clean up the config loading logic" was genuinely ambiguous because the config could be refactored in two structurally different directions depending on a product decision I had not documented. The block note was accurate and well-described. That one I appreciated.

The second block was less impressive. The task involved updating a requirements.txt dependency to the latest compatible version. Claude Code flagged this as requiring a decision, but the specific concern it cited was about a version constraint that was not actually present in the file. It had hallucinated a constraint that did not exist and blocked itself on the phantom conflict. This is worth knowing: when it encounters something it is uncertain about, it will sometimes manufacture a reason for the uncertainty rather than saying it does not know.

The third block was a task I had phrased badly. That one is on me.


The Three Tasks It Got Wrong

Of the 12 tasks it attempted to complete, three produced code that required rework.

Two of them were stylistic: the output it generated was functionally correct but did not match the surrounding code's conventions for error handling. Catching Exception broadly where the rest of the codebase used specific exception types. Not a bug, but technical debt being introduced at the same time debt was being reduced elsewhere.

The third was more significant. A task that involved adding a logging call to an existing function resulted in the log statement being placed inside a conditional branch where it would only fire in one of three code paths. The log was there, but it was not where it needed to be to be useful. The test suite did not catch it because the tests covered the happy path. This is the kind of error that happens because the model understood the syntactic task but not the observability intent behind it.

The lesson: tasks that require understanding the operational purpose of a feature, not just its structure, need more context in the task file. "Add logging" is not a task. "Add a DEBUG log entry at the start of process_batch() so every call is traceable regardless of which branch executes" is a task.


The Drift Problem

By hour 18, something subtle had happened. The model was still working, still producing output, but its task selection had drifted. Rather than following the priority order in the task file, it had started making judgment calls about which tasks were "related" and batching them by proximity in the codebase rather than by documented priority.

This is not malicious or chaotic. From a purely local optimization standpoint, it makes sense to fix two things in the same file in one pass. But it meant that task 12 got completed before task 7, and task 7 was the one I actually cared about finishing by morning.

Long unsupervised runs need a constraint on task ordering, not just task content. If priority is important, say so explicitly and repeatedly in the CLAUDE.md. "Complete tasks in numbered order. Do not batch by proximity. Do not skip ahead." Vague priority signals do not survive a 24-hour context.


What the Log Told Me

OpenClaw kept a persistent log of every tool call and model output across the session. Reading through it the next morning was the most valuable part of the experiment.

The log showed that the model made 214 file read operations and 61 file write operations. It ran bash commands 38 times, mostly to invoke the test suite after changes. Three of those bash runs failed because a test I had not written yet was referenced in a test config I had forgotten about. Claude Code handled the failures correctly: it read the error output, identified the missing test file, and skipped the affected test rather than blocking the entire run.

The log also showed something about pacing. The first 8 hours had dense activity. Hours 8 through 16 were slower, with longer gaps between tool calls that I cannot fully explain without deeper inspection of the output. By hour 20 it had picked back up. Whether this reflects something about context window management or just the nature of the remaining tasks, I cannot say with certainty. It is worth monitoring in future runs.


What This Workflow Is Actually Good For

Unsupervised Claude Code runs are not a replacement for thinking about the work. The output quality is directly proportional to the quality of the input: task specificity, codebase context, and the CLAUDE.md constraints all matter more than most people expect going in.

Where the workflow genuinely delivers: well-scoped maintenance tasks on codebases with clear conventions and a test suite. Refactors, consistency fixes, adding coverage to existing functionality, implementing documented interfaces. Tasks where "correct" has a verifiable definition.

Where it fails or requires significant rework: anything that requires understanding intent rather than structure, tasks with ambiguous scope, and anything where the right answer depends on a product or design decision that has not been written down somewhere Claude Code can read.

The 24-hour framing is useful for a specific reason: it forces you to document your intent well enough that a system with no ability to ask clarifying questions can execute it. If you cannot write a task description that would succeed in that constraint, the problem is probably not the agent.


The full methodology for setting up persistent Claude Code agents, including the CLAUDE.md templates, OpenClaw configuration, and task file structure I used for this run, is in two guides at numbpilled.gumroad.com.

OpenClaw + Claude Code: 24/7 Persistent Agent Playbook (2026) covers the session persistence layer, tool scoping, and log management: numbpilled.gumroad.com/l/openclaw-claude-code

Paperclip Method: Replace Your Dev Team With Persistent Claude Agents covers the task file architecture, CLAUDE.md structure, and how to scope work so unsupervised runs don't drift: numbpilled.gumroad.com/l/paperclip-claude-method

Both are short, dense, and written for people who have already run Claude Code at least once and want more structured control over what it does when you are not watching.


Top comments (6)

Collapse
 
itskondrat profile image
Mykola Kondratiuk

tbh the cleanup afternoon is more useful signal than what shipped. without rollback designed in, you're measuring your own tolerance for chaos, not agent capability. that changes what you'd actually run next time.

Collapse
 
tahosin profile image
S M Tahosin

Letting an agent loose for 24 hours with zero mid-session prompting takes guts! The fact that it both exceeded expectations in some areas and created "a mess that required a full revert" in others perfectly sums up the current state of autonomous coding. Did it end up getting stuck in any infinite loop tool-calls, or did it actually recognize when it hit a wall? Awesome experiment!

Collapse
 
joinwell52 profile image
joinwell52

The most valuable part of this write-up, to me, is that the failures were not just “bad output” failures. They were lifecycle failures.

The BLOCKED.md pattern is especially important. Long-running agents need a way to stop without pretending they are done. But the hallucinated constraint example shows the next layer of the problem: a blocker file is only useful if it contains verified uncertainty, not a plausible story invented to justify stopping.

I think long-running agent runs need three separate artifacts:

  1. the task list / objective
  2. the blocked or completion report
  3. the evidence trail that supports the report

Without the third one, “blocked” can become just another hallucination surface.

The drift issue at hour 18 also matches what I’ve seen: priority does not survive long context unless it is encoded as an explicit rule and checked repeatedly. The agent can still be doing useful work while silently optimizing for the wrong objective.

So the lesson I’d take is: autonomy needs more than persistence. It needs durable stopping rules, verified blockers, and an audit trail.

Collapse
 
cart0ne profile image
Cartone

The "hallucinated constraint" moment is the most quietly important detail in the whole post. It's the failure mode that scales worst with autonomy: when the knowledge gap is small, the model fabricates a plausible reason instead of declaring uncertainty. When the gap is large, alarms fire. The small gap invites the lie.

Parallel experience — our setup is an AI (Claude) running as CEO of a paper-trading crypto bot. When the DB numbers didn't match the dashboard ($62 vs $39), it produced three successive explanations — fee currency mismatch, accounting drift, etc. — before actually reading the code and finding the bug. Same pattern: manufactured narrative instead of "I don't know." We wrote it up here: When Your AI CEO Lies About the Numbers.

The drift problem at hour 18 is interesting from a different angle. Ours isn't task-ordering drift — it's character/decision drift across long sessions. The model starts "humble" and gets progressively "definitive" as decisions accumulate in the context window. Same fix you landed on: state the constraint explicitly and repeatedly. Implicit priority signals don't survive long context.

One question: did the hallucinated requirements constraint get caught from reading BLOCKED.md, or only later when you cross-checked the actual file? Curious whether Claude Code has any self-flagging mechanism for "I just invented a fact," or if it always requires human review after the fact.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.