DEV Community

Cover image for I Reimplemented Anthropic Dreaming. The First Dream Was Wrong.
vericum
vericum

Posted on

I Reimplemented Anthropic Dreaming. The First Dream Was Wrong.

Anthropic announced Dreaming five days ago at Code w/ Claude.
I rebuilt it over a weekend for a single-developer crypto trading bot.
 
The first dream pass surfaced a clean time-of-day profit pattern.
The next day's full-history backtest crushed that signal —
zero variables passed Bonferroni, the walk-forward filter failed to generalize,
and 69% of the 47-day loss turned out to be concentrated in three specific days.
 
Dreaming surfaced a hypothesis. The follow-up backtest disproved it.
That is the right loop, and most "what is Dreaming" posts skip the second half.
 
This is the build, the finding, and the disproof.
 

 
What Anthropic's Dreaming actually does
 
Dreaming is a memory-consolidation feature for long-running Claude agents.
When an agent has been working in a codebase or domain for hours or days,
its context window fills with raw transcripts, tool outputs, partial conclusions, and contradictions.
 
Dreaming runs an offline pass that scans accumulated state,
extracts patterns, resolves contradictions, prunes stale references,
and writes a clean consolidated memory back.
The agent wakes up with a smaller, sharper, less-contradictory view of what it knows.
 
Anthropic described the design in their managed-agents post on 2026-05-06.
A third-party reimplementation, dream-skill by grandamenium,
formalized the four-phase loop into a skill anyone can install:
Orient → Gather Signal → Consolidate → Prune & Index.
That is the pattern I adapted.
 

 
Why a solo builder needs this
 
I run a single-developer crypto trading bot called WILD_SNIPER.
By 2026-05-10 its active-work/wild-sniper.md file had grown to 331 lines.
Three "Previous current state" blocks contradicted each other.
References to deprecated bot versions (V3.9.x forks, V4.0 paper logs)
sat next to live V3.7.1 facts.
A locked-strategy stance from 4/22 had been overturned by forensic analysis,
but the override note lived in a different section.
 
This is not a theoretical mess. On 2026-04-30 the bot entered a self-kill loop —
daily loss limit tripped (correct safety behavior),
watcher daemon restarted the bot every cycle,
bot detected the tripped limit and immediately re-shut.
Watcher → bot → watcher → bot, until midnight.
 
The bot itself worked exactly as designed.
The watcher's restart policy did not know about the tripped daily limit.
That metadata gap was the bug.
 
Stale memory was not the cause of that incident,
but the post-mortem made the cost obvious:
when meta-state about the bot is fragmented across 331 lines and three contradicting sections,
the operator cannot react quickly.
Memory hygiene became infrastructure work, not housekeeping.
 

 
The 4-phase pattern
 
I wrote a slash command, /dream-sniper,
that runs once a week (Sunday 09:33 KST) on the bot's accumulated state.
Source: ~/.claude/commands/dream-sniper.md.
 
Phase 1 — Orient
Inventory all input surfaces.
Count lines, note last-modified times, scan for pattern frequency.
Do not read full files yet — just measure.
 
Inputs span four categories:
current-state notes (active-work memos),
accumulated trade logs (last 4 weeks),
runtime logs (last 7 days),
and bot health state files.
The pass measures size and recency only —
actual content reading happens in Phase 2.
Output is a small JSON snapshot: per-domain line count, last-modified timestamp, and key pattern counts.
Cheap by design.
 


 
Phase 2 — Gather Signal
Targeted extraction.
Search categories are deliberately abstract —
the actual bot-side entry conditions, position sizing, and threshold values stay in the bot, not in the writeup.
The dream pass surfaces whether a category is showing signal; the bot keeps the what.
 
Categories scanned:
time-of-day pattern shifts,
recurring failure modes per symbol,
success-condition clusters,
user decisions logged in Recent activity,
and corrections/preference changes that should propagate forward.
 
Heavy analysis is delegated to a domain-specialist subagent
rather than pulled into the main context.
This keeps the dream pass cheap and the analysis sharp.
 
Phase 3 — Consolidate
Integrate findings into the canonical memory.
Convert relative dates ("yesterday", "last week") to absolute.
Resolve contradictions by adopting the most recent.
Drop stale references.
 
The critical safeguard here is a protected-file list.
Certain memory files — autonomy rules, family PII, cross-project context,
the user's personal preferences — are explicitly excluded from any dream-pass mutation.
Without that list, a dream pass can cheerfully "consolidate" away
information that is load-bearing across other projects,
or worse, drift the user's stated rules.
 
Phase 4 — Prune & Index
Hard cap on file size: wild-sniper.md stays under 200 lines.
Old "Previous current state" blocks move to archive/wild-sniper-.md.
The memory index gets a one-line summary of the new insight.
 
The archive is not deletion. Everything stays recoverable.
The active file stays sharp.
 

 
The first dream — what it actually found
 
First pass ran 2026-05-10 23:42 KST.
Pool: V3.7.1 sub-era A, trades from 4/7 through 4/30, N=236.
 
Headline reconfirmation:
realized R:R 0.499.
Expected value per trade -$0.039.
Cumulative PnL over 24 active trading days: -$9.25.
The math is internally consistent — this strategy is a slow leak, not noise.
 
The new finding was a sharp time-of-day pattern in KST:
 


 
Late-night Asia (00-05 KST): WR 64-87%, net positive.
Mid-morning Asia open (07-10 KST): WR 10-43%, net dump.
The 10:00 hour alone leaked -$2.34 on N=19.
 
Easy story to tell:
"the bot has an edge in low-volume Asia-night windows and gets shredded by the morning open."
 


 

 
The honest follow-up
 
The next day I ran a full-history backtest.
920 live paired trades, twelve bot versions, 47 days of data.
Entry-side features only —
using post-trade observables like exit type or slippage at entry time is lookahead bias.
First pass made that mistake, second pass fixed it.
 
Results across 69 univariate hypotheses:
 

Bonferroni-significant signals (alpha = 0.000725): 0
BH-FDR-significant signals (alpha = 0.05): 0
Raw-p < 0.05 signals: 3 (hour 10, hour 8, Friday)

 
Of those three raw-p hits, the strongest (Friday, p=0.0235) turned out to be
102 of 128 Friday trades coming from one Friday — 2026-04-03.
"Fridays are bad" was structurally "April 3rd was bad."
 
The dream's 00-05 KST signal does not appear in the corrected list.
On the broader pool:
hour 0 raw-p 0.039,
hours 1-4 not even raw-significant,
hour 5 actually loses.
The pattern was real inside V3.7.1's sub-era A
but did not generalize across other bot versions.
 
The walk-forward 70/30 split (train cut at 04-20 05:35 KST) is the deciding test.
 


 
The filter saves $2.55.
The 95% bootstrap CI under conservative costs: [-$23.90, -$11.53].
Firmly negative both ways.
 
And the most uncomfortable finding:
the worst three days (04-03, 04-25, 04-26) account for -$13.01,
which is 69% of the entire 47-day -$18.82 loss.
Drop those three days and the bot is nearly break-even.
 
The "00-05 is good" pattern was not a signal about time-of-day.
It was an artifact of a few specific catastrophic trading days falling outside that window.
Patch the symptom and the disease keeps eating.
 

 
What Dreaming is good for, what it isn't
 
Good for:
Memory hygiene — the 331-line active-work file has a path to staying under 200 without manual archival.
Pattern detection — the time-of-day finding was real (in sub-era A) and worth surfacing.
Hypothesis generation — the dream gave the next day's backtest its question.
Decision audit trail — recent activity entries become harder to lose.
 
Not good for:
Validating the patterns it surfaces.
That is a separate step — backtest, walk-forward, multi-comparison correction.
Acting on dream output without a confirmation pass.
The system should make the hypothesis cheap and the action expensive.
Replacing domain analysis.
The dream pass is shallow by design. The real work happens in the follow-up.
 
The honest framing:
Dream surfaces "this looks like something."
The next day's rigorous backtest tells you whether it actually is something.
Dream is allowed to be wrong. The backtest is the part that has to be right.
When both pieces exist, the system gets sharper.
When only the dream exists, the system gets confidently wrong.
 

 
Cost / infrastructure
 
Runs on Claude Max subscription spare quota.
One healthcheck pass costs roughly 2k tokens.
One weekly dream costs roughly 10k.
Daily total under 14k tokens — below 1% of an average user's monthly limit.
 
Most subscribers use 5-15% of their plan.
The remaining 85% is unused quota that expires monthly.
Putting a monitoring agent on that surplus costs $0 incremental.
 
The infrastructure pattern:
one slash command per cron,
two crons per active project
(/sniper-healthcheck every 4 hours, /dream-sniper weekly Sunday).
Auto-rehydration on session start ensures cron jobs survive PC restarts without manual intervention.
 


 

 
Closing
 
After building this, what I realized is that Dreaming is not much different
from what I have already been doing for months.
 
Across multiple projects — not just the trading bot —
I have been layering automated cycles inside the context window:
log collection → pattern detection → validation → discard.
The structure was already systematized and running.
Dreaming is just a name and a formalization for that same structure.
 
That said, the fact that Anthropic shipped this as an official feature
is also a confirmation that the approach was right.
 
V3.7.1 keeps running unchanged.
No filter was added.
The next two weeks are observation.

Top comments (0)