Forem

Cover image for Does Claude Code Need Sleep? Inside the Unreleased Auto-dream Feature
灯里/iku
灯里/iku

Posted on

Does Claude Code Need Sleep? Inside the Unreleased Auto-dream Feature

Greetings from the island nation of Japan.

There is something profoundly humbling about discovering that your AI coding assistant might need a nap. I opened Claude Code's /memory menu expecting the usual housekeeping options, only to find a toggle labelled "Auto-dream: off", sitting there like a dormant cat on a warm keyboard, refusing to be woken. It cannot be turned on. Anthropic, it seems, has built the bedroom but has not yet handed out the pyjamas. We have reached the stage of technological evolution where the question is no longer "Can AI think?" but rather "Can AI benefit from sleeping on it?" (personally, I find the implications for my own work-life balance rather unsettling). This article traces the thread from a stray Twitter post through source code archaeology and a UC Berkeley research paper, assembling the circumstantial case for why your CLI might soon require a bedtime story. By the end, you will either be convinced that LLM memory consolidation is the next frontier, or at least equipped to say goodnight to your terminal with a straight face. Truly.

Table of Contents

What Is Auto-dream?

How I Found It

A post drifted across my Twitter timeline:

"just found out Claude Code has a new (unreleased?) feature called 'Auto-dream' under /memory — according to reddit, this basically runs a subagent periodically to consolidate Claude's memory files for better long-term storage"

I opened /memory in my local Claude Code. There it was.


Memory

    Auto-memory: on
    Auto-dream: off · never

  > 1. User memory          Saved in ~/.claude/CLAUDE.md
    2. Project memory        Checked in at ./CLAUDE.md
    3. Open auto-memory folder
Enter fullscreen mode Exit fullscreen mode

It shows up in the UI, but you cannot turn it on.

Digging Into the Source with Claude Code

Curious, I asked Claude Code itself to investigate. We dug through the source together and found the following.

Auto-dream is controlled by a server-side feature flag (codename: tengu_onyx_plover). It is not a simple toggle in settings.json. Anthropic manages the rollout on their end.

The default values are:

enabled: false
minHours: 24  # minimum 24-hour interval
minSessions: 5  # minimum 5 sessions accumulated
Enter fullscreen mode Exit fullscreen mode

The UI shows it, but the feature is not yet available to the general public. Anthropic appears to be rolling it out gradually.

What the Defaults Tell Us About the Design

These three parameters alone reveal quite a bit about the design intent.

Parameter Value Meaning
enabled false Server-side flag. Changing settings.json locally has no effect
minHours 24 At least 24 hours must pass since the last run. Once per day at most
minSessions 5 Will not run unless 5 sessions have accumulated

There is no point in tidying a small amount of memory frequently. Let it accumulate, then consolidate once a day. The concept closely mirrors memory consolidation during human sleep.

Why Auto-dream Is Needed

Auto-memory, as it exists today, has a structural problem.

The Write-and-Forget Problem

Auto-memory writes what it learns during conversations to memory files. However, there is no mechanism to organise them.

  • Throwaway working notes and genuinely important learnings are stored side by side
  • Similar content gets written over and over
  • Notes about resolved issues or abandoned tech stacks linger indefinitely
  • MEMORY.md is capped at 200 lines, yet the space fills up without any curation

The more sessions you run, the worse the quality of your memory gets. I actually turned Auto-memory off on my own Claude Code for this exact reason. It kept memorising things that frankly did not need memorising.

Auto-dream Is the Missing Half

It seems natural to think Auto-memory and Auto-dream were designed as a pair from the start.

  • Auto-memory: the writing phase. Jot down notes during conversations
  • Auto-dream: the organising phase. Consolidate, deduplicate, and prune accumulated notes

Only one half shipped first, leaving us in a halfway state: taking notes but never tidying the notebook.

The Sleep-time Compute Paper

Auto-dream's design philosophy has a theoretical backing in a paper published in April 2025.

Overview

Sleep-time Compute: Beyond Inference Scaling at Test-time
Kevin Lin, Charlie Snell et al. (Letta + UC Berkeley)

[2504.13171] Sleep-time Compute: Beyond Inference Scaling at Test-time

Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

favicon arxiv.org

Core Idea

Conventional LLMs think only after a question arrives (test-time compute). This paper proposes thinking ahead of time by predicting queries from the context (sleep-time compute).

  1. Sleep-time: using only the context c, prompt the LLM to predict likely queries and pre-compute inferences. This produces a restructured context c'
  2. Test-time: when the actual query q arrives, use the pre-computed c' to answer quickly

Expressed formally:

S(c)c S(c) \rightarrow c'
Tb(q,c)a(bB) T_b(q, c') \rightarrow a \quad (b \ll B)

By doing the heavy lifting in advance, the test-time compute budget bb can be made far smaller than the conventional budget BB .

Experimental Results

Metric Effect
Test-time compute ~5x reduction at equal accuracy
Accuracy improvement Up to +13% (GSM-Symbolic), +18% (AIME)
Cost per query (multiple queries) 2.5x reduction (amortisation)

Query Predictability

A particularly suggestive finding: the more predictable the query, the greater the benefit of sleep-time compute.

Applied to Auto-dream, this means memory consolidation gets more precise as user work patterns accumulate. The minSessions: 5 threshold can be interpreted as ensuring a minimum amount of data for meaningful prediction.

The Authors' Background

The authorship sits at the intersection of two threads.

  • Letta (formerly MemGPT): the team behind the 2023 MemGPT paper, which proposed giving LLMs OS-like memory management
  • Charlie Snell: a UC Berkeley researcher who did pioneering work on test-time compute scaling

Memory management experts and compute scaling experts joined forces to produce research on organising memory while sleeping. Some members had previously worked on GPT-family models, and one could read this as pursuing an approach distinct from OpenAI's o1/o3 scaling trajectory within a smaller team. Knowing that Anthropic's own founding members departed from OpenAI, there is a certain wry irony to the whole affair.

Mapping the Paper to Auto-dream

Laying the paper's theory alongside Auto-dream's implementation, the correspondence is quite clean.

Sleep-time Compute (paper) Auto-dream (Claude Code)
Pre-compute by predicting user queries Consolidate and organise past memory
5x reduction in test-time compute More efficient context loading at session start
Process offline (sleep-time) Run once per day asynchronously (minHours: 24)
Amortise across multiple queries Batch-process across sessions (minSessions: 5)

That said, the paper addresses pre-inference over arbitrary contexts, whereas Auto-dream limits its scope to memory file consolidation. It is not the full application of the theory but rather a pragmatic extraction of the most immediately useful piece. I think this scoping decision is genuinely clever. You can see the pain that would come from expanding further, so they drew the line and kept it contained.

How Do You Implement "Sleep"?

The Paper's Premise

The paper defines sleep-time as "idle time when the user is not sending queries". The LLM is not sleeping. The user is idle while the LLM works behind the scenes. It is the reverse.

Claude Code's Case

Claude Code is a CLI tool. It is not a daemon, so running background work while the user sleeps seems difficult at first glance.

But Anthropic already has the infrastructure to solve this. Scheduled execution is available in a three-tier structure.

Method Runs on After restart Machine off
/loop (in-session) Local Gone No
Desktop scheduled tasks Local Persists No
Cloud scheduled tasks Anthropic cloud Persists Yes

Run prompts on a schedule - Claude Code Docs

Use /loop and the cron scheduling tools to run prompts repeatedly, poll for status, or set one-time reminders within a Claude Code session.

favicon code.claude.com

/loop is a lightweight in-session scheduler. Desktop tasks persist locally. Cloud tasks run on Anthropic's infrastructure, so they execute even when the user's machine is off.

Which tier Auto-dream will use is unknown, but all three are already running in production. The technical barrier is essentially zero.

When Might It Ship?

What Is Already in Place

  • Theoretical backing (Sleep-time Compute paper, April 2025)
  • Scheduling infrastructure (Desktop schedule, CLI cron commands, Cloud scheduled tasks)
  • UI readiness (/memory already displays it)
  • Feature flag mechanism (server-side, just flip to true)

Remaining Questions

Technically, it looks ready to ship any time. What remains is likely a business decision.

  • Who bears the cost of subagent executions the user did not explicitly request?
  • How to explain that memory content is processed via the API during consolidation
  • Should it default to ON, or require explicit opt-in?

Given recent feature releases and the Team plan's approach, I would guess it will be a settings toggle. But I genuinely do not know.

Enterprise Demand

Long-running agents with long-term memory are in strong demand from the enterprise segment.

  • Context carries over to new sessions, reducing onboarding cost
  • Infrastructure operation knowledge accumulates (incident history, operational know-how)
  • Demand exists for sharing knowledge across teams, from individual memory to project-scoped memory

Anthropic announced a $100 million investment in the Claude Partner Network in March 2026, accelerating its enterprise expansion. An Auto-dream release aligns with this business strategy.

Counter-arguments

Everything discussed so far is circumstantial evidence. Here are the points that could counter this article's hypotheses.

Auto-dream May Have Nothing to Do with Sleep-time Compute

This article drew parallels between Auto-dream's design and the Sleep-time Compute paper, but there is no direct evidence that Anthropic referenced the paper in their design. Anthropic does not typically disclose such things, so the absence of confirmation is not surprising, but it is worth noting.

The idea of periodically tidying memory is hardly novel. Cron-based cleanup, defragmentation, log rotation. These are bread-and-butter patterns in infrastructure operations. You do not need an academic paper to think of applying them to LLM memory management.

Furthermore, the paper's sleep-time compute is about "pre-inferring future queries from context", whilst Auto-dream is about "organising past memory". The paper looks forward; Auto-dream looks backward. They may resemble each other on the surface whilst solving different problems entirely.

That said, both share the structure of "using compute during user idle time to improve the efficiency of the next session". Even if the implementation details differ, I believe there is a genuine connection at the design philosophy level.

Enterprise and Auto-dream May Not Connect

The article argued alignment with enterprise demand, but current Auto-memory has a constraint.

The official documentation states clearly:

Auto memory is machine-local.

Auto-memory is machine-local. It cannot be shared across team members. This is a fundamentally different design from the team-shared knowledge base that enterprises want.

CLAUDE.md does offer Project scope (shared via source control) and Managed policy (organisation-wide), and the autoMemoryDirectory setting allows changing the storage location. Pointing it at shared storage could enable pseudo-sharing.

However, team-shared memory is an area where the gap between "want" and "can implement" is large.

  • How do you merge when multiple people write to memory simultaneously? CLAUDE.md can be managed with git, but merging unstructured Auto-memory is messy
  • Individual memory is already cluttered from the write-and-forget problem. Mix in an entire team's notes and it becomes chaos. With Auto-dream not yet implemented even for individual memory consolidation, team sharing is premature
  • What scope of memory should be shared? Project-specific knowledge is worth sharing, but individual workflow quirks mixed in would just be noise

The natural sequence is Auto-dream (individual memory consolidation) first, team sharing second. The current design is squarely focused on individual memory, and team-shared memory will likely be designed as a separate feature.

Though, being a dream feature, it does carry a certain aspirational quality.

It Might Never Ship

Feature flags appearing in the UI does not guarantee a release. Plenty of product features have been experimented with and then quietly retired. Auto-dream could follow the same fate.

A feature for dreaming that ends up being just a dream. That too would be a form of goodnight.

Beyond this point, speculation begets speculation. It is a fun exercise, but this article will say its own goodnight here.

Summary

Auto-dream is a poetic concept (giving an LLM sleep), but its substance is grounded in computation theory.

  • A subagent automatically consolidates and organises memory files
  • It solves Auto-memory's write-and-forget problem, creating a cycle where the tool gets smarter the more you use it
  • The theoretical backdrop is the Sleep-time Compute paper's finding that "pre-computation costs are recovered through test-time savings"
  • The UI and infrastructure are in place. It is one feature flag away from release

When Auto-memory and Auto-dream begin working as a pair, Claude Code's memory management will shift from "write and forget" to "write, sleep, organise, and remember".

I think the day we say "sweet dreams" to Claude Code is not far off. If the feature ships, that is.

References

  • Sleep-time Compute: Beyond Inference Scaling at Test-time (arXiv:2504.13171)

    [2504.13171] Sleep-time Compute: Beyond Inference Scaling at Test-time

    Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

    favicon arxiv.org
  • MemGPT: Towards LLMs as Operating Systems (arXiv:2310.08560)

    [2310.08560] MemGPT: Towards LLMs as Operating Systems

    Large language models (LLMs) have revolutionized AI, but are constrained by limited context windows, hindering their utility in tasks like extended conversations and document analysis. To enable using context beyond limited context windows, we propose virtual context management, a technique drawing inspiration from hierarchical memory systems in traditional operating systems that provide the appearance of large memory resources through data movement between fast and slow memory. Using this technique, we introduce MemGPT (Memory-GPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM's limited context window, and utilizes interrupts to manage control flow between itself and the user. We evaluate our OS-inspired design in two domains where the limited context windows of modern LLMs severely handicaps their performance: document analysis, where MemGPT is able to analyze large documents that far exceed the underlying LLM's context window, and multi-session chat, where MemGPT can create conversational agents that remember, reflect, and evolve dynamically through long-term interactions with their users. We release MemGPT code and data for our experiments at https://memgpt.ai.

    favicon arxiv.org
  • Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters (arXiv:2408.03314)

    [2408.03314] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

    favicon arxiv.org

Top comments (0)