Evan-dong

Posted on May 7

Anthropic's Agents Now Self-Improve Between Sessions. Here's How Dreaming Works.

#ai #api #image #tutorial

On May 6th, Anthropic shipped three new capabilities for Managed Agents. Two of them — Outcomes and multi-agent orchestration — are solid infrastructure upgrades. The third one, Dreaming, is the one worth stopping to think about.

Dreaming is a scheduled background process that runs between sessions. The agent reviews its own past conversation transcripts, identifies recurring patterns, and writes learnings into its memory stores. No human prompt required. No explicit instruction to "remember this."

If you've been building with Claude agents, you already know how memory works: you tell the agent something, it stores it, it uses it next time. Passive. Explicit. You're the one deciding what gets remembered.

Dreaming flips that. The agent decides.

How It Actually Works

The process runs on a schedule between sessions. The agent scans past transcripts looking for signal: mistakes it repeated, approaches that worked, edge cases it missed. It then curates its own memory stores based on what it finds. The original session data stays untouched — Dreaming writes to memory, not back to history.

There are two autonomy modes you can configure:

Automatic: the agent identifies patterns and writes them to memory directly
Human review: the agent proposes memory updates, you approve before they take effect

The human review mode is the safer starting point for production systems. You get the cross-session pattern recognition without giving the agent unilateral write access to its own memory.

Currently in research preview — not GA yet.

Why This Matters: The Cross-Session Blind Spot

Here's the problem Dreaming solves. Individual sessions can't see cross-session patterns. A support agent that misclassifies a certain type of ticket won't notice it's made the same error 12 times this month. Each session starts fresh. The pattern is invisible.

Dreaming surfaces exactly that kind of signal. It's the difference between an agent that resets every session and one that accumulates operational experience over time.

The practical implication: an agent that's been running for three months has three months of self-curated experience. A freshly deployed agent starts from zero. Over time, these become fundamentally different systems — not because of different prompts, but because of different histories.

Outcomes: The Signal Dreaming Needs

Dreaming needs to know what "doing well" means. That's what Outcomes provides.

You define a success rubric. A separate Claude instance — isolated from the agent's reasoning, running in its own context window — evaluates output against your criteria. If it fails, the grader identifies what needs to change, and the agent iterates until it meets the bar.

Numbers from Anthropic's internal testing:

Task success rates improved up to 10 percentage points over standard prompting
Structured file generation: +8.4% on .docx, +10.1% on .pptx
Works for subjective quality — editorial voice, writing style, brand consistency

The isolation model matters here. The grader runs in a separate context window, which means it can't be influenced by the agent's own reasoning. It's evaluating output, not process.

Connect the two: Outcomes identifies failures. Dreaming remembers them. One is the exam. The other is the error notebook.

Multi-Agent Orchestration: Now in Public Beta

The third piece moved from preview to public beta. A coordinator agent decomposes tasks and delegates to up to 20 specialist subagents running in parallel. Each subagent gets its own context window. They share a common filesystem.

Key details for builders:

Full trace visibility in Claude Console
Coordinator can send follow-up messages mid-workflow
Subagents retain context between exchanges
Orchestration depth limited to one level — no sub-sub-agents

The depth limit is worth noting. If your architecture needs nested orchestration, this isn't the right fit yet.

Real-world results from early adopters:

Harvey (legal AI): task completion rates up approximately 6x
Wisedocs (document verification): review speed improved 50% while maintaining quality
Netflix: parallel batch analysis across hundreds of build logs
Spiral by Every: Haiku coordinator + Opus writing subagents + Outcomes grader scoring against editorial principles

Webhooks and Pricing

Webhooks are in public beta. Agents push notifications to your system when tasks complete. For long-running jobs — some sessions run for hours — this is essential. You don't want to poll.

Pricing: standard Claude API token rates plus $0.08 per active session hour. Idle time is free. A 30-minute task costs 4 cents in infrastructure fees on top of tokens. Dreaming, Outcomes, and Webhooks don't add separate charges.

Quick Reference

Feature	Status	What It Does
Dreaming	Research preview	Agents review past sessions, extract patterns, curate memory
Outcomes	Public beta	Automated output grading against developer-defined rubrics
Multi-agent orchestration	Public beta	Coordinator + up to 20 parallel subagents, shared filesystem
Webhooks	Public beta	Push notifications when agent tasks complete
Pricing	Live	$0.08/active session hour + standard token costs

One Limitation Worth Knowing

Managed Agents runs Claude models exclusively. The orchestration, Dreaming, Outcomes grading — all Claude. If your architecture needs to route between models (cost optimization, specialized capabilities, latency requirements), that's a layer Managed Agents doesn't address.

If you're building multi-model agent systems that need persistent context across providers, EvoLink provides a unified gateway routing across Claude, DeepSeek, GPT, and others from a single API endpoint.

Author: Jessie, COO at EvoLink

Sources:

DEV Community