Originally published on The Searchless Journal
Nineteen days. That is how long one AI model spent programming nonstop on a single MirrorCode task before Epoch AI shut it down. The task cost $2,600 in compute. The model never solved it.
This is the kind of data point that reframes how we think about AI coding capabilities. Headlines celebrate models that write boilerplate in seconds or generate working prototypes in minutes. But sustained, complex, multi-day programming remains a frontier that no model has truly conquered.
Epoch AI, the AI research organization known for rigorous evaluations, designed MirrorCode to test something different from standard coding benchmarks. Instead of asking models to solve isolated algorithmic problems or write functions from specifications, MirrorCode asks models to recreate complete programs without access to the original source code. Think of it as asking an AI to rebuild a house from a photograph, without blueprints, using only general knowledge of construction.
The benchmark reveals a performance landscape that is both impressive and humbling.
How MirrorCode Works
The premise is elegant in its difficulty. Models receive a description of what a software tool does, its public API, and documentation. They must then implement the entire thing from scratch, writing code that passes the original test suite.
This is not LeetCode. These are real software projects with real complexity. Thousands of lines of interconnected code. Edge cases. Error handling. Performance considerations. Architecture decisions. The kind of work that senior engineers spend days or weeks completing.
MirrorCode currently includes tasks ranging from small utilities (a few hundred lines) to substantial projects (16,000+ lines of code). Models must manage context across files, maintain consistency in naming conventions and design patterns, and handle dependencies between components.
The evaluation runs autonomously. Models write code, run tests, read error messages, iterate. They operate as a solo developer would, without human intervention. The only limit is time and money.
The Results: Capability at the Top, Failure at the Frontier
Claude Opus 4.7 leads the pack with a 56% overall solve rate. It rebuilt a 16,000-line data processing toolkit in 14 hours, passing all tests. That is genuinely remarkable. A single AI model, working autonomously, recreated a complex software project faster than most human developers could.
But the 56% number tells only half the story. Opus dominated the easier and medium-difficulty tasks. On the hardest tasks, the most complex multi-thousand-line projects with intricate dependencies, every model struggled. Including Opus.
The performance gradient is steep. Models that look competent on standard benchmarks fall apart when asked to maintain coherence across thousands of lines of code. Context windows fill up. Earlier design decisions get forgotten. Inconsistencies creep in. The model starts contradicting itself, using deprecated patterns it abandoned hours ago.
This is the endurance problem. AI models excel at bursts of focused work. Ask a model to write a function, and it performs brilliantly. Ask it to maintain architectural consistency across 50 files over 14 hours of continuous work, and the cracks appear.
The Cost of Persistence
The 19-day marathon reveals something important about AI reasoning under sustained load. The model in question (Epoch AI has not publicly named which model ran the longest) did not crash or error out. It kept working. It kept trying different approaches, running tests, reading failures, and attempting fixes.
But it never converged on a solution.
This is the nightmare scenario for autonomous coding agents. Not that they fail fast, but that they fail slowly, burning through compute budgets while making no real progress. The $2,600 cost of that single failed task exceeds what most companies would pay a human developer for a week of work.
The problem is not intelligence in the traditional sense. It is judgment. Knowing when an approach is not working. Recognizing when to step back and reconsider the architecture rather than patching the same broken function for the hundredth time. Human developers develop this intuition through years of experience. AI models do not have it at all.
The Snowflake benchmark of GLM-5.2 found a similar pattern. On one task, GLM fired off 411 tool calls in 24 minutes, checking row counts, distributions, null values, and column types. It failed all three attempts. Claude Opus solved the same task with 49 calls in 9 minutes. More effort did not produce better results. It produced wasted compute.
Context Degradation and the Long-Run Problem
The technical challenge underlying MirrorCode's difficulty is context degradation. Even with context windows exceeding 200,000 tokens, models lose track of information over long sessions. A design decision made in hour two gets forgotten by hour eight. The model reinvents a pattern it already established, creating inconsistency. Or worse, it actively contradicts earlier work.
This is not a token-limit problem. It is an attention problem. Transformers attend to all context equally in principle, but in practice, attention degrades over sequence length. Information from early in a long session carries less weight in generation decisions than recent context.
For agentic coding, this means models benefit from frequent context resets. Start fresh with a clear summary of decisions made so far, rather than dragging 100,000 tokens of conversation history forward. Some coding agent frameworks already implement this pattern. Others do not, and their performance suffers on long tasks.
The MirrorCode results validate this approach. Models that managed context effectively solved harder problems. Models that let context grow unbounded got lost in their own history.
What This Means for Agentic Coding Products
Companies building autonomous coding agents need to take MirrorCode's lessons seriously.
Task decomposition is essential. No current model can reliably handle a 16,000-line project in a single session. Breaking work into smaller, independently verifiable chunks dramatically improves success rates. Each chunk should be small enough that the model can maintain coherence.
Verification loops need budgets. The 19-day failure shows what happens when agents iterate without limits. Every agentic system needs a kill switch. Maximum iterations. Maximum runtime. Maximum cost. When those limits are hit, the system should stop and escalate to a human.
Progress detection prevents wasted compute. An agent that runs 400 tool calls without making progress is stuck. The system should detect this pattern and either reset, seek help, or stop. Simple heuristics work well here. If the last 50 tool calls did not change any test outcomes, the agent is probably spinning.
Architecture consistency requires explicit management. Models cannot maintain design patterns across long sessions through implicit memory. Coding agents need explicit architecture documents, style guides, and decision logs that get re-injected into context regularly.
The Gap Between Benchmark and Product
MirrorCode is a controlled benchmark. Real-world software engineering is messier. Requirements change mid-project. Codebases have technical debt. Dependencies have breaking changes. Documentation is outdated.
The 56% solve rate on MirrorCode probably represents an upper bound on what models can achieve in production coding environments. Real-world success rates will be lower, sometimes significantly so.
This does not mean autonomous coding is not useful. The 56% of tasks that Opus solved represent real value. Rebuilding a 16,000-line tool in 14 hours, even with occasional human intervention, is a massive productivity multiplier. The key is setting appropriate expectations.
Autonomous coding agents are not replacement engineers. They are extremely capable tools that handle specific categories of work well and fail in predictable ways. Teams that understand both the capabilities and the failure modes can extract enormous value. Teams that expect fully autonomous software engineering will be disappointed.
The Road Ahead
MirrorCode will get harder. Epoch AI plans to add more complex tasks, including projects that require understanding entire codebases, integrating with external systems, and optimizing for performance. The benchmark will track whether model improvements translate into genuine endurance or just better burst performance.
The 19-day marathon will eventually become a footnote. Some future model will solve that task in hours instead of weeks. But the underlying lesson will remain. Intelligence is not just about solving problems. It is about knowing when to stop, when to change course, and when to ask for help.
That is the frontier MirrorCode exposes. Not raw capability, but judgment. And judgment remains the hardest thing to build into an AI system.
Top comments (0)