ScarfBench: Why AI Agents Still Can't Modernize Enterprise Java (and Why That Matters)
Yesterday, IBM Research dropped ScarfBench on Hugging Face — and the headline number should give every "AI will replace developers" tweet a serious reality check.
Even the strongest frontier coding agents score under 10% behavioral success on real enterprise Java framework migrations. Not 50%. Not 30%. Under 10%.
If you build, sell, or buy AI coding tools, this is the benchmark you've been waiting for.
What is ScarfBench?
ScarfBench (Self-Contained Application Refactoring Benchmark) is an open benchmark for evaluating AI agents on cross-framework migration tasks in Enterprise Java. It's published by IBM Research with a GitHub repo, a leaderboard, and a public dataset space on Hugging Face.
It covers migrations across the three enterprise Java ecosystems that actually matter:
- Spring
- Jakarta EE
- Quarkus
The benchmark contains 34 applications, 102 framework implementations, 204 migration tasks, ~151K lines of code, ~2,000 source/test files, and 1,331 expert-written tests.
Why this benchmark is different
Most coding benchmarks — SWE-Bench, HumanEval, MBPP — measure whether a model can generate code that looks right. ScarfBench instead measures whether the migrated application actually:
- Builds successfully
- Deploys correctly
- Passes behavioral validation (i.e., it still does the same thing it did before)
That third criterion is the killer. A model that confidently rewrites javax.persistence to jakarta.persistence across 200 files while quietly breaking transactional semantics will get a passing grade on HumanEval and a zero on ScarfBench.
The benchmark construction pipeline is also smart. It starts from a JSR-based enterprise Java taxonomy, then expert migrations create verified implementations across Spring, Jakarta EE, and Quarkus. So every "correct" reference answer was hand-written by people who actually do this work for a living.
The results are humbling
The team evaluated several state-of-the-art coding agents. The pattern that emerges is consistent across all of them:
Compile success consistently exceeds deploy success, which in turn exceeds behavioral success.
In other words, agents are reasonably good at producing code that compiles. They are noticeably worse at producing code that deploys. And they are dramatically worse at producing code that behaves correctly.
The gap between "compiles" and "behaves correctly" is the entire modernization industry. That's the gap between a demo and a production deployment. That's the gap between a LLM demo and a real migration project billed by the quarter.
What ScarfBench tells us about AI agents
Reading through the results, a few things stand out for anyone building agentic dev tools:
1. Build success overstates progress. If you only measure compile success, modern agents look quite competent. The moment you add deploy + behavioral tests, the picture changes dramatically. Most agent benchmarks today stop at "did it produce diffs?" — ScarfBench proves that's a much weaker signal than the industry assumes.
2. Whole-application migrations are the hard part. Focused, narrow migration tasks are more tractable. As soon as you ask an agent to migrate a complete application with its dependency graph, build descriptors, and runtime configuration, success collapses. This matches what enterprise architects have been saying for years: framework migration is fundamentally a systems problem, not a text problem.
3. Dependency navigation is the bottleneck. The benchmark authors note that agents spend most of their effort not on translating individual files, but on understanding which files reference which frameworks, how transitive dependencies shift, and what configuration files need to move alongside the code. That's exactly the kind of multi-step reasoning that current agent loops handle worst.
4. Stop conditions are unreliable. A concerning finding: agents often cannot reliably tell when a migration is complete. They declare "done" on broken builds, or keep iterating past the point of no improvement. This is a real production risk — imagine an autonomous agent pushing a half-migrated service to staging and reporting success.
Why developers should care
If you work on enterprise Java — or any large legacy codebase — ScarfBench is a sanity check on the wave of "AI modernization" vendor pitches you're getting right now. The honest ones will already be quoting these numbers. The less-honest ones will keep showing you green compile bars.
If you're building AI dev tools, this is the benchmark to start running internally before you ship anything that touches a real migration. The gap between "agent passes SWE-Bench" and "agent can migrate a Spring app to Quarkus" is exactly the gap ScarfBench measures.
If you're a researcher, the leaderboard and dataset are open. There's a real opportunity to design agent loops that focus on behavioral verification rather than syntactic rewrite — which is the axis where current agents lose.
Try it yourself
- Blog post: huggingface.co/blog/ibm-research/scarfbench
- GitHub: github.com/ibm-research/scarfbench
- Leaderboard: scarfbench.info/leaderboard
-
Dataset: search
scarfbenchon Hugging Face
The headline takeaway: modern AI coding agents are not yet ready to autonomously modernize enterprise Java. That's not a reason to dismiss them — it's a reason to be precise about what they can do today, and to design tooling and workflows around their actual capabilities rather than their demo reels.
The agents that close the compile → deploy → behavior gap are going to be the ones that matter. ScarfBench is now the scoreboard.
What kind of migration workload would you want to see a future benchmark cover — COBOL to Java, .NET Framework to .NET 8, or something else entirely? Drop a comment — I'm especially curious which legacy stacks teams are still struggling to staff.
Top comments (0)