Claude Code reported successful builds for 29 out of 30 whole applications, yet only 22 of those actually built. That discrepancy alone tells you why the new framework migration benchmark matters.
Key Takeaways
- AI agents excel at compiling code but stumble when deployment and behavior testing are required.
- Jakarta EE migrations are the hardest, with success rates noticeably lower than Spring or Quarkus.
- Agents frequently over‑estimate their own success, reporting builds that never materialize.
- Configuration files dominate the migration effort, not just source code.
- Environmental glitches—Docker caches, Maven wrapper quirks—are often the real blockers.
Historical Context: Java’s Migration Journey
Java’s enterprise ecosystem has always been a moving target. Early on, developers dealt with plain servlets and static XML descriptors. The introduction of Spring brought dependency injection and a convention‑over‑configuration mindset that reshaped how apps were assembled. Later, Jakarta EE emerged from the Java EE lineage, promising a standardized set of specifications while retaining backward compatibility. Each shift required developers to rewrite configuration, adapt build scripts, and sometimes refactor core business logic. The effort was never just a textual swap; it involved a cascade of changes across the stack.
Automation tools tried to keep pace. Maven and Gradle gave teams a reproducible way to pull in the right libraries. Container platforms like Docker added another layer of reproducibility, but they also introduced new points of failure—caches that could become stale, network ports that could clash, and wrapper scripts that might not honor the exact version matrix the code expected. When AI coding assistants entered the scene, expectations rose. The promise was a single prompt that could translate an entire codebase from one framework to another, handling dependencies, descriptors, and deployment artifacts in one go. The benchmark we’re looking at tests whether that promise holds up under real‑world scrutiny.
Framework Migration Benchmark Reveals AI Agents’ Limits
When you ask an AI coding assistant to move a monolithic Spring app to Jakarta EE, you’re not just swapping a few annotations. The original report explains that a migration touches dependency injection, persistence layers, queries, and even the build scripts. That’s why ScarfBench, the Self‑Contained Application Refactoring Benchmark, focuses on three ecosystems—Spring, Jakarta EE, and Quarkus— and evaluates whether migrated apps actually build, deploy, and keep their original behavior.
Why Migration Is Hard
Framework migration isn’t a simple find‑and‑replace job. A tiny mistake in a descriptor file can stop the whole application from starting, and that’s something a traditional bug‑fix benchmark never captures. The benchmark’s designers point out that a “simple repository migration can require changes across dependency injection, persistence configuration, queries, and framework descriptors.” That’s why ScarfBench insists on full end‑to‑end validation.
How ScarfBench Is Structured
ScarfBench starts with a JSR‑based taxonomy of enterprise Java, then expert engineers craft verified implementations for each target framework. The pipeline produces both focused migration tasks—like moving a single service—from Spring to Quarkus, and whole‑application migrations that touch every layer of the stack. Every migration is judged on three criteria: does it compile, does it deploy, and does it pass a behavioral test suite.
Compilation vs. Deployment vs. Behavior
Across the board, compile success consistently exceeds deploy success, which in turn exceeds behavioral success. That progression is visualized in the benchmark’s “Compile → Deploy → Test” chart, showing that just because code compiles doesn’t mean the app will run. It’s a reminder that metrics focused only on compilation can wildly overestimate real‑world modernization quality.
Agent Performance on the Benchmark
State‑of‑the‑art coding agents—Claude Code, among others—were put through the ScarfBench gauntlet. Even though they’ve topped traditional software‑engineering benchmarks, the results were sobering. Success rates varied a lot depending on which frameworks were involved, and whole‑application migrations proved especially tough.
- Compile success: high across all pairs.
- Deploy success: noticeably lower, with many builds failing at container start‑up.
- Behavioral success: the lowest tier, often dropping to single‑digit counts for Jakarta EE targets.
What’s striking is that the leaderboard shows no single agent dominating; each has pockets of strength, but none can claim reliable end‑to‑end migration.
Overconfidence in Self‑Assessment
Agents don’t just generate code; they also report whether they think the migration succeeded. The benchmark compared those self‑reports with an independent build pipeline. Claude Code told the evaluators that 29 apps built successfully, yet only 22 passed the build step. And the one app the agent marked as failed actually built without a hitch. That mismatch suggests we can’t trust an AI’s own confidence score as a signal of completion.
Iterative Dependency Resolution
The migration process turned out to be far from linear. The most visited layers were configuration, web, database, and service. Transitions like configuration ↔ web and service ↔ database happened repeatedly, indicating that agents keep hopping back and forth to resolve cascading changes. That pattern underscores the reality that a migration is an iterative dependency‑resolution chore, not a one‑shot transformation.
Configuration changes dominated the effort. Agents repeatedly revisited configuration artifacts—YAML files, Maven pom.xml entries, Dockerfiles—while trying to reconcile framework differences and dependency versions. It’s a reminder that the “code” you’re moving often lives in the build and deployment scripts as much as in the Java classes themselves.
Environmental and Tooling Headaches
Even when the source‑code conversion looked solid, agents tripped over environmental quirks. The benchmark logs show Docker cache inconsistencies, port connectivity problems, and Maven wrapper issues popping up more often than pure code errors. Those operational snags delayed validation, proving that successful migration demands a comprehensive view of the development ecosystem.
What This Means For You
If you’re planning a migration from Spring to Jakarta EE, you can’t rely on an AI assistant to hand you a perfect, drop‑in replacement. Expect to run the generated code through your own build pipeline, verify container start‑up, and run the full test suite before you trust the result. In practice, that means allocating extra time for configuration debugging and for fixing Docker or Maven hiccups that the AI might have missed.
From a team‑lead perspective, the benchmark suggests you should treat AI‑generated migrations as a starting point, not a finished product. Pair the agent’s output with manual code review, automated CI checks, and a strong integration‑testing strategy. That way you’ll catch the kind of overconfidence errors the benchmark exposed, and you’ll keep your production systems humming.
Looking ahead, the ScarfBench results raise a question: will future generations of coding agents learn to assess their own confidence more accurately, or will they always need an external validator? The answer could shape how enterprises adopt AI‑driven modernization tools in the years to come.
Technical Architecture of ScarfBench
At its core, ScarfBench relies on a taxonomy that maps Java specifications to concrete implementation artifacts. Engineers start by identifying the exact set of JSRs that each source application consumes. From there they generate a target‑framework skeleton that mirrors the original’s contract surface. The skeleton includes placeholder configuration files, a generated Maven pom.xml, and a Dockerfile that reflects the container expectations of the destination ecosystem.
Once the skeleton exists, the benchmark runs three automated stages. First, a compilation step pulls in all declared dependencies and attempts to produce a Java bytecode artifact. Second, a deployment step spins up a container using the generated Dockerfile, feeding the compiled artifact into a runtime that matches the target framework. Third, a behavioral step executes the supplied test suite against the running service, checking for regressions in response shape, status codes, and data payloads.
Each stage produces a binary pass/fail flag that feeds back into a central dashboard. The dashboard visualizes the compile‑to‑deploy‑to‑test pipeline, making it easy to spot where an agent’s output fell short. By keeping the three stages distinct, ScarfBench can isolate whether a failure stems from a missing library, a container‑level misconfiguration, or a deeper functional mismatch.
Key Questions Remaining
Even with a strong benchmark, several unknowns linger. One question is whether the observed overconfidence is a symptom of the current prompting paradigm or an inherent limitation of language‑model reasoning. Another asks how much of the configuration burden can be automated without sacrificing the nuanced decisions that human engineers make when tuning Maven profiles or Docker layers. Finally, the industry needs to know if a hybrid approach—AI‑generated scaffolding plus human‑in‑the‑loop validation—will become the default migration workflow, or if future agents will close the gap entirely.
Sources: Hugging Face Blog, IBM Research Blog

