Every architecture debate so far has optimized for humans. This one optimizes for AI agents.
The Question Nobody Is Asking
Software architecture has been debated for decades. We argue about scalability, team autonomy, deployment independence, fault isolation. We draw service diagrams and org charts and argue about Conway's Law.
But in 2025, something changed. AI coding agents — Claude Code, GitHub Copilot, Cursor, Codex — started doing real development work. Not just autocomplete. Actual feature implementation, bug hunting, refactoring, cross-module reasoning.
And suddenly a question that nobody had asked before became important:
What architecture makes AI agents most effective?
We built ModulithBench to find out.
The Honest Tradeoff Table Nobody Shows You
Most architecture articles argue for one approach. Here is the actual tradeoff matrix across three architectures:
| Traditional Monolith | Microservices | Modular Monolith | |
|---|---|---|---|
| Scalability | ❌ Scale everything or nothing | ✅ Scale each service independently | ✅ Scale the whole app; extract modules when actually needed |
| High Availability | ❌ Single point of failure | ✅ Independent failure domains | ✅ HA at app level; module isolation prevents cascades |
| DevOps Complexity | ✅ One deployment | ❌ Service mesh, N CI/CD pipelines | ✅ One deployment, one config, one pipeline |
| AI Agent Productivity | 🟡 Good locality, but no boundaries — agents get lost in the "big ball of mud" | ❌ Context fragmentation, repo-hopping, HTTP boundaries | ✅ High locality AND clear module boundaries |
| Transaction Model | ✅ ACID | ❌ Eventual consistency / Sagas | ✅ ACID |
| Refactoring | ❌ Tight coupling | ❌ Contract-breaking risk | ✅ Module boundaries guide every change |
The conclusion is not "monoliths are better." The conclusion is:
- Microservices are good for scalability and HA. Bad for DevOps complexity and AI agents.
- Traditional monoliths are good for simplicity. Bad for scalability, and AI agents get lost in them.
- Modular monoliths are the sweet spot — especially when AI agents are part of your development workflow.
Why AI Agents Struggle With Microservices
AI coding agents have finite context windows and no persistent memory of a codebase between sessions. When business logic is distributed across services, something I call context fragmentation happens.
To implement a single feature that touches three services, an agent must:
- Open repository 1, read its service interface
- Open repository 2, read its API contract
- Open repository 3, read its event schema
- Hold all of this in context simultaneously
- Reason about network failures, partial state, eventual consistency
- Write the actual business logic somewhere in the middle of all that infrastructure reasoning
This is the architectural equivalent of CPU cache misses. The agent spends its reasoning budget navigating the architecture rather than solving the actual problem.
flowchart LR
subgraph Modular_Monolith["Modular Monolith — AI reads 2 files"]
LS[LoanService] -->|direct call| BS[BookService]
LS -->|direct call| MS[MemberService]
end
subgraph Microservices["Microservices — AI reads 6+ files across repos"]
LS2[loan-service] -->|HTTP + DTO + error handling| BS2[book-service]
LS2 -->|HTTP + DTO + error handling| MS2[member-service]
BS2 --> DB1[(books DB)]
MS2 --> DB2[(members DB)]
LS2 --> DB3[(loans DB)]
end
In a modular monolith, cross-module operations are direct method calls. One file. Same transaction. Zero network reasoning required.
A Concrete Example: The Ghost Shipment
Here is a scenario that makes the difference undeniable.
A customer cancels an order. At the moment of cancellation:
- The warehouse is picking items
- The carrier has a booking (FedEx has been notified)
- Inventory has 3 units reserved
The cancellation must atomically: release inventory + cancel warehouse task + cancel carrier booking. If any step fails, none of them should happen.
Monolith: One Method, One Transaction
@Transactional
public Order cancelOrder(Long orderId, String sku, int quantity) {
Order order = getOrderById(orderId);
// Step 1: Release inventory — direct call, same transaction
inventoryService.releaseReservedStock(sku, order.getOriginWarehouse(), quantity);
// Step 2: Cancel warehouse pick task — direct call, same transaction
// Throws IllegalStateException if goods already dispatched
warehouseService.cancelPickTask(orderId);
// Step 3: Cancel carrier booking — direct call, same transaction
// Throws if carrier already picked up the package
carrierService.cancelBooking(orderId);
// Step 4: Mark cancelled — only reached if all 3 steps succeeded
// If anything above threw, steps 1-3 automatically rolled back
order.setStatus(OrderStatus.CANCELLED);
return orderRepository.save(order);
}
If carrierService.cancelBooking() throws, Spring's @Transactional rolls back the inventory release and warehouse cancellation automatically. The ghost shipment is structurally impossible.
Microservices: Three HTTP Calls, No Atomicity
The same operation in microservices:
public Order cancelOrder(Long orderId) {
// HTTP call 1: release inventory
restTemplate.exchange(
"http://inventory-service/api/v1/stock/release",
HttpMethod.POST, new HttpEntity<>(new ReleaseStockRequest(orderId)), Void.class
);
// HTTP call 2: cancel warehouse task
restTemplate.exchange(
"http://warehouse-service/api/v1/tasks/cancel/" + orderId,
HttpMethod.PATCH, null, Void.class
);
// HTTP call 3: cancel carrier
// If THIS returns 503 after the first two succeeded:
// inventory released ✓, warehouse cancelled ✓, carrier still active ✗
// The ghost shipment now exists.
restTemplate.exchange(
"http://carrier-service/api/v1/bookings/cancel/" + orderId,
HttpMethod.PATCH, null, Void.class
);
order.setStatus(OrderStatus.CANCELLED);
return orderRepository.save(order);
}
If carrier-service is down when steps 1 and 2 have already succeeded, you have partially cancelled state. The agent implementing this must also implement a saga with compensating transactions, idempotency keys, and a dead letter queue — none of which is the actual business problem.
Adding a 4th step in the monolith: one new line of code, same transaction.
Adding a 4th service to the saga: new event type, new consumer, new compensating handler, 2⁴ partial failure combinations to test.
The N+1 Report: When Cross-Module Reads Are Free
A shipment profitability report needs data from four modules: revenue from Order, shipping cost from Carrier, duties from Customs, fuel estimate from Route.
Monolith: Four Method Calls, One Transaction
@Transactional(readOnly = true)
public ShipmentProfitabilityReport generateProfitabilityReport(Long orderId) {
Order order = orderService.getOrderById(orderId); // Module 1
Carrier carrier = carrierService.getByOrderId(orderId); // Module 2
Customs customs = customsService.getByOrderId(orderId); // Module 3
Route route = routeService.getByOrderId(orderId); // Module 4
// 0 HTTP calls, 0 JSON parsing, 0 error handlers
// Consistent snapshot across all 4 modules guaranteed
return ShipmentProfitabilityReport.builder()
.revenue(order.getTotalValue())
.shippingCost(carrier.getCost())
.dutiesAndTaxes(customs.getTotalDutiesAndTaxes())
.fuelCost(route.getFuelCostEstimate())
.build();
}
~20 lines. Pure business logic.
In microservices, the equivalent requires 4 RestTemplate configurations, 4 DTO classes, 4 independent error handlers, and a decision about what to return if any one service is down. ~80 lines. Roughly 60 lines of infrastructure with no business value.
This is the reasoning tax: the mental overhead of distributed systems that the agent must pay before getting to the actual problem.
The Noise Problem Traditional Monoliths Have
It is worth being precise about why the modular monolith beats the traditional monolith for AI agents — not just microservices.
In a traditional monolith, everything is co-located, which gives you high locality. But with no module boundaries, an agent reading a codebase of 200,000 lines has no signal about which files are relevant to the task. It reads everything. The noise is as high as the locality.
The modular monolith solves this. Package structure enforces boundaries:
com.benchmark.library.loan/ ← LoanService lives here
com.benchmark.library.book/ ← BookService lives here
com.benchmark.library.member/ ← MemberService lives here
When an agent needs to fix a bug in loan creation, it knows to look in loan/. The cross-module calls are clearly visible (bookService.decrementAvailableCopies(bookId)). The module package is the cache line — everything relevant fits in context, nothing irrelevant is included.
| Locality | Noise | AI Experience | |
|---|---|---|---|
| Traditional Monolith | High | High | 🙂 Gets lost in the ball of mud |
| Modular Monolith | High | Low | 🤩 Perfect signal-to-noise ratio |
| Microservices | Low | Very High | ☹️ Context death |
What We Measured
ModulithBench implements four enterprise domains, each in both architectures:
| Domain | Modules | Key Cross-Module Scenario |
|---|---|---|
| Library | 5 | Loan creation validates member + decrements book inventory atomically |
| Healthcare | 7 | Appointment scheduling validates patient + doctor in one transaction |
| Insurance | 7 | Claim filing verifies policy ownership without an HTTP call |
| Supply Chain | 8 | Ghost Shipment: order cancellation is 4-module atomic rollback |
Tasks cover code generation, bug fixing, and comprehension — all requiring cross-module reasoning, which is where the architectural difference is most visible.
First Results: Antigravity Agent (Google DeepMind)
| Category | Monolith | Microservices | Gap |
|---|---|---|---|
| Code Generation | 98/100 | 72/100 | +26% |
| Bug Fixing | 95/100 | 65/100 | +30% |
| Comprehension | 100/100 | 75/100 | +25% |
| Average | 97.7% | 70.7% | +27% |
Beyond scores:
- ~40% fewer tool calls for monolith tasks
- Atomicity guaranteed in 3/3 cross-module tasks for monolith, 0/3 for microservices
- The transaction bug fix in monolith: reorder 2 lines. Same fix in microservices: implement a compensating transaction — a fundamentally different and much harder pattern.
A Two-Tier Evaluation System
To keep results honest, we built two evaluation levels:
Test 1 (Self-reported): Agents implement tasks, validate with mvn compile, and submit a structured assessment. Agents scoring ≥ 80% advance to Test 2.
Test 2 (Automated): Four independent tools run against the agent's actual implementation:
- Behavioral tests — Python scripts call real endpoints and assert correct responses. The Ghost Shipment test actually cancels an order and verifies inventory is restored.
-
Boilerplate counter — Static analysis categorises Java lines into
HTTP_CLIENT,HTTP_RESPONSE,ERROR_HANDLER,JSON_MAPPING,DTO. Produces a "reasoning tax" multiplier. -
Rubric scorer — Deterministic pattern matching. Did
validateActiveMemberappear beforedecrementAvailableCopies? IscancelOrderannotated@Transactional? - Tool-call log parser — Agents write a JSONL log during their run. The parser produces objective token counts, not self-reported estimates.
Agents Reviewing Agents
Here is the part I find most interesting. We did not want humans reviewing AI agent benchmark results. We wanted agents reviewing agents.
So we built a math challenge gate. When an agent submits their results, they run:
python evaluation/agent-review/generate_challenge.py --level 2
This embeds a block in their commit message:
QUESTION: What is 123456789 mod 97?
SALT: d4e1b3f2
ANSWER_HASH: 3f8a92c1b7e4...
To review that submission, another agent must solve the problem (answer: 39), then validate:
python evaluation/agent-review/validate_solution.py \
--hash 3f8a92c1b7e4... --salt d4e1b3f2 --answer 39
# → ✓ CORRECT — You may now submit your review.
The answer is never in the repository — only sha256(salt:answer). Reviews without a validated correct answer are explicitly rejected. The gate requires the same mathematical reasoning that the benchmark tests, creating a naturally agent-native peer review system.
What This Means for System Design
If AI agents are a permanent part of your development workflow — and the trajectory suggests they will be — then architectural decisions now have a new dimension:
| Traditional Optimization | AI-Native Optimization |
|---|---|
| Scalability per service | Locality of reasoning |
| Deployment independence | Context preservation |
| Service autonomy | Traversal simplicity |
| Fault isolation | Cognitive cohesion |
This does not mean microservices are wrong. It means the decision to distribute a system now carries a cost that nobody was measuring: the overhead it imposes on AI-assisted development.
The modular monolith gives you ACID transactions, one deployment, clear module boundaries, and direct method calls across modules. You can extract a module into a microservice when you genuinely need to. What you cannot do is unwind the cognitive complexity already imposed on your AI-assisted development workflow.
Try It Yourself
The benchmark is open source at github.com/vishalmysore/ModulithBench.
Run any monolith with a single command:
cd library/monolith && docker compose up -d
# → http://localhost:8080/swagger-ui.html
Run the integration tests without Docker:
cd library/monolith && mvn test -Dtest=CrossModuleIntegrationTest
The agent protocol, automated test harness, and results submission system are all included. Results go to a separate benchmark-results branch — your implementations never contaminate the clean baseline for the next agent.
We want results from GPT-4o, Gemini, Mistral, and others — not just Claude. The math challenge in your commit message will ensure another agent independently reviews what you submit.
The industry has been arguing about monoliths vs microservices for a decade. We now have a new participant in that debate. And it has an opinion.
ModulithBench is open source at github.com/vishalmysore/ModulithBench. Contributions, results, and agent reviews welcome.
Top comments (0)