vishalmysore

Posted on May 15

Do AI Coding Agents Reason Better in Monoliths? We Built a Benchmark to Find Out

#agents #ai #architecture #softwareengineering

Every architecture debate so far has optimized for humans. This one optimizes for AI agents.

The Question Nobody Is Asking

Software architecture has been debated for decades. We argue about scalability, team autonomy, deployment independence, fault isolation. We draw service diagrams and org charts and argue about Conway's Law.

But in 2025, something changed. AI coding agents — Claude Code, GitHub Copilot, Cursor, Codex — started doing real development work. Not just autocomplete. Actual feature implementation, bug hunting, refactoring, cross-module reasoning.

And suddenly a question that nobody had asked before became important:

What architecture makes AI agents most effective?

We built ModulithBench to find out.

The Honest Tradeoff Table Nobody Shows You

Most architecture articles argue for one approach. Here is the actual tradeoff matrix across three architectures:

	Traditional Monolith	Microservices	Modular Monolith
Scalability	❌ Scale everything or nothing	✅ Scale each service independently	✅ Scale the whole app; extract modules when actually needed
High Availability	❌ Single point of failure	✅ Independent failure domains	✅ HA at app level; module isolation prevents cascades
DevOps Complexity	✅ One deployment	❌ Service mesh, N CI/CD pipelines	✅ One deployment, one config, one pipeline
AI Agent Productivity	🟡 Good locality, but no boundaries — agents get lost in the "big ball of mud"	❌ Context fragmentation, repo-hopping, HTTP boundaries	✅ High locality AND clear module boundaries
Transaction Model	✅ ACID	❌ Eventual consistency / Sagas	✅ ACID
Refactoring	❌ Tight coupling	❌ Contract-breaking risk	✅ Module boundaries guide every change

The conclusion is not "monoliths are better." The conclusion is:

Microservices are good for scalability and HA. Bad for DevOps complexity and AI agents.
Traditional monoliths are good for simplicity. Bad for scalability, and AI agents get lost in them.
Modular monoliths are the sweet spot — especially when AI agents are part of your development workflow.

Why AI Agents Struggle With Microservices

AI coding agents have finite context windows and no persistent memory of a codebase between sessions. When business logic is distributed across services, something I call context fragmentation happens.

To implement a single feature that touches three services, an agent must:

Open repository 1, read its service interface
Open repository 2, read its API contract
Open repository 3, read its event schema
Hold all of this in context simultaneously
Reason about network failures, partial state, eventual consistency
Write the actual business logic somewhere in the middle of all that infrastructure reasoning

This is the architectural equivalent of CPU cache misses. The agent spends its reasoning budget navigating the architecture rather than solving the actual problem.

flowchart LR
    subgraph Modular_Monolith["Modular Monolith — AI reads 2 files"]
        LS[LoanService] -->|direct call| BS[BookService]
        LS -->|direct call| MS[MemberService]
    end

    subgraph Microservices["Microservices — AI reads 6+ files across repos"]
        LS2[loan-service] -->|HTTP + DTO + error handling| BS2[book-service]
        LS2 -->|HTTP + DTO + error handling| MS2[member-service]
        BS2 --> DB1[(books DB)]
        MS2 --> DB2[(members DB)]
        LS2 --> DB3[(loans DB)]
    end

In a modular monolith, cross-module operations are direct method calls. One file. Same transaction. Zero network reasoning required.

A Concrete Example: The Ghost Shipment

Here is a scenario that makes the difference undeniable.

A customer cancels an order. At the moment of cancellation:

The warehouse is picking items
The carrier has a booking (FedEx has been notified)
Inventory has 3 units reserved

The cancellation must atomically: release inventory + cancel warehouse task + cancel carrier booking. If any step fails, none of them should happen.

Monolith: One Method, One Transaction

@Transactional
public Order cancelOrder(Long orderId, String sku, int quantity) {
    Order order = getOrderById(orderId);

    // Step 1: Release inventory — direct call, same transaction
    inventoryService.releaseReservedStock(sku, order.getOriginWarehouse(), quantity);

    // Step 2: Cancel warehouse pick task — direct call, same transaction
    // Throws IllegalStateException if goods already dispatched
    warehouseService.cancelPickTask(orderId);

    // Step 3: Cancel carrier booking — direct call, same transaction
    // Throws if carrier already picked up the package
    carrierService.cancelBooking(orderId);

    // Step 4: Mark cancelled — only reached if all 3 steps succeeded
    // If anything above threw, steps 1-3 automatically rolled back
    order.setStatus(OrderStatus.CANCELLED);
    return orderRepository.save(order);
}

If carrierService.cancelBooking() throws, Spring's @Transactional rolls back the inventory release and warehouse cancellation automatically. The ghost shipment is structurally impossible.

Microservices: Three HTTP Calls, No Atomicity

The same operation in microservices:

public Order cancelOrder(Long orderId) {
    // HTTP call 1: release inventory
    restTemplate.exchange(
        "http://inventory-service/api/v1/stock/release",
        HttpMethod.POST, new HttpEntity<>(new ReleaseStockRequest(orderId)), Void.class
    );

    // HTTP call 2: cancel warehouse task
    restTemplate.exchange(
        "http://warehouse-service/api/v1/tasks/cancel/" + orderId,
        HttpMethod.PATCH, null, Void.class
    );

    // HTTP call 3: cancel carrier
    // If THIS returns 503 after the first two succeeded:
    // inventory released ✓, warehouse cancelled ✓, carrier still active ✗
    // The ghost shipment now exists.
    restTemplate.exchange(
        "http://carrier-service/api/v1/bookings/cancel/" + orderId,
        HttpMethod.PATCH, null, Void.class
    );

    order.setStatus(OrderStatus.CANCELLED);
    return orderRepository.save(order);
}

If carrier-service is down when steps 1 and 2 have already succeeded, you have partially cancelled state. The agent implementing this must also implement a saga with compensating transactions, idempotency keys, and a dead letter queue — none of which is the actual business problem.

Adding a 4th step in the monolith: one new line of code, same transaction.

Adding a 4th service to the saga: new event type, new consumer, new compensating handler, 2⁴ partial failure combinations to test.

The N+1 Report: When Cross-Module Reads Are Free

A shipment profitability report needs data from four modules: revenue from Order, shipping cost from Carrier, duties from Customs, fuel estimate from Route.

Monolith: Four Method Calls, One Transaction

@Transactional(readOnly = true)
public ShipmentProfitabilityReport generateProfitabilityReport(Long orderId) {
    Order order     = orderService.getOrderById(orderId);    // Module 1
    Carrier carrier = carrierService.getByOrderId(orderId);  // Module 2
    Customs customs = customsService.getByOrderId(orderId);  // Module 3
    Route route     = routeService.getByOrderId(orderId);    // Module 4

    // 0 HTTP calls, 0 JSON parsing, 0 error handlers
    // Consistent snapshot across all 4 modules guaranteed
    return ShipmentProfitabilityReport.builder()
        .revenue(order.getTotalValue())
        .shippingCost(carrier.getCost())
        .dutiesAndTaxes(customs.getTotalDutiesAndTaxes())
        .fuelCost(route.getFuelCostEstimate())
        .build();
}

~20 lines. Pure business logic.

In microservices, the equivalent requires 4 RestTemplate configurations, 4 DTO classes, 4 independent error handlers, and a decision about what to return if any one service is down. ~80 lines. Roughly 60 lines of infrastructure with no business value.

This is the reasoning tax: the mental overhead of distributed systems that the agent must pay before getting to the actual problem.

The Noise Problem Traditional Monoliths Have

It is worth being precise about why the modular monolith beats the traditional monolith for AI agents — not just microservices.

In a traditional monolith, everything is co-located, which gives you high locality. But with no module boundaries, an agent reading a codebase of 200,000 lines has no signal about which files are relevant to the task. It reads everything. The noise is as high as the locality.

The modular monolith solves this. Package structure enforces boundaries:

com.benchmark.library.loan/       ← LoanService lives here
com.benchmark.library.book/       ← BookService lives here
com.benchmark.library.member/     ← MemberService lives here

When an agent needs to fix a bug in loan creation, it knows to look in loan/. The cross-module calls are clearly visible (bookService.decrementAvailableCopies(bookId)). The module package is the cache line — everything relevant fits in context, nothing irrelevant is included.

	Locality	Noise	AI Experience
Traditional Monolith	High	High	🙂 Gets lost in the ball of mud
Modular Monolith	High	Low	🤩 Perfect signal-to-noise ratio
Microservices	Low	Very High	☹️ Context death

What We Measured

ModulithBench implements four enterprise domains, each in both architectures:

Domain	Modules	Key Cross-Module Scenario
Library	5	Loan creation validates member + decrements book inventory atomically
Healthcare	7	Appointment scheduling validates patient + doctor in one transaction
Insurance	7	Claim filing verifies policy ownership without an HTTP call
Supply Chain	8	Ghost Shipment: order cancellation is 4-module atomic rollback

Tasks cover code generation, bug fixing, and comprehension — all requiring cross-module reasoning, which is where the architectural difference is most visible.

First Results: Antigravity Agent (Google DeepMind)

Category	Monolith	Microservices	Gap
Code Generation	98/100	72/100	+26%
Bug Fixing	95/100	65/100	+30%
Comprehension	100/100	75/100	+25%
Average	97.7%	70.7%	+27%

Beyond scores:

~40% fewer tool calls for monolith tasks
Atomicity guaranteed in 3/3 cross-module tasks for monolith, 0/3 for microservices
The transaction bug fix in monolith: reorder 2 lines. Same fix in microservices: implement a compensating transaction — a fundamentally different and much harder pattern.

A Two-Tier Evaluation System

To keep results honest, we built two evaluation levels:

Test 1 (Self-reported): Agents implement tasks, validate with mvn compile, and submit a structured assessment. Agents scoring ≥ 80% advance to Test 2.

Test 2 (Automated): Four independent tools run against the agent's actual implementation:

Behavioral tests — Python scripts call real endpoints and assert correct responses. The Ghost Shipment test actually cancels an order and verifies inventory is restored.
Boilerplate counter — Static analysis categorises Java lines into HTTP_CLIENT, HTTP_RESPONSE, ERROR_HANDLER, JSON_MAPPING, DTO. Produces a "reasoning tax" multiplier.
Rubric scorer — Deterministic pattern matching. Did validateActiveMember appear before decrementAvailableCopies? Is cancelOrder annotated @Transactional?
Tool-call log parser — Agents write a JSONL log during their run. The parser produces objective token counts, not self-reported estimates.

Agents Reviewing Agents

Here is the part I find most interesting. We did not want humans reviewing AI agent benchmark results. We wanted agents reviewing agents.

So we built a math challenge gate. When an agent submits their results, they run:

python evaluation/agent-review/generate_challenge.py --level 2

This embeds a block in their commit message:

QUESTION:     What is 123456789 mod 97?
SALT:         d4e1b3f2
ANSWER_HASH:  3f8a92c1b7e4...

To review that submission, another agent must solve the problem (answer: 39), then validate:

python evaluation/agent-review/validate_solution.py \
  --hash 3f8a92c1b7e4... --salt d4e1b3f2 --answer 39
# → ✓ CORRECT — You may now submit your review.

The answer is never in the repository — only sha256(salt:answer). Reviews without a validated correct answer are explicitly rejected. The gate requires the same mathematical reasoning that the benchmark tests, creating a naturally agent-native peer review system.

What This Means for System Design

If AI agents are a permanent part of your development workflow — and the trajectory suggests they will be — then architectural decisions now have a new dimension:

Traditional Optimization	AI-Native Optimization
Scalability per service	Locality of reasoning
Deployment independence	Context preservation
Service autonomy	Traversal simplicity
Fault isolation	Cognitive cohesion

This does not mean microservices are wrong. It means the decision to distribute a system now carries a cost that nobody was measuring: the overhead it imposes on AI-assisted development.

The modular monolith gives you ACID transactions, one deployment, clear module boundaries, and direct method calls across modules. You can extract a module into a microservice when you genuinely need to. What you cannot do is unwind the cognitive complexity already imposed on your AI-assisted development workflow.

Try It Yourself

The benchmark is open source at github.com/vishalmysore/ModulithBench.

Run any monolith with a single command:

cd library/monolith && docker compose up -d
# → http://localhost:8080/swagger-ui.html

Run the integration tests without Docker:

cd library/monolith && mvn test -Dtest=CrossModuleIntegrationTest

The agent protocol, automated test harness, and results submission system are all included. Results go to a separate benchmark-results branch — your implementations never contaminate the clean baseline for the next agent.

We want results from GPT-4o, Gemini, Mistral, and others — not just Claude. The math challenge in your commit message will ensure another agent independently reviews what you submit.

The industry has been arguing about monoliths vs microservices for a decade. We now have a new participant in that debate. And it has an opinion.

ModulithBench is open source at github.com/vishalmysore/ModulithBench. Contributions, results, and agent reviews welcome.

DEV Community