Tetsuya Wakita

Posted on Mar 9

I Made an LLM Build the Same App Twice — Classic Spring Boot vs Clean Architecture

#kotlin #architecture #ai #cleancode

I gave Claude Code identical business requirements with two different architectural contexts. The generated code told me more about architecture's value than any textbook could.

The experiment

I set up a controlled experiment in two phases:

Phase 0 — Full-scratch generation. Give the LLM a spec and an empty project. It builds everything from zero. This tests how architecture shapes greenfield code.
Phase 1 — Feature addition. Give the LLM an existing codebase (the Phase 0 output) and a new feature spec. It reads, modifies, and extends. This tests how architecture holds up under change.

Same business domain for both: a SaaS subscription billing system. The spec covers plan management with tiered pricing (free / starter / professional / enterprise), the full subscription lifecycle (trial → active → paused → canceled → expired), usage-based metering with per-plan limits, proration calculations on mid-cycle plan changes, invoice generation with line items and discount application, and payment processing with retry logic and grace periods.

Same spec files. Two different architectural contexts.

Classic: single-module Spring Boot, layered architecture (controller → service → repository → model), JPA, JUnit 5, exception-based error handling
Clean: 5-module Clean Architecture, Arrow-kt Either, CQRS, jOOQ, Kotest, sealed interface error types

This article covers Phase 0. The orchestrator agent read the specs, produced a layer sketch, and spawned implementer agents to generate every file from zero. No human edits. No iterative refinement. The only difference was the .claude/ directory — the architectural context the LLM received.

What "context" means for an LLM

Most discussions about AI-generated code focus on the prompt. But for a codebase-scale task, the prompt is the least important input. What matters is the context — the accumulated knowledge the LLM has access to when making decisions.

For classic:

classic/.claude/
├── CLAUDE.md            # Architecture rules, package conventions
├── rules/               # 3 files: JPA patterns, Spring patterns, testing
├── agents/              # 5 agents: orchestrator, entity, service, api, tester
├── skills/              # TDD patterns
├── specs/               # 4 spec files (shared)
└── layer-sketch.md      # 45 files to create

For clean:

clean/.claude/
├── CLAUDE.md            # Layer rules, CQRS structure, forbidden imports
├── rules/               # 2 files: Arrow Either style, test conventions
├── agents/              # 6 agents: orchestrator, domain, app, infra, presentation, tester
├── skills/              # 5 skills: CA patterns, FP patterns, Arrow-kt, jOOQ, TDD
├── specs/               # 4 spec files (shared)
└── layer-sketch.md      # 115 files to create

But the difference isn't just the .claude/ directory. The clean project also requires infrastructure that exists before the LLM starts generating: a multi-module Gradle setup with dependency constraints between modules, custom detekt rules (ForbiddenLayerImportRule, NoThrowOutsidePresentationRule), jOOQ code generation from DDL, and Kover coverage thresholds. Classic needs none of this — a single build.gradle.kts with Spring Boot starters is enough.

The classic context teaches the LLM how to write code — JPA annotation patterns, @Transactional placement, exception hierarchy design. The clean context teaches the LLM where code cannot go — forbidden imports per layer, mandatory Either return types, whitelist-based constraints — backed by tooling that enforces those boundaries at build time.

This is the fundamental difference. Classic gives guidelines. Clean gives boundaries with teeth.

The generated code: classic

The LLM produced 33 source files in a single module. Here's what it did well and where it broke down.

What the LLM got right

Rich domain models. The Money class handles currency-safe arithmetic with operator overloading. SubscriptionStatus and InvoiceStatus implement state machine transitions with canTransitionTo() guards. These patterns came directly from the spec + the jpa-kotlin.md rule file.

// classic/model/Money.kt
data class Money(val amount: BigDecimal, val currency: Currency) {
    operator fun plus(other: Money): Money {
        require(currency == other.currency) { "Currency mismatch" }
        return Money(amount + other.amount, currency)
    }
}

Exception hierarchy. The LLM created a ServiceException base class with HTTP status codes and a GlobalExceptionHandler that maps each exception type to the correct response. Clean, predictable, consistent.

Where classic broke down

The 457-line service. SubscriptionService handles creation, plan changes, pauses, resumes, cancellations, renewals, and discount management — all in one class. The LLM followed the most common Spring Boot pattern it's seen: one service per entity.

// classic/service/SubscriptionService.kt — 457 lines
@Service
class SubscriptionService(
    private val subscriptionRepository: SubscriptionRepository,
    private val planRepository: PlanRepository,
    private val invoiceRepository: InvoiceRepository,
    private val usageRecordRepository: UsageRecordRepository,
    private val discountRepository: DiscountRepository,  // 5 repositories
    private val paymentGateway: PaymentGateway,
    private val clock: Clock,
) {
    fun createSubscription(...): Subscription { /* 50 lines */ }
    fun changePlan(...): Subscription { /* 80 lines */ }
    fun pauseSubscription(...): Subscription { /* 25 lines */ }
    fun resumeSubscription(...): Subscription { /* 20 lines */ }
    fun cancelSubscription(...): Subscription { /* 35 lines */ }
    fun processRenewal(...): Subscription { /* 90 lines — 9 responsibilities */ }
    // ...
}

Nobody told the LLM to create one service with nine responsibilities. But nobody told it not to, either. The classic context defines a service/ package and lists SubscriptionService as the class to create. The LLM filled it with everything subscription-related — because that's the statistical norm in Spring Boot codebases.

Hardcoded discount resolution — in the wrong place. The spec mentions discount codes but doesn't specify a database lookup. Both implementations hardcoded the same "WELCOME20" string. The difference is where.

In classic, the hardcode lives inside SubscriptionService as a private method:

// classic/service/SubscriptionService.kt
private fun resolveDiscount(code: String, now: Instant): Discount =
    when (code) {
        "WELCOME20" -> Discount(type = PERCENTAGE, value = BigDecimal("20"), ...)
        else -> throw InvalidDiscountCodeException(code)
    }

In clean, the same hardcode lives in DiscountRepositoryImpl in the infrastructure layer, behind a DiscountCodePort interface:

// clean/infrastructure/.../DiscountRepositoryImpl.kt
override fun resolve(code: String, ...): Either<DomainError, Discount?> = either {
    when (code) {
        "WELCOME20" -> Discount.of(...)
        else -> null
    }
}

Both are stubs. But when the time comes to replace the hardcode with a database lookup, classic requires modifying SubscriptionService — the 457-line class that owns all subscription logic. Clean requires modifying only the infrastructure adapter. The use case never knows the difference, because it depends on DiscountCodePort, not the concrete implementation. The layer sketch defined this port upfront; the LLM didn't choose it — the architecture did.

Invoice generation duplicated. The changePlan() method and processRenewal() method both construct Invoice objects with line items. The logic is nearly identical. The LLM generated each method independently, and since there's no factory pattern in the context, it copy-pasted the construction logic.

The generated code: clean

The LLM produced 80+ source files across 5 modules. The same spec, but a fundamentally different result.

The architecture constrained the output

One use case, one file. The clean context defines 8 command use cases and 3 query use cases — each as a separate interface and implementation. The LLM couldn't create a 457-line service because the structure doesn't have a place for one.

// clean/application/command/usecase/PlanChangeUseCaseImpl.kt — 85 lines
class PlanChangeUseCaseImpl(
    private val subscriptionCommandQueryPort: SubscriptionCommandQueryPort,
    private val planQueryPort: PlanQueryPort,
    private val paymentGatewayPort: PaymentGatewayPort,
    private val prorationDomainService: ProrationDomainService,
    private val subscriptionRepository: SubscriptionRepository,
    private val invoiceRepository: InvoiceRepository,
    private val clockPort: ClockPort,
    private val transactionPort: TransactionPort,
) : PlanChangeUseCase {
    override fun execute(command: ChangePlanCommand): Either<PlanChangeError, Subscription> = either {
        // ...
    }
}

The largest use case is 120 lines. The structure made bloat physically impossible.

Either forced exhaustive error handling. Every use case returns Either<SpecificError, T>. The LLM couldn't skip error cases because the sealed interface makes the compiler enforce exhaustiveness:

sealed interface PlanChangeError : ApplicationError {
    data class InvalidInput(val field: String, val reason: String) : PlanChangeError
    data object SubscriptionNotFound : PlanChangeError
    data object NotActive : PlanChangeError
    data object SamePlan : PlanChangeError
    data object PlanNotFound : PlanChangeError
    data class CurrencyMismatch(val from: String, val to: String) : PlanChangeError
    data class PaymentFailed(val reason: String) : PlanChangeError
    data class Domain(val error: DomainError) : PlanChangeError
    data class Internal(val cause: String) : PlanChangeError
}

In classic, the LLM defined 8 exception types. In clean, it defined 50+ error variants across sealed interfaces. Not because the LLM tried harder — but because the type system demanded it.

Ports prevented infrastructure leakage. The PlanQueryPort is defined in the application layer. The LLM couldn't import JPA or jOOQ in the use case because the module boundary prevents it. The adapter pattern isn't a recommendation — it's the only way to access external systems.

// Defined in application/command/port/
interface PaymentGatewayPort {
    fun charge(amount: Money, paymentMethod: PaymentMethod, customerRef: String):
        Either<PaymentError, PaymentResult>
}

// Implemented in infrastructure/command/adapter/
@Component
class PaymentGatewayAdapter : PaymentGatewayPort {
    override fun charge(...) = PaymentResult(...).right()  // stub
}

Where clean showed overhead

115 files for the same business logic. The clean version has 2.5x more files than classic. Many are thin — a sealed interface with two variants, a port interface with one method, a DTO with five fields. But they exist, and each one had to be generated, placed in the correct module, and wired together.

Value object ceremony. Every ID type requires a smart constructor returning Either:

@JvmInline
value class SubscriptionId private constructor(val value: Long) {
    companion object {
        operator fun invoke(value: Long): Either<ValidationError, SubscriptionId> =
            if (value > 0) SubscriptionId(value).right()
            else ValidationError.InvalidId("SubscriptionId", value).left()
    }
}

There are 10 value object types. Classic uses Long directly. The type safety is real, but so is the boilerplate.

The numbers

Generation time

Metric	Classic	Clean
Phase 0 generation	18m 15s	32m 2s

Clean took 1.75x longer to generate. The orchestrator coordinates 6 agents across 5 modules with dependency ordering — domain must finish before application, application before infrastructure and presentation. Classic's 4-agent pipeline runs in a simpler sequence with fewer handoffs. The multi-module coordination, larger layer sketch (115 files vs 45), and the additional Arrow-kt/CQRS patterns all add up.

Code volume

Metric	Classic	Clean
Source files	33	80+
Total production lines	787	1,225
Largest file	457 lines (SubscriptionService)	120 lines (ProcessRenewalUseCase)
Error types defined	8 exception classes	50+ sealed interface variants

Test results

Metric	Classic	Clean
Test count	155	167
Full build + test (cold)	4s	9s
Full build + test (warm)	3s	3s
Test execution time	2.2s	1.7s

The build time difference disappears once the Gradle daemon is warm — compilation is the bottleneck, not test execution. Clean's tests actually execute faster (1.7s vs 2.2s) despite having more tests, because domain and application tests run without Spring context startup.

Layer-by-layer test performance

Classic:

Layer	Tests	Time	Per test
Model (pure)	70	0.027s	0.4ms
Service (MockK)	67	0.723s	10.8ms
Controller (WebMvcTest)	18	1.45s	80.6ms

Clean:

Module	Tests	Time	Per test
Domain (pure + Kotest)	85	0.45s	5.2ms
Application (Kotest + MockK)	61	2.25s	36.9ms
Presentation (no Spring)	21	1.1s	52.4ms

Two things stand out. First, clean's presentation tests don't use @WebMvcTest — they instantiate the controller directly and call methods. The 1.1s is pure Kotest/MockK initialization overhead, not Spring. Second, the Kotest runner adds ~0.9s of startup per module for the first test class. After initialization, subsequent test classes run at ~15ms each.

Coverage

Metric	Classic (JaCoCo)	Clean (Kover)
Line coverage	89.7%	87.7%
Production lines	787	1,225

Classic by layer:

Layer	Coverage
Model	92.9%
Service	90.4%
Controller	100%

Clean by module:

Module	Coverage
Domain	74.1%
Application	84.2%
Presentation	82.1%

Coverage numbers are nearly identical — but they measure different things.

Classic's 89.7% means "89.7% of lines were reached." The missing 10.3% includes exception paths that were never thrown in tests. Those paths exist as throw statements — invisible to the type system, easy to forget in tests.

Clean's 87.7% means "87.7% of typed error paths were exercised." The missing 12.3% is visible in the code — it's sealed interface variants like Internal(cause: String) that tests didn't trigger. The untested paths are named and enumerated. You can grep for them.

What the experiment actually shows

Both implementations are functionally equivalent. They handle the same subscription lifecycle — creation, plan changes, pauses, resumes, cancellations, renewals, usage metering, and payment processing. Both compile, both pass tests, both have ~89% coverage. There is no meaningful difference in what the code does.

The difference is in how the code is shaped.

Classic: the LLM's strengths and blind spots

Classic plays to the LLM's greatest advantage: the sheer volume of training data for conventional patterns. Spring Boot is the most common Kotlin/JVM web framework. The LLM has seen millions of @Service + @Repository + @RestController examples. The result is concise, idiomatic code that any Spring developer would recognize immediately.

The model layer is genuinely good. Money with operator overloading, SubscriptionStatus with canTransitionTo() guards, Invoice with state machine transitions — these are clean, well-structured patterns that benefit from the LLM's deep familiarity with the ecosystem.

But the same statistical bias that makes the LLM good at conventional patterns makes it prone to conventional mistakes:

Service bloat. SubscriptionService at 457 lines with 9 responsibilities. The LLM follows the "one service per entity" norm — the pattern it's seen most often — even when the complexity demands decomposition.
Code duplication. Invoice construction logic appears in both changePlan() and processRenewal(). The LLM generates each method independently. Without an explicit Factory pattern in the context, it doesn't extract shared logic.
Shortest-path solutions. Hardcoded discount resolution instead of a repository lookup. When the spec is underspecified, the LLM takes the minimal implementation — not the extensible one.

These aren't bugs. The code works. But they're the kind of structural problems that compound over time as features are added.

Clean: boundaries prevent problems, at a cost

Clean architecture constrains the LLM's output through structure rather than instruction. The result is code with clear dependency direction, no bloated classes, and exhaustive error handling — but significantly more of it.

What boundaries buy you:

No bloat possible. The layer sketch defines 8 command use cases as separate files. The LLM can't merge them into a god service because there's no single file to put it in. The largest use case is 120 lines.
Explicit dependencies. Every external system access goes through a Port interface defined in the application layer. The LLM can't accidentally couple a use case to jOOQ or JPA — the module boundary prevents the import.
Exhaustive error handling. Sealed interfaces with Either force the LLM to define and handle every error variant. Classic has 8 exception types; clean has 50+ error variants. The compiler enforces this, not the prompt.

What boundaries cost you:

2.5x more code. 1,225 lines across 80+ files vs 787 lines across 33 files. Many clean files are thin wrappers — a port interface with one method, a sealed interface with two variants — but they all need to be generated, placed correctly, and wired together.
Longer generation time. The clean orchestrator must coordinate 6 agents across 5 modules with dependency ordering (domain → application → infrastructure | presentation). Classic's orchestrator runs 4 agents in a simpler pipeline. The multi-module coordination overhead is real.
Higher context investment. 24 configuration files vs 19. The additional skills (Arrow-kt patterns, CA layer rules, FP patterns) and the detailed layer sketch are necessary to produce correct output — but they represent upfront work before the first feature ships.

The core observation

The LLM doesn't "understand" architecture. It follows the path of least resistance within the constraints it's given.

With classic constraints, the path of least resistance is the statistically dominant pattern — concise, idiomatic, and immediately readable, but carrying the same structural debts (god services, code duplication) that the average Spring Boot project accumulates over time.

With clean constraints, the path of least resistance is the structurally enforced pattern — no bloat, exhaustive error handling, explicit dependencies, but at the cost of 1.75x generation time and 2.5x more files.

The LLM didn't write better or worse code in either case. It filled the container it was given.

Architecture is not a prompt. It's the shape of the container the LLM fills.

But this is Phase 0 — greenfield generation. Both codebases work. Both have high coverage. The structural differences are visible but haven't been tested under pressure. The real question is what happens when the LLM modifies existing code — when the 457-line service needs a tenth responsibility, when the duplicated invoice logic needs a third variant.

The full source is on GitHub: classic/ | clean/

DEV Community