I gave Claude Code identical business requirements with two different architectural contexts. The generated code told me more about architecture's value than any textbook could.
The experiment
I set up a controlled experiment in two phases:
- Phase 0 — Full-scratch generation. Give the LLM a spec and an empty project. It builds everything from zero. This tests how architecture shapes greenfield code.
- Phase 1 — Feature addition. Give the LLM an existing codebase (the Phase 0 output) and a new feature spec. It reads, modifies, and extends. This tests how architecture holds up under change.
Same business domain for both: a SaaS subscription billing system. The spec covers plan management with tiered pricing (free / starter / professional / enterprise), the full subscription lifecycle (trial → active → paused → canceled → expired), usage-based metering with per-plan limits, proration calculations on mid-cycle plan changes, invoice generation with line items and discount application, and payment processing with retry logic and grace periods.
Same spec files. Two different architectural contexts.
- Classic: single-module Spring Boot, layered architecture (controller → service → repository → model), JPA, JUnit 5, exception-based error handling
-
Clean: 5-module Clean Architecture, Arrow-kt
Either, CQRS, jOOQ, Kotest, sealed interface error types
This article covers Phase 0. The orchestrator agent read the specs, produced a layer sketch, and spawned implementer agents to generate every file from zero. No human edits. No iterative refinement. The only difference was the .claude/ directory — the architectural context the LLM received.
What "context" means for an LLM
Most discussions about AI-generated code focus on the prompt. But for a codebase-scale task, the prompt is the least important input. What matters is the context — the accumulated knowledge the LLM has access to when making decisions.
For classic:
classic/.claude/
├── CLAUDE.md # Architecture rules, package conventions
├── rules/ # 3 files: JPA patterns, Spring patterns, testing
├── agents/ # 5 agents: orchestrator, entity, service, api, tester
├── skills/ # TDD patterns
├── specs/ # 4 spec files (shared)
└── layer-sketch.md # 45 files to create
For clean:
clean/.claude/
├── CLAUDE.md # Layer rules, CQRS structure, forbidden imports
├── rules/ # 2 files: Arrow Either style, test conventions
├── agents/ # 6 agents: orchestrator, domain, app, infra, presentation, tester
├── skills/ # 5 skills: CA patterns, FP patterns, Arrow-kt, jOOQ, TDD
├── specs/ # 4 spec files (shared)
└── layer-sketch.md # 115 files to create
But the difference isn't just the .claude/ directory. The clean project also requires infrastructure that exists before the LLM starts generating: a multi-module Gradle setup with dependency constraints between modules, custom detekt rules (ForbiddenLayerImportRule, NoThrowOutsidePresentationRule), jOOQ code generation from DDL, and Kover coverage thresholds. Classic needs none of this — a single build.gradle.kts with Spring Boot starters is enough.
The classic context teaches the LLM how to write code — JPA annotation patterns, @Transactional placement, exception hierarchy design. The clean context teaches the LLM where code cannot go — forbidden imports per layer, mandatory Either return types, whitelist-based constraints — backed by tooling that enforces those boundaries at build time.
This is the fundamental difference. Classic gives guidelines. Clean gives boundaries with teeth.
The generated code: classic
The LLM produced 33 source files in a single module. Here's what it did well and where it broke down.
What the LLM got right
Rich domain models. The Money class handles currency-safe arithmetic with operator overloading. SubscriptionStatus and InvoiceStatus implement state machine transitions with canTransitionTo() guards. These patterns came directly from the spec + the jpa-kotlin.md rule file.
// classic/model/Money.kt
data class Money(val amount: BigDecimal, val currency: Currency) {
operator fun plus(other: Money): Money {
require(currency == other.currency) { "Currency mismatch" }
return Money(amount + other.amount, currency)
}
}
Exception hierarchy. The LLM created a ServiceException base class with HTTP status codes and a GlobalExceptionHandler that maps each exception type to the correct response. Clean, predictable, consistent.
Where classic broke down
The 457-line service. SubscriptionService handles creation, plan changes, pauses, resumes, cancellations, renewals, and discount management — all in one class. The LLM followed the most common Spring Boot pattern it's seen: one service per entity.
// classic/service/SubscriptionService.kt — 457 lines
@Service
class SubscriptionService(
private val subscriptionRepository: SubscriptionRepository,
private val planRepository: PlanRepository,
private val invoiceRepository: InvoiceRepository,
private val usageRecordRepository: UsageRecordRepository,
private val discountRepository: DiscountRepository, // 5 repositories
private val paymentGateway: PaymentGateway,
private val clock: Clock,
) {
fun createSubscription(...): Subscription { /* 50 lines */ }
fun changePlan(...): Subscription { /* 80 lines */ }
fun pauseSubscription(...): Subscription { /* 25 lines */ }
fun resumeSubscription(...): Subscription { /* 20 lines */ }
fun cancelSubscription(...): Subscription { /* 35 lines */ }
fun processRenewal(...): Subscription { /* 90 lines — 9 responsibilities */ }
// ...
}
Nobody told the LLM to create one service with nine responsibilities. But nobody told it not to, either. The classic context defines a service/ package and lists SubscriptionService as the class to create. The LLM filled it with everything subscription-related — because that's the statistical norm in Spring Boot codebases.
Hardcoded discount resolution — in the wrong place. The spec mentions discount codes but doesn't specify a database lookup. Both implementations hardcoded the same "WELCOME20" string. The difference is where.
In classic, the hardcode lives inside SubscriptionService as a private method:
// classic/service/SubscriptionService.kt
private fun resolveDiscount(code: String, now: Instant): Discount =
when (code) {
"WELCOME20" -> Discount(type = PERCENTAGE, value = BigDecimal("20"), ...)
else -> throw InvalidDiscountCodeException(code)
}
In clean, the same hardcode lives in DiscountRepositoryImpl in the infrastructure layer, behind a DiscountCodePort interface:
// clean/infrastructure/.../DiscountRepositoryImpl.kt
override fun resolve(code: String, ...): Either<DomainError, Discount?> = either {
when (code) {
"WELCOME20" -> Discount.of(...)
else -> null
}
}
Both are stubs. But when the time comes to replace the hardcode with a database lookup, classic requires modifying SubscriptionService — the 457-line class that owns all subscription logic. Clean requires modifying only the infrastructure adapter. The use case never knows the difference, because it depends on DiscountCodePort, not the concrete implementation. The layer sketch defined this port upfront; the LLM didn't choose it — the architecture did.
Invoice generation duplicated. The changePlan() method and processRenewal() method both construct Invoice objects with line items. The logic is nearly identical. The LLM generated each method independently, and since there's no factory pattern in the context, it copy-pasted the construction logic.
The generated code: clean
The LLM produced 80+ source files across 5 modules. The same spec, but a fundamentally different result.
The architecture constrained the output
One use case, one file. The clean context defines 8 command use cases and 3 query use cases — each as a separate interface and implementation. The LLM couldn't create a 457-line service because the structure doesn't have a place for one.
// clean/application/command/usecase/PlanChangeUseCaseImpl.kt — 85 lines
class PlanChangeUseCaseImpl(
private val subscriptionCommandQueryPort: SubscriptionCommandQueryPort,
private val planQueryPort: PlanQueryPort,
private val paymentGatewayPort: PaymentGatewayPort,
private val prorationDomainService: ProrationDomainService,
private val subscriptionRepository: SubscriptionRepository,
private val invoiceRepository: InvoiceRepository,
private val clockPort: ClockPort,
private val transactionPort: TransactionPort,
) : PlanChangeUseCase {
override fun execute(command: ChangePlanCommand): Either<PlanChangeError, Subscription> = either {
// ...
}
}
The largest use case is 120 lines. The structure made bloat physically impossible.
Either forced exhaustive error handling. Every use case returns Either<SpecificError, T>. The LLM couldn't skip error cases because the sealed interface makes the compiler enforce exhaustiveness:
sealed interface PlanChangeError : ApplicationError {
data class InvalidInput(val field: String, val reason: String) : PlanChangeError
data object SubscriptionNotFound : PlanChangeError
data object NotActive : PlanChangeError
data object SamePlan : PlanChangeError
data object PlanNotFound : PlanChangeError
data class CurrencyMismatch(val from: String, val to: String) : PlanChangeError
data class PaymentFailed(val reason: String) : PlanChangeError
data class Domain(val error: DomainError) : PlanChangeError
data class Internal(val cause: String) : PlanChangeError
}
In classic, the LLM defined 8 exception types. In clean, it defined 50+ error variants across sealed interfaces. Not because the LLM tried harder — but because the type system demanded it.
Ports prevented infrastructure leakage. The PlanQueryPort is defined in the application layer. The LLM couldn't import JPA or jOOQ in the use case because the module boundary prevents it. The adapter pattern isn't a recommendation — it's the only way to access external systems.
// Defined in application/command/port/
interface PaymentGatewayPort {
fun charge(amount: Money, paymentMethod: PaymentMethod, customerRef: String):
Either<PaymentError, PaymentResult>
}
// Implemented in infrastructure/command/adapter/
@Component
class PaymentGatewayAdapter : PaymentGatewayPort {
override fun charge(...) = PaymentResult(...).right() // stub
}
Where clean showed overhead
115 files for the same business logic. The clean version has 2.5x more files than classic. Many are thin — a sealed interface with two variants, a port interface with one method, a DTO with five fields. But they exist, and each one had to be generated, placed in the correct module, and wired together.
Value object ceremony. Every ID type requires a smart constructor returning Either:
@JvmInline
value class SubscriptionId private constructor(val value: Long) {
companion object {
operator fun invoke(value: Long): Either<ValidationError, SubscriptionId> =
if (value > 0) SubscriptionId(value).right()
else ValidationError.InvalidId("SubscriptionId", value).left()
}
}
There are 10 value object types. Classic uses Long directly. The type safety is real, but so is the boilerplate.
The numbers
Generation time
| Metric | Classic | Clean |
|---|---|---|
| Phase 0 generation | 18m 15s | 32m 2s |
Clean took 1.75x longer to generate. The orchestrator coordinates 6 agents across 5 modules with dependency ordering — domain must finish before application, application before infrastructure and presentation. Classic's 4-agent pipeline runs in a simpler sequence with fewer handoffs. The multi-module coordination, larger layer sketch (115 files vs 45), and the additional Arrow-kt/CQRS patterns all add up.
Code volume
| Metric | Classic | Clean |
|---|---|---|
| Source files | 33 | 80+ |
| Total production lines | 787 | 1,225 |
| Largest file | 457 lines (SubscriptionService) | 120 lines (ProcessRenewalUseCase) |
| Error types defined | 8 exception classes | 50+ sealed interface variants |
Test results
| Metric | Classic | Clean |
|---|---|---|
| Test count | 155 | 167 |
| Full build + test (cold) | 4s | 9s |
| Full build + test (warm) | 3s | 3s |
| Test execution time | 2.2s | 1.7s |
The build time difference disappears once the Gradle daemon is warm — compilation is the bottleneck, not test execution. Clean's tests actually execute faster (1.7s vs 2.2s) despite having more tests, because domain and application tests run without Spring context startup.
Layer-by-layer test performance
Classic:
| Layer | Tests | Time | Per test |
|---|---|---|---|
| Model (pure) | 70 | 0.027s | 0.4ms |
| Service (MockK) | 67 | 0.723s | 10.8ms |
| Controller (WebMvcTest) | 18 | 1.45s | 80.6ms |
Clean:
| Module | Tests | Time | Per test |
|---|---|---|---|
| Domain (pure + Kotest) | 85 | 0.45s | 5.2ms |
| Application (Kotest + MockK) | 61 | 2.25s | 36.9ms |
| Presentation (no Spring) | 21 | 1.1s | 52.4ms |
Two things stand out. First, clean's presentation tests don't use @WebMvcTest — they instantiate the controller directly and call methods. The 1.1s is pure Kotest/MockK initialization overhead, not Spring. Second, the Kotest runner adds ~0.9s of startup per module for the first test class. After initialization, subsequent test classes run at ~15ms each.
Coverage
| Metric | Classic (JaCoCo) | Clean (Kover) |
|---|---|---|
| Line coverage | 89.7% | 87.7% |
| Production lines | 787 | 1,225 |
Classic by layer:
| Layer | Coverage |
|---|---|
| Model | 92.9% |
| Service | 90.4% |
| Controller | 100% |
Clean by module:
| Module | Coverage |
|---|---|
| Domain | 74.1% |
| Application | 84.2% |
| Presentation | 82.1% |
Coverage numbers are nearly identical — but they measure different things.
Classic's 89.7% means "89.7% of lines were reached." The missing 10.3% includes exception paths that were never thrown in tests. Those paths exist as throw statements — invisible to the type system, easy to forget in tests.
Clean's 87.7% means "87.7% of typed error paths were exercised." The missing 12.3% is visible in the code — it's sealed interface variants like Internal(cause: String) that tests didn't trigger. The untested paths are named and enumerated. You can grep for them.
What the experiment actually shows
Both implementations are functionally equivalent. They handle the same subscription lifecycle — creation, plan changes, pauses, resumes, cancellations, renewals, usage metering, and payment processing. Both compile, both pass tests, both have ~89% coverage. There is no meaningful difference in what the code does.
The difference is in how the code is shaped.
Classic: the LLM's strengths and blind spots
Classic plays to the LLM's greatest advantage: the sheer volume of training data for conventional patterns. Spring Boot is the most common Kotlin/JVM web framework. The LLM has seen millions of @Service + @Repository + @RestController examples. The result is concise, idiomatic code that any Spring developer would recognize immediately.
The model layer is genuinely good. Money with operator overloading, SubscriptionStatus with canTransitionTo() guards, Invoice with state machine transitions — these are clean, well-structured patterns that benefit from the LLM's deep familiarity with the ecosystem.
But the same statistical bias that makes the LLM good at conventional patterns makes it prone to conventional mistakes:
-
Service bloat.
SubscriptionServiceat 457 lines with 9 responsibilities. The LLM follows the "one service per entity" norm — the pattern it's seen most often — even when the complexity demands decomposition. -
Code duplication. Invoice construction logic appears in both
changePlan()andprocessRenewal(). The LLM generates each method independently. Without an explicit Factory pattern in the context, it doesn't extract shared logic. - Shortest-path solutions. Hardcoded discount resolution instead of a repository lookup. When the spec is underspecified, the LLM takes the minimal implementation — not the extensible one.
These aren't bugs. The code works. But they're the kind of structural problems that compound over time as features are added.
Clean: boundaries prevent problems, at a cost
Clean architecture constrains the LLM's output through structure rather than instruction. The result is code with clear dependency direction, no bloated classes, and exhaustive error handling — but significantly more of it.
What boundaries buy you:
- No bloat possible. The layer sketch defines 8 command use cases as separate files. The LLM can't merge them into a god service because there's no single file to put it in. The largest use case is 120 lines.
- Explicit dependencies. Every external system access goes through a Port interface defined in the application layer. The LLM can't accidentally couple a use case to jOOQ or JPA — the module boundary prevents the import.
-
Exhaustive error handling. Sealed interfaces with
Eitherforce the LLM to define and handle every error variant. Classic has 8 exception types; clean has 50+ error variants. The compiler enforces this, not the prompt.
What boundaries cost you:
- 2.5x more code. 1,225 lines across 80+ files vs 787 lines across 33 files. Many clean files are thin wrappers — a port interface with one method, a sealed interface with two variants — but they all need to be generated, placed correctly, and wired together.
- Longer generation time. The clean orchestrator must coordinate 6 agents across 5 modules with dependency ordering (domain → application → infrastructure | presentation). Classic's orchestrator runs 4 agents in a simpler pipeline. The multi-module coordination overhead is real.
- Higher context investment. 24 configuration files vs 19. The additional skills (Arrow-kt patterns, CA layer rules, FP patterns) and the detailed layer sketch are necessary to produce correct output — but they represent upfront work before the first feature ships.
The core observation
The LLM doesn't "understand" architecture. It follows the path of least resistance within the constraints it's given.
With classic constraints, the path of least resistance is the statistically dominant pattern — concise, idiomatic, and immediately readable, but carrying the same structural debts (god services, code duplication) that the average Spring Boot project accumulates over time.
With clean constraints, the path of least resistance is the structurally enforced pattern — no bloat, exhaustive error handling, explicit dependencies, but at the cost of 1.75x generation time and 2.5x more files.
The LLM didn't write better or worse code in either case. It filled the container it was given.
Architecture is not a prompt. It's the shape of the container the LLM fills.
But this is Phase 0 — greenfield generation. Both codebases work. Both have high coverage. The structural differences are visible but haven't been tested under pressure. The real question is what happens when the LLM modifies existing code — when the 457-line service needs a tenth responsibility, when the duplicated invoice logic needs a third variant.
Top comments (0)