Tetsuya Wakita

Posted on Mar 9

I Made an LLM Add Features to Its Own Code — Classic Spring Boot vs Clean Architecture

#architecture #kotlin #springboot #ai

In Part 0, both implementations worked. In Part 1, the LLM modified its own code — and the results challenged my assumptions about both architectures.

Phase 1: feature addition

In Part 0, I had Claude Code generate two complete SaaS billing systems from scratch — one in classic Spring Boot, one in Clean Architecture with Arrow-kt. Both worked. Both had ~89% coverage. The structural differences were visible but untested.

Phase 1 tests what happens next. I gave the LLM three new feature specs and pointed it at the Phase 0 codebase:

Add-ons — flat-rate and per-seat billing types, attached to subscriptions with proration on mid-cycle changes
Seat management — per-seat pricing with minimum/maximum constraints, prorated charges on seat count changes
Credit notes — full and partial refunds, with two application methods (refund to payment method or account credit)

Same orchestrator agent. Same process. The only difference: this time, the LLM read existing code before writing new code.

I expected Clean to scale cleanly and Classic to degrade predictably. The actual results were messier — and more interesting.

Generation time: the gap disappeared

Phase	Classic	Clean	Ratio
Phase 0 (greenfield)	18m 15s	32m 2s	1.75x
Phase 1 (feature add)	28m 57s	29m 20s	1.01x

This is the most surprising number in the entire experiment.

In Phase 0, Clean took 1.75x longer — the orchestrator had to coordinate 6 agents across 5 modules with strict dependency ordering, generating 115 files from a detailed layer sketch. Classic's simpler pipeline finished in half the time.

In Phase 1, they're 23 seconds apart. The existing codebase became the context. The LLM didn't need the layer sketch to figure out where files go — it could read the project structure and follow the patterns already established. The upfront cost of Clean's architecture was paid in Phase 0. Phase 1 spent it.

Classic: the LLM learned from its own code

The most interesting finding

In Phase 0, the LLM created one SubscriptionService with 457 lines and 9 responsibilities. I expected Phase 1 to make it worse — to stuff add-on logic, seat management, and credit notes into the same class.

Instead, the LLM created three new services:

// classic/service/AddOnService.kt — 257 lines
@Service
class AddOnService(
    private val subscriptionRepository: SubscriptionRepository,
    private val addOnRepository: AddOnRepository,
    private val subscriptionAddOnRepository: SubscriptionAddOnRepository,
    private val invoiceRepository: InvoiceRepository,
    private val paymentGateway: PaymentGateway,
    private val clock: Clock,
)

// classic/service/SeatService.kt — 220 lines
@Service
class SeatService(
    private val subscriptionRepository: SubscriptionRepository,
    private val subscriptionAddOnRepository: SubscriptionAddOnRepository,
    private val invoiceRepository: InvoiceRepository,
    private val paymentGateway: PaymentGateway,
    private val clock: Clock,
)

// classic/service/CreditNoteService.kt — 128 lines
@Service
class CreditNoteService(...)

The LLM saw the existing 457-line service and chose not to add to it. It created separate services for separate domains. This is genuine adaptation — the Phase 0 code taught the LLM that this project uses one service per domain concept.

This matters. Nobody added a rule saying "create separate services." Nobody modified the .claude/ context. The LLM read the existing codebase and adapted its behavior. The implication is significant: good existing code can guide an LLM even without explicit architectural constraints. Classic's Phase 0 output, despite its flaws, was good enough to teach the LLM better decomposition in Phase 1.

What didn't improve

SubscriptionService still grew. Despite decomposition into new services, SubscriptionService went from 457 to 635 lines. The processRenewal() method absorbed per-seat pricing calculations, add-on charge aggregation, and account credit application:

// classic/service/SubscriptionService.kt — processRenewal() grew to include:
// - Per-seat plan charge calculation
// - Add-on charge iteration and line item generation
// - Account credit balance application
// - 9 original responsibilities + 3 new ones

The new services handle creation and modification of add-ons, seats, and credit notes. But renewal — the operation that touches everything — still lives in SubscriptionService. The LLM couldn't extract it because the method depends on repositories from all three new domains.

Proration logic is duplicated. The same day-ratio calculation appears in SeatService.updateSeatCount(), AddOnService.attachAddOn(), and SubscriptionService.processRenewal():

// This pattern appears in 3 different services:
val daysRemaining = ChronoUnit.DAYS.between(now, currentPeriodEnd)
val totalDays = ChronoUnit.DAYS.between(currentPeriodStart, currentPeriodEnd)
val prorationFactor = BigDecimal(daysRemaining).divide(BigDecimal(totalDays), 10, RoundingMode.HALF_UP)
val proratedAmount = amount.multiply(prorationFactor).setScale(2, RoundingMode.HALF_UP)

Each service calculates proration independently. The LLM generated each method in isolation and didn't refactor common logic afterward. (Note: as we'll see, clean has the same problem — the LLM duplicates cross-cutting calculations regardless of architecture.)

Exception explosion. Phase 0 had 8 exception types. Phase 1 added 13 more:

class AddOnNotFoundException(id: Long) : ServiceException(...)
class DuplicateAddOnException(addonId: Long) : ServiceException(...)
class PerSeatAddOnOnNonPerSeatPlanException() : ServiceException(...)
class SeatCountOutOfRangeException(message: String) : ServiceException(...)
class SameSeatCountException(current: Int) : ServiceException(...)
// ... 8 more

Each exception is a flat class extending ServiceException. There's no hierarchy grouping add-on errors separate from seat errors. When a controller catches ServiceException, it handles all 21 types through the same GlobalExceptionHandler — the HTTP status code is embedded in each exception, which works, but provides no compile-time guarantee that all cases are handled.

Clean: use cases added like Lego blocks

The structure held

Phase 1 added exactly 4 new use cases:

// Each use case: single responsibility, explicit error type, Either return
class AttachAddOnUseCaseImpl(...) : AttachAddOnUseCase {
    override fun execute(command: AttachAddOnCommand): Either<AttachAddOnError, SubscriptionAddOn>
}

class DetachAddOnUseCaseImpl(...) : DetachAddOnUseCase {
    override fun execute(subscriptionId: Long, addOnId: Long): Either<DetachAddOnError, Unit>
}

class UpdateSeatCountUseCaseImpl(...) : UpdateSeatCountUseCase {
    override fun execute(command: UpdateSeatCountCommand): Either<UpdateSeatCountError, Subscription>
}

class IssueCreditNoteUseCaseImpl(...) : IssueCreditNoteUseCase {
    override fun execute(command: IssueCreditNoteCommand): Either<IssueCreditNoteError, CreditNote>
}

Each use case has its own sealed error type. AttachAddOnError has 12 variants. UpdateSeatCountError has 11 variants. The compiler enforces exhaustive handling at every call site.

sealed interface AttachAddOnError : ApplicationError {
    data class InvalidInput(val field: String, val reason: String) : AttachAddOnError
    data object SubscriptionNotFound : AttachAddOnError
    data object NotActive : AttachAddOnError
    data object AddOnNotFound : AttachAddOnError
    data object CurrencyMismatch : AttachAddOnError
    data object TierIncompatible : AttachAddOnError
    data object PerSeatOnNonPerSeatPlan : AttachAddOnError
    data object DuplicateAddOn : AttachAddOnError
    data object AddOnLimitReached : AttachAddOnError
    data class PaymentFailed(val reason: String) : AttachAddOnError
    data class Domain(val error: DomainError) : AttachAddOnError
    data class Internal(val cause: String) : AttachAddOnError
}

New domain models (AddOn, SubscriptionAddOn, CreditNote) follow the same value object pattern. New ports (AddOnQueryPort, SubscriptionAddOnCommandQueryPort, CreditNoteCommandQueryPort) maintain the dependency direction. New infrastructure adapters implement the ports. The existing code didn't need to change its structure to accommodate new features — new features slotted into the existing structure.

Where clean showed cracks

The backward-compatibility hack. ProcessRenewalUseCase needed to incorporate add-on charges into renewal invoices. The LLM's solution:

class ProcessRenewalUseCaseImpl(
    // ... existing dependencies ...
    private val subscriptionAddOnCommandQueryPort: SubscriptionAddOnCommandQueryPort? = null,
    private val addOnQueryPort: AddOnQueryPort? = null,
) : ProcessRenewalUseCase {
    override fun execute(subscriptionId: Long): Either<ProcessRenewalError, Subscription> = either {
        // ...
        val addOnCharges = subscriptionAddOnCommandQueryPort?.let { port ->
            // calculate add-on charges only if port is available
        } ?: emptyList()
        // ...
    }
}

Optional dependencies with ?.let null checks. This is a code smell in Clean Architecture — a use case should declare all its dependencies explicitly. The LLM chose backward compatibility over correctness, probably because the existing tests for ProcessRenewalUseCase don't provide add-on ports. Breaking those tests would mean rewriting them, which the LLM avoided.

Proration duplication — the same problem as classic. Phase 0's PlanChangeUseCase delegates to a ProrationDomainService in the domain layer. But the Phase 1 use cases (AttachAddOnUseCase, DetachAddOnUseCase, UpdateSeatCountUseCase) all calculate proration inline — the same totalDays / daysRemaining pattern, duplicated across four use cases. The LLM didn't reuse the existing domain service. Use case isolation, which prevents god services, also prevents the LLM from seeing shared logic across use cases. Architecture constrains bloat but doesn't prevent duplication.

252-line use case. UpdateSeatCountUseCaseImpl handles both seat increases (charge proration) and seat decreases (credit issuance), including per-seat add-on proration for both directions. At 252 lines, it's the largest use case — still smaller than classic's SubscriptionService, but pushing the boundary of "one use case, one responsibility."

The numbers: Phase 0 → Phase 1

Code volume

Metric	Classic P0	Classic P1	Clean P0	Clean P1
Source files (total)	33	69	80+	173
Production lines	787	2,973	1,225	6,274
Test lines	—	5,775	—	4,825
Largest file	457 lines	635 lines	120 lines	252 lines

Phase 0 line counts covered production source only. Phase 1 includes production + test breakdown. jOOQ generated code (in build/) is excluded from all counts.

Both codebases roughly doubled in file count. The largest file metric is not a full maintainability measure, but it's a useful proxy for responsibility concentration — how much logic the LLM packs into a single place. Classic's largest file grew 39% (457 → 635). Clean's grew 110% but from a much lower base (120 → 252, still below the size where single-file comprehension starts to get harder).

Test results

Metric	Classic P0	Classic P1	Clean P0	Clean P1
Test count	155	291	167	288
Cold build + test	4s	5s	9s	10s
Warm build + test	3s	4s	3s	8s

Test counts are almost identical (291 vs 288). The LLM generated comprehensive tests for both architectures. In this setup, Clean's warm build time rose from 3s to 8s, likely reflecting the growing overhead of a larger multi-module Gradle build.

Coverage

Metric	Classic P0	Classic P1	Clean P0	Clean P1
Line coverage	89.7%	94.6%	87.7%	90.2%
Branch coverage	—	83.4%	—	64.7%

Classic's line coverage improved from 89.7% to 94.6%. The new services (AddOnService, SeatService, CreditNoteService) are well-tested — possibly because the LLM generated them fresh with tests, rather than modifying existing code.

Clean's branch coverage at 64.7% deserves honest scrutiny. One-third of all branch paths are untested. The sealed interface error types define variants like Internal(cause: String) and Domain(error: DomainError) that no test exercises. In Part 0, I argued that clean's uncovered paths are "named and enumerable" — you can grep for them. That's true. But Phase 1 reveals the practical cost: the LLM defines exhaustive error types because the type system demands it, then doesn't write exhaustive tests because nothing demands that. The architecture pays the definition cost but doesn't automatically collect the testing benefit. The types exist as compile-time documentation, not runtime guarantees — unless someone (human or LLM) writes the tests to match.

What Phase 1 reveals that Phase 0 couldn't

Patterns across both architectures

1. Existing code is the strongest context. In classic, the LLM read a 457-line god service and decided not to make it worse — it created three new services unprompted. In clean, it followed the existing use case pattern without needing the layer sketch. The .claude/ rules and agent configurations matter most in Phase 0. By Phase 1, the codebase itself teaches the LLM what to do. This also explains the generation time convergence.

2. Architecture constrains structure but not duplication. Both implementations duplicated proration logic — classic across three services, clean across four use cases. Phase 0's PlanChangeUseCase delegates to a ProrationDomainService, but Phase 1's new use cases all calculate proration inline. The LLM didn't reuse the existing domain service. It generates each unit independently and doesn't refactor across boundaries — whether those boundaries are services or use cases.

3. Type safety without test safety is incomplete. Clean's branch coverage dropped to 64.7%. The type system demanded exhaustive error definitions; the LLM complied. But nothing demanded exhaustive tests, and the LLM didn't write them. Meanwhile, classic's processRenewal() grew from 9 to 12 responsibilities (457 → 635 lines) because there's no mechanism to prevent a method from becoming the integration point for the entire domain. Clean constrains bloat structurally; classic relies on the LLM making good local decisions — which it sometimes does (new services) and sometimes doesn't (growing processRenewal). But neither architecture made the LLM write better tests.

What the data suggests — and what it doesn't

What the numbers show

Two phases of data show clear patterns:

Generation time converged. Phase 0: 1.75x gap. Phase 1: 1.01x gap. Existing code as context reduced Clean's coordination overhead and increased Classic's reading time. Whether this trend continues, stabilizes, or reverses at Phase N is unknown — we have two data points, not a curve.

Classic's largest file grew; Clean's stayed bounded. 457 → 635 lines vs 120 → 252 lines. Classic's growth came from processRenewal() absorbing cross-domain logic. Clean's growth came from a complex use case (seat + add-on proration), but each use case remains independent. The mechanism differs: Classic grows by accretion in existing files; Clean grows by adding new files.

Duplication is architecture-independent. Both implementations duplicated proration logic across multiple locations. The LLM doesn't refactor across boundaries — whether those boundaries are services (classic) or use cases (clean). This is arguably the most important finding: the LLM's weakest capability is recognizing shared logic across independently generated code units.

The LLM adapts to existing code patterns. Classic's self-decomposition into new services was unprompted. Clean's consistent use case structure was self-reinforcing. Existing code is a stronger guide than configuration files.

The trade-off table

Dimension	Classic	Clean
Greenfield speed	18m (fast)	32m (slow)
Feature-add speed	29m	29m
Code readability	Immediately familiar	Requires architecture knowledge
Error handling	Flat exceptions (easy to add, easy to miss)	Sealed types (verbose, compiler-enforced)
Duplication	Across services	Across use cases (same problem)
Max file size	Growing (457 → 635)	Bounded (120 → 252)
Branch coverage	83.4%	64.7%
Upfront investment	Low	High (multi-module, tooling, rules)

A hypothesis, not a conclusion

Two data points don't make a trend. But they suggest a hypothesis worth testing:

Architecture may function as a scaling constraint on LLM-generated code. Clean's explicit boundaries prevent the structural problems that compound over time (god services, file bloat, implicit dependencies). Classic's flexibility allows the LLM to adapt (new services), but also allows it to accumulate debt (growing integration points, flat error hierarchies).

However, architecture does not solve the LLM's core limitation — cross-boundary refactoring. Both implementations duplicated proration logic. Both missed opportunities to extract shared abstractions. Clean's sealed types went untested at the same rate as Classic's exception paths. The LLM fills each container correctly but doesn't optimize across containers.

The question for a real project isn't "which architecture should I pick?" It's a compound question:

How often will features be added? More features favor explicit boundaries.
How much do features interact? More interaction creates cross-cutting concerns that stress any architecture.
What is the team's tolerance for boilerplate? In this experiment, Clean carried a roughly 2.5x file-count overhead.
What is the LLM's role? If the LLM generates initial code and humans maintain it, Classic's familiarity wins. If the LLM generates and extends code continuously, Clean's self-documenting structure wins.

Architecture is not a prompt. It's the shape of the container the LLM fills — and the shape determines what compounds and what stays constant as the codebase grows.

The full source is on GitHub: classic/ | clean/

DEV Community