DEV Community

Alex Rogov
Alex Rogov

Posted on • Originally published at alexrogov.hashnode.dev

I Tested Claude, GPT-4, and Gemini on the Same Refactoring Task

I gave Claude, GPT-4, and Gemini the exact same refactoring task — extract a 400-line god service into Clean Architecture layers. Same codebase, same prompt, same TypeScript project. The results weren't even close.

This isn't a synthetic benchmark. I took a real NestJS service from a production project, froze the state in a git branch, and ran each model through the same workflow I use every day. Here's what happened.

The Task

The patient: PaymentService — a 400-line NestJS service that handled payment processing, invoice generation, webhook handling, and retry logic. Classic god service.

The goal: refactor into Clean Architecture layers:

  • domain/payment/ — entities, value objects, error types
  • application/payment/ — use cases, port interfaces
  • infrastructure/payment/ — Stripe adapter, repository implementation

The rules I gave each model:

  • Don't change public API signatures
  • Keep all 23 existing tests passing
  • Extract interfaces for every external dependency
  • Follow the naming conventions already in the project

Here's the exact prompt I used:

Refactor src/services/payment.service.ts into Clean Architecture layers.

Target structure:
- src/domain/payment/ (entities, value objects, errors)
- src/application/payment/use-cases/ (one file per use case)
- src/application/payment/ports/ (repository + external service interfaces)
- src/infrastructure/payment/ (Stripe adapter, Prisma repository)

Rules:
- Do NOT change any public API signatures on PaymentController
- All 23 tests in payment.service.spec.ts must still pass
- Every external dependency gets an interface in ports/
- Follow existing naming: kebab-case files, PascalCase exports
- Run typecheck after each step
Enter fullscreen mode Exit fullscreen mode

The Models

I tested three models in their coding-optimized setups:

  • Claude Opus 4 — via Claude Code CLI with CLAUDE.md context
  • GPT-4o — via Cursor IDE with the same project open
  • Gemini 2.5 Pro — via Gemini CLI with the same project context

Each got the same prompt, the same codebase state, and the same CLAUDE.md file describing the project structure and conventions.

Round 1: Understanding the Codebase

Before any model touched code, I asked each one to map the dependency graph of PaymentService.

Claude: Produced a clean dependency map in 30 seconds. Identified 7 direct imports, 4 circular risk points, and flagged that PaymentService.handleWebhook() was calling InvoiceService directly instead of through an event. Suggested starting the extraction there.

GPT-4o: Also mapped dependencies correctly, but missed the circular risk through InvoiceService. Listed all imports but didn't prioritize the extraction order. I had to ask a follow-up to get a sequenced plan.

Gemini: Gave the most verbose analysis — two pages of dependency listings. Technically accurate but unfocused. Included files that weren't relevant to the refactoring. Took an extra prompt to narrow down to actionable steps.

Score: Claude 9/10, GPT-4o 7/10, Gemini 6/10

The difference here was focus. Claude identified what mattered for the actual refactoring, not just what existed.

Round 2: Domain Layer Extraction

The first real coding step — extract entities, value objects, and domain errors from PaymentService.

Claude: Created payment.entity.ts, payment-amount.value-object.ts, payment.errors.ts. The entity used proper encapsulation — private constructor with a static factory method. Value object was immutable with validation in the constructor. Named everything following the project's existing conventions without being told twice.

// Claude's payment-amount.value-object.ts
export class PaymentAmount {
  private constructor(
    readonly value: number,
    readonly currency: string,
  ) {}

  static create(value: number, currency: string): PaymentAmount {
    if (value <= 0) throw new InvalidPaymentAmountError(value);
    if (!['USD', 'EUR', 'GBP'].includes(currency)) {
      throw new UnsupportedCurrencyError(currency);
    }
    return new PaymentAmount(value, currency);
  }

  equals(other: PaymentAmount): boolean {
    return this.value === other.value && this.currency === other.currency;
  }
}
Enter fullscreen mode Exit fullscreen mode

GPT-4o: Similar structure, but used a plain class with public constructor and validation in a separate validate() method. Workable, but less idiomatic for DDD. Also named the file paymentAmount.ts — camelCase instead of the project's kebab-case convention.

Gemini: Created the entity correctly but went overboard — added a PaymentStatus enum, a PaymentEvent type, and a PaymentHistory value object that weren't part of the original service. Scope creep from the model itself. I had to tell it to remove the extras.

Score: Claude 9/10, GPT-4o 7/10, Gemini 5/10

Round 3: Use Cases and Ports

This is where Clean Architecture either works or becomes ceremony. Each model needed to extract the 4 core operations into separate use cases with proper port interfaces.

Claude: Created process-payment.ts, refund-payment.ts, handle-webhook.ts, and generate-invoice.ts. Each use case took dependencies through constructor injection via interfaces. The port interfaces were minimal — only the methods each use case actually needed, not a dump of every method from the original service.

// Claude's payment-gateway.port.ts
export interface PaymentGateway {
  charge(amount: PaymentAmount, customerId: string): Promise<PaymentResult>;
  refund(paymentId: string, amount?: PaymentAmount): Promise<RefundResult>;
  constructWebhookEvent(payload: string, signature: string): WebhookEvent;
}
Enter fullscreen mode Exit fullscreen mode

GPT-4o: Also extracted four use cases, but the port interfaces were too broad — PaymentGateway included methods for subscription management that PaymentService never used. It pulled them from Stripe's actual API types instead of scoping to what the code needed. The use cases worked but had unnecessary coupling.

Gemini: Extracted three use cases instead of four — combined webhook handling into the process-payment use case. When I pointed this out, it created a fourth but the separation felt forced. The ports were correct but it duplicated some type definitions that already existed in the project.

Score: Claude 9/10, GPT-4o 6/10, Gemini 6/10

Round 4: Infrastructure Adapters

The final step — implement the port interfaces with Stripe SDK calls and Prisma queries, update the controller's dependency injection, and make sure everything compiles.

Claude: Created stripe-payment.gateway.ts implementing PaymentGateway, and prisma-payment.repository.ts implementing PaymentRepository. Updated the NestJS module to wire everything through DI. All imports correct on the first pass — npm run typecheck passed immediately. The 23 tests needed minor updates (import paths changed), and Claude fixed those proactively.

GPT-4o: Infrastructure implementations were solid, but the NestJS module wiring had two errors — a missing provider and a wrong injection token. Took one correction prompt to fix. Tests needed the same import path updates but GPT-4o didn't fix them proactively — I had to ask.

Gemini: The Stripe adapter was actually the best of the three — it included proper error mapping from Stripe error codes to domain errors, which the others handled more generically. But it broke the module wiring worse than GPT-4o — three missing providers and a circular dependency it introduced by importing a use case inside the repository. Two correction rounds to get it compiling.

Score: Claude 8/10, GPT-4o 7/10, Gemini 6/10

The Final Scorecard

Criteria Claude Opus 4 GPT-4o Gemini 2.5 Pro
Codebase understanding 9 7 6
Domain extraction 9 7 5
Use cases & ports 9 6 6
Infrastructure wiring 8 7 6
Convention adherence 9 6 7
Scope discipline 9 8 5
Corrections needed 1 3 5
Total 54/70 44/70 40/70

What Actually Mattered

The gap wasn't about raw coding ability. All three models can write TypeScript. The differences were:

Convention adherence. Claude followed kebab-case file naming, existing import patterns, and project structure without reminders. GPT-4o drifted to its defaults. Gemini was inconsistent — sometimes following conventions, sometimes not.

Scope discipline. Claude did exactly what was asked. GPT-4o added slightly too-broad interfaces. Gemini added entire features nobody asked for. In a real refactoring, scope creep from your AI is worse than scope creep from a junior developer — it happens faster and you might not catch it in review.

Proactive problem-solving. Claude flagged the circular dependency risk before it became a problem and fixed test imports without being asked. The others waited for things to break.

Context utilization. Claude read and applied the CLAUDE.md file throughout the session. The others seemed to reference it initially but drifted as the conversation progressed.

The Honest Caveats

This is one task, one codebase, one developer's workflow. Your results may differ based on:

  • How you prompt. I use CLAUDE.md extensively — that's Claude's home turf. GPT-4o might perform differently with Cursor's inline editing flow. Gemini might shine in a different IDE setup.
  • The type of task. Clean Architecture refactoring is opinionated. For greenfield code generation or data processing scripts, the ranking might shift.
  • Model versions. These models update frequently. What I tested today might not match next month's results.
  • Your project's patterns. If your codebase uses different conventions, the model that aligns best with your style wins.

My Takeaway

I use Claude Code as my primary tool and this test confirmed why — for TypeScript projects with Clean Architecture patterns, it consistently needs fewer correction rounds. But I'd reach for GPT-4o through Cursor for quick inline edits where I don't need full architectural awareness.

The biggest insight isn't which model "won." It's that the quality of your project setup matters more than the model you choose. A well-structured CLAUDE.md, consistent conventions, and clear architecture boundaries made all three models perform significantly better than they would in a chaotic codebase.

The AI is the engine. Your codebase is the road. Even the best engine can't perform on a road full of potholes.


Originally published on my Hashnode blog.

Top comments (0)