yan yan

Posted on May 13

I Tested 5 AI Coding Tools on Real Work. Here Are the Results.

#career #ai #programming #productivity

I Tested 5 AI Coding Tools on Real Work. Here Are the Results.

I gave Copilot, Cursor, Claude Code, Windsurf, and Aider the same 3 real tasks. The results were not even close.

AI coding tools are everywhere. GitHub Copilot. Cursor. Claude Code. Windsurf. Aider. Every week there is a new one, and every review says "this tool changed my life."

I don't trust those reviews. Most test on toy problems — a todo app, sorting an array, fetching from an API. That is not how real software works.

So I designed a real-world benchmark. Three tasks pulled from my actual work. Not contrived. Not simplified. The same mess you deal with every day.

Here are the results.

The Test Setup

The tasks:

Legacy refactor: A 400-line Python script with no tests, no types, and a known bug. Add type hints, write tests, and fix the bug without breaking anything else.
Greenfield feature: Build a real-time data pipeline with WebSocket ingestion, transformation, and PostgreSQL writes. Must handle reconnection, backpressure, and schema evolution.
Debug mystery: A Node.js service randomly returns 502 errors under load. No error messages. Been open for 2 weeks. Find it.

The contestants:

Tool	Type	Pricing
GitHub Copilot	VS Code extension	$10/mo
Cursor	AI-native IDE	$20/mo
Claude Code	CLI agent	API pay-per-use
Windsurf	AI IDE	$15/mo
Aider	Open-source CLI	Free (API cost only)

Scoring (1-10):

Accuracy: Did it produce correct, working code?
Context awareness: Did it understand the existing codebase?
Autonomy: How much did I have to hand-hold?
Speed: Time from prompt to working solution.

Task 1: Legacy Refactor

The script processes CSV files from an IoT sensor fleet. 400 lines. Zero comments. Variable names like x, tmp, and stuff. Been in production for 2 years. Nobody wants to touch it.

The known bug: on files larger than 10MB, the script silently drops the last batch of rows.

GitHub Copilot — 6/10

Copilot added type hints quickly and correctly. It caught several obvious bugs. But it struggled with the big-picture refactor — understanding why certain choices were made. The type hints were correct but superficial.

When I asked it to "refactor this into smaller functions," it gave a reasonable split but broke the data pipeline ordering. I had to manually fix the function call chain.

Best for: Tab completion and boilerplate. Not architecture.

Cursor — 7/10

Cursor did better at understanding full file context. Its inline suggestions for type hints were solid. When I selected a 100-line block and asked "refactor this," it proposed a clean extraction with proper error handling.

It missed the silent data loss bug on its own, but when I pointed at the specific region, it correctly identified the off-by-one error in the batch processing loop.

Best for: Refactoring with context awareness.

Claude Code — 5/10

Claude Code was too aggressive. When I said "refactor this file," it rewrote the entire thing from scratch — new structure, new patterns, everything. The result was cleaner code, but it changed behavior in subtle ways that would break production.

To be fair, when I said "no, just add types and fix bugs," it did exactly that. But I had to catch its first attempt. That is not autonomy.

Best for: Greenfield projects where you want a fresh architecture.

Windsurf — 6/10

Solid basics. Good type inference. Decent refactor suggestions. But it asked for confirmation on every single change. After 47 confirmations, I wanted to throw my laptop out the window.

In cascade mode, it got more autonomous but also more error-prone. It changed a dict to a defaultdict without asking, which subtly changed error behavior.

Best for: People who want fine-grained control over every change.

Aider — 4/10

Aider struggled with the 400-line file. It kept losing context — making a change, then suggesting another that contradicted the first one. The refactor it proposed was correct in isolation but broke imports across the codebase.

I had to explicitly say "keep all imports unchanged" for it to produce safe output.

Best for: Small, well-defined tasks with clear boundaries.

Winner: Cursor (7/10)

Best balance of autonomy and accuracy for refactoring legacy code.

Task 2: Greenfield Feature

Build a service that:

Connects to a WebSocket stream (crypto exchange ticker)
Transforms raw events into normalized records
Batches writes to PostgreSQL (1000 records or 5 seconds)
Handles reconnection with exponential backoff
Implements backpressure — stop reading if DB queue exceeds 10,000
Supports schema evolution — new fields should not crash the pipeline

I gave each tool the exact same specification paragraph. No starter code.

GitHub Copilot — 4/10

Copilot is a tab-completion engine, not an architect. It wrote reasonable code line by line but had no sense of overall design. The WebSocket client was fine. The PostgreSQL writer was fine. But the connection between them — the part that actually matters — was a mess. No backpressure. No graceful shutdown. Thread-safety issues everywhere.

I had to design the architecture myself and use Copilot as a faster typing tool.

Verdict: Good pair programmer, bad system architect.

Cursor — 8/10

Cursor impressed me here. It asked clarifying questions before writing code:

"Should the backpressure block the WebSocket reader or drop messages?"
"Do you want schema evolution to be strict (reject unknown fields) or permissive (store them in a JSONB column)?"

Then it generated a complete, working implementation. AsyncIO based. Proper connection management. Real backpressure with asyncio.Queue(maxsize=10000). Schema evolution via a JSONB overflow column. All in ~200 lines.

I ran it against a test WebSocket server. It worked on the first try.

Verdict: Closest thing to an "AI software engineer" I have seen.

Claude Code — 6/10

Claude Code generated beautiful code. Clean architecture. Type annotations everywhere. Comprehensive error handling. It even added a health check endpoint and structured logging that I didn't ask for.

The problem: it used asyncio.gather() without proper error propagation. When the WebSocket connection dropped, the entire process silently hung instead of crashing. I caught it during testing, but this is the kind of bug that makes it to production if you trust the output without reading it.

Verdict: Beautiful code, subtle bugs. Always review before shipping.

Windsurf — 7/10

Strong implementation. Good structure. It chose a multi-process approach instead of async, which I disagreed with but it worked. The backpressure implementation was creative — semaphore-based throttle instead of queue size checking.

It missed the schema evolution requirement entirely. I had to explicitly ask for it.

Verdict: Solid but not thorough. You need to be the project manager.

Aider — 5/10

Aider produced working code after three iterations. The first attempt had no error handling. The second added error handling but broke the batch writer. The third was functional but had a subtle race condition in the backpressure logic.

Verdict: Feels like pair programming with a junior dev who reads docs fast.

Winner: Cursor (8/10)

Its ability to ask clarifying questions and design a system end-to-end is unmatched.

Task 3: Debug Mystery

A Node.js Express service in Kubernetes. Under load (>500 concurrent requests), ~3% return 502 Bad Gateway. No stack traces. No error logs. Health check works fine. Memory and CPU look normal.

This bug had been open for 2 weeks. Two senior engineers had looked at it.

GitHub Copilot — 3/10

Copilot is not a debugger. It suggested checking error handlers, adding logging, and looking at the reverse proxy config — the same things the human engineers already tried.

Verdict: Useless for debugging.

Cursor — 6/10

When I gave Cursor access to the full codebase, it noticed something: the service uses a connection pool for downstream HTTP calls, and the pool has a default timeout of 30s. But the service has middleware that sets a request timeout of 29s.

The bug: When traffic spikes, connections queue up. Some requests hit the middleware timeout before the connection pool returns a connection. The middleware catches the timeout and returns a 502 — but the error happens outside the try-catch that logs errors. No log, no trace, just a 502.

This was the actual bug. A cursed interaction between two timeouts implemented by two different people 6 months apart.

Verdict: Actually useful for debugging. Reads the whole codebase.

Claude Code — 7/10

Claude Code found the same bug as Cursor but faster. It read the middleware chain, the connection pool config, and the error handling in sequence and said:

"There is a 1-second gap between the middleware timeout (29s) and the pool timeout (30s). During this gap, requests are cancelled by the middleware but the error handler does not catch cancellation errors. Try adding process.on('unhandledRejection', ...) and check if cancellation errors are being swallowed."

It was right. The fix was a 3-line change to the error handler.

Verdict: Best debugger of the bunch. Reads code like a senior engineer.

Windsurf — 5/10

Windsurf found the timeout mismatch but didn't connect it to the missing error logging. It said "these timeouts look close, maybe that's the problem?" — but didn't explain why or how to verify it.

I had to do the actual debugging myself.

Verdict: Hints at the answer but doesn't get you all the way there.

Aider — 4/10

Aider couldn't handle this task. It doesn't have a "read the whole codebase and form a hypothesis" mode. It works on files you explicitly show it. By the time I had shown it all the relevant files, I had basically debugged it myself.

Verdict: Not built for debugging.

Winner: Claude Code (7/10)

Fastest to the correct diagnosis. Best at reasoning about system-level interactions.

Overall Scores

Tool	Task 1 (Refactor)	Task 2 (Greenfield)	Task 3 (Debug)	Average
Cursor	7	8	6	7.0
Claude Code	5	6	7	6.0
Windsurf	6	7	5	6.0
GitHub Copilot	6	4	3	4.3
Aider	4	5	4	4.3

The Verdict

If you write code for a living, get Cursor. It is the only tool that consistently helps across all three task types. The $20/month is cheaper than one hour of your time.

If you do a lot of debugging, add Claude Code to your toolkit. It reasons about code differently — more like a senior engineer than a code completion engine.

GitHub Copilot is fine if you already have it, but it is not worth $10/month if you don't. It is a fancy autocomplete, not a coding assistant.

Aider is the best free option if you are comfortable on the command line and don't mind hand-holding the AI.

Windsurf is Cursor but more annoying to use. Skip it unless you really want that cascade feature.

The One Thing Nobody Tells You

All of these tools make you faster at writing code. None of them make you faster at thinking about what to build.

If you don't know what you're doing, AI will help you build the wrong thing faster.

The developers who get the most out of these tools are not the ones who prompt the best. They are the ones who already know what good code looks like and use AI to get there faster.

AI is a force multiplier, not a replacement for judgment.

If you found this useful, follow me on Dev.to for more honest tool reviews and practical engineering advice. No affiliate links, no sponsorship, just real testing.

DEV Community

I Tested 5 AI Coding Tools on Real Work. Here Are the Results.

I Tested 5 AI Coding Tools on Real Work. Here Are the Results.

The Test Setup

Task 1: Legacy Refactor

GitHub Copilot — 6/10

Cursor — 7/10

Claude Code — 5/10

Windsurf — 6/10

Aider — 4/10

Winner: Cursor (7/10)

Task 2: Greenfield Feature

GitHub Copilot — 4/10

Cursor — 8/10

Claude Code — 6/10

Windsurf — 7/10

Aider — 5/10

Winner: Cursor (8/10)

Task 3: Debug Mystery

GitHub Copilot — 3/10

Cursor — 6/10

Claude Code — 7/10

Windsurf — 5/10

Aider — 4/10

Winner: Claude Code (7/10)

Overall Scores

The Verdict

The One Thing Nobody Tells You

Top comments (0)