Toni Antunovic

Posted on Mar 27 • Originally published at lucidshark.com

The Hidden Cost of Code Duplication in AI-Assisted Development

#ai #webdev

This article was originally published on LucidShark Blog.

AI coding agents are exceptional at generating code. They are also, structurally, among the worst duplicators in the history of software development. Here is why that matters more than you think - and how to stop it before it compounds.

There is a number that comes up repeatedly in software engineering research: 20 to 30 percent. That is the fraction of code in a typical production codebase that is duplicated, according to studies by researchers at Carnegie Mellon, the Software Engineering Institute, and industry analyses from tools like SonarQube. In a 100,000-line codebase, you are carrying between 20,000 and 30,000 lines of redundant logic. Each one of those lines needs to be maintained, tested, and understood by every developer who reads the file.

Before AI coding assistants, duplication grew slowly. A developer copy-pasted a utility function during a deadline crunch. A new engineer did not know the helper already existed in a shared module. Over time, the numbers crept up. It was a manageable problem with discipline and the occasional refactoring sprint.

AI coding agents have changed the rate of accumulation entirely.

How AI Agents Generate Duplication by Default

When you ask Claude Code, Cursor, or a similar agent to implement a feature, it does not search your entire codebase for existing abstractions before writing. It generates code that is locally coherent, satisfies the immediate task, and returns. If you have a formatCurrency utility in src/utils/formatting.ts and you ask an agent to add a payment summary component, there is a meaningful chance it writes a new formatCurrency inline, because the context window did not include the utility file.

This is not a bug in the agent. It is a structural limitation of how large language models process context. They are excellent at generating code that is consistent within the context they have been given. They are poor at asserting global uniqueness across a codebase they have only partially seen.

The duplication patterns that emerge from AI-assisted development tend to cluster in three categories.

1. Utility Function Proliferation

Helper functions are the most common casualty. Date formatting, string sanitization, numeric rounding, object deep-cloning - these are written once by the first agent invocation that needs them, and then silently re-written by every subsequent invocation that encounters the same problem without seeing the original solution.

In a codebase where agents have been active for three months, it is common to find four or five implementations of the same date formatting logic, each slightly different, each tested separately if at all, and each with subtly different edge-case behavior. The developer who later encounters a timezone bug has no idea which of the five implementations is the canonical one.

2. Constant and Configuration Duplication

Magic numbers and string literals are even more insidious. An agent writes const MAX_RETRIES = 3 in a network request handler. Three prompts later, another agent writes const RETRY_LIMIT = 3 in an API client. A week after that, const maxAttempts = 3 appears in a background job processor. All three are the same business rule. When that rule changes - and it will change - the developer who updates one will not know to update the other two.

This is how silent production bugs are born. Not from dramatic failures, but from a configuration value updated in two of three places.

3. Structural Logic Cloning

The most expensive category is duplicated logic blocks: input validation sequences, error-handling patterns, pagination logic, authentication guard implementations. These tend to be 10 to 50 line blocks that an agent re-generates from scratch each time a similar requirement appears.

Unlike a copy-pasted block, an AI-generated duplicate is rarely identical. It is semantically equivalent but syntactically distinct, which means naive string-matching deduplication tools will miss it entirely. The overlap is at the structural level: the same conditional chains, the same variable names in a different order, the same fallback patterns with different error messages.

The Compound Interest of Duplication

Research from McKinsey's developer productivity studies and CAST's annual software intelligence reports consistently finds that technical debt costs development teams between 20 and 40 percent of their total development capacity. Not one-time cleanup work - ongoing, per-sprint drag on every feature, every bug fix, every on-call response.

Code duplication is among the most significant contributors to that debt. A 2021 study published in the journal Empirical Software Engineering found that duplicated code regions are statistically more likely to contain bugs than non-duplicated regions - not because the logic is wrong, but because fixes applied to one copy are not propagated to others. The bug is "fixed" in one place and silently persists in three others.

AI-assisted development compresses the timeline for this accumulation dramatically. A developer working with an agent can generate code at five to ten times the rate of unassisted development. The duplication rate does not compress at the same ratio - if anything, it increases, because the agent has less contextual awareness than the developer would have had working through the codebase manually.

What took a year to accumulate in a human-written codebase now accumulates in six weeks of active AI-assisted development. The technical debt clock runs faster.

Why Your Current Tooling Misses This

Most code review processes are not configured to catch duplication at the rate AI agents produce it. The typical pull request review catches obvious copy-pastes within the changed files, but reviewers rarely search the entire codebase for prior implementations of a function that looks locally reasonable.

Linters catch style issues. Type checkers catch interface mismatches. SAST tools catch security vulnerabilities. None of them are looking for semantic duplication across files. Even dedicated duplication detection tools in CI/CD pipelines tend to run on merge, after the duplication has already landed and been built on top of.

The feedback loop that matters is the one that closes before the commit.

LucidShark's Duplication Analysis: What It Actually Does

LucidShark runs duplication analysis as one of its ten quality check categories, and it runs locally, before anything is committed. The analysis goes beyond token matching.

When an agent writes a new utility function and you run LucidShark pre-commit, the duplication engine normalizes variable names, strips whitespace, and compares structural AST patterns against the existing codebase. A function that formats a price as $X.XX will be flagged as a near-duplicate of an existing one even if every variable name is different, because the structure is identical.

The output is specific enough to act on immediately:

[DUPLICATION] MEDIUM  src/components/PaymentSummary.tsx:34
  Near-duplicate of src/utils/formatting.ts:12 (91% similarity)
  Rule: duplication/utility-function
  Existing:  formatCurrency(amount: number): string
  New:       formatPrice(value: number): string
  Recommendation: Remove formatPrice and import formatCurrency
                  from src/utils/formatting.ts instead.

This finding surfaces before the developer commits, before the code review, and before the second implementation gets imported by three other components that make it expensive to remove.

The same analysis catches constant duplication across files:

[DUPLICATION] LOW  src/services/api-client.ts:8
  Constant duplication detected.
  Rule: duplication/magic-constant
  Value: 3 (assigned to RETRY_LIMIT)
  Existing declarations with same value and similar context:
    src/network/request-handler.ts:14  MAX_RETRIES = 3
    src/jobs/background-processor.ts:22  maxAttempts = 3
  Recommendation: Extract to src/config/constants.ts
                  and import from single source of truth.

That finding, acted on when the third constant is written, saves the developer from a three-location update when the retry policy changes in six months.

The Integration with Claude Code

LucidShark integrates with Claude Code via MCP (Model Context Protocol), which creates a tight feedback loop. Claude Code writes code. LucidShark scans it. Claude Code receives the findings and can address them before moving to the next task.

This is not just about catching individual duplicates. Over time, it trains the agent to prefer imports over re-implementations - not through any change to the model, but because the agent sees duplication findings in its context and learns within the session to check for existing utilities before generating new ones.

In practice, this means teams using LucidShark with Claude Code report significantly lower duplication rates in codebases that have been active for several months, compared to teams using AI agents without a local quality gate. The agent does not start writing worse code. The quality gate catches and surfaces what would otherwise silently accumulate.

A Practical Example: Six Weeks Without a Quality Gate

Consider a team that starts a new Next.js application with Claude Code handling most feature implementation. Without a duplication gate in place, a six-week snapshot of the codebase will typically show:

                - Three to five implementations of date and time formatting logic, each with slightly different timezone handling

                - Two or three versions of an API error handler, each with different retry behavior and logging verbosity

                - Scattered magic numbers representing the same business rules: session timeouts, maximum file sizes, pagination limits

                - Repeated validation logic for form inputs that should have been abstracted into a shared schema

None of these are individually catastrophic. All of them together represent a codebase where the cognitive load of making a change has grown substantially beyond what the line count implies. Every developer who works in the codebase now needs to understand which implementation is canonical, or risk working with a stale one.

With LucidShark running pre-commit, the same six-week period produces a codebase where these duplicates are caught as they are introduced and either consolidated immediately or flagged for explicit deduplication. The codebase does not grow duplication-free overnight, but the rate of accumulation drops substantially, and the debt does not compound.

Getting Started

LucidShark runs entirely on your machine - no cloud services, no SaaS subscription, no data leaving your environment. It supports JavaScript and TypeScript with full duplication analysis, along with Python, Go, Java, Rust, and several others.

Install it in one command:

curl -fsSL https://raw.githubusercontent.com/toniantunovi/lucidshark/main/install.sh | bash

Integrate it with Claude Code via the MCP server and you have a duplication gate that runs every time the agent finishes a task, before anything is committed. Visit lucidshark.com for full installation instructions and configuration docs.

AI coding agents are not going to become naturally averse to duplication. That is not how they work. But duplication that is caught pre-commit, before it is imported and depended on, is cheap to fix. Duplication that has been in production for six months, imported by twelve components, with divergent bug fixes applied to each copy, is expensive.

Run the gate. Pay the cheap cost now, not the expensive one later.

DEV Community