DEV Community

Rahul Singh
Rahul Singh

Posted on • Originally published at aicodereview.cc

13 Best Duplicate Code Checker Tools in 2026

Code duplication is the silent tax on every codebase

I have worked on codebases where fixing a single bug required changing the same logic in seven different files. Not because the architecture demanded it - because someone copy-pasted a function years ago, and then someone else copy-pasted the copy, and then the copies diverged slightly, and nobody knew which version was canonical anymore.

That is the real cost of code duplication. It is not just wasted disk space or inflated line counts. It is the compounding maintenance burden of keeping multiple copies of the same logic in sync - a task that humans are reliably terrible at.

Studies from large-scale codebases confirm this. Research on the Linux kernel found that inconsistent changes to cloned code were responsible for a meaningful percentage of bugs. A study of open-source Java projects found that cloned code was changed more frequently and contained more defects than non-cloned code. The numbers vary by study, but the pattern is consistent: duplication breeds bugs.

The good news is that detecting duplicate code is a well-understood problem with excellent tooling. This guide covers 13 tools that find duplicated code - from lightweight CLI utilities you can run in 30 seconds to full platforms that track duplication trends across your entire organization.

What is code duplication (and why should you care)?

Code duplication - also called code cloning - occurs when identical or nearly identical code fragments exist in multiple locations within a codebase. It typically happens through copy-paste programming, where a developer copies a working block of code and modifies it slightly for a new context instead of abstracting the shared logic into a reusable function.

The four types of code clones

The research community classifies code clones into four types, and this taxonomy matters because different tools detect different types:

Type 1 - Exact clones. Identical code fragments except for differences in whitespace, layout, and comments. This is the simplest form - someone copied a function and only changed the formatting.

// Clone A
function calculateTax(amount) {
  const rate = 0.08;
  return amount * rate;
}

// Clone B (Type 1 - only whitespace differs)
function calculateTax(amount) {
    const rate = 0.08;
    return amount * rate;
}
Enter fullscreen mode Exit fullscreen mode

Type 2 - Renamed clones. Syntactically identical fragments with differences in identifier names, literal values, or type declarations. The structure is the same, but names and values have changed.

// Clone A
function calculateTax(amount) {
  const rate = 0.08;
  return amount * rate;
}

// Clone B (Type 2 - renamed variables and changed literal)
function computeLevy(price) {
  const percentage = 0.10;
  return price * percentage;
}
Enter fullscreen mode Exit fullscreen mode

Type 3 - Near-miss clones. Fragments with further modifications - statements added, removed, or reordered. The code is recognizably similar but not structurally identical.

// Clone A
function calculateTax(amount) {
  const rate = 0.08;
  return amount * rate;
}

// Clone B (Type 3 - added validation logic)
function computeLevy(price) {
  if (price < 0) throw new Error("Invalid price");
  const percentage = 0.10;
  const result = price * percentage;
  console.log(`Levy: ${result}`);
  return result;
}
Enter fullscreen mode Exit fullscreen mode

Type 4 - Semantic clones. Functionally equivalent code implemented with different syntax or algorithms. Two sorting functions that use different algorithms but produce identical output are Type 4 clones. These are the hardest to detect and most tools cannot reliably find them.

Why duplication matters

The DRY (Don't Repeat Yourself) principle exists for practical reasons, not aesthetic ones:

  • Bug propagation. A bug fixed in one copy often remains unfixed in the others. The more copies, the more likely a fix will be incomplete.
  • Maintenance cost. Every change to shared logic must be applied N times across N copies. This scales linearly with duplication.
  • Code review burden. Reviewers waste time reading code they have already reviewed in another file.
  • Binary size and build time. Duplicated code inflates compilation and bundle sizes unnecessarily.
  • Inconsistent behavior. When copies diverge, users encounter different behavior depending on which code path they hit.

Not all duplication is harmful. Test files, generated code, and certain boilerplate patterns are often intentionally duplicated. The goal is not zero duplication - it is zero accidental duplication of business logic.

Tool comparison table

Tool Clone Types Languages Pricing CI Integration Open Source
SonarQube 1, 2, 3 35+ Free (Community) to $65K+/yr GitHub, GitLab, Jenkins, Azure Community Build: Yes
PMD CPD 1, 2 20+ Free Maven, Gradle, CLI Yes (BSD)
Simian 1, 2 15+ $299-$499/license CLI, Ant, Maven No
jscpd 1, 2 150+ (via tokenizers) Free CLI, GitHub Actions Yes (MIT)
MOSS 1, 2, 3 25+ Free (academic) Web upload only No
Duplo 1, 2 Language-agnostic Free CLI Yes
CloneDR 1, 2, 3, 4 20+ Enterprise pricing CLI No
CodeAnt AI 1, 2, 3 30+ Free to $40/user/mo GitHub, GitLab, Bitbucket No
Codacy 1, 2 40+ Free to $15/user/mo GitHub, GitLab, Bitbucket No
DeepSource 1, 2, 3 15+ Free to $12/user/mo GitHub, GitLab, Bitbucket No
IntelliJ IDEA 1, 2, 3 JVM, Python, JS, PHP $249-$779/yr (Ultimate) IDE only No
Coverity 1, 2, 3 22+ Enterprise pricing ($50K+) Jenkins, GitHub, GitLab No
Semgrep 1, 2 (pattern-based) 30+ Free (OSS) to custom GitHub, GitLab, CLI OSS engine: Yes

1. SonarQube - copy-paste detection built into the quality platform

SonarQube's duplication detection is one of the most widely deployed in the industry, largely because it comes bundled with the broader quality platform that most enterprises already run. If you are using SonarQube for code quality and security, you are already getting duplication analysis for free.

How it works. SonarQube uses a combination of token-based and AST-based analysis depending on the language. For most languages, it tokenizes code, normalizes identifiers and literals, and then finds matching token sequences above a configurable minimum length (default: 100 tokens for Java, 120 for JavaScript). It detects Type 1, Type 2, and some Type 3 clones.

What sets it apart. The duplication metrics feed directly into SonarQube's quality gate system. You can block merges when duplication on new code exceeds a threshold - 3% is the default. The duplication visualization shows exactly which blocks are duplicated and where, making it easy to plan refactoring.

Languages: 35+ including Java, Python, JavaScript, TypeScript, C#, C/C++, Go, PHP, Ruby, Swift, Kotlin.

Pricing: Community Build is free and open source. Developer Edition starts around $150/year for 100K LOC. Enterprise runs $65,000+/year.

Pros:

  • Duplication detection is part of a comprehensive quality platform
  • Quality gates can enforce duplication thresholds on PRs
  • Excellent visualization of duplicate blocks across the codebase
  • Deep language support with language-specific tokenization

Cons:

  • Community Build requires self-hosting a server
  • Configuration is heavier than standalone CLI tools
  • Enterprise pricing is steep for teams that only need duplication detection
  • Type 3 detection is limited compared to AST-native tools

Best for: Teams already using SonarQube that want duplication checks as part of their quality workflow.

2. PMD CPD - the copy-paste detector that just works

PMD's Copy/Paste Detector (CPD) has been the go-to standalone duplication checker for over two decades. It is free, fast, reliable, and works with every major build system. If you need a no-frills duplicate code finder that you can add to a CI pipeline in five minutes, CPD is the answer.

How it works. CPD uses token-based detection. It lexes source files into token streams, then applies the Karp-Rabin algorithm to find matching subsequences. You configure a minimum token count (default: 100), and CPD reports all pairs of code fragments that share at least that many consecutive tokens.

What sets it apart. Zero dependencies, zero accounts, zero configuration. Download the PMD distribution, run pmd cpd --minimum-tokens 100 --dir src/, and you get a report. It integrates natively with Maven (mvn pmd:cpd), Gradle, and Ant.

Languages: Java, JavaScript, TypeScript, Python, C/C++, C#, Go, Ruby, Swift, Kotlin, Scala, PHP, Lua, MATLAB, Fortran, and more.

Pricing: Free and open source (BSD license).

Pros:

  • Extremely fast - scans millions of lines in seconds
  • No server, no account, no internet connection required
  • Native Maven, Gradle, and Ant integration
  • Well-documented with 20+ years of stability
  • Configurable minimum tokens and output formats (XML, CSV, text)

Cons:

  • Token-based only - misses Type 3 and Type 4 clones
  • No visualization or trending - just a flat report
  • No PR integration or quality gates out of the box
  • No web dashboard or historical tracking

Best for: Any team that wants a fast, free, no-nonsense duplication check in CI.

3. Simian - commercial token matcher with strict licensing

Simian (Similarity Analyser) is a commercial duplicate code detection tool that focuses on fast, accurate token-based matching. It was popular in the mid-2010s and still has users in enterprise Java and .NET shops.

How it works. Simian uses proprietary token-based matching that the vendor claims is more accurate than CPD for certain edge cases, particularly around multiline string literals and complex expressions.

Languages: Java, C#, C/C++, JavaScript, TypeScript, Ruby, Swift, Objective-C, Visual Basic, COBOL, and others.

Pricing: $299 for a single developer license, $499 for a site license. One-time purchase.

Pros:

  • Fast scanning with low memory footprint
  • Good handling of edge cases in .NET and Java code
  • One-time license fee rather than subscription

Cons:

  • No longer actively developed - last major update was several years ago
  • Token-based only - no Type 3 detection
  • No CI/CD integration beyond CLI exit codes
  • No web dashboard or PR comments
  • PMD CPD provides comparable functionality for free

Best for: Legacy projects already using Simian that do not want to migrate.

4. jscpd - the polyglot copy-paste detector

jscpd is a modern, Node.js-based duplicate code detector that supports an impressive range of languages through its tokenizer system. If you work with a polyglot codebase and want a single duplication tool, jscpd is worth considering.

How it works. jscpd tokenizes source files using language-specific tokenizers (powered by the Prism syntax highlighter's grammar definitions), then finds matching token sequences using the Rabin-Karp algorithm. It supports a configurable minimum number of tokens and lines.

What sets it apart. The breadth of language support is exceptional. Because it leverages Prism's grammars, jscpd can tokenize over 150 languages and file formats, including markup languages, configuration files, and even Dockerfiles. It also generates HTML reports with side-by-side clone views.

Languages: 150+ including all mainstream programming languages plus markup, configuration, and infrastructure-as-code files.

Pricing: Free and open source (MIT license).

Pros:

  • Broadest language support of any standalone duplication tool
  • Beautiful HTML reports with side-by-side clone visualization
  • CI-friendly exit codes and JSON output
  • Configurable thresholds per language via .jscpd.json
  • Active open-source project with regular updates

Cons:

  • Token-based only - no Type 3 or Type 4 detection
  • Node.js dependency required
  • Slower than PMD CPD on very large codebases
  • No server or dashboard for tracking trends over time

Best for: Polyglot codebases and teams that want HTML reporting without a server.

5. MOSS - academic plagiarism detection for source code

MOSS (Measure Of Software Similarity) is a web-based service from Stanford University designed to detect plagiarism in programming assignments. It is not a typical developer tool, but it excels at detecting code similarity across large sets of submissions - which makes it uniquely useful for certain scenarios.

How it works. MOSS uses document fingerprinting with the Winnowing algorithm. You submit a set of source files, and MOSS returns a ranked list of file pairs with the highest similarity, along with a web-based side-by-side view showing the matching regions.

What sets it apart. MOSS is designed to compare many files against each other simultaneously, which is different from most tools that scan a single codebase. It normalizes code to ignore variable names and whitespace, catching Type 1, 2, and some Type 3 clones.

Languages: C, C++, Java, Python, JavaScript, C#, MATLAB, Perl, Haskell, Lisp, Scheme, and others.

Pricing: Free for educational and research use.

Pros:

  • Excellent at detecting similarity across large file sets
  • Web-based visualization with highlighted matching regions
  • Handles obfuscation attempts (variable renaming, reordering)
  • Trusted and maintained by Stanford for decades

Cons:

  • Requires uploading code to Stanford's servers - not suitable for proprietary code
  • No CI/CD integration
  • Web-only interface - no CLI or API for automation
  • Designed for academic plagiarism, not production code quality
  • Availability depends on Stanford's infrastructure

Best for: Academic settings and open-source projects where code can be uploaded to an external server.

6. Duplo - lightweight C/C++ focused detector

Duplo is a simple, lightweight duplicate code detection tool originally designed for C and C++ projects. It takes a minimalist approach - feed it a list of files, and it reports duplicate blocks.

How it works. Duplo uses a line-based matching algorithm. It normalizes lines by stripping whitespace and comments, then finds matching sequences of lines above a configurable minimum length. This is simpler than token-based approaches but surprisingly effective for exact and renamed clones.

Languages: Technically language-agnostic (line-based matching works on any text), but optimized for C, C++, Java, and C#.

Pricing: Free and open source.

Pros:

  • Extremely lightweight - single binary, no dependencies
  • Fast on large codebases
  • Simple to understand and configure
  • Works on any text-based language

Cons:

  • Line-based matching is less accurate than token or AST-based detection
  • Very basic output - no visualization or HTML reports
  • Limited development activity in recent years
  • No CI integration beyond exit codes

Best for: C/C++ projects that need a quick, zero-dependency duplication check.

7. CloneDR - the research-grade AST clone detector

CloneDR is one of the few tools that uses full AST-based clone detection, developed by Semantic Designs. It parses source code into abstract syntax trees and compares subtrees to find clones - including Type 3 and even some Type 4 clones that token-based tools miss completely.

How it works. CloneDR parses source files using language-specific grammars, builds ASTs, and then compares subtrees using a parameterized matching algorithm. It can detect clones where statements have been added, removed, or reordered (Type 3), and in some cases can identify semantically equivalent code with different structure (Type 4).

What sets it apart. This is the most thorough clone detection approach available in a commercial tool. CloneDR does not just find copied code - it can suggest a refactored version that extracts the common logic into a shared function with parameters for the varying parts.

Languages: Java, C#, C/C++, Python, JavaScript, COBOL, PHP, and others.

Pricing: Enterprise licensing through Semantic Designs. Contact for quotes.

Pros:

  • Detects Type 3 and some Type 4 clones
  • Suggests refactored abstractions for detected clones
  • Most thorough detection of any tool in this list
  • Handles complex transformations like loop restructuring

Cons:

  • Expensive enterprise-only licensing
  • Slower than token-based tools due to full AST parsing
  • Smaller user community - fewer resources and integrations
  • No native CI/CD pipeline integration
  • Requires language-specific grammars for each supported language

Best for: Organizations serious about eliminating deep structural duplication and willing to invest in thorough analysis.

8. CodeAnt AI - duplication detection with AI code review

CodeAnt AI bundles duplicate code detection with its broader AI-powered code review and static analysis platform. It is one of the newer entrants that treats duplication as part of a holistic code quality workflow rather than a standalone feature.

How it works. CodeAnt AI scans repositories connected through GitHub, GitLab, or Bitbucket. Its duplication engine uses a combination of token-based matching and structural analysis to detect Type 1, 2, and 3 clones. Findings appear as PR comments alongside security and quality issues.

What sets it apart. Duplication findings are contextualized with AI-generated explanations. Instead of just pointing out that two blocks match, CodeAnt AI explains why the duplication is problematic and suggests a refactored approach. It also tracks duplication metrics over time.

Languages: 30+ languages including Python, JavaScript, TypeScript, Java, Go, Ruby, PHP, and C#.

Pricing: Free Basic plan for open source. Pro at $24/user/month and Enterprise at $40/user/month.

Pros:

  • AI-powered explanations and refactoring suggestions
  • Duplication detection is part of a comprehensive code review platform
  • PR-level integration - findings appear as comments on pull requests
  • Tracks duplication metrics and trends over time
  • SAST, secrets detection, and DORA metrics included

Cons:

  • Duplication detection is not available as a standalone feature
  • Requires connecting your repository to CodeAnt AI's platform
  • Newer tool with a smaller community than SonarQube or PMD
  • Enterprise pricing adds up for large teams

Best for: Teams that want duplication detection bundled with AI code review and static analysis.

9. Codacy - duplication as part of code quality automation

Codacy includes duplication detection as one of its core code quality checks. Like SonarQube, it bundles duplication analysis with security scanning, code coverage tracking, and quality gates - but as a fully managed cloud service.

How it works. Codacy uses PMD CPD and language-specific analyzers under the hood for its duplication detection. It tokenizes code, finds matching sequences, and reports duplicates with links to both locations. Findings appear in the Codacy dashboard and as PR comments.

What sets it apart. The fully managed experience means zero infrastructure. Connect your GitHub, GitLab, or Bitbucket repository, and duplication analysis starts automatically on every push. Quality gates can enforce duplication thresholds, blocking merges when thresholds are exceeded.

Languages: 40+ languages through its analyzer ecosystem.

Pricing: Free for open source. Pro at $15/user/month.

Pros:

  • Zero-configuration cloud setup
  • Quality gates enforce duplication thresholds on PRs
  • Duplication trends visible in dashboard over time
  • Bundled with SAST, SCA, code coverage, and code quality
  • Supports GitHub, GitLab, and Bitbucket

Cons:

  • Duplication detection relies on PMD CPD - no Type 3 detection
  • Cannot run as a standalone CLI tool
  • $15/user/month adds up for larger teams
  • Less configurable than running PMD CPD directly

Best for: Teams that want a managed code quality platform with duplication checks included.

10. DeepSource - fast duplication analysis with Autofix

DeepSource includes duplication detection as part of its static analysis platform and stands out for its speed and low false positive rate. It takes a modern approach to developer experience, with clean UI and actionable findings.

How it works. DeepSource uses its own analysis engine that combines token-based and structural techniques. It detects Type 1, 2, and some Type 3 clones. When duplication is found, DeepSource can generate Autofix suggestions that extract duplicated code into shared functions.

What sets it apart. The Autofix feature for duplication is genuinely useful. Rather than just flagging that code is duplicated, DeepSource proposes a refactored version. Its sub-5% false positive rate also means you do not waste time reviewing findings that are not real issues.

Languages: Python, JavaScript, TypeScript, Java, Go, Ruby, Rust, C#, Kotlin, Swift, PHP, and others.

Pricing: Free for individual developers and open-source projects. Team plan at $12/user/month.

Pros:

  • Autofix generates refactored code for duplicated blocks
  • Very low false positive rate
  • Fast scan times - typically completes in under a minute
  • Clean, modern developer experience
  • Free tier is generous for individual use

Cons:

  • Smaller language coverage than SonarQube or Codacy
  • Type 3 detection is limited
  • No standalone CLI for duplication-only scanning
  • Enterprise features require the paid tier

Best for: Teams that value developer experience and want AI-assisted refactoring of duplicated code.

11. IntelliJ IDEA - IDE-native duplicate detection

JetBrains IntelliJ IDEA (and other JetBrains IDEs like PyCharm, WebStorm, and Rider) includes built-in duplicate code detection that runs directly in the editor. This is not a CI/CD tool - it is a developer productivity feature that surfaces duplication while you are actively coding.

How it works. IntelliJ's duplicate detection uses AST-based analysis powered by the JetBrains inspection engine. It compares structural elements rather than raw tokens, which enables Type 3 detection where code has been slightly modified. Results appear as editor highlights and in the inspection results panel.

What sets it apart. The IDE integration means you see duplication immediately as you write code, before you even commit. The refactoring tools built into IntelliJ - Extract Method, Extract Variable, Pull Members Up - work seamlessly with the duplication findings, making it trivial to fix clones on the spot.

Languages: Java, Kotlin, Scala, Groovy (IntelliJ), Python (PyCharm), JavaScript/TypeScript (WebStorm), C#/.NET (Rider), PHP (PhpStorm).

Pricing: Community Edition is free but does not include duplication detection. Ultimate starts at $249/year for individuals, $779/year for organizations.

Pros:

  • Real-time detection as you write code
  • AST-based analysis catches Type 3 clones
  • Seamless integration with IntelliJ refactoring tools
  • No external tooling or configuration needed
  • Cross-module detection within a project

Cons:

  • Requires JetBrains IDE - not available in VS Code or other editors
  • Not a CI/CD tool - cannot enforce thresholds on PRs
  • Only runs on code currently open in the IDE
  • Ultimate license required for full duplication analysis
  • No historical tracking or trending

Best for: Individual developers using JetBrains IDEs who want real-time duplication awareness.

12. Coverity - deep analysis for safety-critical code

Coverity (now part of Black Duck, formerly Synopsys) includes duplication detection as part of its enterprise static analysis platform. Coverity is the standard for safety-critical industries - automotive, aerospace, medical devices, and embedded systems.

How it works. Coverity performs deep interprocedural analysis that includes clone detection. Its engine builds a comprehensive model of the entire codebase, including call graphs and data flow, which enables it to detect structural clones that simpler tools miss. It focuses on clones that are likely to cause defects - duplicated code with inconsistent error handling or boundary checks.

What sets it apart. Coverity does not just find duplicated code - it finds duplicated code that is dangerous. Its defect-oriented approach means it prioritizes clones where one copy has been patched but others have not, which is exactly the pattern that causes real-world bugs.

Languages: C, C++, Java, C#, JavaScript, TypeScript, Python, Ruby, Go, Kotlin, Swift, and others.

Pricing: Enterprise-only. Typically $50,000+ per year depending on codebase size and seats.

Pros:

  • Defect-focused duplication detection - finds clones that cause bugs
  • Deep interprocedural analysis catches complex structural clones
  • Industry standard for safety-critical code (ISO 26262, DO-178C)
  • Finds inconsistently patched clones that other tools miss
  • Comprehensive reporting for compliance requirements

Cons:

  • Extremely expensive - not practical for small teams
  • Slow scan times due to deep analysis
  • Complex deployment and configuration
  • Overkill for web applications or non-critical software
  • No free tier or community edition

Best for: Enterprise teams working on safety-critical software where finding dangerous clones is more important than finding all clones.

13. Semgrep - pattern-based duplicate detection with custom rules

Semgrep takes a different approach to duplication. Rather than scanning for arbitrary matching code blocks, Semgrep lets you define patterns that match specific types of duplication in your codebase. This is not traditional clone detection - it is targeted pattern matching that catches the duplication patterns that matter most to your team.

How it works. You write Semgrep rules using a pattern syntax that matches code structure rather than exact text. For example, you can write a rule that detects when the same error handling pattern is duplicated across multiple catch blocks, or when identical validation logic appears in multiple API endpoints. Semgrep matches against the AST, so it catches renamed variables and reformatted code.

What sets it apart. The custom rule approach means you focus on the duplication that actually causes problems in your codebase. Instead of a noisy report showing every duplicated three-line block, you get targeted findings for the specific patterns you care about.

Languages: 30+ including Python, JavaScript, TypeScript, Java, Go, Ruby, C, C++, Rust, PHP, Kotlin, Swift, and more.

Pricing: Open-source CLI is free (LGPL-2.1). Team tier available for PR integration, and enterprise pricing for advanced features.

Pros:

  • Custom rules target the specific duplication patterns that matter
  • AST-based matching catches renamed and reformatted clones
  • Extremely fast - scans most codebases in seconds
  • Huge community rule library with 3,000+ pre-built rules
  • Free for commercial use and CI/CD integration

Cons:

  • Not a traditional clone detector - requires writing rules for specific patterns
  • Will not generate a comprehensive duplication report like PMD CPD
  • No built-in duplication percentage metric
  • Requires learning Semgrep's pattern syntax

Best for: Teams that know what duplication patterns to target and want precise, low-noise detection.

How to choose the right duplicate code checker

The right tool depends on what you need:

For a quick CLI scan with no setup, use PMD CPD. It is free, fast, and integrates with every build tool. Run pmd cpd --minimum-tokens 75 --dir src/ and you have a report in seconds. If you work with many languages, jscpd is the better choice for its broader tokenizer support.

For CI/CD enforcement, SonarQube, Codacy, or DeepSource give you quality gates that block PRs when duplication thresholds are exceeded. Codacy and DeepSource are fully managed, while SonarQube requires self-hosting (unless you use SonarCloud).

For IDE-level awareness, IntelliJ IDEA's built-in duplication detection catches clones as you type. This prevents duplication before it is committed rather than catching it after the fact.

For deep structural analysis, CloneDR and Coverity detect Type 3 and Type 4 clones that token-based tools miss entirely. These are expensive options but essential for safety-critical codebases.

For AI-assisted refactoring, CodeAnt AI and DeepSource go beyond detection by suggesting how to refactor duplicated code. This bridges the gap between finding duplication and actually fixing it.

For targeted pattern matching, Semgrep lets you define rules for the specific duplication patterns that cause problems in your codebase. This is the lowest-noise approach but requires upfront effort to write rules.

Setting up duplicate code detection in CI

Here is a practical example of adding duplication checks to a GitHub Actions pipeline using PMD CPD:

name: Duplication Check
on: [pull_request]

jobs:
  cpd:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download PMD
        run: |
          wget https://github.com/pmd/pmd/releases/download/pmd_releases%2F7.9.0/pmd-dist-7.9.0-bin.zip
          unzip pmd-dist-7.9.0-bin.zip
      - name: Run CPD
        run: |
          ./pmd-bin-7.9.0/bin/pmd cpd \
            --minimum-tokens 100 \
            --dir src/ \
            --format xml \
            --fail-on-violation true
Enter fullscreen mode Exit fullscreen mode

For jscpd, the setup is even simpler:

name: Duplication Check
on: [pull_request]

jobs:
  jscpd:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npx jscpd src/ --threshold 5 --reporters console
Enter fullscreen mode Exit fullscreen mode

The --threshold 5 flag fails the check when duplication exceeds 5% of total lines.

The bottom line

Duplicate code detection is a solved problem at the tool level. PMD CPD and jscpd are free, fast, and effective for token-based detection. SonarQube, Codacy, and DeepSource bundle duplication checks into broader quality platforms. CloneDR and Coverity provide deep structural analysis for teams that need it.

The unsolved problem is actually fixing duplication once you find it. That is where the newer tools - CodeAnt AI with its AI-powered suggestions, DeepSource with Autofix, and IntelliJ with its refactoring tools - are pushing the field forward. Detection without a path to remediation just creates a backlog of tech debt that nobody addresses.

My recommendation for most teams: start with PMD CPD or jscpd in CI to establish a baseline and prevent new duplication. If you are already using SonarQube, Codacy, or DeepSource, enable their duplication checks and set quality gates. If you are working on safety-critical software, invest in Coverity's defect-focused approach. And regardless of what you use in CI, enable IntelliJ's duplication detection in your IDE - catching clones before they are committed is always cheaper than catching them after.

Frequently Asked Questions

What is the best free duplicate code checker?

PMD CPD is the best free duplicate code checker for most teams. It supports 20+ languages, detects Type 1 and Type 2 clones, integrates with every major build tool, and runs locally without any account or server. For JavaScript and TypeScript projects, jscpd is an excellent alternative with built-in HTML reporting and CI-friendly exit codes. SonarQube Community Build also includes copy-paste detection at no cost, though it requires a self-hosted server.

What are the four types of code clones?

Type 1 clones are exact copies with only whitespace and comment differences. Type 2 clones are syntactically identical but with renamed variables, changed types, or modified literals. Type 3 clones are near-miss copies where statements have been added, removed, or reordered. Type 4 clones are semantically equivalent but syntactically different - two functions that produce the same output using completely different logic. Most tools detect Type 1 and 2 reliably. Type 3 detection requires AST-based analysis. Type 4 detection remains a research problem, though AI-powered tools are making progress.

How much code duplication is acceptable?

Most industry benchmarks consider less than 3-5% duplication acceptable for a healthy codebase. SonarQube's default quality gate flags code with more than 3% duplication on new code. However, context matters - some duplication is intentional and acceptable, such as test setup code or generated files. The key metric is whether duplicated code is actively maintained in multiple places, creating a risk of inconsistent changes. Zero duplication is not a realistic or even desirable target.

What is the DRY principle and why does it matter?

DRY stands for Don't Repeat Yourself, a software engineering principle stating that every piece of knowledge should have a single, authoritative representation in a system. Violating DRY by copy-pasting code creates maintenance burden - when a bug is fixed in one copy, all other copies must be found and updated. Studies show that inconsistent changes to cloned code account for a significant percentage of bugs in large codebases. DRY is not about eliminating all similar-looking code - it is about ensuring that business logic and domain knowledge are not scattered across multiple locations.

Can duplicate code checkers run in CI/CD pipelines?

Yes, most modern duplicate code checkers integrate with CI/CD pipelines. PMD CPD, jscpd, and Duplo run as CLI tools that return non-zero exit codes when duplication thresholds are exceeded, making them easy to add to any pipeline. SonarQube, Codacy, DeepSource, and CodeAnt AI provide native GitHub Actions and GitLab CI integrations that comment duplication findings directly on pull requests. The key is configuring appropriate thresholds so the check catches meaningful duplication without blocking every PR over minor similarities.

What is the difference between token-based and AST-based clone detection?

Token-based detection (used by PMD CPD, Simian, and jscpd) breaks source code into a stream of tokens and finds matching sequences. It is fast and language-agnostic but only reliably detects Type 1 and Type 2 clones. AST-based detection (used by CloneDR, SonarQube, and DeepSource) parses code into abstract syntax trees and compares subtrees, which catches Type 3 clones where code has been modified or reordered. AST-based tools are slower but more accurate for detecting near-miss duplicates that token matching misses.


Originally published at aicodereview.cc

Top comments (0)