DEV Community

Cover image for Finding Structurally Duplicate Go Functions with AST Hashing
alexey.zh
alexey.zh

Posted on

Finding Structurally Duplicate Go Functions with AST Hashing

#go

Copy-paste is one of those engineering problems that starts harmless.

You copy one handler.
You rename User to Order.
You change a few string literals.
You adjust one validation branch.
Everything works.

Then six months later, the same bug is fixed in one copy but not the other.

That is the kind of duplication godedup is built to find.

godedup is a small Go CLI tool that finds structurally duplicate functions in Go code. It does not compare raw text. It does not care if variable names changed. It does not care if string literals changed. It looks at the shape of the code.

In other words, it tries to answer a practical question:

“Did I copy this function and slightly edit it?”


The problem with text-based duplication checks

A simple duplicate detector can compare lines of code.

That works when two functions are literally identical.

But real copy-paste rarely stays identical.

It usually becomes this:

func validateUser(userName string) error {
    if userName == "" {
        return errors.New("empty user name")
    }
    return nil
}
Enter fullscreen mode Exit fullscreen mode

and this:

func validateOrder(orderName string) error {
    if orderName == "" {
        return errors.New("empty order name")
    }
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Textually, these are different.

Structurally, they are almost the same function.

That distinction is the whole point of godedup.


What godedup does

godedup scans Go code and reports clone groups: sets of functions that have the same or very similar structure.

Example:

godedup ./...
Enter fullscreen mode Exit fullscreen mode

Default text output:

godedup: found 2 clone group(s) (1 exact, 1 near)

=== clone group 1  [EXACT 100% similarity] ===
  storage.(UserStore).validate
    pkg/storage/user.go:89  (12 stmts, 28 lines)
  storage.(OrderStore).validate
    pkg/storage/order.go:134  (12 stmts, 28 lines)

=== clone group 2  [NEAR 91% similarity] ===
  api.handleUserCreate
    pkg/api/user.go:44  (18 stmts, 45 lines)
  api.handleOrderCreate
    pkg/api/order.go:51  (19 stmts, 47 lines)

suggestion: extract shared logic into a common function
            or use generics if types differ
Enter fullscreen mode Exit fullscreen mode

There is also a table output mode:

godedup --output=table ./...
Enter fullscreen mode Exit fullscreen mode
GROUP  TYPE   SIM   FUNCTION                       LOCATION                  STMTS  LINES
-----  -----  ----  -----------------------------  ------------------------  -----  -----
1      EXACT  100%  storage.(UserStore).validate   pkg/storage/user.go:89    12     28
1      EXACT  100%  storage.(OrderStore).validate  pkg/storage/order.go:134  12     28
2      NEAR   91%   api.handleUserCreate           pkg/api/user.go:44        18     45
2      NEAR   91%   api.handleOrderCreate          pkg/api/order.go:51       19     47
Enter fullscreen mode Exit fullscreen mode

The file.go:line location format is intentional. Many terminals, editors, and IDEs can open these locations directly.


The main purpose

godedup is not trying to be a full static analyzer.

It has a narrower goal:

Find duplicated Go functions that are structurally the same, even when superficial details changed.

That makes it useful for:

  • finding copy-pasted HTTP handlers
  • finding repeated validation methods
  • spotting repeated storage/repository methods
  • detecting repeated error-handling patterns
  • checking pull requests for newly introduced duplication
  • keeping large Go projects easier to refactor

The tool is especially useful in backend projects where many functions follow similar flows:

decode request
validate input
call service
handle error
write response
Enter fullscreen mode Exit fullscreen mode

or:

start transaction
load entity
check permissions
update row
commit
Enter fullscreen mode Exit fullscreen mode

Sometimes that structure is intentional. Sometimes it is accidental duplication. godedup helps you see it.


Install

go install github.com/hashmap-kz/godedup@latest
Enter fullscreen mode Exit fullscreen mode

Then run:

godedup ./...
Enter fullscreen mode Exit fullscreen mode

Usage

godedup [flags] [path ...]
Enter fullscreen mode Exit fullscreen mode

Examples:

godedup ./...
godedup --exact ./...
godedup --min-similarity 0.90 ./pkg/...
godedup --output=table ./...
godedup --output=json ./... | jq .
Enter fullscreen mode Exit fullscreen mode

Common flags:

--min-similarity float   minimum similarity threshold for near clones
--min-stmts int          minimum statements in a function to analyze
--exact                  report only exact structural clones
--no-tests               exclude test files
--output string          output format: text, table, json
--version                print version and exit
Enter fullscreen mode Exit fullscreen mode

godedup exits with code 1 when duplicates are found, which makes it convenient for CI.

Example GitHub Actions step:

- name: Check duplicate Go functions
  run: godedup --exact ./...
Enter fullscreen mode Exit fullscreen mode

For stricter checks:

- name: Check near-duplicate Go functions
  run: godedup --min-similarity 0.90 --min-stmts 5 ./...
Enter fullscreen mode Exit fullscreen mode

How it works internally

At a high level, godedup does this:

Go source files
    ↓
go/parser
    ↓
AST
    ↓
function extraction
    ↓
normalized structural hashing
    ↓
exact clone detection
    ↓
near clone detection
    ↓
report
Enter fullscreen mode Exit fullscreen mode

The important part is that it works on the Go AST, not raw source text.

Go already ships with a parser in the standard library:

import (
    "go/ast"
    "go/parser"
    "go/token"
)
Enter fullscreen mode Exit fullscreen mode

That means godedup can parse Go code into syntax trees and reason about code structure.


AST-based detection

Consider this function:

func validateUser(user User) error {
    if user.Name == "" {
        return errors.New("empty user name")
    }

    if user.Email == "" {
        return errors.New("empty email")
    }

    return nil
}
Enter fullscreen mode Exit fullscreen mode

And this one:

func validateOrder(order Order) error {
    if order.Name == "" {
        return errors.New("empty order name")
    }

    if order.Code == "" {
        return errors.New("empty code")
    }

    return nil
}
Enter fullscreen mode Exit fullscreen mode

A text comparison sees different identifiers and literals:

user        vs order
User        vs Order
Email       vs Code
"empty..."  vs "empty..."
Enter fullscreen mode Exit fullscreen mode

But the AST shape is similar:

function
  if statement
    selector expression
    string comparison
    return call expression
  if statement
    selector expression
    string comparison
    return call expression
  return nil
Enter fullscreen mode Exit fullscreen mode

godedup focuses on that shape.


Normalization

Before hashing, godedup normalizes away details that usually do not matter for structural duplication.

Examples:

Source detail Normalized idea
userID, orderID, productID identifier
"users", "orders" string literal
10, 42, 999 numeric literal
fmt.Println, log.Println selector call shape

But it preserves details that affect behavior and structure.

Examples:

Source detail Preserved
statement order yes
control flow yes
operators like +, -, == yes
nil, true, false yes
if, for, switch, defer, go yes

This gives a useful middle ground.

It is not pure text comparison.
It is also not full semantic equivalence.
It is structural comparison.


Exact clones

Exact clones are functions that produce the same normalized structural hash.

For example:

func hashExprList(items []ast.Expr) []uint64 {
    var out []uint64
    for _, item := range items {
        out = append(out, hashExpr(item))
    }
    return out
}
Enter fullscreen mode Exit fullscreen mode

and:

func hashStmtList(items []ast.Stmt) []uint64 {
    var out []uint64
    for _, item := range items {
        out = append(out, hashStmt(item))
    }
    return out
}
Enter fullscreen mode Exit fullscreen mode

These functions are not textually identical. The types and called functions differ.

But structurally they are the same:

declare slice
range over input
append transformed item
return slice
Enter fullscreen mode Exit fullscreen mode

A report might look like this:

GROUP  TYPE   SIM   FUNCTION                     LOCATION                   STMTS  LINES
-----  -----  ----  ---------------------------  -------------------------  -----  -----
1      EXACT  100%  hash.(*Hasher).hashExprList  internal/hash/hash.go:288  3      7
1      EXACT  100%  hash.(*Hasher).hashStmtList  internal/hash/hash.go:296  3      7
Enter fullscreen mode Exit fullscreen mode

This is the kind of duplication that can often be replaced with a generic helper.


Near clones

Exact clones are useful, but they are not enough.

Real copy-paste often changes one or two statements.

For example:

func handleUserCreate(w http.ResponseWriter, r *http.Request) {
    var req CreateUserRequest

    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        writeError(w, err)
        return
    }

    if err := validateUser(req); err != nil {
        writeError(w, err)
        return
    }

    user, err := users.Create(r.Context(), req)
    if err != nil {
        writeError(w, err)
        return
    }

    writeJSON(w, user)
}
Enter fullscreen mode Exit fullscreen mode

And:

func handleOrderCreate(w http.ResponseWriter, r *http.Request) {
    var req CreateOrderRequest

    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        writeError(w, err)
        return
    }

    if req.DryRun {
        writeJSON(w, map[string]bool{"ok": true})
        return
    }

    if err := validateOrder(req); err != nil {
        writeError(w, err)
        return
    }

    order, err := orders.Create(r.Context(), req)
    if err != nil {
        writeError(w, err)
        return
    }

    writeJSON(w, order)
}
Enter fullscreen mode Exit fullscreen mode

These are not exact clones. One has an extra DryRun branch.

But most of the statement sequence is still similar.

For this, godedup uses near-clone detection.


Statement hash sequences

Instead of comparing only one hash for the whole function, godedup can represent a function as a sequence of statement hashes.

For example:

handleUserCreate:
  [decode, decode-error-return, validate, validate-error-return, create, create-error-return, write-json]

handleOrderCreate:
  [decode, decode-error-return, dry-run, validate, validate-error-return, create, create-error-return, write-json]
Enter fullscreen mode Exit fullscreen mode

Then the problem becomes sequence similarity.

How many edits are needed to transform one statement sequence into another?

That is where edit distance comes in.


Edit distance for near clones

For near clones, godedup compares per-statement hash sequences using edit distance.

Conceptually:

A = [decode, validate, create, respond]
B = [decode, dry-run, validate, create, respond]
Enter fullscreen mode Exit fullscreen mode

The sequences are almost the same. B has one extra statement.

The edit distance is small, so similarity is high.

A simple similarity score can be derived from edit distance:

similarity = 1 - edits / max(len(A), len(B))
Enter fullscreen mode Exit fullscreen mode

So if two functions have 20 statements and differ by 2 edits:

similarity = 1 - 2/20 = 0.90
Enter fullscreen mode Exit fullscreen mode

That becomes a 90% near clone.

This is not trying to understand program meaning. It is a practical heuristic that works well for copy-paste detection.


Why not compare raw tokens?

Token comparison is better than line comparison, but it still carries a lot of surface-level detail.

For example, these calls are token-different:

fmt.Println("created user")
log.Println("created order")
Enter fullscreen mode Exit fullscreen mode

But structurally they are both selector calls with a string argument.

For copy-paste detection, that shape is often more important than the exact package or literal value.

The AST gives a cleaner signal.


What it finds well

godedup is good at finding functions that share a skeleton.

Examples:

[+] two validate methods with identical logic
[+] HTTP handlers with the same flow
[+] storage methods repeated across entities
[+] copy-pasted error handling
[+] small framework-like patterns that were duplicated manually
[+] functions where names and literals were changed after copy-paste
Enter fullscreen mode Exit fullscreen mode

This is especially useful in Go because Go code often favors explicitness. That is a strength of the language, but it can also lead to repeated boilerplate.

godedup helps identify where explicit code has quietly become duplicated code.


What it intentionally does not do

godedup is intentionally simple.

It does not try to prove that two functions are semantically equivalent.

These may not match:

[-] same behavior implemented with completely different control flow
[-] duplicated types or constants
[-] duplicated code spread across many tiny helper functions
[-] functions below the minimum statement threshold
[-] generated code, unless included in scanned paths
Enter fullscreen mode Exit fullscreen mode

That simplicity is intentional.

The goal is not to build a compiler-grade equivalence checker. The goal is to find practical duplication that is worth reviewing.


Why minimum statement count matters

By default, very short functions are ignored.

Why?

Because tiny functions often look structurally similar by accident.

For example:

func Name() string {
    return "user"
}
Enter fullscreen mode Exit fullscreen mode

and:

func Type() string {
    return "order"
}
Enter fullscreen mode Exit fullscreen mode

These are structurally identical, but reporting thousands of tiny getters would be noise.

That is why godedup has:

--min-stmts
Enter fullscreen mode Exit fullscreen mode

Example:

godedup --min-stmts 5 ./...
Enter fullscreen mode Exit fullscreen mode

This means:

Only analyze functions with at least 5 statements.

For large projects, increasing this value can make reports much more useful.


Exact-only mode

Sometimes near clones are too noisy.

For a first CI integration, exact structural clones are easier to start with:

godedup --exact ./...
Enter fullscreen mode Exit fullscreen mode

This reports only functions with identical normalized structure.

That makes it a good low-risk mode for introducing the tool into an existing repository.

Later, you can enable near-clone detection:

godedup --min-similarity 0.90 ./...
Enter fullscreen mode Exit fullscreen mode

JSON output

For automation, JSON output is useful:

godedup --output=json ./...
Enter fullscreen mode Exit fullscreen mode

That makes it possible to build custom checks, dashboards, CI comments, or pull request annotations.

Example:

godedup --output=json ./... | jq '.[] | select(.exact)'
Enter fullscreen mode Exit fullscreen mode

A team could use JSON output to:

  • fail only on exact clones
  • ignore specific paths
  • build a duplication trend report
  • comment on pull requests
  • track duplicated hotspots over time

Table output for humans

For local use, table output is often the nicest:

godedup --output=table ./...
Enter fullscreen mode Exit fullscreen mode

Example:

GROUP  TYPE   SIM   FUNCTION                       LOCATION                  STMTS  LINES
-----  -----  ----  -----------------------------  ------------------------  -----  -----
1      EXACT  100%  storage.(UserStore).validate   pkg/storage/user.go:89    12     28
1      EXACT  100%  storage.(OrderStore).validate  pkg/storage/order.go:134  12     28
Enter fullscreen mode Exit fullscreen mode

The LOCATION column is deliberately formatted as:

path/to/file.go:line
Enter fullscreen mode Exit fullscreen mode

This is a small detail, but it matters.

Many tools recognize that format. You can often click it in terminal output, CI logs, or IDE-integrated consoles.


Why this can be useful in code review

Duplication is easiest to fix when it is introduced.

Once duplicated code lives for months, it starts to diverge. One copy gets a bug fix. Another gets a new branch. A third copy gets logging. Eventually it becomes unclear whether the differences are intentional.

Running godedup in CI can catch this earlier.

For example:

- name: Check structural duplicates
  run: godedup --exact --no-tests ./...
Enter fullscreen mode Exit fullscreen mode

That gives a team a simple rule:

Exact structural clones should be reviewed before merging.

Not every clone needs to be removed. Sometimes duplication is acceptable. But it should be visible.


What to do when godedup finds a clone

The answer is not always “extract a function”.

Sometimes extraction makes code worse.

Good options include:

Extract a helper

Useful when the duplicate logic is truly shared.

func validateRequired(name, value string) error {
    if value == "" {
        return fmt.Errorf("%s is required", name)
    }
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Use generics

Useful when the algorithm is identical but the types differ.

func mapList[T any, R any](items []T, fn func(T) R) []R {
    out := make([]R, 0, len(items))
    for _, item := range items {
        out = append(out, fn(item))
    }
    return out
}
Enter fullscreen mode Exit fullscreen mode

Extract a handler skeleton

Useful for repeated HTTP handlers.

func handleCreate[TReq any, TResp any](
    w http.ResponseWriter,
    r *http.Request,
    validate func(TReq) error,
    create func(context.Context, TReq) (TResp, error),
) {
    var req TReq

    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        writeError(w, err)
        return
    }

    if err := validate(req); err != nil {
        writeError(w, err)
        return
    }

    resp, err := create(r.Context(), req)
    if err != nil {
        writeError(w, err)
        return
    }

    writeJSON(w, resp)
}
Enter fullscreen mode Exit fullscreen mode

Leave it duplicated

Sometimes this is the right answer.

If two pieces of code are similar today but expected to evolve independently, keeping them separate can be better.

The value of godedup is not that it forces refactoring. The value is that it makes duplication visible.


Internal design trade-offs

The interesting part of a tool like this is choosing what not to do.

It does not use type checking

Type checking would provide more semantic information, but it also makes the tool heavier.

For duplication detection, syntax structure is often enough.

Using the parser and AST keeps the tool fast and simple.

It does not try to understand business meaning

Two functions may be semantically identical even if they have different structure.

Detecting that reliably is much harder.

godedup instead targets the common and practical case: copy-paste with small edits.

It normalizes aggressively

Names and literals are ignored because they often change during copy-paste.

This can produce false positives if two functions share a shape but are conceptually unrelated. That is why thresholds like --min-stmts and --min-similarity matter.

It reports clone groups, not individual warnings

A clone is only meaningful in relation to another function.

Grouping duplicate functions together makes the report easier to understand.


A simplified algorithm

The rough algorithm looks like this:

for each Go file:
    parse file into AST

for each function declaration:
    if function has fewer than min statements:
        skip it

    normalized_hash = hash(normalized AST)
    statement_hashes = hash each normalized statement

group functions by normalized_hash
    groups with more than one function are exact clones

for remaining functions:
    compare statement_hash sequences
    calculate similarity from edit distance
    if similarity >= threshold:
        report as near clone
Enter fullscreen mode Exit fullscreen mode

The exact clone path is fast because it is mostly hashing and grouping.

The near clone path is more expensive because it compares sequences, so thresholds are important.


Why Go is a nice language for this tool

Go is a good target for structural duplicate detection because:

  • it has a standard parser
  • the AST is available in the standard library
  • formatting is consistent thanks to gofmt
  • code is usually explicit
  • many backend projects follow repeated structural patterns

You do not need to write a custom parser. The Go toolchain already provides the foundation.

That makes it possible to build useful static analysis tools with surprisingly little infrastructure.


Example workflow

A practical workflow could be:

First, run exact-only mode:

godedup --exact ./...
Enter fullscreen mode Exit fullscreen mode

Fix or ignore obvious clones.

Then increase minimum size:

godedup --min-stmts 8 ./...
Enter fullscreen mode Exit fullscreen mode

This focuses on larger functions.

Then enable near-clone detection:

godedup --min-similarity 0.90 --min-stmts 8 ./...
Enter fullscreen mode Exit fullscreen mode

Finally, add CI:

- name: Check duplicate Go functions
  run: godedup --exact --no-tests ./...
Enter fullscreen mode Exit fullscreen mode

This approach avoids introducing too much noise at once.


Future ideas

There are many directions this kind of tool can grow.

Some useful ones:

  • Markdown output for GitHub job summaries
  • baseline files to fail only on new duplication
  • package-level duplication summary
  • ignore patterns for generated files
  • severity levels based on duplicated line count
  • GitHub Actions annotations
  • better near-clone difference hints
  • HTML reports for large repositories

But the core should stay simple:

Parse Go code, normalize function structure, find suspicious duplication, report it clearly.


Conclusion

godedup is a small tool for a specific kind of code smell: structurally duplicated Go functions.

It is not a replacement for code review.
It is not a semantic analyzer.
It is not a refactoring engine.

It is a flashlight.

It shows places where code looks copy-pasted, even after names and literals changed.

That is often enough to start a useful refactoring conversation.

If you work on Go services with many handlers, repositories, validators, or repeated workflows, a structural duplicate detector can be a surprisingly helpful addition to your toolbox.

go install github.com/hashmap-kz/godedup@latest
godedup ./...
Enter fullscreen mode Exit fullscreen mode

Top comments (0)