David Essien

Posted on Jun 2

Code Fingerprinting: Detecting Duplicate Submissions Without Losing Your Mind (or Your API Budget)

#api #backend #performance #security

While building LogicVisor's review system, I was faced with a question I hadn't thought about before: what would happen if users submitted duplicate algorithm solutions?

I decided to test it myself. I submitted a solution to a simple palindrome challenge. The request went through, my solution was reviewed correctly. Normal happy stuff. Then I submitted the exact same solution a second time. And the normal happy stuff happened again. But I was far from happy.

That duplicate submission meant more tokens spent, more database memory consumed, more remote procedure calls, more AI calls, and more waiting time. For a problem that had already been solved.

I shuddered at the thought of a bad actor effectively DOS-ing my site or running up my AI bill just by hammering the same solution repeatedly. Rate limiting controls the number of requests per client — but it has nothing to do with the content of those requests. A user could stay well within their rate limit and still abuse the system by submitting the same code over and over.

This wasn't going to work. I needed a way to identify duplicate submissions and return a cached response instead of processing everything from scratch. Which led me to a more interesting question: how do you actually detect identical code?

The Problem With "Identical"

Programming by its very nature allows us to solve the same problem in multiple ways. But even a single solution can be written in an infinite number of ways.

User A could create a function that takes and returns a string, and name it strawberries. User B could write the exact same function with the exact same logic and name it bananas. Different names, same implementation, same result.

For LogicVisor's purposes, these should be counted as duplicates. But how do you get a computer to see past the surface differences and recognize that two pieces of code are structurally the same?

I started thinking about it from first principles. At the end of the day, programming languages get compiled down to binary and then to electrical pulses. If two solutions solve the same problem the same way, they're ultimately firing the same set of pulses at the machine level. Variable names be damned.

So the question became: can I strip a solution down to its most fundamental representation and compare that?

Enter the Abstract Syntax Tree

More research led me to the Abstract Syntax Tree — something my OS professor had mentioned in school that I'd filed away as "probably never useful in web dev." No knowledge is wasted.

An Abstract Syntax Tree (AST) is a hierarchical, tree-shaped data structure that represents the syntactic structure of source code. It strips away superficial syntax — semicolons, curly braces, whitespace, comments — and keeps only the structural and semantic skeleton of the code. The shape of what you wrote, not the clothing it's wearing.

This was it. But I hit a wall almost immediately: I can't store and index an entire AST reliably in PostgreSQL. It's a complex nested structure, not a scalar value you can slap an index on.

Then I remembered an article about handling duplicate file uploads in Express with Multer. The trick was generating a SHA-256 hash of the file's content. A hash is a fixed-length fingerprint — the same input always produces the same output, and different inputs produce different outputs. Perfect.

The plan crystallized:

Parse the submitted code into an AST
Normalize the AST (rename variables, strip comments, etc.)
Hash the normalized output with SHA-256
Store and index that hash
On every new submission, compute the hash and check if it already exists

If the hash is in the database, return the cached review. No AI call. No wasted tokens. No waiting.

Two Parsers for the Price of One

LogicVisor supports eight languages: JavaScript, TypeScript, Python, Java, C, C++, Go, and Rust. Getting an AST for all of them required two different tools, each with their own approach to the world.

Babel Parser (for JavaScript and TypeScript)

Babel is something most JavaScript developers know as a transpiler — the thing that converts modern JS to something older browsers can run. But under the hood, Babel has a full-featured parser that produces a detailed AST of your JavaScript or TypeScript code. The @babel/parser package exposes this directly.

I'd used Babel's output for years but never touched the parser API directly. It's surprisingly clean. You hand it source code and options, and it gives you back a structured AST you can traverse and manipulate.

What made Babel the right choice for JS/TS specifically:

Native TypeScript support via a plugin — no separate toolchain
JSX support out of the box
The @babel/traverse package lets you walk the AST and modify nodes in place, which is exactly how I rename identifiers
@babel/generator converts the modified AST back to compact source code

The JS/TS canonicalization path: parse with Babel → traverse the AST and rename all non-builtin identifiers to generic names (v0, v1, v2...) → optionally sort top-level declarations → regenerate compact source → hash it.

Tree-sitter (for everything else)

Tree-sitter is a different beast. It's a parser generator — a tool for building fast, incremental, error-tolerant parsers for programming languages. You might have encountered it without knowing it: if you've used Neovim with syntax highlighting that actually understands your code structure (my first encounter with the tool), that's Tree-sitter. Same if you've used editors where renaming a variable highlights every reference correctly. It's doing real parsing, not just regex matching.

What makes Tree-sitter interesting for this use case is that it has community-maintained grammars for a huge range of languages — C, C++, Go, Rust, Java, Python, and many more. Each grammar is a separate package that teaches Tree-sitter how to parse that language. You install the grammar, set it on a parser instance, and Tree-sitter gives you a concrete syntax tree you can walk.

I hadn't really understood what Tree-sitter was before this. I'd seen it referenced in Neovim config files and assumed it was some Vim-specific thing. It's not — it's a general purpose parsing library with bindings for multiple languages including Node.js via the node-tree-sitter package.

The Tree-sitter path: parse the source → walk the resulting syntax tree with a DFS traversal → collect all identifier nodes by their byte ranges → build a name mapping (first-occurrence order) → apply replacements from the end of the file backwards (so byte indices don't shift) → normalize whitespace → hash it.

The backwards-replacement trick is worth calling out. When you're doing string replacements by index, if you replace something early in the file, every subsequent index is now wrong. By sorting replacements by start position descending and working from the end, you avoid that entire class of bug.

The Implementation

Here's how it all fits together.

The Simple Canonicalization (Pre-AST)

Before I landed on the AST approach, I had a simpler version that just handled whitespace and comments:

export function canonicalizeCode(code: string): string {
  // Remove multi-line comments
  let cleanedCode = code.replace(/\/\*[\s\S]*?\*\//g, "");

  // Remove single-line comments
  cleanedCode = cleanedCode.replace(/\/\/.*$/gm, "");

  // Normalize all whitespace to a single space
  cleanedCode = cleanedCode.replace(/\s+/g, " ").trim();

  return cleanedCode;
}

This catches the obvious cases — same code with different formatting or comments. But it falls apart the moment someone renames a variable. The AST approach handles that.

The Router

The main entry point checks the language and dispatches to the right parser:

export async function canonicalizeCodeAST(
  code: string,
  language: LanguageKey,
  options?: CanonicalizeOptions
): Promise<string> {
  const opts = { ...defaultOptions, ...(options || {}) };

  if (language === "javascript" || language === "typescript") {
    return canonicalizeJS(code, opts);
  }

  // C/C++/Go/Rust/Java/Python go through Tree-sitter
  return canonicalizeWithTreeSitter(code, language, null);
}

The Babel Path (JS/TS)

function canonicalizeJS(
  code: string,
  opts: Required<CanonicalizeOptions>
): string {
  const plugins: any[] = ["jsx"];
  if (opts) plugins.push("typescript");

  const ast = babelParser.parse(code, {
    sourceType: "module",
    plugins,
    allowReturnOutsideFunction: true,
  });

  // Rename all non-builtin identifiers to generic names
  const nameMap = new Map<string, string>();
  let counter = 0;

  traverse(ast as any, {
    Identifier(path) {
      if (!opts.normalizeVariableNames) return;

      const name = path.node.name;
      if (isBuiltinJsName(name)) return;

      if (!nameMap.has(name)) {
        nameMap.set(name, `v${counter++}`);
      }

      path.node.name = nameMap.get(name)!;
    },
  });

  // Optionally sort top-level declarations for order-independent comparison
  if (
    opts.sortTopLevel &&
    (ast as any).program &&
    Array.isArray((ast as any).program.body)
  ) {
    (ast as any).program.body.sort((a: any, b: any) => {
      const aStr = generate(a).code;
      const bStr = generate(b).code;
      return aStr.localeCompare(bStr);
    });
  }

  // Regenerate compact source from the modified AST
  return generate(ast as any, {
    comments: !opts.removeComments,
    compact: true,
  }).code;
}

The Tree-sitter Path (Everything Else)

function canonicalizeWithTreeSitter(
  code: string,
  language: LanguageKey,
  opts: Required<CanonicalizeOptions> | null
): string {
  const LangCtor = (treeSitterMap as any)[language];

  if (!LangCtor) {
    throw new Error(`Unsupported language for tree-sitter: ${language}`);
  }

  const parser = new Parser();
  parser.setLanguage(LangCtor);

  const tree = parser.parse(code);
  const root = tree.rootNode;

  const identifierNodeTypes = identifierTypeMap[language] || ["identifier"];
  const idNodes: { start: number; end: number; text: string }[] = [];
  const commentNodes: { start: number; end: number }[] = [];

  // DFS traversal to collect identifier and comment nodes
  const stack = [root];
  while (stack.length) {
    const node = stack.pop()!;
    const type = node.type;

    if (type.includes("comment") && opts?.removeComments) {
      commentNodes.push({ start: node.startIndex, end: node.endIndex });
    }

    if (identifierNodeTypes.includes(type) || /identifier|name/i.test(type)) {
      const text = code.slice(node.startIndex, node.endIndex);
      idNodes.push({ start: node.startIndex, end: node.endIndex, text });
    }

    for (let i = 0; i < node.childCount; i++) {
      stack.push(node.child(i) as SyntaxNode);
    }
  }

  // Build name mapping in first-occurrence order
  const nameMap = new Map<string, string>();
  let counter = 0;

  for (const n of idNodes) {
    const orig = n.text;
    if (isLanguageBuiltinName(orig, language)) continue;
    if (!nameMap.has(orig)) {
      nameMap.set(orig, `v${counter++}`);
    }
  }

  // Build replacement list
  const replacements: { start: number; end: number; replacement: string }[] = [];

  if (opts?.removeComments) {
    for (const c of commentNodes) {
      replacements.push({ start: c.start, end: c.end, replacement: "" });
    }
  }

  if (opts?.normalizeVariableNames) {
    for (const id of idNodes) {
      const mapped = nameMap.get(id.text);
      if (mapped) {
        replacements.push({ start: id.start, end: id.end, replacement: mapped });
      }
    }
  }

  // Apply replacements from end to start so byte indices stay valid
  replacements.sort((a, b) => b.start - a.start);

  let out = code;
  for (const r of replacements) {
    out = out.slice(0, r.start) + r.replacement + out.slice(r.end);
  }

  return out.trim();
}

Hashing the Canonical Form

Once we have the normalized source, hashing is straightforward:

export async function createCanonicalHash(canonicalCode: string): Promise<string> {
  const encoder = new TextEncoder();
  const data = encoder.encode(canonicalCode);
  const hashBuffer = await crypto.subtle.digest("SHA-256", data);
  const hashArray = Array.from(new Uint8Array(hashBuffer));
  return hashArray.map((b) => b.toString(16).padStart(2, "0")).join("");
}

SHA-256 gives us a fixed 64-character hex string regardless of how long the original code was. That's what gets stored and indexed in PostgreSQL.

Putting It All Together in the Request Handler

let canonicalCode;
try {
  canonicalCode = await canonicalizeCodeAST(solution, preferred_language);
} catch (error) {
  console.error("Canonicalization error:", error);
  return NextResponse.json(
    {
      success: false,
      error: {
        code: "BAD_USER_INPUT",
        message: "Unable to parse your code. Please check for syntax errors and try again.",
      },
    },
    { status: 400 }
  );
}

const canonicalHash = await createCanonicalHash(
  typeof canonicalCode === "string" ? canonicalCode : ""
);

// Check cache before touching the AI provider
const cachedReview = await getCachedAIReview(
  preferred_model.id + "-" + canonicalHash
);

if (cachedReview) {
  return NextResponse.json(
    { success: true, data: cachedReview },
    { status: 201 }
  );
}

// No cache hit — proceed with AI review and store the result

The model ID is included in the cache key because the same code reviewed by Gemini and Groq will produce different outputs. You don't want to serve a Gemini review to someone who selected Groq.

Does It Actually Work?

Here's the test that proved it out. Five TypeScript implementations of the same palindrome algorithm — different variable names, different formatting, different comments, and one with operand order swapped:

// Solution 1 — clean, descriptive names
function isPalindrome(x: number): boolean {
  let originalNum = x;
  let reversedNum = 0;
  while (originalNum > 0) {
    let lastDigit = originalNum % 10;
    reversedNum = (reversedNum * 10) + lastDigit;
    originalNum = Math.floor(originalNum / 10);
  }
  return x === reversedNum;
}

// Solution 2 — different variable names
function isPalindrome(n: number): boolean {
  let temp = n;
  let result = 0;
  while (temp > 0) {
    let digit = temp % 10;
    result = (result * 10) + digit;
    temp = Math.floor(temp / 10);
  }
  return n === result;
}

// Solution 3 — compressed formatting + comments
function isPalindrome(x: number): boolean{
  // Store original
  let originalNum=x;let reversedNum=0;
  while(originalNum>0){
    let lastDigit=originalNum%10; /* get digit */
    reversedNum=(reversedNum*10)+lastDigit;
    originalNum=Math.floor(originalNum/10);
  }
  return x===reversedNum;
}

// Solution 4 — addition operands swapped
function isPalindrome(x: number): boolean {
  let originalNum = x;
  let reversedNum = 0;
  while (originalNum > 0) {
    let lastDigit = originalNum % 10;
    reversedNum = lastDigit + (reversedNum * 10); // addition order changed
    originalNum = Math.floor(originalNum / 10);
  }
  return reversedNum === x; // comparison order changed
}

// Solution 5 — completely different algorithm
function isPalindrome(x: number): boolean {
  const str = x.toString();
  return str === str.split('').reverse().join('');
}

And the test harness:

export async function testDuplicateDetection() {
  const solutions = [ /* ...the five solutions above... */ ];

  console.log("Code hash comparison:");

  solutions.forEach(async (code, index) => {
    const canonicalized = await canonicalizeCodeAST(code, "typescript");
    const hash = await createCanonicalHash(canonicalized);
    console.log(`\nSolution ${index + 1}:`);
    console.log(`Hash: ${hash.substring(0, 12)}...`);
    console.log(
      `Canonical: ${
        typeof canonicalized === "string" ? canonicalized.substring(0, 100) : ""
      }...`,
    );
  });

  // Canonicalize before hashing — not after
  const hashes = await Promise.all(
    solutions.map(async (code) => {
      const canonicalized = await canonicalizeCodeAST(code, "typescript");
      return createCanonicalHash(canonicalized);
    }),
  );
  const uniqueHashes = new Set(hashes);

  console.log(`\nOriginal solutions: ${solutions.length}`);
  console.log(`Unique hashes: ${uniqueHashes.size}`);
  console.log(`Duplicates detected: ${solutions.length - uniqueHashes.size}`);
}

The output:

Code hash comparison:

Solution 1:
Hash: f7142f71e35a...
Canonical: function v0(v1:number):boolean{let v2=v1;let v3=0;while(v2>0){let v4=v2%10;v3=v3*10+v4;v2=Math.v5(v2...

Solution 2:
Hash: f7142f71e35a...
Canonical: function v0(v1:number):boolean{let v2=v1;let v3=0;while(v2>0){let v4=v2%10;v3=v3*10+v4;v2=Math.v5(v2...

Solution 3:
Hash: f7142f71e35a...
Canonical: function v0(v1:number):boolean{let v2=v1;let v3=0;while(v2>0){let v4=v2%10;v3=v3*10+v4;v2=Math.v5(v2...

Solution 4:
Hash: da8268fe24dd...
Canonical: function v0(v1:number):boolean{let v2=v1;let v3=0;while(v2>0){let v4=v2%10;v3=v4+v3*10;v2=Math.v5(v2...

Solution 5:
Hash: e3d2696010ad...
Canonical: function v0(v1:number):boolean{const v2=v1.v3();return v2===v2.v4('').v5().v6('');}...

Original solutions: 5
Unique hashes: 3
Duplicates detected: 2

Solutions 1, 2, and 3 collapse to the same hash — variable names, whitespace, and comments are all stripped away during canonicalization, leaving identical ASTs. Solution 4 gets a distinct hash because operand order (lastDigit + reversedNum * 10 vs reversedNum * 10 + lastDigit) is a syntactic difference the AST preserves — commutativity is math, not structure, and the canonicalizer doesn't evaluate expressions. Solution 5 is correctly identified as a genuinely different algorithm.
Two duplicates detected. Working as intended.

The Tradeoffs Worth Knowing

This approach isn't magic. A few things it doesn't catch:

Semantic equivalence across algorithms. Two completely different implementations that produce the same result — bubble sort vs. merge sort, for example — will have different hashes. This system detects structural duplicates, not algorithmic equivalence. That's actually the right behavior for a code review platform.

Heuristic identifier detection in Tree-sitter. The identifier node types vary between language grammars, and I'm using pattern matching (/identifier|name/i) as a fallback. It works well in practice but it's not exhaustive. If a grammar names its identifier nodes something unusual, you might miss some.

Builtin lists are incomplete. The isBuiltinJsName and isLanguageBuiltinName functions cover the most common builtins but aren't exhaustive. If a user names their variable fetch or setTimeout, those won't be normalized. Not a critical flaw, but worth knowing.

What I Took Away From This

The part I didn't expect: how much I'd enjoy it. This was one of those problems that pulled me away from the normal web dev loop of "call an API, render some data" and into something that felt closer to computer science. Parsers, ASTs, tree traversal — things I'd learned in school that felt abstract at the time suddenly had a real home in a real production system.

The other thing I took away: sometimes the right caching strategy isn't about time-to-live or cache invalidation. Sometimes it's about recognizing that two requests are asking for the same thing, even when they look different on the surface. That realization changed how I thought about a whole class of optimization problems.

If you're building anything that takes code as input — a linter, a reviewer, a plagiarism detector, an interview platform — the AST + hash pattern is worth having in your toolkit.

LogicVisor is an AI-powered tool that reviews your algorithmic code, breaks down time and space complexity, and gives you the kind of feedback you'd want before a technical interview. You can try it at logicvisor.vercel.app.

DEV Community