Aakash Gour

Posted on May 25

How to Build a Content Similarity Checker to Avoid Duplicate AI Output

#webdev #ai #javascript #tutorial

When I started running bulk content generation jobs in PostAll, I hit a problem I didn't expect.

Not hallucinations. Not rate limits. Duplicates.

Not copy-paste duplicates — subtle ones. Two blog posts about "email marketing tips" that were structurally identical: same three sections, same examples, same conclusion, different sentences. The kind of thing that looks fine until you read them back to back and realize the AI just remixed the same underlying content.

At 10 articles, you catch this manually. At 500, you don't.

Here's the exact similarity checker I built to fix it — using OpenAI embeddings and cosine similarity, with a threshold you can tune to your use case.

Why String Matching Doesn't Cut It

My first instinct was to compare outputs with something simple. Calculate the Levenshtein distance between strings, flag anything above a similarity score. Shipped it in an afternoon.

It failed immediately.

Two articles can share 0% of their exact words and still be meaningfully duplicate content. "Email marketing boosts your open rates" and "Sending better emails increases how many people read them" are semantically identical — string matching treats them as completely different.

What you actually need is semantic similarity: are these two pieces of content saying the same thing, regardless of how the words are arranged?

That's what embeddings are for.

How Embeddings Work (The Part That Actually Matters)

An embedding is a vector — a list of numbers — that represents the meaning of a piece of text. OpenAI's text-embedding-3-small model converts any text into a 1536-dimensional vector.

Here's the useful property: texts with similar meaning produce vectors that point in similar directions in that 1536-dimensional space. Texts with different meaning point in different directions.

Cosine similarity measures the angle between two vectors. An angle of 0° (cosine similarity = 1.0) means identical direction — same meaning. An angle of 90° (cosine similarity = 0) means orthogonal — completely unrelated content.

In practice:

0.95+: Near-duplicate. Same content, slight rewording.
0.85–0.95: Very similar. Same topic and angle, different structure.
0.70–0.85: Related content. Same topic, meaningfully different angle.
Below 0.70: Different enough. Probably fine to publish both.

Your threshold depends on your use case. I'll come back to this.

The Implementation

1. Set Up the Embedding Client

// similarity-checker.js
import OpenAI from "openai";

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// text-embedding-3-small: 1536 dimensions, $0.02/1M tokens
// text-embedding-3-large: 3072 dimensions, $0.13/1M tokens
// For content dedup, small is accurate enough and ~6x cheaper
const EMBEDDING_MODEL = "text-embedding-3-small";

2. Get an Embedding for a Piece of Content

async function getEmbedding(text) {
  // Truncate at ~8000 tokens — embeddings model limit is 8191
  // For articles, take first 6000 chars: captures topic, angle, structure
  // without padding from conclusion boilerplate
  const truncated = text.slice(0, 6000);

  const response = await openai.embeddings.create({
    model: EMBEDDING_MODEL,
    input: truncated,
  });

  return response.data[0].embedding; // array of 1536 floats
}

Why truncate at 6000 chars instead of sending the full article?

Two reasons. First, the model has an 8191-token limit. Second, and more importantly: the first ~1500 words of an article carry almost all of its semantic fingerprint — the topic, angle, key claims. The conclusion is usually boilerplate. Truncating early makes the check faster and, in my testing, more accurate for detecting duplicates.

3. Calculate Cosine Similarity

function cosineSimilarity(vecA, vecB) {
  // Both vectors must be the same length (1536 for text-embedding-3-small)
  if (vecA.length !== vecB.length) {
    throw new Error(`Vector length mismatch: ${vecA.length} vs ${vecB.length}`);
  }

  let dotProduct = 0;
  let magnitudeA = 0;
  let magnitudeB = 0;

  for (let i = 0; i < vecA.length; i++) {
    dotProduct += vecA[i] * vecB[i];
    magnitudeA += vecA[i] * vecA[i];
    magnitudeB += vecB[i] * vecB[i];
  }

  magnitudeA = Math.sqrt(magnitudeA);
  magnitudeB = Math.sqrt(magnitudeB);

  if (magnitudeA === 0 || magnitudeB === 0) return 0;

  return dotProduct / (magnitudeA * magnitudeB);
}

This is the full cosine similarity formula — no library needed. It runs in O(n) where n is the vector dimension (1536), so it's fast for individual comparisons.

4. Check One Article Against a Library

async function checkForDuplicates(newContent, existingArticles, threshold = 0.88) {
  const newEmbedding = await getEmbedding(newContent);
  const duplicates = [];

  for (const article of existingArticles) {
    // Skip re-embedding if you've cached embeddings (you should — see below)
    const existingEmbedding = article.embedding ?? await getEmbedding(article.content);

    const similarity = cosineSimilarity(newEmbedding, existingEmbedding);

    if (similarity >= threshold) {
      duplicates.push({
        articleId: article.id,
        title: article.title,
        similarity: Math.round(similarity * 100) / 100,
      });
    }
  }

  // Sort by similarity descending so the most egregious duplicate is first
  duplicates.sort((a, b) => b.similarity - a.similarity);

  return {
    isDuplicate: duplicates.length > 0,
    matches: duplicates,
    newEmbedding, // return so caller can cache it
  };
}

5. Putting It Together for a Bulk Job

async function processBulkGeneration(articles) {
  const results = [];
  const processedArticles = []; // grows as we process

  for (const article of articles) {
    const { isDuplicate, matches, newEmbedding } = await checkForDuplicates(
      article.content,
      processedArticles,
      0.88
    );

    if (isDuplicate) {
      results.push({
        ...article,
        status: "duplicate",
        duplicateOf: matches[0], // most similar match
        action: "needs_revision",
      });
    } else {
      results.push({
        ...article,
        status: "unique",
        action: "ready_to_publish",
      });

      // Only add non-duplicates to the comparison pool
      // Adding duplicates would penalize articles similar to a bad article
      processedArticles.push({
        id: article.id,
        title: article.title,
        content: article.content,
        embedding: newEmbedding, // cache it — never re-embed the same content
      });
    }

    // Rate limit: embeddings API allows ~3000 RPM on tier 1
    // For bulk jobs, 50ms between requests keeps you well under
    await new Promise((resolve) => setTimeout(resolve, 50));
  }

  return results;
}

The Part That Bit Me: Caching Embeddings

The first version of this had no caching. For a job of 200 articles:

Embedding each article: 200 API calls
Checking each article against all previous: up to 200 × 199 / 2 = 19,900 comparisons

The comparisons are free — they're just math. But I was re-embedding articles every time I ran the checker, even articles I'd already processed.

At $0.02 per million tokens, re-embedding 200 articles (~500 tokens each) costs almost nothing. But at scale, with a persistent library of 10,000+ articles to check against, you want to store embeddings in your database and retrieve them directly.

// With a database — pseudocode for the storage layer
// but the retrieval pattern is what matters

async function getOrCreateEmbedding(articleId, content, db) {
  // Check if we already have this embedding stored
  const cached = await db.query(
    "SELECT embedding FROM articles WHERE id = $1 AND embedding IS NOT NULL",
    [articleId]
  );

  if (cached.rows[0]) {
    // Parse back from JSON — most DBs store vectors as JSON arrays
    return JSON.parse(cached.rows[0].embedding);
  }

  // Generate and store
  const embedding = await getEmbedding(content);
  await db.query(
    "UPDATE articles SET embedding = $1 WHERE id = $2",
    [JSON.stringify(embedding), articleId]
  );

  return embedding;
}

If you're on Postgres, the pgvector extension handles this natively, including fast ANN (approximate nearest neighbor) search so you're not doing a full scan of 10,000 embeddings for every new article.

Tuning the Threshold

I ran 50 article pairs through manual review to calibrate the threshold for PostAll's use case. Here's what I found at each level:

0.95+: True duplicates only. Same article, rewritten paragraph by paragraph. This catches almost nothing useful because it's too strict for AI-generated content, which tends to vary more than you'd expect even when covering the same topic.

0.88–0.92: The sweet spot for "same content, different words." Two articles about "onboarding email sequences" that cover the same three tactics in the same order. Definitely shouldn't both be published.

0.80–0.88: Similar topic and angle, but meaningfully different. An article about "onboarding email sequences for SaaS" vs. "onboarding email sequences for e-commerce" might land here. Depending on your use case, this might be a flag-for-review rather than an automatic rejection.

Below 0.80: Different enough to publish regardless of topic overlap.

My recommendation: start at 0.88, manually review your first 20 flags, and adjust from there. The threshold is highly dependent on how diverse your content topics are and how much variation your generation prompts produce.

What Can Go Wrong

Short content produces unreliable embeddings. A 50-word product description doesn't give the model enough signal. I set a minimum of 200 words before running similarity checks — anything shorter gets flagged for manual review instead.

All your content comes back as similar. This usually means your generation prompts are too rigid. If every prompt says "write in a professional, informative tone with three sections and a clear conclusion," the AI will produce structurally similar content regardless of topic. The similarity checker is revealing a prompt diversity problem, not a bug in the checker.

Embeddings drift if you switch models. text-embedding-3-small and text-embedding-ada-002 produce incompatible vector spaces. If you switch embedding models, you need to re-embed your entire library. Store which model generated each embedding alongside the vector.

The check is O(n) per article against your existing library. At 10,000 articles, checking a new article requires 10,000 cosine similarity calculations. That's still fast (under 100ms in my testing), but if you're checking 500 new articles against a library of 50,000, consider pgvector's HNSW index for approximate nearest neighbor search instead of the brute-force approach above.

The Complete Script (Copy-Paste Ready)

// content-similarity-checker.js
// Requires: openai npm package, OPENAI_API_KEY env var

import OpenAI from "openai";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

async function getEmbedding(text) {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text.slice(0, 6000),
  });
  return response.data[0].embedding;
}

function cosineSimilarity(vecA, vecB) {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < vecA.length; i++) {
    dot += vecA[i] * vecB[i];
    magA += vecA[i] ** 2;
    magB += vecB[i] ** 2;
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}

export async function checkSimilarity(newContent, library, threshold = 0.88) {
  if (newContent.length < 200) {
    return { status: "too_short", isDuplicate: false, matches: [] };
  }

  const embedding = await getEmbedding(newContent);
  const matches = library
    .map((item) => ({
      ...item,
      similarity: cosineSimilarity(embedding, item.embedding),
    }))
    .filter((item) => item.similarity >= threshold)
    .sort((a, b) => b.similarity - a.similarity);

  return {
    status: matches.length > 0 ? "duplicate" : "unique",
    isDuplicate: matches.length > 0,
    matches,
    embedding, // cache this
  };
}

Usage:

import { checkSimilarity } from "./content-similarity-checker.js";

// Your existing articles with cached embeddings
const library = await db.query("SELECT id, title, embedding FROM articles");

const result = await checkSimilarity(newArticleContent, library.rows);

if (result.isDuplicate) {
  console.log(`Duplicate detected (${result.matches[0].similarity} similarity)`);
  console.log(`Most similar to: "${result.matches[0].title}"`);
} else {
  // Safe to publish — cache the embedding
  await db.query("UPDATE articles SET embedding = $1 WHERE id = $2", [
    JSON.stringify(result.embedding),
    newArticle.id,
  ]);
}

What I'd Add Next

The similarity checker catches what was already generated. It doesn't prevent duplicate content from being generated in the first place.

The next layer I'm building: before generating content on a topic, check the existing library for similar topics. If something with 0.88+ similarity already exists, either skip the generation or pass the existing article's key points to the prompt as a negative constraint: "Do not cover these points — they're already addressed in our existing content."

That closes the loop. Instead of catching duplicates after generation, you're steering generation away from duplication upfront.

The full code is on GitHub: [link to repo]. If you're using pgvector and want the schema for storing and querying embeddings at scale, the README covers that too.

What threshold did you land on for your use case? I'm curious whether the 0.88 default holds up across different content types — drop it in the comments.

DEV Community