DEV Community

Cover image for I Built an AI That Finds Every Article Similar to Your Text — Here’s How It Works
Stokry
Stokry

Posted on

I Built an AI That Finds Every Article Similar to Your Text — Here’s How It Works

Most search engines will show you pages that mention the same keywords.

But what if you want to find articles that express the same ideas, even if they’re written completely differently?

That’s the problem that pushed me into building something new.

A few months ago, I needed to verify whether parts of my content were being reused online. Google wasn’t helpful — keyword search missed half of the semantically similar pages. Rephrased paragraphs were completely invisible.

So I built my own engine.

Today, I’m excited to introduce Nooth, an AI-powered system that takes any text you enter and searches the web for content with similar themes, ideas, and semantic meaning.

Drop in a paragraph → get URLs, similarity percentages, and deep analytical insights.

Here’s how it works under the hood.

🚀 The Idea

Traditional search relies on keyword overlap.

If two texts don’t share the same words, the engine assumes they’re unrelated.

But humans don’t think that way.

We recognize similarity through meaning — the structure of ideas, not characters.

So the goal was simple:

Build an engine that understands what your text is about, then find everything on the internet that expresses the same ideas.

That meant two pieces were crucial:

  1. A way to measure meaning

  2. A way to scan the web, extract clean text, and compare it

🧠 Step 1: Turning Text Into Meaning (Embeddings)

At the core of Nooth is a semantic model that converts text into embeddings — numerical vectors that represent meaning.

Example:

“How to speed up your Node.js API”

and

“Optimizing performance in a JavaScript backend”

These look different, but in vector space, their embeddings are neighbors.

Every time you paste text into Nooth:

  • it cleans and normalizes it

  • creates sentence-level embeddings

  • composes them into a unified “semantic fingerprint”

That fingerprint becomes the reference point for the search.

🌐 Step 2: Finding Content Across the Web

Nooth crawls and indexes content using:

  • lightweight scrapers

  • structured metadata

  • boilerplate removal

  • language detection

  • canonical URL resolution

  • and chunking for long-form content

The goal isn’t to store the entire HTML — only the meaningful part:

titles, headings, paragraphs, and contextual metadata.

Each chunk gets its own embedding.

That allows highly precise, paragraph-level similarity detection.


📊 Step 3: Computing Semantic Similarity

Once all candidates are scraped, Nooth performs:

  • cosine similarity

  • conceptual overlap scoring

  • syntactic variance weighting

  • partial-match scoring (for rewritten content)

This combination is what lets Nooth detect even heavily rephrased articles.

It doesn’t just ask:

“Do these texts share words?”

It asks:

“Do these texts express the same thing?”


🔍 Step 4: Presenting Clear Evidence

The final result looks like this (example flow):

  1. You paste a paragraph

  2. Engine breaks it into meaning blocks

  3. It fetches semantically closest content across the web

  4. You get:

-   URL

-   domain

-   similarity percentage

-   highlighted overlapping ideas

-   reasoning on _why_ the match is close
Enter fullscreen mode Exit fullscreen mode

Ui Nooth

This makes it useful for:

  • discovering who’s copying your content

  • researching related topics

  • finding competitors

  • understanding what else exists on the same theme

  • generating new ideas based on existing clusters

UI NOOTH


🧭 Why I Built It

I didn’t want another Google-like keyword search.

I wanted a lens into the semantic web — the world of ideas, not strings.

Whether you’re a researcher, writer, builder, or SEO nerd, sometimes you need to find content that’s conceptually similar, not textually identical.

Nooth solves that by combining:

  • modern embeddings

  • optimized scraping

  • semantic search

  • scoring heuristics

  • and clean UI on top of everything

It shows you what’s related on an idea level, not just word level.


🧪 Try It Out

If you want to test it:

👉 Paste any text.

👉 See what the web holds that’s semantically connected.

👉 Explore clusters of related ideas you didn’t even know existed.

https://nooth.dev/


🔮 What’s Next

I’m currently working on:

  • deeper plagiarism detection

  • topic extraction

  • content clustering

  • semantic timelines

  • alerts when new similar content appears

  • full API for developers

If any of this sounds interesting, I’d love feedback.

Thanks for reading — and welcome to the world of semantic discovery.

Top comments (0)