Stokry

Posted on Dec 2, 2025

I Built an AI That Finds Every Article Similar to Your Text — Here’s How It Works

#showdev #webdev #productivity

Most search engines will show you pages that mention the same keywords.

But what if you want to find articles that express the same ideas, even if they’re written completely differently?

That’s the problem that pushed me into building something new.

A few months ago, I needed to verify whether parts of my content were being reused online. Google wasn’t helpful — keyword search missed half of the semantically similar pages. Rephrased paragraphs were completely invisible.

So I built my own engine.

Today, I’m excited to introduce Nooth, an AI-powered system that takes any text you enter and searches the web for content with similar themes, ideas, and semantic meaning.

Drop in a paragraph → get URLs, similarity percentages, and deep analytical insights.

Here’s how it works under the hood.

🚀 The Idea

Traditional search relies on keyword overlap.

If two texts don’t share the same words, the engine assumes they’re unrelated.

But humans don’t think that way.

We recognize similarity through meaning — the structure of ideas, not characters.

So the goal was simple:

Build an engine that understands what your text is about, then find everything on the internet that expresses the same ideas.

That meant two pieces were crucial:

A way to measure meaning
A way to scan the web, extract clean text, and compare it

🧠 Step 1: Turning Text Into Meaning (Embeddings)

At the core of Nooth is a semantic model that converts text into embeddings — numerical vectors that represent meaning.

Example:

“How to speed up your Node.js API”

and

“Optimizing performance in a JavaScript backend”

These look different, but in vector space, their embeddings are neighbors.

Every time you paste text into Nooth:

it cleans and normalizes it
creates sentence-level embeddings
composes them into a unified “semantic fingerprint”

That fingerprint becomes the reference point for the search.

🌐 Step 2: Finding Content Across the Web

Nooth crawls and indexes content using:

lightweight scrapers
structured metadata
boilerplate removal
language detection
canonical URL resolution
and chunking for long-form content

The goal isn’t to store the entire HTML — only the meaningful part:

titles, headings, paragraphs, and contextual metadata.

Each chunk gets its own embedding.

That allows highly precise, paragraph-level similarity detection.

📊 Step 3: Computing Semantic Similarity

Once all candidates are scraped, Nooth performs:

cosine similarity
conceptual overlap scoring
syntactic variance weighting
partial-match scoring (for rewritten content)

This combination is what lets Nooth detect even heavily rephrased articles.

It doesn’t just ask:

“Do these texts share words?”

It asks:

“Do these texts express the same thing?”

🔍 Step 4: Presenting Clear Evidence

The final result looks like this (example flow):

You paste a paragraph
Engine breaks it into meaning blocks
It fetches semantically closest content across the web
You get:

-   URL

-   domain

-   similarity percentage

-   highlighted overlapping ideas

-   reasoning on _why_ the match is close

This makes it useful for:

discovering who’s copying your content
researching related topics
finding competitors
understanding what else exists on the same theme
generating new ideas based on existing clusters

🧭 Why I Built It

I didn’t want another Google-like keyword search.

I wanted a lens into the semantic web — the world of ideas, not strings.

Whether you’re a researcher, writer, builder, or SEO nerd, sometimes you need to find content that’s conceptually similar, not textually identical.

Nooth solves that by combining:

modern embeddings
optimized scraping
semantic search
scoring heuristics
and clean UI on top of everything

It shows you what’s related on an idea level, not just word level.

🧪 Try It Out

If you want to test it:

👉 Paste any text.

👉 See what the web holds that’s semantically connected.

👉 Explore clusters of related ideas you didn’t even know existed.

https://nooth.dev/

🔮 What’s Next

I’m currently working on:

deeper plagiarism detection
topic extraction
content clustering
semantic timelines
alerts when new similar content appears
full API for developers

If any of this sounds interesting, I’d love feedback.

Thanks for reading — and welcome to the world of semantic discovery.

DEV Community