Embeddings: How AI Knows Things Are Similar

#ai #embeddings #machinelearning #beginners

AI in Practice, No Fluff — Day 8/10

I built a memory system for one of my AI projects. Every conversation and decision gets saved as a journal entry. After a few weeks, I had hundreds of entries. Finding the right one when I needed it was the problem.

Keyword search was useless. If I searched for "authentication," I would miss the entry where I wrote about "login flow" or "user credentials." The words were different. The meaning was the same. I needed something that could match on meaning, not just spelling.

That something is embeddings.

In the first series, I mentioned embeddings as part of how RAG systems prepare your data for retrieval. Today I want to unpack what embeddings actually are, how they work, and why they matter well beyond RAG.

A list of numbers that represents meaning

An embedding is a list of numbers. That is it. You send a piece of text to an embedding model, and it returns a list of numbers (called a vector) that represents the meaning of that text.

The list is long. Depending on the model, it might be 256 numbers, 1,024 numbers, or even more. Each number represents some dimension of meaning that the model learned during training. You do not get to choose what those dimensions mean, and honestly, most of them are not interpretable by humans. The model learned its own internal language for representing concepts.

Texts with similar meanings get similar numbers.

"The dog sat on the porch" and "A canine rested on the veranda" would produce vectors that are very close to each other, even though they share zero words. "The stock market crashed" would produce a vector that is far away from both of them.

The model is not doing keyword matching or looking for shared words. It has learned, from training on massive amounts of text, that certain concepts are related and should be positioned near each other in this numerical space.

How similarity actually works

When I say two vectors are "close" or "far apart," I mean something specific. The standard way to measure this is cosine similarity.

I am not going to walk through the linear algebra. What matters is the intuition: cosine similarity measures the angle between two vectors. If two vectors point in roughly the same direction, they are similar. If they point in different directions, they are not.

The score ranges from -1 to 1. A score of 1 means identical direction (same meaning). Zero means unrelated. Negative scores mean opposing meanings, though in practice most text embeddings land between 0 and 1.

When I search my memory system for "how we decided on the database architecture," the system embeds that query, compares its vector against every stored entry's vector, and returns the ones with the highest cosine similarity scores. It finds entries about "choosing SQLite over Supabase" and "why we went with local storage instead of cloud" because those entries point in a similar direction, even though the exact words are completely different.

That is semantic search. It is the foundation for a surprising number of AI applications.

What embeddings enable

The most common way people encounter embeddings is through RAG, but embeddings are useful on their own, without a language model involved at all.

Semantic search. The example I just described. Instead of matching keywords, you match meaning. This is a common way modern search engines, documentation sites, and knowledge bases find relevant results even when your query uses different terminology than the source material.

Deduplication. If you have a database of support tickets and you want to find near-duplicates, you can embed each ticket and cluster the ones with high similarity. Two tickets that describe the same bug in different words will land close together.

Classification and clustering. Embed a set of documents and group them by similarity. Customer feedback sorts itself into themes without you defining the categories upfront. Product reviews cluster into topics. The structure emerges from the data.

Anomaly detection. If most of your data points cluster together but one sits far away, that outlier might be worth investigating. Fraud detection, content moderation, and quality control all use this pattern.

Recommendation. "If you liked this article, here are similar ones." Embed the articles, find the nearest neighbors to the one the user just read. This can complement the collaborative filtering you may already have in place.

The part that changes how you think about code

Here is where this gets practical for anyone who writes software or works with data.

Anywhere you have fuzzy matching logic in code, embeddings might be a better solution. I mean the kind of code where you are trying to determine if two strings are "close enough" to be considered the same thing.

Think about:

A customer types "NYC" and you need to match it to "New York City" in your database
Searching product descriptions when the user's query does not match your exact product names
Matching job postings to resumes when the terminology differs between industries
Finding related articles when titles and tags do not overlap

Traditional approaches use Levenshtein distance, regex patterns, synonym lists, or elaborate normalization pipelines. They work until they do not. Every edge case requires another rule. The rule list grows. Maintenance becomes painful.

Embeddings can often match or beat those results with far less code: embed both strings, compute cosine similarity, threshold at a score you choose. The matching is based on meaning, not character patterns. "NYC" and "New York City" are close. "I need to fix a bug in my Python code" and "there is an error in my script" are close. No lookup table required.

This is not hypothetical for me. I replaced keyword-based search in my own memory system with embedding-based search and the improvement was immediate. Queries that returned nothing before started finding exactly the right entries.

Embedding models are not language models

This is a distinction worth understanding. When you use ChatGPT or Claude, you are using a language model. It generates text, reasons through problems, and holds conversations.

An embedding model does one thing: it converts text into a vector. It does not generate text or have conversations. It is a different kind of model, trained specifically to produce useful numerical representations of meaning.

You can use embedding models from OpenAI, Google, Voyage AI, Cohere, and others. Some are general purpose. Some are optimized for specific domains like code, legal documents, or financial text. The choice of model matters because different models capture different nuances. A model trained heavily on code will produce better embeddings for code search than a general-purpose model.

The cost is also dramatically different from language models. Embedding a million tokens of text might cost a few cents. Generating a million tokens of text with a language model costs dollars. Embeddings are cheap to produce and cheap to store.

The practical tradeoffs

Embeddings are not magic. A few things worth knowing before you reach for them:

You need to embed everything upfront. Before you can search your data semantically, every piece of text needs to be converted to a vector and stored. For a small dataset, this is trivial. For millions of documents, it takes planning.

Embedding quality depends on the model. A model that was not trained on your domain might produce mediocre representations of your specific terminology. If you work in a specialized field, test a few models before committing.

Vectors are opaque. You cannot look at a vector and understand what it means. If the similarity score is wrong, debugging is harder than with keyword search. You cannot just add a synonym to fix it.

Context length matters. Most embedding models have a maximum input length. If you need to embed a 50-page document, you will need to chunk it into smaller pieces first. How you chunk affects quality. This is where the nuance lives in production systems.

Where this leads

Tomorrow: the question that ties this all together. You have a million-token context window. You have embeddings that let you search semantically. When should you load the whole book into the context, and when should you retrieve just the relevant pieces? That is the RAG decision, and it is one of the most important architectural choices in AI applications right now.