Understanding Embeddings: The Building Block of Modern AI Applications

#ai #embeddings #vectordb #tutorial

Embeddings turn text into numbers that capture meaning. They power search, recommendations, RAG, and clustering. Here's how to use them effectively.

What Are Embeddings?

An embedding is a vector (list of numbers) that represents the meaning of text. Similar texts have similar vectors.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["How to train a neural network", "Deep learning model training guide", "Best pizza recipes in New York"]
embeddings = model.encode(texts)
print(embeddings.shape)  # (3, 384)

Measuring Similarity

from sklearn.metrics.pairwise import cosine_similarity

sims = cosine_similarity(embeddings)
# texts[0] vs texts[1]: ~0.82 (both about ML)
# texts[0] vs texts[2]: ~0.12 (ML vs pizza)

Building a Semantic Search Engine

import numpy as np
from sentence_transformers import SentenceTransformer

class SemanticSearch:
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None

    def index(self, documents: list[str]):
        self.documents = documents
        self.embeddings = self.model.encode(documents, normalize_embeddings=True)

    def search(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
        q_emb = self.model.encode([query], normalize_embeddings=True)
        scores = np.dot(self.embeddings, q_emb.T).flatten()
        top_idx = np.argsort(scores)[::-1][:top_k]
        return [(self.documents[i], float(scores[i])) for i in top_idx]

engine = SemanticSearch()
engine.index([
    "Python asyncio tutorial for beginners",
    "How to deploy Docker containers to Kubernetes",
    "Building REST APIs with FastAPI",
    "Introduction to machine learning with scikit-learn",
])
results = engine.search("async programming in Python")
for doc, score in results:
    print(f"{score:.3f}: {doc}")

Using OpenAI Embeddings

from openai import OpenAI
client = OpenAI()

def get_embedding(text: str, model="text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

def get_embeddings(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(input=texts, model="text-embedding-3-small")
    return [item.embedding for item in response.data]

Vector Databases for Production

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="documents", metadata={"hnsw:space": "cosine"})

collection.add(
    documents=["doc1 text", "doc2 text", "doc3 text"],
    metadatas=[{"source": "blog"}, {"source": "docs"}, {"source": "blog"}],
    ids=["doc1", "doc2", "doc3"]
)

results = collection.query(query_texts=["search query"], n_results=5, where={"source": "blog"})

Clustering with Embeddings

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(embeddings)

clusters = {}
for doc, label in zip(documents, labels):
    clusters.setdefault(label, []).append(doc)

Key Takeaways

Embeddings capture semantic meaning as vectors
Cosine similarity measures how related two texts are
Use sentence-transformers for free, local embeddings
Vector databases handle storage and search at scale
Normalize embeddings for faster cosine similarity

6. Choose model based on quality/speed/cost tradeoffs

🚀 Level up your AI workflow! Check out my AI Developer Mega Prompt Pack — 80 battle-tested prompts for developers. $9.99

DEV Community