DEV Community

郑沛沛
郑沛沛

Posted on

Understanding Embeddings: The Building Block of Modern AI Applications

Embeddings turn text into numbers that capture meaning. They power search, recommendations, RAG, and clustering. Here's how to use them effectively.

What Are Embeddings?

An embedding is a vector (list of numbers) that represents the meaning of text. Similar texts have similar vectors.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["How to train a neural network", "Deep learning model training guide", "Best pizza recipes in New York"]
embeddings = model.encode(texts)
print(embeddings.shape)  # (3, 384)
Enter fullscreen mode Exit fullscreen mode

Measuring Similarity

from sklearn.metrics.pairwise import cosine_similarity

sims = cosine_similarity(embeddings)
# texts[0] vs texts[1]: ~0.82 (both about ML)
# texts[0] vs texts[2]: ~0.12 (ML vs pizza)
Enter fullscreen mode Exit fullscreen mode

Building a Semantic Search Engine

import numpy as np
from sentence_transformers import SentenceTransformer

class SemanticSearch:
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.documents = []
        self.embeddings = None

    def index(self, documents: list[str]):
        self.documents = documents
        self.embeddings = self.model.encode(documents, normalize_embeddings=True)

    def search(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
        q_emb = self.model.encode([query], normalize_embeddings=True)
        scores = np.dot(self.embeddings, q_emb.T).flatten()
        top_idx = np.argsort(scores)[::-1][:top_k]
        return [(self.documents[i], float(scores[i])) for i in top_idx]

engine = SemanticSearch()
engine.index([
    "Python asyncio tutorial for beginners",
    "How to deploy Docker containers to Kubernetes",
    "Building REST APIs with FastAPI",
    "Introduction to machine learning with scikit-learn",
])
results = engine.search("async programming in Python")
for doc, score in results:
    print(f"{score:.3f}: {doc}")
Enter fullscreen mode Exit fullscreen mode

Using OpenAI Embeddings

from openai import OpenAI
client = OpenAI()

def get_embedding(text: str, model="text-embedding-3-small") -> list[float]:
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

def get_embeddings(texts: list[str]) -> list[list[float]]:
    response = client.embeddings.create(input=texts, model="text-embedding-3-small")
    return [item.embedding for item in response.data]
Enter fullscreen mode Exit fullscreen mode

Vector Databases for Production

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="documents", metadata={"hnsw:space": "cosine"})

collection.add(
    documents=["doc1 text", "doc2 text", "doc3 text"],
    metadatas=[{"source": "blog"}, {"source": "docs"}, {"source": "blog"}],
    ids=["doc1", "doc2", "doc3"]
)

results = collection.query(query_texts=["search query"], n_results=5, where={"source": "blog"})
Enter fullscreen mode Exit fullscreen mode

Clustering with Embeddings

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(embeddings)

clusters = {}
for doc, label in zip(documents, labels):
    clusters.setdefault(label, []).append(doc)
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Embeddings capture semantic meaning as vectors
  2. Cosine similarity measures how related two texts are
  3. Use sentence-transformers for free, local embeddings
  4. Vector databases handle storage and search at scale
  5. Normalize embeddings for faster cosine similarity

6. Choose model based on quality/speed/cost tradeoffs

🚀 Level up your AI workflow! Check out my AI Developer Mega Prompt Pack — 80 battle-tested prompts for developers. $9.99

Top comments (0)