Embeddings turn text into numbers that capture meaning. They power search, recommendations, RAG, and clustering. Here's how to use them effectively.
What Are Embeddings?
An embedding is a vector (list of numbers) that represents the meaning of text. Similar texts have similar vectors.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["How to train a neural network", "Deep learning model training guide", "Best pizza recipes in New York"]
embeddings = model.encode(texts)
print(embeddings.shape) # (3, 384)
Measuring Similarity
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(embeddings)
# texts[0] vs texts[1]: ~0.82 (both about ML)
# texts[0] vs texts[2]: ~0.12 (ML vs pizza)
Building a Semantic Search Engine
import numpy as np
from sentence_transformers import SentenceTransformer
class SemanticSearch:
def __init__(self, model_name="all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
self.documents = []
self.embeddings = None
def index(self, documents: list[str]):
self.documents = documents
self.embeddings = self.model.encode(documents, normalize_embeddings=True)
def search(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
q_emb = self.model.encode([query], normalize_embeddings=True)
scores = np.dot(self.embeddings, q_emb.T).flatten()
top_idx = np.argsort(scores)[::-1][:top_k]
return [(self.documents[i], float(scores[i])) for i in top_idx]
engine = SemanticSearch()
engine.index([
"Python asyncio tutorial for beginners",
"How to deploy Docker containers to Kubernetes",
"Building REST APIs with FastAPI",
"Introduction to machine learning with scikit-learn",
])
results = engine.search("async programming in Python")
for doc, score in results:
print(f"{score:.3f}: {doc}")
Using OpenAI Embeddings
from openai import OpenAI
client = OpenAI()
def get_embedding(text: str, model="text-embedding-3-small") -> list[float]:
response = client.embeddings.create(input=text, model=model)
return response.data[0].embedding
def get_embeddings(texts: list[str]) -> list[list[float]]:
response = client.embeddings.create(input=texts, model="text-embedding-3-small")
return [item.embedding for item in response.data]
Vector Databases for Production
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="documents", metadata={"hnsw:space": "cosine"})
collection.add(
documents=["doc1 text", "doc2 text", "doc3 text"],
metadatas=[{"source": "blog"}, {"source": "docs"}, {"source": "blog"}],
ids=["doc1", "doc2", "doc3"]
)
results = collection.query(query_texts=["search query"], n_results=5, where={"source": "blog"})
Clustering with Embeddings
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(embeddings)
clusters = {}
for doc, label in zip(documents, labels):
clusters.setdefault(label, []).append(doc)
Key Takeaways
- Embeddings capture semantic meaning as vectors
- Cosine similarity measures how related two texts are
- Use sentence-transformers for free, local embeddings
- Vector databases handle storage and search at scale
- Normalize embeddings for faster cosine similarity
6. Choose model based on quality/speed/cost tradeoffs
🚀 Level up your AI workflow! Check out my AI Developer Mega Prompt Pack — 80 battle-tested prompts for developers. $9.99
Top comments (0)