Shreekansha

Posted on Feb 5 • Originally published at Medium

Vector Databases for Developers (With a Simple Python Example)

#ai #genai #softwaredevelopment #machinelearning

Why embeddings matter and how modern AI systems search meaning, not keywords.

For decades, software engineers have relied on keyword-based search. Whether using a simple LIKE clause in SQL or a complex Elasticsearch cluster, the underlying logic was the same: find documents that contain these exact characters. While powerful, keyword search is fundamentally "dumb." It doesn't understand that "canine" and "dog" refer to the same thing, or that "how to start a fire" and "igniting a flame" share an identical intent.

Modern Generative AI systems have moved beyond this limitation using Vector Databases. These systems enable "Semantic Search"—searching by meaning rather than by characters. For any developer building GenAI applications, understanding how these databases work is no longer optional; it is a core requirement of the modern stack.

Why Keyword Search Is Limited

Keyword search relies on exact or fuzzy matching of tokens. This leads to two primary issues:

The Synonym Problem: If your user searches for "automobile repairs" but your database only contains the phrase "car maintenance," a keyword search will return zero results.
The Context Problem: Keywords don't understand context. A search for "bank" could refer to a financial institution or the side of a river. Keyword engines struggle to distinguish between the two without complex manual tagging.
Vector search solves this by translating human language into a mathematical space where similar meanings are physically close to one another.

What Embeddings Are: The Intuition

At the heart of every vector database is a concept called an "Embedding." An embedding is a way of representing data—usually text, but also images or audio—as a long list of numbers (a vector).

Think of an embedding as a set of "coordinates" in a high-dimensional map. In a 2D map, you have X and Y. In a vector space used by AI, you might have 768 or 1,536 dimensions. Each dimension represents a subtle feature of the meaning.

For example, in a simplified 3D embedding space, the dimensions might roughly correlate to:

Dimension 1: Is it an animal?
Dimension 2: Is it a tool?
Dimension 3: Is it related to technology?

The word "Hammer" would have a high value in Dimension 2 and a low value in Dimensions 1 and 3. The word "Dog" would have a high value in Dimension 1. Because "Dog" and "Puppy" would both have very similar coordinates across all dimensions, they end up sitting right next to each other in this mathematical map.

How Vector Similarity Works

Once we have these coordinates (vectors), searching becomes a geometry problem rather than a string-matching problem. Instead of asking "Does this string match?", we ask "Which vectors in my database are closest to the vector of the user's query?"

The most common way to calculate this "closeness" is through Cosine Similarity. This measures the angle between two vectors. If the angle is small (close to 0 degrees), the meanings are highly related. If the angle is large, the meanings are unrelated.

Simple Python Semantic Search Example

In a production environment, you would use a specialized embedding model and a dedicated vector database. However, we can simulate the entire process using basic Python and the math library to understand the mechanics.


import math

# --- 1. Simulating an Embedding Model ---
# In reality, an LLM would generate these 1536-dimension vectors.
# Here, we use 3-dimensional vectors for simplicity.
# Dimensions: [Is_Animal, Is_Technology, Is_Food]

knowledge_base = {
    "The golden retriever ran across the park.": [0.9, 0.1, 0.2],
    "The new smartphone features a high-res camera.": [0.1, 0.9, 0.1],
    "I ate a delicious pepperoni pizza for lunch.": [0.2, 0.1, 0.9],
    "A cat was sleeping on the sofa.": [0.8, 0.1, 0.1]
}

# --- 2. Cosine Similarity Function ---
def cosine_similarity(v1, v2):
    dot_product = sum(a * b for a, b in zip(v1, v2))
    magnitude1 = math.sqrt(sum(a**2 for a in v1))
    magnitude2 = math.sqrt(sum(a**2 for a in v2))
    if magnitude1 == 0 or magnitude2 == 0:
        return 0
    return dot_product / (magnitude1 * magnitude2)

# --- 3. The Search Engine ---
def semantic_search(query_vector, documents):
    results = []
    for text, doc_vector in documents.items():
        score = cosine_similarity(query_vector, doc_vector)
        results.append((text, score))

    # Sort by highest similarity score
    return sorted(results, key=lambda x: x[1], reverse=True)

# --- 4. Execution ---

# User query: "Tell me about pets."
# Our 'Embedding Model' translates this query into a vector:
query_vec = [0.85, 0.05, 0.1] # Highly related to 'Animal'

print("Searching for: 'Tell me about pets.'\n")
search_results = semantic_search(query_vec, knowledge_base)

for text, score in search_results:
    print(f"Score: {score:.4f} | {text}")

How Vector Databases Scale This Idea

The Python example above works well for four sentences, but it uses an O(N) search—it checks every single document. If you have 100 million documents, calculating cosine similarity for every single one in real-time is impossible.

A Vector Database (like Milvus, Pinecone, or Weaviate) solves this through Approximate Nearest Neighbor (ANN) algorithms. Instead of checking every vector, they create an index (similar to a B-Tree or Hash Map, but for geometry) that allows them to find the "neighborhood" of the query vector instantly. This allows for sub-second searches across billions of data points.

Where Vector DBs Are Used in GenAI

Vector databases are the storage engine for Retrieval-Augmented Generation (RAG).

Step 1: You take your company's PDFs and manuals, turn them into vectors, and store them in a Vector DB.
Step 2: When a user asks a question, you turn that question into a vector.
Step 3: You query the Vector DB for the top 3 most similar document snippets.
Step 4: You feed those snippets to your LLM to generate an accurate, grounded response.

Without the Vector DB, the LLM would have no way to "look up" specific information from your private data efficiently.

Common Misconceptions

1.Vector Databases store the original text.

While many do allow you to store "metadata" (the original text), their primary job is to store the numerical vectors. The vector is what allows for the search; the text is just what you show the user.

2.You can just use a regular SQL database.

While some SQL databases (like PostgreSQL with pgvector) have added vector support, dedicated vector databases are often optimized for specific hardware accelerations (like using GPUs or SIMD instructions) to make high-dimensional math significantly faster.

3.More dimensions are always better.

Higher dimensions (e.g., 3072 vs 768) can capture more nuance, but they also increase storage costs and latency. For many applications, smaller, highly-optimized embeddings are more efficient than massive ones.

Conclusion

Vector databases have fundamentally changed the way we handle information. By moving from character matching to mathematical similarity, we have enabled AI systems to "understand" context and intent.

As a developer, your role is to manage the pipeline: ensuring high-quality data is converted into accurate embeddings and choosing the right indexing strategy for your vector database. When you master the vector space, you move beyond building simple keyword-matching apps and start building systems that can truly navigate the complexity of human language.

DEV Community

Vector Databases for Developers (With a Simple Python Example)

Top comments (0)