anup s

Posted on Jun 30

Comprehending Vector Search [LLM-A2]

#llm #genai #rag #python

Why Vector Search?

Keyword search literally hunts for matching terms. That’s fine—until it isn’t:

Query	Keyword Search Might Return	What You Actually Wanted
`table tennis`	“10 Best Dining Tables” “Wimbledon Lawn Tennis Highlights”	Articles, rules and gear for table tennis / ping-pong

Keyword engines struggle even more with non-text media: images, audio, video, genome sequences, etc. They simply don’t “see” pixels or sound waves.

Vector (semantic) search fixes this by turning each item—text, image, whatever—into a high-dimensional vector. Similar meaning -> nearby vectors. Your query is embedded the same way, and the engine brings back the closest neighbours.

TL;DR Vector search ➜ find things that feel the same, not just things that spell the same.

How Embedding happen:

Document Vectorization
You start with a set of text passages (in the drawing they’re labelled “Text / Answers”).
Each passage is fed through an embedding model (a neural network that maps text to points in a high-dimensional space).
The model outputs a vector for each passage—these vectors (sometimes called word or sentence embeddings) capture the meaning of the text as coordinates in that space.
Query Vectorization & Retrieval
When a user asks a question, you send the question through the same embedding model and obtain a query vector.
You then compare that query vector to all of your stored document vectors (e.g. with cosine similarity).
The documents whose vectors lie closest to the query vector are the most semantically relevant answers, even if they don’t share the exact same keywords.

Why it matters: by operating in a continuous vector space rather than matching literal words, you can find passages that “mean the same thing” and surface them to your LLM (or directly to the user). This is the core of semantic (vector) search in Retrieval-Augmented Generation pipelines.

Give It a Go with Qdrant

Many open-source vector databases exist; we’ll use Qdrant because it’s lightweight, fast, and has a friendly Python client.

Setup

Installing Qdrant using docker:

docker pull qdrant/qdrant

docker run -p 6333:6333 -p 6334:6334 \
   -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
   qdrant/qdrant

Installing python client libs:

!python -m pip install -q "qdrant-client[fastembed]>=1.14.2"

Implementation

Stage 1: Connections and Data Prep

Import the necessary modules to connect to the vector DB , choose the models that would be required based on the need and study the dataset.

# Client to connect to Vector DB.
qd_client = QdrantClient("http://localhost:6333") 

# Model Selection
from fastembed import  TextEmbedding
models = TextEmbedding.list_supported_models()
print(f"Models:\n{models}\n\n")


# Analyse dataset and Prep the documents in the relevant format. 

import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']
    if course_name != 'machine-learning-zoomcamp':
        continue

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

print(f"Documents:\n{documents[:5]}\n\n")

Stage 2: Storage and Index Prep

Create a collection (say for a business problem) and add points (data points or documents) into the collection that would be embedded into vectors.

from qdrant_client import QdrantClient, models

qd_client = QdrantClient("http://localhost:6333") 

EMBEDDING_DIMENSIONS = 384
model_handle = "BAAI/bge-small-en"

# Create DB Storage
collection_name = "hw_2_collection"
qd_client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONS,
        distance=models.Distance.COSINE
    )
)

# Index data 
qd_client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword" # exact match on string metadata field
)

Stage 3: Ingest data

Upsert the relevant section of the documents into vector db.

points = []

for i, doc in enumerate(documents):

    q_a = doc['question'] + ' ' + doc['text']  # Concatenate question and text for embedding
    vector=models.Document(text=q_a, model=model_handle)

    point = models.PointStruct(
        id=i,
        vector=vector,
        payload=doc
    )
    points.append(point)

qd_client.upsert(
    collection_name=collection_name,
    points=points
)

Stage 4: Search capability

Provide a search capability to query the documents say based on similarity matches (cosine distance)

def vector_search(question, course="machine-learning-zoomcamp", limit=5):
    print(f"Using Vector Search with filter: {course}. Results limit: {limit}")

    q_points = qd_client.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=question,
            model=model_handle
        ),
        query_filter=models.Filter(
            must=[
                models.FieldCondition(
                    key="course",
                    match=models.MatchValue(value=course)
                )
            ]
        ),
        limit=limit,
        with_payload=True
    )


    results = []
    for point in q_points.points:
        results.append(point.payload)


 # Search similar items.

 res = vector_search(question, course="machine-learning-zoomcamp", limit=5)
 print(res)

Stage 5: Query LLM with Vector DB as a RAG

llm_client = OpenAI()

def build_prompt(q_question, search_results):
    prompt_template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT.
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT doesn't contain the answer, output NONE

QUESTION: {question} 

CONTEXT: {context}
""".strip()

    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer:  {doc['text']}\n\n"

    prompt = prompt_template.format(question=q_question, context=context).strip()
    return prompt

# Query the LLM with the modified prompt
def query_llm(mod_prompt):
    response = llm_client.chat.completions.create(
        model = 'gpt-4o-mini',
        messages = [{"role": "user", "content": mod_prompt}]
    )

    return response.choices[0].message.content

Stage 6: Check Results

def rag(query):
    search_results = vector_search(query)
    prompt = build_prompt(query, search_results)
    answer = query_llm(prompt)
    return answer


rag("How to install Kafka?”)

Improving with Hybrid Search

No single search technique suits every scenario. Sometimes you need the precision of keywords (exact product codes, player stats, specific names), and other times the flexibility of semantic matching (similar games, related concepts, broader topics). A hybrid search strategy blends both:

Sparse (keyword) embeddings for exact matches
Dense (semantic) embeddings for meaning-based recall
Fusion techniques (e.g. reciprocal rank fusion) or multi-stage pipelines (keyword filter → semantic re-rank, or vice versa)

Example:

Looking up a particular player’s season statistics? A keyword search is ideal.

Hunting for matches that felt like nail-biters? Semantic search surfaces games with similar “excitement vectors.”

Hybrid Embedding & Fusion

By storing both sparse and dense vectors in your collection and then combining their scores—either in two passes or via a fusion query—you get the best of both worlds, serving precise queries and broad, semantically rich ones with equal finesse.

References

LLM Zoomcamp: https://github.com/DataTalksClub/llm-zoomcamp/tree/main/02-vector-search

DEV Community