DEV Community: anup s

Comprehending Vector Search [LLM-A2]

anup s — Mon, 30 Jun 2025 13:36:03 +0000

Why Vector Search?

Keyword search literally hunts for matching terms. That’s fine—until it isn’t:

Query	Keyword Search Might Return	What You Actually Wanted
`table tennis`	“10 Best Dining Tables” “Wimbledon Lawn Tennis Highlights”	Articles, rules and gear for table tennis / ping-pong

Keyword engines struggle even more with non-text media: images, audio, video, genome sequences, etc. They simply don’t “see” pixels or sound waves.

Vector (semantic) search fixes this by turning each item—text, image, whatever—into a high-dimensional vector. Similar meaning -> nearby vectors. Your query is embedded the same way, and the engine brings back the closest neighbours.

TL;DR Vector search ➜ find things that feel the same, not just things that spell the same.

How Embedding happen:

Document Vectorization
You start with a set of text passages (in the drawing they’re labelled “Text / Answers”).
Each passage is fed through an embedding model (a neural network that maps text to points in a high-dimensional space).
The model outputs a vector for each passage—these vectors (sometimes called word or sentence embeddings) capture the meaning of the text as coordinates in that space.
Query Vectorization & Retrieval
When a user asks a question, you send the question through the same embedding model and obtain a query vector.
You then compare that query vector to all of your stored document vectors (e.g. with cosine similarity).
The documents whose vectors lie closest to the query vector are the most semantically relevant answers, even if they don’t share the exact same keywords.

Why it matters: by operating in a continuous vector space rather than matching literal words, you can find passages that “mean the same thing” and surface them to your LLM (or directly to the user). This is the core of semantic (vector) search in Retrieval-Augmented Generation pipelines.

Give It a Go with Qdrant

Many open-source vector databases exist; we’ll use Qdrant because it’s lightweight, fast, and has a friendly Python client.

Setup

Installing Qdrant using docker:

docker pull qdrant/qdrant

docker run -p 6333:6333 -p 6334:6334 \
   -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
   qdrant/qdrant

Installing python client libs:

!python -m pip install -q "qdrant-client[fastembed]>=1.14.2"

Implementation

Stage 1: Connections and Data Prep

Import the necessary modules to connect to the vector DB , choose the models that would be required based on the need and study the dataset.

# Client to connect to Vector DB.
qd_client = QdrantClient("http://localhost:6333") 

# Model Selection
from fastembed import  TextEmbedding
models = TextEmbedding.list_supported_models()
print(f"Models:\n{models}\n\n")


# Analyse dataset and Prep the documents in the relevant format. 

import requests 

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']
    if course_name != 'machine-learning-zoomcamp':
        continue

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

print(f"Documents:\n{documents[:5]}\n\n")

Stage 2: Storage and Index Prep

Create a collection (say for a business problem) and add points (data points or documents) into the collection that would be embedded into vectors.

from qdrant_client import QdrantClient, models

qd_client = QdrantClient("http://localhost:6333") 

EMBEDDING_DIMENSIONS = 384
model_handle = "BAAI/bge-small-en"

# Create DB Storage
collection_name = "hw_2_collection"
qd_client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONS,
        distance=models.Distance.COSINE
    )
)

# Index data 
qd_client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword" # exact match on string metadata field
)

Stage 3: Ingest data

Upsert the relevant section of the documents into vector db.

points = []

for i, doc in enumerate(documents):

    q_a = doc['question'] + ' ' + doc['text']  # Concatenate question and text for embedding
    vector=models.Document(text=q_a, model=model_handle)

    point = models.PointStruct(
        id=i,
        vector=vector,
        payload=doc
    )
    points.append(point)

qd_client.upsert(
    collection_name=collection_name,
    points=points
)

Stage 4: Search capability

Provide a search capability to query the documents say based on similarity matches (cosine distance)

def vector_search(question, course="machine-learning-zoomcamp", limit=5):
    print(f"Using Vector Search with filter: {course}. Results limit: {limit}")

    q_points = qd_client.query_points(
        collection_name=collection_name,
        query=models.Document(
            text=question,
            model=model_handle
        ),
        query_filter=models.Filter(
            must=[
                models.FieldCondition(
                    key="course",
                    match=models.MatchValue(value=course)
                )
            ]
        ),
        limit=limit,
        with_payload=True
    )


    results = []
    for point in q_points.points:
        results.append(point.payload)


 # Search similar items.

 res = vector_search(question, course="machine-learning-zoomcamp", limit=5)
 print(res)

Stage 5: Query LLM with Vector DB as a RAG

llm_client = OpenAI()

def build_prompt(q_question, search_results):
    prompt_template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT.
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT doesn't contain the answer, output NONE

QUESTION: {question} 

CONTEXT: {context}
""".strip()

    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer:  {doc['text']}\n\n"

    prompt = prompt_template.format(question=q_question, context=context).strip()
    return prompt

# Query the LLM with the modified prompt
def query_llm(mod_prompt):
    response = llm_client.chat.completions.create(
        model = 'gpt-4o-mini',
        messages = [{"role": "user", "content": mod_prompt}]
    )

    return response.choices[0].message.content

Stage 6: Check Results

def rag(query):
    search_results = vector_search(query)
    prompt = build_prompt(query, search_results)
    answer = query_llm(prompt)
    return answer


rag("How to install Kafka?”)

Improving with Hybrid Search

No single search technique suits every scenario. Sometimes you need the precision of keywords (exact product codes, player stats, specific names), and other times the flexibility of semantic matching (similar games, related concepts, broader topics). A hybrid search strategy blends both:

Sparse (keyword) embeddings for exact matches
Dense (semantic) embeddings for meaning-based recall
Fusion techniques (e.g. reciprocal rank fusion) or multi-stage pipelines (keyword filter → semantic re-rank, or vice versa)

Example:

Looking up a particular player’s season statistics? A keyword search is ideal.

Hunting for matches that felt like nail-biters? Semantic search surfaces games with similar “excitement vectors.”

Hybrid Embedding & Fusion

By storing both sparse and dense vectors in your collection and then combining their scores—either in two passes or via a fusion query—you get the best of both worlds, serving precise queries and broad, semantically rich ones with equal finesse.

References

LLM Zoomcamp: https://github.com/DataTalksClub/llm-zoomcamp/tree/main/02-vector-search

Comprehending RAGs with a keyword search [LLM A1]

anup s — Wed, 18 Jun 2025 06:03:23 +0000

Large Language Models (LLMs) have reshaped the IT industry over the past few years, and I’m no exception to the wave of adoption. They’ve become a go-to assistant for both day-to-day questions and deeper research tasks.

Whether the goal is business innovation or a personal project, building an effective GenAI workflow means understanding a handful of core capabilities and components. One of the most important is component is Retrieval-Augmented Generation (RAG).

🧠 What is RAG?

RAG is a technique/design pattern that improves the quality and relevance of LLM responses by combining two core capabilities:

Retrieval: Pulling relevant documents or facts from an external knowledge source (like a vector database or search engine).
Generation: Using a foundational model (e.g., from OpenAI, AWS, etc.) to generate natural language answers.
Augmentation: Enriching the model’s input prompt by injecting the retrieved content to provide context that the LLM alone wouldn't otherwise have.

⚽ Example Use Case

Imagine you’re building a sports Q&A assistant. A user asks:

"Who scored the winning goal in the last Barcelona match?"

Rather than relying solely on the LLM’s trained knowledge (which may be outdated), the RAG pipeline:

Retrieves up-to-date match stats, player performance data, and venue context from your internal index or search service.
Appends that information to the user's question.
Sends the enriched prompt to the LLM, which then generates a precise and relevant answer.

Build a Simple GenAI Workflow with ES and OpenAI

🛠️ Setup and Code Walkthrough

1. Start Elasticsearch via Docker

docker run -it \
  --name elasticsearch \
  -p 9200:9200 \
  -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  docker.elastic.co/elasticsearch/elasticsearch:8.17.6

2. Index Documents into Elasticsearch

from elasticsearch import Elasticsearch
from tqdm.auto import tqdm

# Connect to local Elasticsearch
es_client = Elasticsearch("http://localhost:9200")

# Define schema for indexing
index_settings = {
    "settings": {"number_of_shards": 1, "number_of_replicas": 0},
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"}
        }
    }
}

index_name = "course-questions"
es_client.indices.create(index=index_name, body=index_settings)

# Index your documents (replace `docs` with your actual dataset)
for doc in tqdm(docs):
    es_client.index(index=index_name, document=doc)

3. Define Retrieval Logic (ElasticSearch)

def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)

    result_docs = []
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

4. Build prompt for querying LLM

def build_prompt(q_question, search_results):
    prompt_template = """
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT.
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT doesn't contain the answer, output NONE


QUESTION: {question} 

CONTEXT: {context}
""".strip()

    context = ""

    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer:  {doc['text']}\n\n"

    prompt = prompt_template.format(question=q_question, context=context).strip()
    return prompt

5. Retrieve Response using LLM


# Function to query the LLM (e.g. OpenAI)
def query_llm(mod_prompt):
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": mod_prompt}]
    )
    return response.choices[0].message.content

# Combine everything in a RAG-style function
def rag(query):
    results = elastic_search(query)
    prompt = build_prompt(query, results)  # You’ll need to define this function
    return query_llm(prompt)

# Try it out
query = "can I still join the course?"
print(rag(query))  # Example output: "Yes, you can still join the course."