Why Vector Search?
Keyword search literally hunts for matching terms. That’s fine—until it isn’t:
Query | Keyword Search Might Return | What You Actually Wanted |
---|---|---|
table tennis |
“10 Best Dining Tables” “Wimbledon Lawn Tennis Highlights” |
Articles, rules and gear for table tennis / ping-pong |
Keyword engines struggle even more with non-text media: images, audio, video, genome sequences, etc. They simply don’t “see” pixels or sound waves.
Vector (semantic) search fixes this by turning each item—text, image, whatever—into a high-dimensional vector. Similar meaning -> nearby vectors. Your query is embedded the same way, and the engine brings back the closest neighbours.
TL;DR Vector search ➜ find things that feel the same, not just things that spell the same.
How Embedding happen:
Document Vectorization
You start with a set of text passages (in the drawing they’re labelled “Text / Answers”).
Each passage is fed through an embedding model (a neural network that maps text to points in a high-dimensional space).
The model outputs a vector for each passage—these vectors (sometimes called word or sentence embeddings) capture the meaning of the text as coordinates in that space.Query Vectorization & Retrieval
When a user asks a question, you send the question through the same embedding model and obtain a query vector.
You then compare that query vector to all of your stored document vectors (e.g. with cosine similarity).
The documents whose vectors lie closest to the query vector are the most semantically relevant answers, even if they don’t share the exact same keywords.
Why it matters: by operating in a continuous vector space rather than matching literal words, you can find passages that “mean the same thing” and surface them to your LLM (or directly to the user). This is the core of semantic (vector) search in Retrieval-Augmented Generation pipelines.
Give It a Go with Qdrant
Many open-source vector databases exist; we’ll use Qdrant because it’s lightweight, fast, and has a friendly Python client.
Setup
Installing Qdrant using docker:
docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 \
-v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
qdrant/qdrant
Installing python client libs:
!python -m pip install -q "qdrant-client[fastembed]>=1.14.2"
Implementation
Stage 1: Connections and Data Prep
Import the necessary modules to connect to the vector DB , choose the models that would be required based on the need and study the dataset.
# Client to connect to Vector DB.
qd_client = QdrantClient("http://localhost:6333")
# Model Selection
from fastembed import TextEmbedding
models = TextEmbedding.list_supported_models()
print(f"Models:\n{models}\n\n")
# Analyse dataset and Prep the documents in the relevant format.
import requests
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()
documents = []
for course in documents_raw:
course_name = course['course']
if course_name != 'machine-learning-zoomcamp':
continue
for doc in course['documents']:
doc['course'] = course_name
documents.append(doc)
print(f"Documents:\n{documents[:5]}\n\n")
Stage 2: Storage and Index Prep
Create a collection (say for a business problem) and add points (data points or documents) into the collection that would be embedded into vectors.
from qdrant_client import QdrantClient, models
qd_client = QdrantClient("http://localhost:6333")
EMBEDDING_DIMENSIONS = 384
model_handle = "BAAI/bge-small-en"
# Create DB Storage
collection_name = "hw_2_collection"
qd_client.create_collection(
collection_name=collection_name,
vectors_config=models.VectorParams(
size=EMBEDDING_DIMENSIONS,
distance=models.Distance.COSINE
)
)
# Index data
qd_client.create_payload_index(
collection_name=collection_name,
field_name="course",
field_schema="keyword" # exact match on string metadata field
)
Stage 3: Ingest data
Upsert the relevant section of the documents into vector db.
points = []
for i, doc in enumerate(documents):
q_a = doc['question'] + ' ' + doc['text'] # Concatenate question and text for embedding
vector=models.Document(text=q_a, model=model_handle)
point = models.PointStruct(
id=i,
vector=vector,
payload=doc
)
points.append(point)
qd_client.upsert(
collection_name=collection_name,
points=points
)
Stage 4: Search capability
Provide a search capability to query the documents say based on similarity matches (cosine distance)
def vector_search(question, course="machine-learning-zoomcamp", limit=5):
print(f"Using Vector Search with filter: {course}. Results limit: {limit}")
q_points = qd_client.query_points(
collection_name=collection_name,
query=models.Document(
text=question,
model=model_handle
),
query_filter=models.Filter(
must=[
models.FieldCondition(
key="course",
match=models.MatchValue(value=course)
)
]
),
limit=limit,
with_payload=True
)
results = []
for point in q_points.points:
results.append(point.payload)
# Search similar items.
res = vector_search(question, course="machine-learning-zoomcamp", limit=5)
print(res)
Stage 5: Query LLM with Vector DB as a RAG
llm_client = OpenAI()
def build_prompt(q_question, search_results):
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT.
Use only the facts from the CONTEXT when answering the QUESTION.
If the CONTEXT doesn't contain the answer, output NONE
QUESTION: {question}
CONTEXT: {context}
""".strip()
context = ""
for doc in search_results:
context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
prompt = prompt_template.format(question=q_question, context=context).strip()
return prompt
# Query the LLM with the modified prompt
def query_llm(mod_prompt):
response = llm_client.chat.completions.create(
model = 'gpt-4o-mini',
messages = [{"role": "user", "content": mod_prompt}]
)
return response.choices[0].message.content
Stage 6: Check Results
def rag(query):
search_results = vector_search(query)
prompt = build_prompt(query, search_results)
answer = query_llm(prompt)
return answer
rag("How to install Kafka?”)
Improving with Hybrid Search
No single search technique suits every scenario. Sometimes you need the precision of keywords (exact product codes, player stats, specific names), and other times the flexibility of semantic matching (similar games, related concepts, broader topics). A hybrid search strategy blends both:
- Sparse (keyword) embeddings for exact matches
- Dense (semantic) embeddings for meaning-based recall
- Fusion techniques (e.g. reciprocal rank fusion) or multi-stage pipelines (keyword filter → semantic re-rank, or vice versa)
Example:
- Looking up a particular player’s season statistics? A keyword search is ideal.
- Hunting for matches that felt like nail-biters? Semantic search surfaces games with similar “excitement vectors.”
Hybrid Embedding & Fusion
By storing both sparse and dense vectors in your collection and then combining their scores—either in two passes or via a fusion query—you get the best of both worlds, serving precise queries and broad, semantically rich ones with equal finesse.
References
LLM Zoomcamp: https://github.com/DataTalksClub/llm-zoomcamp/tree/main/02-vector-search
Top comments (0)