Ed Ian Jay Baguio for UP Mindanao SPARCS

Posted on Feb 22

Leveraging a Vector Database for Semantic Search with ChromaDB: A Beginner’s Guide

#sparcs #vectordatabase #chromadb

Sybil Micarandayo is a co-author of this article.

🔎 Here’s our First QUERY = Search-Answers…

Imagine a situation where you ever searched for “Apple”, and the computer looked for the exact letters A-P-P-L-E.

But what if you wanted information about “fruit” or “Steve Jobs"?

This is the failure of traditional keyword search. It requires a perfect 1-to-1 match of characters.

Fortunately, modern AI solves this using a vector database to interpret context, allowing the computer to understand that a search for Apple might actually be a quest for fruit or Steve Jobs based on the context of the data.

What Makes A VECTOR DATABASE Different?

A Vector Database is “Contextual”, they don’t look at words, they look at meaning. While traditional databases are “Literal”, they look for each character match. If two ideas are similar, they sit close together, regardless of the words used to describe them.

If two ideas are similar, they are placed close together in vector space

It represents data as mathematical points in space, like a vector model or 3D graph that captures semantic meaning. These points are stored as Vector Embeddings, a long list of numbers that represent coordinates.

A Vector Embedding model acts as a translator:

It analyzes the human meaning of a query and matches it against a vast set of sources based on conceptual similarity rather than just text.

However, these embeddings need a place to be stored and managed. This is where ChromaDB comes in.

Why Use ChromaDB?

ChromaDb serves as a memory bank that stores and retrieves vector embeddings together with their associated metadata.

ChromaDB is a specialized database specifically designed for handling unique requirements of vector embeddings. Its production expanded rapidly in 2023 as an open-source, “AI-native” database which means built for AI applications rather than adding AI features to traditional databases. It offers optimized storage and lightning-fast querying by using vector indexing techniques such as approximate nearest neighbor search which quickly finds similar data points in large datasets. It is remarkably simple to set up and can run in your Python environment, ideally on version 3.12.

Now, let us try to build our own Semantic Search Engine. But don’t worry, this demonstration is beginner-friendly. First, make sure to have the following before proceeding with our demonstration.

✅ PREREQUISITE:

Download Git bash
Open Source code Editor (e.g. VS Code, Cursor, etc)
Prepare a Python Environment/Extension

LET’S BUILD IT

Now, let’s try to build our own semantic search engine through this entry-level demonstration script in VS Code!

A. Open your VS Code

Ensure that the Python extension is installed
Create a folder “Semantic Search”

B. Install the Dependencies

On you terminal (Git Bash), run: pip install chromadb sentence-transformers This code installs:
ChromaDB - a vector database
Sentence Transformers - an embedding model for converting text into vectors.

This library is the translator that turns human language into a format that computers can understand.

Note: the first run may download the embedding model

C. Click the Command Palette (or Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (macOS).

Search for Python Create Environment
Then, click “venv”: to create virtual environment (This ensures your project has its own isolated dependencies.)
Select Python Interpreter version 3.12

D. Create a file in python
semantic_search.py

E. Import required library

import os
import shutil

import chromadb
from chromadb.utils import embedding_functions
from chromadb.errors import InternalError

This library starts with ChromaDB (vector database) locally on your computer.

Think of it as opening a mini database memory where your AI data will be stored.

F. Set up the vector database (Persistent)

DB_PATH = "./chroma_semantic_search_db"

def _create_client_and_collection(path, collection_name):
   client = chromadb.PersistentClient(path=path)

   # Sentence Transformers
   embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
       model_name="all-MiniLM-L6-v2"
   )

   # get_or_create_collection: reuse existing DB if it exists
   collection = client.get_or_create_collection(
       name=collection_name,
       embedding_function=embedding_function,
   )
   return client, collection

This loads a pre-trained embedding model that understands human meaning.

This model converts sentences into vectors so the computer can understand meaning, not just words.

Persistent means the vector database is saved to disk, not just kept in RAM. ChromaDB can run in-memory or persist to disk. For a “real” app, we use persistence.

With chromadb.PersistentClient(path="./chroma_semantic_search_db"):

vectors + documents are written into files under chroma_semantic_search_db (e.g. chroma.sqlite3)
If you stop the script and run it again later, the data is still there and can be queried

Sentence Transformers: local model, no API key needed. Good for beginners.

G. Error-handling Around Chroma Client (IMPORTANT!)

def get_or_create_collection(path=DB_PATH, collection_name="documents"):
   """
   Create a ChromaDB client and get (or create) a collection with embeddings.
   The embedding function converts text into vectors that capture meaning.

   If the underlying SQLite database is corrupted (e.g. "database disk image is malformed"),
   automatically reset the on-disk Chroma DB and recreate it.
   """
   try:
       return _create_client_and_collection(path, collection_name)
   except InternalError as e:
       # Handle corrupted SQLite DB by clearing and recreating the Chroma directory.
       if "database disk image is malformed" not in str(e):
           raise

       print(
           f"Detected corrupted ChromaDB database at '{path}' "
           "(database disk image is malformed). Resetting the database directory..."
       )

       # Remove the existing persistent directory and recreate it fresh.
       if os.path.exists(path):
           shutil.rmtree(path)

       return _create_client_and_collection(path, collection_name)

This is important for safe initialization and automatic DB reset

Note: Upon the initial execution following a corruption event, the system is expected to print a message indicating the detection of a corrupted database and the initiation of a reset. It should then proceed with normal operation. Subsequent executions should begin cleanly with a new vector database; consequently, you might need to re-add or re-seed any necessary data.

H. Create a Collection

INITIAL_DOCS = [
   "Amazon Web Services is a cloud computing platform.",
   "Python is used in machine learning.",
   "Artificial Intelligence enables automation.",
]


def seed_documents(collection, initial_docs=None):
   """Add initial documents if the collection is empty. Returns count added."""
   if initial_docs is None:
       initial_docs = INITIAL_DOCS
   existing = collection.count()
   if existing > 0:
       return 0  
   for i, doc in enumerate(initial_docs):
       collection.add(documents=[doc], ids=[f"doc_{i}"])
   return len(initial_docs)

Every text you store here will be automatically converted into vectors using the embedding model.

I. Add a Document

 add_document(collection, document, doc_id, stored_docs_set):
   """
   Add one document to the collection. ChromaDB will compute its embedding
   automatically using the collection's embedding_function.
   """
   key = document.strip().lower()
   if not key:
       return False, "Cannot add an empty sentence."
   if key in stored_docs_set:
       return False, "Duplicate sentence; not added."
   collection.add(documents=[document.strip()], ids=[doc_id])
   stored_docs_set.add(key)
   return True, "Sentence added successfully."

When you add a document to ChromaDB, the system processes your text with the embedding model and then stores the resulting vector. That’s why later you can search “by meaning”: everything in the DB is already stored as vectors.

J. Semantic Search (TOP-K)

def search_documents(collection, query, n_results=3):
   """
   Query by meaning: the query is embedded and compared to all document
   vectors. Returns the top-n_results closest matches and their distances.
   Lower distance = more similar.
   """
   query = query.strip()
   if not query:
       return None, "Please enter a non-empty search query."

   results = collection.query(
       query_texts=[query],
       n_results=min(n_results, collection.count() or 1),
       include=["documents", "distances"],
   )

   docs = results["documents"][0]
   dists = results["distances"][0]
   if not docs:
       return [], "No documents in the database yet. Add some with the 'add' command."

   return list(zip(docs, dists)), None

This section provides an answer to a user's question based on meaning, rather than on the exact phrasing. It utilizes the same embedding model used for adding documents: the user query is converted into a vector, which is then compared against the vectors of all stored documents. The top-k results are the k documents with vectors closest to the query vector (in this case, k = 3).

K. Main Application Loop

def main():
   print("Initializing Semantic Search (Vector DB with ChromaDB)...\n")

   client, collection = get_or_create_collection()
   stored_docs = set()
   # Rebuild in-memory duplicate set from existing collection (optional; for demo we track new adds)
   try:
       all_data = collection.get(include=["documents"])
       for doc_list in (all_data.get("documents") or []):
           for d in doc_list or []:
               if d:
                   stored_docs.add(d.lower())
   except Exception:
       pass

   added = seed_documents(collection)
   if added:
       for d in INITIAL_DOCS:
           stored_docs.add(d.lower())
       print(f"Seeded {added} sample documents.\n")

   # Next ID for newly added documents
   id_counter = collection.count()

   print("========================================")
   print(" SEMANTIC SEARCH (ChromaDB Demo)")
   print("========================================")
   print("Commands:")
   print("  add    -> Add a sentence to the vector database")
   print("  search -> Search by meaning (top results + scores)")
   print("  exit   -> Quit\n")

   while True:
       command = input("Enter command (add / search / exit): ").strip().lower()

       if command == "add":
           new_doc = input("Enter a sentence to store: ")
           success, msg = add_document(
               collection, new_doc, f"doc_{id_counter}", stored_docs
           )
           if success:
               id_counter += 1
           print(msg + "\n")

       elif command == "search":
           query = input("Enter your search query: ")
           results, err = search_documents(collection, query, n_results=3)
           if err:
               print(err + "\n")
               continue
           print("\nTop matching results (lower distance = more similar):")
           for rank, (doc, dist) in enumerate(results, 1):
               print(f"  {rank}. [distance: {dist:.4f}] {doc}")
           print("-----------------------------\n")

       elif command == "exit":
           print("\nExiting. Your vector database is saved at:", DB_PATH)
           break

       else:
           print("Unknown command. Use: add, search, or exit.\n")


if __name__ == "__main__":
   main()

The core of the application is a small Command Line Interface (CLI) loop that orchestrates all functions. This loop initializes the application, loads an existing data collection or creates a new one, and can optionally pre-populate it with data (seed). It then enters a continuous cycle, prompting the user to select and execute one of the available commands: add, search, or exit.

Application Flow:

a. Startup: Create a persistent client and collection (get_or_create_collection). Rebuild the in-memory set for duplicate checks. Seed sample documents if the collection is empty (seed_documents). Set the id_counter. Print the menu (add/search/exit).

b. Loop (while True): Read a command.

add: Prompt for a sentence, call add_document(), and update id_counter.
search: Prompt for a query, call search_documents(..., n_results=3), and print the top matching results (rank, distance, text).
exit: Print the database save location and break.

c. After exit: Data is persisted in chroma_semantic_search_db due to using PersistentClient.

Summary: The main application loop repeatedly takes a command to add a document, search with top-k semantic search, or exit, with all data persistently stored on disk.

TEST IT YOURSELF!

🧪 Now that your semantic search program is ready, let’s test it properly by entering meaningful search queries and verifying the results.

You can do a search query, add documents or knowledge or exit the program.

A. Run the program
In your terminal, run and execute your Python file:

python semantic_search.py

B. Test Semantic Search Queries

Try the following queries:

Enter command (add / search / exit): search
Enter your search query: cloud services platform

Enter command (add / search / exit): search
Enter your search query: language used for AI models

C. Add your Own Document

You can add more sentences to test better.

Enter command (add / search / exit): add

Enter a sentence to store: AWSCC is a community where you can gain hands-on cloud skills, access AWS esources, work on real projects,and get support for certification.

Now try searching:

Enter command (add / search / exit): search

Enter your search query: What is AWSCC?

And find out what's the result.

⚠️LIMITATION

However, there are also cons with using vector databases.

It does not automatically remove duplicates.
Stores every embedding independently.
Identical or very similar sentences may be taken in repeated results.

In Semantic Retrieval Systems, this is called Redundant Semantic Retrieval.

And please understand that this is not a Question-Answering (QA) system. This does not mean that a semantic search engine can generate direct answers to user queries. Instead, it just provides back documents that are semantically similar with our query based on vector similarity.

In our demo we added a simple duplicate check so the same sentence isn’t stored twice; ChromaDB itself does not do this for you.

If we want to generate an answer or build context-aware or natural language responses, we need to combine this with a Large Language Model (LLM) within a Retrieval-Augmented Generation (RAG) pipeline.

What we built is a retrieval system, and not a generation system.

This is what our system is actually implying:

🤔 Oooops, you might be thinking now that this could be useless.

Nuh - - uh, Not at all.

Even though this is not a QA System, semantic search alone can be used in many aspects, especially here in our university to elevate the services we can offer.

It can be used for:

1. University Knowledge Base Search
2. Freshie Guide Primer
3. UPMin Enrollment FAQ Retrieval System
And even
4. Document Search for our Organization

In fact, before chatbots could generate answers, they also need to retrieve information. Our system can also become a backend retrieval system for them.

This may look underwhelming, but this system can be actually very helpful, even for students and school organizations. It’s practical, applicable, and very relevant.

We believe that’s all of it for now, and we hope to share with you more articles soon, maybe about LLM already so we could do the QA or even ELEVATE our game even higher!