Introduction
In this tutorial, I'll walk you through how I built a semantic search engine using CocoIndex, an open-source Python library for creating powerful search experiences. If you've ever wanted to build a search feature that understands context and meaning (not just exact keyword matches), this post is for you!
What is CocoIndex?
CocoIndex is a lightweight semantic search library that makes it easy to index and search through documents using vector embeddings. Unlike traditional keyword-based search, semantic search understands the meaning behind queries, allowing users to find relevant results even when they use different words.
Why I Chose CocoIndex
I needed a search solution that was:
- Easy to integrate - No complex setup or infrastructure required
- Fast - Quick indexing and search performance
- Semantic - Understanding context, not just keywords
- Open source - Free to use and modify
CocoIndex checked all these boxes!
Getting Started
First, install CocoIndex:
pip install cocoindex
Building the Search Engine
Here's how I implemented the core functionality:
1. Initialize CocoIndex
from cocoindex import CocoIndex
2. Add Documents
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
"""
Define an example flow that embeds text into a vector database.
"""
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="markdown_files"))
doc_embeddings = data_scope.add_collector()
Index the documents
Process each document
with data_scope["documents"].row() as doc:
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500)
Embed
with doc["chunks"].row() as chunk:
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"
)
)
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
Export
doc_embeddings.export(
"doc_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_indexes=[
cocoindex.VectorIndexDef(
field_name="embedding",
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
3. Perform Semantic Search
def search(pool: ConnectionPool, query: str, top_k: int = 5):
table_name = cocoindex.utils.get_target_storage_default_name(text_embedding_flow, "doc_embeddings")
query_vector = text_to_embedding.eval(query)
with pool.connection() as conn:
with conn.cursor() as cur:
cur.execute(f"""
SELECT filename, text, embedding <=> %s::vector AS distance
FROM {table_name} ORDER BY distance LIMIT %s
""", (query_vector, top_k))
return [
{"filename": row[0], "text": row[1], "score": 1.0 - row[2]}
for row in cur.fetchall()
]
Key Features I Implemented
Fast Indexing
CocoIndex uses efficient vector storage, making indexing thousands of documents quick and painless.
Semantic Understanding
The search understands that "teaching computers" relates to "machine learning" even without exact keyword matches.
Customizable Embeddings
You can use different embedding models depending on your use case and accuracy requirements.
Real-World Example
I built a documentation search for my project with 500+ markdown files. With CocoIndex:
- Indexing took less than 30 seconds
- Search response time averaged 50ms
- Users found relevant docs even with vague queries
Performance Tips
- Batch indexing - Add multiple documents at once for better performance
- Choose the right embedding model - Balance between accuracy and speed
- Cache frequently accessed results - Store common queries for instant responses
Challenges I Faced
Challenge 1: Choosing Embedding Dimensions
Higher dimensions = better accuracy but slower performance. I settled on 384 dimensions as a sweet spot.
Challenge 2: Handling Large Document Collections
For collections over 10k documents, I implemented pagination and lazy loading.
Results
After implementing CocoIndex:
- User satisfaction increased significantly
- Implementation took only 2 days vs weeks for alternatives
Conclusion
CocoIndex made building a semantic search engine surprisingly simple. Whether you're building a documentation site, blog search, or product catalog, it's a fantastic tool that punches above its weight.
The library is actively maintained, well-documented, and the community is helpful. I highly recommend giving it a try for your next search implementation!
Resources
- GitHub: CocoIndex Repository
- Documentation: Official Docs
- My Demo Project: Simple Vector Index Demo
Have you used CocoIndex or other semantic search libraries? Share your experience in the comments below!
Happy coding! 🚀
Top comments (0)