Building a Semantic Search Engine for Any Website Using React, Django & Milvus Lite

#webdev #ai #python #programming

I recently built an end-to-end semantic search application that takes any website URL + a user query, and returns the top 10 most relevant HTML content chunks—all using embeddings and a vector database.

🔹 Tech Stack

Frontend: React 18 + Vite

Backend: Django 5 + Django REST Framework

NLP: BERT tokenizer + Sentence-Transformers

Vector DB: Milvus Lite (with cosine similarity fallback)

🔹 Processing Pipeline

Fetch + clean HTML (BeautifulSoup)

Extract DOM blocks (h1–h6, p, li, code, etc.)

Chunk to ≤500 tokens (BERT limitation)

Embed blocks

Store + search in vector DB

Rank + return top-10 results

🔹 Frontend Highlights

Card-based UI

Snippet + full HTML tabs

Show more/less

Copy markup button

Optional highlight

🔹 Challenges

Preserving readable HTML while staying under 500 tokens

Milvus Lite issues on Windows → fallback to cosine

First-run embedding model download delays

🔹 Lessons Learned

DOM-block chunking improves readability

Normalized embeddings enable consistent similarity scores

Toggle-based UI improves UX

🔹 What's Next?

✅ Multi-page crawling
✅ Better DOM coverage (tables, figures, captions)
✅ Server caching of embeddings
✅ Plug-in vector DB support (Pinecone, Weaviate, etc.)

🎥 I also recorded a full demo video. Happy to share or open source it soon!

visit linkedin:https://www.linkedin.com/in/thiyagu26v/
project repository:https://github.com/thiyagu26v/website-content-django