I recently built an end-to-end semantic search application that takes any website URL + a user query, and returns the top 10 most relevant HTML content chunks—all using embeddings and a vector database.
🔹 Tech Stack
Frontend: React 18 + Vite
Backend: Django 5 + Django REST Framework
NLP: BERT tokenizer + Sentence-Transformers
Vector DB: Milvus Lite (with cosine similarity fallback)
🔹 Processing Pipeline
Fetch + clean HTML (BeautifulSoup)
Extract DOM blocks (h1–h6, p, li, code, etc.)
Chunk to ≤500 tokens (BERT limitation)
Embed blocks
Store + search in vector DB
Rank + return top-10 results
🔹 Frontend Highlights
Card-based UI
Snippet + full HTML tabs
Show more/less
Copy markup button
Optional highlight
🔹 Challenges
Preserving readable HTML while staying under 500 tokens
Milvus Lite issues on Windows → fallback to cosine
First-run embedding model download delays
🔹 Lessons Learned
DOM-block chunking improves readability
Normalized embeddings enable consistent similarity scores
Toggle-based UI improves UX
🔹 What's Next?
✅ Multi-page crawling
✅ Better DOM coverage (tables, figures, captions)
✅ Server caching of embeddings
✅ Plug-in vector DB support (Pinecone, Weaviate, etc.)
🎥 I also recorded a full demo video. Happy to share or open source it soon!
visit linkedin:https://www.linkedin.com/in/thiyagu26v/
project repository:https://github.com/thiyagu26v/website-content-django
other social:
myportfolio : https://thiyagu26v.github.io/myreactportfolio/
linktree : https://linktr.ee/thiyagu26v
Github : https://github.com/thiyagu26v
Forem : https://forem.com/thiyagu26v
Medium : https://medium.com/@thiyagu26v
Instagram : https://www.instagram.com/thiyagu26v
Dev.io : https://dev.to/thiyagu26v
stack overflow : https://stackoverflow.com/users/31647359/thiyagarajan-varadharajan
Facebook : https://www.facebook.com/thiyagu26v
Top comments (0)