Hi everyone!
My name is Constantin, and for the past few months I've been building a custom Information Retrieval (IR) system from scratch — partly for fun, partly to learn, and partly because I wanted something that existing tools didn’t give me, simplicity mixed with raw speed.
I’m finally at a point where it’s usable, and now I’m looking for feedback, testers, ideas, and brutally honest opinions from other developers who care about making information retrieval better for everyone.
https://github.com/engag1ng/hirmes
https://hirmes.webflow.io/
🔍 Why I Built It
I already know that most commercial software has some sort of Information Retrieval system built-in and yet nobody uses it. I wanted to fix that.
✨ Key Features (so far)
🔸 1. Custom Token–Postings Indexing System
Instead of using Lucene, Whoosh, or Elastic, I built my own:
- Tokenizer with optional filters
- Postings lists stored efficiently
- Fast search over token sets
🔸 2. SQLite Backend for Storage
Initially I used dbm + pickle, but I migrated to sqlite3 for:
- Better performance on large posting sets
- ACID guarantees
- Easier debugging
- More predictable persistence
- The schema is simple and extensible, so you can add your own metadata or scoring fields.
🔸 3. User-Assigned Document IDs
You can directly assign your own document IDs, making it ideal for:
- personal knowledge bases
- bookmarking apps
- search inside your own dataset
- programmatically indexed corpora
- No auto-generation required unless you want it.
🔸 4. Search Engine Core Logic
The search API currently supports:
- term lookup
- multi-term queries
- boolean AND/OR
- scoring based on term intersections (more ranking choices planned)
🔸 5. Performance-Focused Tokenization
I spent quite a bit of time optimizing tokenization for speed.
🧪 What I’m Looking For
I’d love early testers who can help with:
✔️ Trying it on your own small dataset
✔️ Finding slow spots, bugs, or edge cases
✔️ Suggesting features, scoring models, or indexing ideas
✔️ Telling me if something is unclear or needs documentation
✔️ Experimenting with tokenization and weighting strategies
If you’ve ever built tools involving:
- search
- indexing
- NLP
- document retrieval
- data engineering
- Python performance tuning …your thoughts would mean the world to me.
🙏 How You Can Help
If you want to test it or give feedback, just drop a comment here or message me.
I can provide:
- Installation instructions
- Example code
- A test dataset
- Architecture overview
Or if you prefer GitHub issues / discussions, I can open those up as well.
❤️ Thank You
I know there are a ton of IR libraries and search engines out there, so if you take the time to try a small personal project of mine, it means a lot.
I’m doing this to learn and to build something useful — and I’d love to improve it with help from you guys.
Top comments (0)