Hi everyone!
My name is Constantin, and for the past few months I've been building a custom Information Retrieval (IR) system from scratch โ partly for fun, partly to learn, and partly because I wanted something that existing tools didnโt give me, simplicity mixed with raw speed.
Iโm finally at a point where itโs usable, and now Iโm looking for feedback, testers, ideas, and brutally honest opinions from other developers who care about making information retrieval better for everyone.
https://github.com/engag1ng/hirmes
https://hirmes.webflow.io/
๐ Why I Built It
I already know that most commercial software has some sort of Information Retrieval system built-in and yet nobody uses it. I wanted to fix that.
โจ Key Features (so far)
๐ธ 1. Custom TokenโPostings Indexing System
Instead of using Lucene, Whoosh, or Elastic, I built my own:
- Tokenizer with optional filters
- Postings lists stored efficiently
- Fast search over token sets
๐ธ 2. SQLite Backend for Storage
Initially I used dbm + pickle, but I migrated to sqlite3 for:
- Better performance on large posting sets
- ACID guarantees
- Easier debugging
- More predictable persistence
- The schema is simple and extensible, so you can add your own metadata or scoring fields.
๐ธ 3. User-Assigned Document IDs
You can directly assign your own document IDs, making it ideal for:
- personal knowledge bases
- bookmarking apps
- search inside your own dataset
- programmatically indexed corpora
- No auto-generation required unless you want it.
๐ธ 4. Search Engine Core Logic
The search API currently supports:
- term lookup
- multi-term queries
- boolean AND/OR
- scoring based on term intersections (more ranking choices planned)
๐ธ 5. Performance-Focused Tokenization
I spent quite a bit of time optimizing tokenization for speed.
๐งช What Iโm Looking For
Iโd love early testers who can help with:
โ๏ธ Trying it on your own small dataset
โ๏ธ Finding slow spots, bugs, or edge cases
โ๏ธ Suggesting features, scoring models, or indexing ideas
โ๏ธ Telling me if something is unclear or needs documentation
โ๏ธ Experimenting with tokenization and weighting strategies
If youโve ever built tools involving:
- search
- indexing
- NLP
- document retrieval
- data engineering
- Python performance tuning โฆyour thoughts would mean the world to me.
๐ How You Can Help
If you want to test it or give feedback, just drop a comment here or message me.
I can provide:
- Installation instructions
- Example code
- A test dataset
- Architecture overview
Or if you prefer GitHub issues / discussions, I can open those up as well.
โค๏ธ Thank You
I know there are a ton of IR libraries and search engines out there, so if you take the time to try a small personal project of mine, it means a lot.
Iโm doing this to learn and to build something useful โ and Iโd love to improve it with help from you guys.
Top comments (0)