Constantin

Posted on Dec 8, 2025

🚀 The Future of Information Retrieval (IR) — Looking for Testers & Feedback

#productivity #python #opensource #testing

Hi everyone!
My name is Constantin, and for the past few months I've been building a custom Information Retrieval (IR) system from scratch — partly for fun, partly to learn, and partly because I wanted something that existing tools didn’t give me, simplicity mixed with raw speed.

I’m finally at a point where it’s usable, and now I’m looking for feedback, testers, ideas, and brutally honest opinions from other developers who care about making information retrieval better for everyone.

https://github.com/engag1ng/hirmes
https://hirmes.webflow.io/

🔍 Why I Built It

I already know that most commercial software has some sort of Information Retrieval system built-in and yet nobody uses it. I wanted to fix that.

✨ Key Features (so far)

🔸 1. Custom Token–Postings Indexing System

Instead of using Lucene, Whoosh, or Elastic, I built my own:

Tokenizer with optional filters
Postings lists stored efficiently
Fast search over token sets

🔸 2. SQLite Backend for Storage

Initially I used dbm + pickle, but I migrated to sqlite3 for:

Better performance on large posting sets
ACID guarantees
Easier debugging
More predictable persistence
The schema is simple and extensible, so you can add your own metadata or scoring fields.

🔸 3. User-Assigned Document IDs

You can directly assign your own document IDs, making it ideal for:

personal knowledge bases
bookmarking apps
search inside your own dataset
programmatically indexed corpora
No auto-generation required unless you want it.

🔸 4. Search Engine Core Logic

The search API currently supports:

term lookup
multi-term queries
boolean AND/OR
scoring based on term intersections (more ranking choices planned)

🔸 5. Performance-Focused Tokenization

I spent quite a bit of time optimizing tokenization for speed.

🧪 What I’m Looking For

I’d love early testers who can help with:

✔️ Trying it on your own small dataset
✔️ Finding slow spots, bugs, or edge cases
✔️ Suggesting features, scoring models, or indexing ideas
✔️ Telling me if something is unclear or needs documentation
✔️ Experimenting with tokenization and weighting strategies

If you’ve ever built tools involving:

search
indexing
NLP
document retrieval
data engineering
Python performance tuning …your thoughts would mean the world to me.

🙏 How You Can Help

If you want to test it or give feedback, just drop a comment here or message me.
I can provide:

Installation instructions
Example code
A test dataset
Architecture overview

Or if you prefer GitHub issues / discussions, I can open those up as well.

❤️ Thank You

I know there are a ton of IR libraries and search engines out there, so if you take the time to try a small personal project of mine, it means a lot.
I’m doing this to learn and to build something useful — and I’d love to improve it with help from you guys.

DEV Community