DEV Community

Cover image for 🚀 The Future of Information Retrieval (IR) — Looking for Testers & Feedback
Constantin
Constantin

Posted on

🚀 The Future of Information Retrieval (IR) — Looking for Testers & Feedback

Hi everyone!
My name is Constantin, and for the past few months I've been building a custom Information Retrieval (IR) system from scratch — partly for fun, partly to learn, and partly because I wanted something that existing tools didn’t give me, simplicity mixed with raw speed.

I’m finally at a point where it’s usable, and now I’m looking for feedback, testers, ideas, and brutally honest opinions from other developers who care about making information retrieval better for everyone.

https://github.com/engag1ng/hirmes
https://hirmes.webflow.io/

🔍 Why I Built It

I already know that most commercial software has some sort of Information Retrieval system built-in and yet nobody uses it. I wanted to fix that.

✨ Key Features (so far)

🔸 1. Custom Token–Postings Indexing System

Instead of using Lucene, Whoosh, or Elastic, I built my own:

  • Tokenizer with optional filters
  • Postings lists stored efficiently
  • Fast search over token sets

🔸 2. SQLite Backend for Storage

Initially I used dbm + pickle, but I migrated to sqlite3 for:

  • Better performance on large posting sets
  • ACID guarantees
  • Easier debugging
  • More predictable persistence
  • The schema is simple and extensible, so you can add your own metadata or scoring fields.

🔸 3. User-Assigned Document IDs

You can directly assign your own document IDs, making it ideal for:

  • personal knowledge bases
  • bookmarking apps
  • search inside your own dataset
  • programmatically indexed corpora
  • No auto-generation required unless you want it.

🔸 4. Search Engine Core Logic

The search API currently supports:

  • term lookup
  • multi-term queries
  • boolean AND/OR
  • scoring based on term intersections (more ranking choices planned)

🔸 5. Performance-Focused Tokenization

I spent quite a bit of time optimizing tokenization for speed.

🧪 What I’m Looking For

I’d love early testers who can help with:

✔️ Trying it on your own small dataset
✔️ Finding slow spots, bugs, or edge cases
✔️ Suggesting features, scoring models, or indexing ideas
✔️ Telling me if something is unclear or needs documentation
✔️ Experimenting with tokenization and weighting strategies

If you’ve ever built tools involving:

  • search
  • indexing
  • NLP
  • document retrieval
  • data engineering
  • Python performance tuning …your thoughts would mean the world to me.

🙏 How You Can Help

If you want to test it or give feedback, just drop a comment here or message me.
I can provide:

  • Installation instructions
  • Example code
  • A test dataset
  • Architecture overview

Or if you prefer GitHub issues / discussions, I can open those up as well.

❤️ Thank You

I know there are a ton of IR libraries and search engines out there, so if you take the time to try a small personal project of mine, it means a lot.
I’m doing this to learn and to build something useful — and I’d love to improve it with help from you guys.

Top comments (0)