DEV Community

Cover image for ๐Ÿš€ The Future of Information Retrieval (IR) โ€” Looking for Testers & Feedback
Constantin
Constantin

Posted on

๐Ÿš€ The Future of Information Retrieval (IR) โ€” Looking for Testers & Feedback

Hi everyone!
My name is Constantin, and for the past few months I've been building a custom Information Retrieval (IR) system from scratch โ€” partly for fun, partly to learn, and partly because I wanted something that existing tools didnโ€™t give me, simplicity mixed with raw speed.

Iโ€™m finally at a point where itโ€™s usable, and now Iโ€™m looking for feedback, testers, ideas, and brutally honest opinions from other developers who care about making information retrieval better for everyone.

https://github.com/engag1ng/hirmes
https://hirmes.webflow.io/

๐Ÿ” Why I Built It

I already know that most commercial software has some sort of Information Retrieval system built-in and yet nobody uses it. I wanted to fix that.

โœจ Key Features (so far)

๐Ÿ”ธ 1. Custom Tokenโ€“Postings Indexing System

Instead of using Lucene, Whoosh, or Elastic, I built my own:

  • Tokenizer with optional filters
  • Postings lists stored efficiently
  • Fast search over token sets

๐Ÿ”ธ 2. SQLite Backend for Storage

Initially I used dbm + pickle, but I migrated to sqlite3 for:

  • Better performance on large posting sets
  • ACID guarantees
  • Easier debugging
  • More predictable persistence
  • The schema is simple and extensible, so you can add your own metadata or scoring fields.

๐Ÿ”ธ 3. User-Assigned Document IDs

You can directly assign your own document IDs, making it ideal for:

  • personal knowledge bases
  • bookmarking apps
  • search inside your own dataset
  • programmatically indexed corpora
  • No auto-generation required unless you want it.

๐Ÿ”ธ 4. Search Engine Core Logic

The search API currently supports:

  • term lookup
  • multi-term queries
  • boolean AND/OR
  • scoring based on term intersections (more ranking choices planned)

๐Ÿ”ธ 5. Performance-Focused Tokenization

I spent quite a bit of time optimizing tokenization for speed.

๐Ÿงช What Iโ€™m Looking For

Iโ€™d love early testers who can help with:

โœ”๏ธ Trying it on your own small dataset
โœ”๏ธ Finding slow spots, bugs, or edge cases
โœ”๏ธ Suggesting features, scoring models, or indexing ideas
โœ”๏ธ Telling me if something is unclear or needs documentation
โœ”๏ธ Experimenting with tokenization and weighting strategies

If youโ€™ve ever built tools involving:

  • search
  • indexing
  • NLP
  • document retrieval
  • data engineering
  • Python performance tuning โ€ฆyour thoughts would mean the world to me.

๐Ÿ™ How You Can Help

If you want to test it or give feedback, just drop a comment here or message me.
I can provide:

  • Installation instructions
  • Example code
  • A test dataset
  • Architecture overview

Or if you prefer GitHub issues / discussions, I can open those up as well.

โค๏ธ Thank You

I know there are a ton of IR libraries and search engines out there, so if you take the time to try a small personal project of mine, it means a lot.
Iโ€™m doing this to learn and to build something useful โ€” and Iโ€™d love to improve it with help from you guys.

Top comments (0)