I built a deterministic byte-exact retrieval engine. Here’s what I learned about correctness the hard way.
Not a search engine. Not a vector DB. Not a grep replacement. Something else.
Last year I started building something I couldn’t find anywhere else: a retrieval system that makes a hard guarantee.
Not “probably found it.” Not “semantically similar.” Not “ranked by relevance.”
Just: these exact bytes exist at these exact offsets. Every time. Same query, same result. No exceptions.
The project is called GLYPH. It’s built on suffix array + BWT + FM-index over raw bytes. It’s experimental. It has known limitations. And building it taught me more about correctness than anything I’ve worked on before.
This is the story of what went wrong, what I fixed, and what “determin... Читать далее
I built a retrieval engine that makes one hard guarantee: same bytes, same result, every time.
No ranking. No embeddings. No “probably found it.”
Just: these exact bytes exist at these exact offsets.
The bug that taught me the most: FM-index counts were wrong on HDFS 1GB. SA correct. BWT correct. C-table correct. The culprit was one missing byte — the terminal sentinel wasn’t physically appended to the corpus, only accounted for symbolically. Off by one byte. Wrong counts.
Fix: append a real 0x00. Verify against Python oracle. Formalize as an invariant. Write a regression test.
That shift — from “fixed a bug” to “formalized a contract” — changed how I think about correctness entirely.
Benchmark reality, honestly:
grep 1GB scan: 11.5 sec
GLYPH persistent FM: 0.0167 ms/query ← index in RAM
GLYPH verified CLI: ~19 ms/query ← subprocess + integrity check
Two different systems. Most benchmarks show only the fast number. Both matter.
RAM cost: 9.4GB for 1GB corpus. Not hiding it. Compressed SA is next.
This isn’t a vector DB killer. It’s a verification layer beneath probabilistic systems — for when you need to know if a chunk was actually in the source, not just semantically similar.
git clone https://github.com/yasha1971-coder/glyph-engine
./examples/mini/build_mini.sh
# count: 2
Apache-2.0. Experimental. Critique welcome, especially on RAM economics.
#systems #retrieval #infrastructure #cpp #algorithms
Top comments (0)