<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: contour</title>
    <description>The latest articles on DEV Community by contour (@yasha1971coder).</description>
    <link>https://dev.to/yasha1971coder</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935596%2F0df6e97a-b14f-429a-9d8d-18f701448faa.jpg</url>
      <title>DEV Community: contour</title>
      <link>https://dev.to/yasha1971coder</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yasha1971coder"/>
    <language>en</language>
    <item>
      <title>Description: Deterministic byte-exact retrieval over static corpora.</title>
      <dc:creator>contour</dc:creator>
      <pubDate>Sat, 16 May 2026 23:33:31 +0000</pubDate>
      <link>https://dev.to/yasha1971coder/description-deterministic-byte-exact-retrieval-over-static-corpora-4793</link>
      <guid>https://dev.to/yasha1971coder/description-deterministic-byte-exact-retrieval-over-static-corpora-4793</guid>
      <description>&lt;h1&gt;
  
  
  I built a deterministic byte-exact retrieval engine. Here’s what I learned about correctness the hard way.
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Not a search engine. Not a vector DB. Not a grep replacement. Something else.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Last year I started building something I couldn’t find anywhere else: a retrieval system that makes a hard guarantee.&lt;/p&gt;

&lt;p&gt;Not “probably found it.” Not “semantically similar.” Not “ranked by relevance.”&lt;/p&gt;

&lt;p&gt;Just: &lt;strong&gt;these exact bytes exist at these exact offsets. Every time. Same query, same result. No exceptions.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The project is called GLYPH. It’s built on suffix array + BWT + FM-index over raw bytes. It’s experimental. It has known limitations. And building it taught me more about correctness than anything I’ve worked on before.&lt;/p&gt;

&lt;p&gt;This is the story of what went wrong, what I fixed, and what “determin... Читать далее&lt;/p&gt;

&lt;h1&gt;
  
  
  I built a retrieval engine that makes one hard guarantee: same bytes, same result, every time.
&lt;/h1&gt;

&lt;p&gt;No ranking. No embeddings. No “probably found it.”&lt;/p&gt;

&lt;p&gt;Just: &lt;strong&gt;these exact bytes exist at these exact offsets.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;The bug that taught me the most: FM-index counts were wrong on HDFS 1GB. SA correct. BWT correct. C-table correct. The culprit was one missing byte — the terminal sentinel wasn’t physically appended to the corpus, only accounted for symbolically. Off by one byte. Wrong counts.&lt;/p&gt;

&lt;p&gt;Fix: append a real &lt;code&gt;0x00&lt;/code&gt;. Verify against Python oracle. Formalize as an invariant. Write a regression test.&lt;/p&gt;

&lt;p&gt;That shift — from “fixed a bug” to “formalized a contract” — changed how I think about correctness entirely.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Benchmark reality, honestly:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;grep 1GB scan:          11.5 sec
GLYPH persistent FM:    0.0167 ms/query  ← index in RAM
GLYPH verified CLI:     ~19 ms/query     ← subprocess + integrity check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two different systems. Most benchmarks show only the fast number. Both matter.&lt;/p&gt;

&lt;p&gt;RAM cost: 9.4GB for 1GB corpus. Not hiding it. Compressed SA is next.&lt;/p&gt;




&lt;p&gt;This isn’t a vector DB killer. It’s a verification layer beneath probabilistic systems — for when you need to know if a chunk was actually in the source, not just semantically similar.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/yasha1971-coder/glyph-engine
./examples/mini/build_mini.sh
&lt;span class="c"&gt;# count: 2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apache-2.0. Experimental. Critique welcome, especially on RAM economics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://glyph.rs" rel="noopener noreferrer"&gt;glyph.rs&lt;/a&gt; · &lt;a href="mailto:contact@glyph.rs"&gt;contact@glyph.rs&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;code&gt;#systems&lt;/code&gt; &lt;code&gt;#retrieval&lt;/code&gt; &lt;code&gt;#infrastructure&lt;/code&gt; &lt;code&gt;#cpp&lt;/code&gt; &lt;code&gt;#algorithms&lt;/code&gt;&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>computerscience</category>
      <category>database</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
