DEV Community

contour
contour

Posted on

Reviving glyph-v8: From a Forgotten Prototype to STRIDE - a Field-Aware Integer Coder

GitHub “Finish-Up-A-Thon” Challenge Submission

Executive Summary

STRIDE is a field‑aware integer coder that revives the abandoned glyph‑v8 prototype and turns it into a practical, measurable, deterministic compression primitive for binary protocols.
It profiles integer fields, builds per‑field models, selects optimal codecs, and outperforms general compressors like zstd on integer‑heavy data.


What I Built

STRIDE — Structured Integer Decoder/Encoder.

A field‑aware integer coder for binary protocols. Not a general compressor.
A primitive that does one thing extremely well: exploit the fact that integer fields in Protobuf, MessagePack, and Thrift are not random — they have highly skewed, predictable distributions.

zstd doesn’t know field boundaries.
STRIDE does.

Built on top of the revived glyph‑v8 prototype.


Demo

• GitHub: https://github.com/yasha1971-coder/glyph-v8 (github.com in Bing)
• Replit demo: https://replit.com/@yasha1971/Glyph-Search (replit.com in Bing)

Initial profiling on a Protobuf corpus shows:
60–70% of fields are integer‑type (timestamps, IDs, counters, enums).
Full benchmark results vs zstd will be added before June 7.


STRIDE Architecture (Why It Works)

┌──────────────────────────────────────────────┐
│ STRIDE │
│ Structured Integer Decoder / Encoder │
└──────────────────────────────────────────────┘

    ┌──────────────────────────────┐
    │ 1. Profiling Layer           │
    │------------------------------│
    │ • Parse corpus               │
    │ • Detect integer fields      │
    │ • Build per-field histograms │
    │ • Estimate entropy           │
    └──────────────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────┐
    │ 2. Model Builder             │
    │------------------------------│
    │ • Choose best codec per field│
    │   (Delta, Rice, Elias, Dict) │
    │ • Produce compact model.json │
    └──────────────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────┐
    │ 3. Encoder                   │
    │------------------------------│
    │ • Apply field-aware coding   │
    │ • Attach model header        │
    │ • Output compressed stream   │
    └──────────────────────────────┘
                 │
                 ▼
    ┌──────────────────────────────┐
    │ 4. Decoder                   │
    │------------------------------│
    │ • Load model                 │
    │ • Decode deterministically   │
    │ • Reconstruct original data  │
    └──────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Before / After — The Revival Story

┌──────────────────────────────┐ ┌────────────────────────────────┐
│ BEFORE │ │ AFTER │
├──────────────────────────────┤ ├────────────────────────────────┤
│ • glyph-v8 abandoned │ │ • STRIDE implemented │
│ • no docs, no roadmap │ │ • profiling + encoding layers │
│ • no demo │ │ • Replit demo + GitHub release │
│ • no architecture │ │ • full architecture + context │
│ • code sitting on OVH │ │ • revived project with purpose │
└──────────────────────────────┘ └────────────────────────────────┘


Why STRIDE Matters

Binary protocols like Protobuf, Thrift, and MessagePack move billions of messages per day.
Most of these messages contain highly structured integer fields:

• timestamps
• counters
• IDs
• status codes
• enums

General compressors treat them as random bytes.
STRIDE treats them as predictable distributions.

This is where the compression gains come from.


STRIDE vs zstd — Conceptual Comparison

┌──────────────────────────────┬──────────────────────────────┬──────────────────────────────┐
│ Feature │ zstd │ STRIDE │
├──────────────────────────────┼──────────────────────────────┼──────────────────────────────┤
│ Field awareness │ No │ Yes │
│ Integer distribution model │ No │ Per-field adaptive │
│ Timestamp delta modeling │ No │ Yes │
│ Status code compression │ No │ Dictionary / RLE │
│ Schema-aware │ No │ Yes │
│ Deterministic decode │ Yes │ Yes │
│ Expected compression ratio │ 3–4× │ 6–8× (integer-heavy data) │
└──────────────────────────────┴──────────────────────────────┴──────────────────────────────┘


STRIDE Pipeline

STRIDE Pipeline

  1. Load Protobuf corpus
  2. Extract integer fields
  3. Build histograms
  4. Compute entropy
  5. Select codec per field
  6. Generate model.json
  7. Encode data
  8. Decode deterministically
  9. Benchmark vs zstd

Technical Highlights

• One‑pass profiling of integer fields
• Entropy estimation per field
• Adaptive codec selection (Delta, Rice, Elias, Dictionary)
• Compact model header
• Deterministic decode (no ML, no heuristics)
• Schema‑aware compression for Protobuf
• Benchmark pipeline with SHA256 verification


My Experience with GitHub Copilot

Copilot Contributions

✓ Reconstructed project context

✓ Designed STRIDE architecture

✓ Implemented integer field profiler

✓ Structured benchmark pipeline

✓ Helped write documentation

✓ Assisted in preparing the submission

Copilot didn’t just autocomplete code — it helped rebuild a forgotten project into a structured system.


What’s Next

STRIDE is the third primitive in a family:

• ACEAPEX — parallel LZ77 decode, 9,903 MB/s, merged into lzbench
• GLYPH — deterministic byte‑exact retrieval, 6,888× faster than grep
• STRIDE — field‑aware integer coding for binary protocols

Roadmap:

• Add full benchmark suite (STRIDE vs zstd vs LZ4)
• Add streaming encoder
• Add MessagePack and Thrift adapters
• Add visualization of field distributions
• Publish STRIDE as a standalone Python package


Conclusion

This challenge gave me the push to revive glyph‑v8 and transform it into STRIDE — a practical, measurable, deterministic compression primitive for structured integer data.

Thanks to GitHub, MLH, and Copilot for making this revival possible.


Top comments (0)