Unorthodox journey towords data compression

#compression #vectordatabase #algorithms #datacompression

Hello and welcome, fellow explorer!

I’m writing from the heart of India, where my background is rooted in the pharmaceutical industry and the startup scene. Although I don’t have a formal computer science education, I’ve recently found myself captivated by the world of data compression—a field where math, logic, and creativity collide in fascinating ways.

How It All Began

It started with a simple curiosity (there is another story): What happens when you repeatedly concatenate the same decimal digit in non-power-of-two bases? For example, if you take the digit "6" and write it six times, you get "666666"—which I like to shorthand as "6'6." When you convert this sequence into binary, you end up with a surprisingly large number, nearly 20 digits long!

But the fun really begins when you start slicing this binary number into smaller sequences, both forwards and backwards. Each slice can be referenced using a notation like "6'6:slice number, position." While this produces many duplicate sequences, it also opens up a playground for exploring how information is stored and compressed.

The Challenge: Compressing the Unknown

Of course, real-world data isn’t always as neat as repeating digits. When you don’t know the structure of your input, things get tricky. That’s where the idea of a dynamic radix matrix comes in—a flexible system that adapts to the unique patterns and entropy of your data.

By experimenting with bit-level operations (like masking, XOR, OR, AND), arithmetic tweaks, and even Gray coding, I’m trying to "normalize" the entropy within this matrix. The goal? To find structures that make the data easier to compress, especially before feeding it into traditional compressors like LZ or table-based methods.

My Two-Step Prototype

As a beginner (with most of my coding done with AI assistance), here’s the approach I’m taking:

Structuring Data: I convert character sequences into decimal and binary forms, then study how their entropy changes with different radix choices.
Bit-Chunk Analysis: I slice the binary data into chunks, looking for efficient structures and patterns, before passing them to a second-stage compressor.

I’m using basic Python scripts and plotting tools to visualize entropy across different chunk sizes and patterns. Even at this early stage, the insights have been eye-opening!

Lessons Learned and implementation, found some intresting beautiful mathematical structures symmetry with different iterations. Soon i will share in next part with my findings, observation once I'm confident enough rigorously

DEV Community

Unorthodox journey towords data compression

Top comments (0)