We have become far too comfortable with the 8-bit byte.
In modern systems, we treat the byte as the indivisible atom of information. When we "compress" data, we usually look for repeating sequences (LZ77) or statistical probabilities (Huffman/ANS). But there is a hidden layer of redundancy that these general-purpose algorithms often leave untouched: The physical overhead of the data format itself.
The "Byte Container" Tax
Think about the character '2'. In ASCII, it is stored as 0x32 (00110010).
If you are storing a long list of numeric sensor logs, every single digit carries that 0x3 prefix. A general-purpose compressor might notice the pattern, but it still treats the data as a stream of 8-bit containers. However, the essence of '2' is just the value 2. To store the value 2, you only need 2 bits (10).
General-purpose compression looks for patterns between containers. I’ve been experimenting with a technique I call Literal Casting, which looks for waste inside the container.
How Literal Casting Works
The logic is to strip away the physical form dictated by standards like ASCII or fixed-width integers and store only the minimal significant bits.
1. Dimensionality Reduction
Instead of storing the character '2', we extract the core value 2 and store it using the minimum number of bits required. To prevent the bitstream from becoming overly fragmented, I use a "Floor Bandwidth" ($B_f$). For example, if $B_f = 2$, any value requiring 1 or 2 bits is normalized to 2 bits.
2. Ambiguity and Type Profiles
The challenge with stripping prefixes is ambiguity. How do you distinguish a literal ASCII '2' from a raw binary 0x02 once both are reduced to the 2-bit value 10?
The solution is to use Type Profiles—predefined semantic environments for data chunks:
- Numeric/Hex ASCII: The decoder knows to OR the value with
0x30to restore the digit. - Sparse 16/32: Designed for machine words with high zero-prefix redundancy.
- UTF-8 Mixed: Strips fixed framing bits from multi-byte Unicode scalars.
- Base64 Bridged: Reinterprets the Base64 alphabet as a 6-bit sextet stream.
Technical Implementation
To make this work reliably, the bitstream is split into two independent parts for each chunk:
- The Pointer Stream: Encodes the bit-width of each record using a custom prefix code.
- The Data Stream: Tightly packs the actual literal values.
Because this operates at the bit level, I use a dual CRC32 approach: one for the compressed payload and one for the original source to ensure absolute parity after reconstruction. If the estimated bit-cost of "casting" exceeds the raw binary size, the system automatically falls back to a raw pass-through.
I’ve implemented this logic in a small C project called LAZ (Literal-cast Absolute Zero). It’s a zero-dependency compressor that targets these specific industrial and structured data redundancies where general-purpose tools often over-calculate.
Project Link: LAZ
Top comments (0)