Text Compression Still Wastes Bits on Structure

#compression #algorithms #c #opensource

We have become far too comfortable with the 8-bit byte.

In modern systems, we treat the byte as the indivisible atom of information. When we "compress" data, we usually look for repeating sequences (LZ77) or statistical probabilities (Huffman/ANS). But there is a hidden layer of redundancy that these general-purpose algorithms often leave untouched: The physical overhead of the data format itself.

The "Byte Container" Tax

Think about the character '2'. In ASCII, it is stored as 0x32 (00110010).

If you are storing a long list of numeric sensor logs, every single digit carries that 0x3 prefix. A general-purpose compressor might notice the pattern, but it still treats the data as a stream of 8-bit containers. However, the essence of '2' is just the value 2. To store the value 2, you only need 2 bits (10).

General-purpose compression looks for patterns between containers. I’ve been experimenting with a technique I call Literal Casting, which looks for waste inside the container.

How Literal Casting Works

The logic is to strip away the physical form dictated by standards like ASCII or fixed-width integers and store only the minimal significant bits.

1. Dimensionality Reduction

Instead of storing the character '2', we extract the core value 2 and store it using the minimum number of bits required. To prevent the bitstream from becoming overly fragmented, I use a "Floor Bandwidth" ($B_f$). For example, if $B_f = 2$, any value requiring 1 or 2 bits is normalized to 2 bits.

2. Ambiguity and Type Profiles

The challenge with stripping prefixes is ambiguity. How do you distinguish a literal ASCII '2' from a raw binary 0x02 once both are reduced to the 2-bit value 10?

The solution is to use Type Profiles—predefined semantic environments for data chunks:

Numeric/Hex ASCII: The decoder knows to OR the value with 0x30 to restore the digit.
Sparse 16/32: Designed for machine words with high zero-prefix redundancy.
UTF-8 Mixed: Strips fixed framing bits from multi-byte Unicode scalars.
Base64 Bridged: Reinterprets the Base64 alphabet as a 6-bit sextet stream.

Technical Implementation

To make this work reliably, the bitstream is split into two independent parts for each chunk:

The Pointer Stream: Encodes the bit-width of each record using a custom prefix code.
The Data Stream: Tightly packs the actual literal values.

Because this operates at the bit level, I use a dual CRC32 approach: one for the compressed payload and one for the original source to ensure absolute parity after reconstruction. If the estimated bit-cost of "casting" exceeds the raw binary size, the system automatically falls back to a raw pass-through.

I’ve implemented this logic in a small C project called LAZ (Literal-cast Absolute Zero). It’s a zero-dependency compressor that targets these specific industrial and structured data redundancies where general-purpose tools often over-calculate.

Project Link: LAZ