DEV Community

Cover image for What’s Inside gzip, zstd, and Other Lossless Compressors
Konstantinas Mamonas
Konstantinas Mamonas

Posted on

What’s Inside gzip, zstd, and Other Lossless Compressors

Compression shows up everywhere - logs, Kafka, Parquet, file systems, APIs - but most engineers use it without thinking about what's actually happening. You call .compress(), and smaller bytes come out. But what’s under the hood?

This post is a follow-up to my previous posts Compression Algorithms you probably inherited and Which Compression Saves the Most Storage $?. This one focuses on what techniques used in real-world lossless compressors. If you’re choosing a format or just want to understand your tools better, this will help.


What Compression Actually Does

Lossless compression shrinks data without sacrificing any of the original information. It achieves this through two primary steps:

  1. Pattern detection - identifying and leveraging recurring structures within the data.
  2. Encoding - representing these structures using fewer bits than the original representation.

While different compressors employ various combinations of these techniques, the fundamentals remain largely consistent. You can think of it like finding abbreviations for frequently used words to make a text shorter without losing any meaning.


Core Techniques

Run-Length Encoding (RLE)

When a sequence of identical data values occurs consecutively, RLE doesn't store each repetition. Instead, it stores the value once, along with the number of times it repeats.

Example:
AAAAABBBB becomes 5A4B

This technique shines when dealing with data containing long stretches of the same value. Imagine a simple black and white image where many consecutive pixels are the same color, RLE would be very effective here. It's less useful for highly variable or random data where repetitions are rare. In such cases, there are few "runs" to encode.

Dictionary Compression (LZ77, LZW)

Instead of repeatedly storing identical sequences of characters or bytes, dictionary compression methods store a single instance of the sequence and then refer back to it whenever it reappears.

  • LZ77 maintains a sliding window of recently seen data. When a repeated sequence is found, it's replaced by a pointer indicating the sequence's position and length within the window. Think of it like saying, "the next 3 characters are the same as the 5 characters that appeared 10 positions ago."
  • LZW (Lempel-Ziv-Welch), building upon LZ78, constructs a dictionary of frequently occurring patterns as it processes the data. When a pattern is encountered, it's replaced by its index in the dictionary.

These techniques form the bedrock of many popular compression algorithms:

  • DEFLATE (used in gzip) combines LZ77 to identify repeated sequences and then uses Huffman coding to efficiently represent the resulting tokens (both literal characters and the length/distance pairs from LZ77).
  • zstd and Brotli both utilize LZ-style pattern matching as an initial step to reduce redundancy in the data.

Entropy Coding

Once the data has been transformed into a more predictable sequence (often the output of pattern matching), entropy coding further reduces its size by assigning shorter bit codes to more frequent symbols. This stage doesn't care about the meaning of the data, only the frequency of the symbols.

  • Huffman Coding: Assigns variable-length bit codes to symbols based on their frequency. More frequent symbols get shorter codes. While fast to decode, it's not perfectly optimal and can sometimes waste up to 1 bit per symbol.
  • Arithmetic Coding: Achieves near-optimal compression by representing the entire input sequence as a single fractional number within the range [0, 1). The length of the fraction's binary representation determines the compressed size. It's generally slower than Huffman coding due to the complex calculations involved in manipulating these ranges.
  • ANS (Asymmetric Numeral Systems): A more recent family of entropy coding methods that offer compression ratios close to arithmetic coding but with decoding speeds similar to Huffman coding. It achieves this efficiency by encoding symbols into a single large number in a more streamlined way. zstd and Brotli commonly use variants of ANS, such as FSE (Finite State Entropy), which employs tables for faster processing.

The more effectively the data is preprocessed to reveal patterns, the more efficient this final entropy coding stage becomes. For example, if LZ77 has replaced many long repetitions with short pointers, the distribution of these pointers and remaining literal characters will be more skewed towards certain values, allowing Huffman or ANS to assign shorter codes to them.

Preprocessing Transforms

Sometimes, the inherent structure in data isn't immediately obvious to the compression encoder. Preprocessing transforms reorganize or reformat the data to make these redundancies more apparent.

  • Burrows–Wheeler Transform (BWT): Rearranges the bytes in a block of data to group identical or similar symbols together. BWT doesn't compress the data on its own, but by creating long runs of identical characters and increasing the frequency of certain symbols, it significantly improves the effectiveness of subsequent RLE and Huffman coding.
  • Move-to-Front (MTF): Maintains a list of recently encountered symbols. When a symbol appears, it's moved to the front of the list, and its position in the list (an integer) is outputted. This transform helps entropy coding because frequently occurring symbols will tend to have small position indices, and smaller integers often appear more frequently or have simpler probability distributions, making them easier to compress.

These transforms are often used in pipelines like:

raw input -> BWT -> MTF -> RLE -> Huffman

You'll find this specific combination in the bzip2 compressor.

Chaining Techniques

Real-world compressors don't rely on a single technique. Instead, they strategically combine multiple stages. One stage restructures the data to expose patterns, and the next stage then efficiently encodes those patterns to achieve a smaller size.

Compressor Pattern Matching Transforms Entropy Coding
gzip LZ77 - Huffman
zstd LZ77-style Optional (e.g., prefix) FSE (ANS variant)
Brotli LZ77-style MTF, context modeling Huffman, ANS
bzip2 - BWT -> MTF -> RLE Huffman
  • gzip is relatively straightforward: it finds repeated substrings using LZ77 and then encodes the resulting stream of literals and (length, distance) pairs with Huffman coding.
  • zstd employs fast LZ-style matching and FSE for highly efficient entropy coding, often incorporating optional preprocessing steps to further enhance compression.
  • Brotli incorporates transforms like Move-to-Front and sophisticated static/dynamic context models to predict the next symbol, enabling Huffman and ANS to achieve higher compression ratios.
  • bzip2 uniquely skips dictionary compression and instead relies on heavy preprocessing (BWT, MTF, RLE) to create highly compressible data for the final Huffman encoding stage.

A Note on Entropy

In information theory, entropy represents the theoretical minimum number of bits required to represent a unit of data, on average. High-entropy data is characterized by its unpredictability. If there are no discernible patterns, there's little or nothing for a compressor to exploit.

Random data, such as encrypted information or already highly compressed media, will not compress well and might even increase in size slightly due to the overhead of the compression format's metadata.


Practical Rules of Thumb

  • Sorted data often compresses better
    Try sorting logs or other datasets before compression, as this tends to create longer sequences of identical or similar values, improving pattern matching.

  • Transforms prep the data
    Techniques like BWT, MTF, and delta encoding don't directly reduce the size of the data. Instead, they reorganize it in a way that makes it more amenable to the subsequent compression stages.

  • Compression is workload-dependent
    Text, logs, and telemetry data typically exhibit significant redundancy and compress well. In contrast, encrypted data lacks patterns, and data with very high cardinality (many unique values) offers fewer opportunities for compression.


Why It Matters

Compression's impact extends far beyond just saving disk space. It influences aspects of your data pipelines, including Kafka throughput, Parquet read latency, network transfer times, and CPU utilization. An understanding of how compressors work helps to make informed decisions about which algorithms and settings to use, ultimately leading to better performance and resource efficiency.


Thank you for reading.

Top comments (0)