Making Compression a Habit with zstd

#python #linux #performance #compression

With zstd being added to Python 3.14, I've been using compressed files more often in my workflow. Here's what I've learned about making compression a habit.

Python Data Processing with Compression

Python 3.14 adds native zstd.open() support, which is a big step forward. Here's the comparison:

Before 3.14 (with zstandard package):

import zstandard as zstd
import json
import io

# Writing compressed JSONL with Zstandard
data = [
    {"id": 1, "name": "Alice", "score": 95},
    {"id": 2, "name": "Bob", "score": 87},
    {"id": 3, "name": "Charlie", "score": 92}
]

# Write
with open('data.jsonl.zst', 'wb') as f:
    cctx = zstd.ZstdCompressor(level=3)
    with cctx.stream_writer(f) as writer:
        for record in data:
            line = (json.dumps(record) + '\n').encode('utf-8')
            writer.write(line)

# Read
with open('data.jsonl.zst', 'rb') as f:
    dctx = zstd.ZstdDecompressor()
    with dctx.stream_reader(f) as reader:
        text_stream = io.TextIOWrapper(reader, encoding='utf-8')
        for line in text_stream:
            record = json.loads(line)
            print(record)

Python 3.14+ is much simpler:

from compression import zstd
import json

# Read and print first record
with zstd.open('data.jsonl.zst', "rt") as f:
    for line in f:
        data = json.loads(line)
        print(data)
        break  # Remove break to read all lines

The API mirrors regular open() -- just use zstd.open() instead.

Key points:

Use 'wt' mode for writing text, 'rt' for reading
Typical compression ratio: 6-7x size reduction at zstd-3

Benchmarking Your Workload

You should benchmark compression according to your workload to determine your trade-offs.

For archival of logs or long-term storage, you can use higher compression levels of zstd. Archives like Pushshift Reddit typically use level 22. For most use cases, zstd-3 is a good default.

Working with Compressed Files

Zstd includes tools for viewing, searching, and processing compressed files without manual decompression.

Quick commands:

zstdcat data.json.zst -- view the contents
zstdless data.json.zst -- page through like less
zstdgrep "error" events.json.zst -- search inside compressed files
zstdgrep -c "timeout" events.json.zst -- count occurrences

You can also pipe to other tools: zstdcat events.json.zst | grep ERROR | jq '.timestamp'

Transferring files

Rsync

The -z flag compresses data during transfer. On highly compressible files, rsync may report speedup > 1.0x. Here's a test with about 66GB of JSONL files:

sent 50,322 bytes  received 12,451,737,167 bytes  19,290,143.28 bytes/sec
total size is 66,857,841,487  speedup is 5.37

That speedup means the data sent was much less than the original file size. This is highly beneficial if you're network bound or concerned about egress costs.

S3 and Cloud Storage

AWS charges for outbound data transfer (egress). Compressing data before storage can significantly reduce these costs. With a 7.0x compression ratio, a $14,000 egress bill drops to roughly $2,000.

Here's an upload comparison on a gigabit connection with a 4GB JSONL file:

# Compressed upload (zstd -k -c ... | s5cmd pipe)
real    0m8.139s
# Result in S3: 363.7MB

# Uncompressed upload (s5cmd cp)
real    0m57.547s
# Result in S3: 4.0GB

The same principle works for downloading: s5cmd cat s3://bucket/data.zst | zstd -d > data.jsonl. Compression takes longer than decompression, but the speedup is usually worth it.

Conclusion

I use zstdcat to read files and rarely need to edit them in an IDE. This habit cut my text storage by up to 80%. There's a balance between convenience, speed, and storage - and this works for me. More optimized formats like protobuf or arrow exist, but most text processing still uses JSON.

The full version with code examples and benchmarks is on my blog.