DEV Community: Israel Mendes

What is the best data compression for data lakes? (with benchmark!)

Israel Mendes — Mon, 22 Jun 2026 10:27:30 +0000

Every time that I'm in a project starting a data lake or storing some information in a blob storage like S3, I start to think about a few basic things. Such as:

What would be the format of the files
The partition that is required to organize/improve performance when reading it
What are the access patterns for the data. What frequency and main ways to consume it
Naming of the files, specific patterns to follow
Versioning of the files in the storage

And much more. Data engineers can spend hours debating those topics but one thing that I see less or not even being discussed deeply online is about data compression itself. Recently I was reading Choosing Between gzip, Brotli and zStandard compression from Paul Calvano and I decided to do a similar work he did but thinking on big data patterns instead of focusing on web consumption.

To be fair, I'm aware there is plenty of benchmarks for data compression in the market (for example, with Squash), but I want to be able to consider the following details:

Control or give more details about the compute/memory (which may be not clear in specific benchmarks)
To use Spark in the process, see details of the codec and check for more advance features (like splitting large parquet files)
Evaluate different compression algoritims and levels based on a data lake purpose

Before diving into the benchmark itself, I want to spend briefly some time and introduce the data compression topics, allowing to explain the process now will make it easier to understand how to compare each algorithm/codec later.

Let's dive into it!

Part 1: Intro do data compression

What is data compression

In a nutshell:

It is the process to reduce the size of the files by an algorithm. It will identify redundancies or patterns, which can help "compact" information more efficiently.

Each compression algorithm can be different (the way that details can vary, so the programming language and methods used), but the main idea is that they identify some redundancies of the data, to reduce that part and represent it in the least volume of data/bits as possible.

Each compression accounts for the following elements:

Algorithm to be used: The main method used to compress/decompress the data. It can support different types (lossless, lossy), levels, programming languages and so forth.
Compression Type: There are, basically, two ways to make the file smaller.
1. Lossless: Rewrite the data to reduce size, then rebuild exactly as it was for consumption. Very common for text/code/executables.
2. Lossy: Rewrite the data to reduce the size, but since humans will consume it, the quality can be lower since (usually) they don't detect small differences and changes. Very common for content like photos/audio/video.
Compression Level: Some algorithms has a way to determine on how "much" you want to compress, and thus reduce size (but there is always a catch/tradeoff).
1. For lossless, this represents how much the encoder searches until it find an identical at every level. The main drawback here is CPU, where you will have more processing to conduct that activity (compressing and decompressing, which can cause the speed for each stage to vary).
2. For lossy, it means how much detail you want to keep or how much quality will eliminate because of the file size.
Splitability: Depending of the file type/format and algorithm, the compression can split the file without decompressing it first. This is useful for reading/processing large files.

When comparing algorithms, it is important to keep an eye in the following metrics:

Compression Ratio: A comparison between the original file size and the new size after a compression. For example, if a file has 100 MB and, after a compression, it went to 50 MB, the ratio would be 100 / 50 = 2x.
Compression Speed/Time: How fast is the data being written after the compression.
Decompression Speed/Time: How fast is the data being read back to the computer.

How it works

The simple situation to explain how data compression works is basically to reduce redundancy. Let's say you have a file that has the following text:

A A A A A A B B B B B B B B C C C C A A A A A A A A

As you can see, there is a lot of repetition there, right? Several As, Bs, Cs and then As. What if we could group them, to save space? For example, you can put like this:

6xA 8xB 4xC 8xA

And then, once you decompress it, you can put exactly as the original content and no loss happened (lossless) plus you saved some space.

Now, other algorithms have more complex processes to save even more space. For example, Huffman coding builds a custom alphabet where the most frequent symbol get the shortest bit-string, rare ones get longer, so forth. This way, you can save even more space! If you use, for example in a word like:

ABRACADABRA:
1. Each word is 1 byte and lead to 8 bits each, so we have 88 bits in total
2. But if you compress it in that method, you will have 23 bits in total
  1. A appears 5 times and has 1 bit each = Total of 5
  2. B appears 2 times and has 3 bits each = Total of 6
  3. R appears 2 times and has 3 bits each = Total of 6
  4. C appears 1 times and has 3 bits each = Total of 3
  5. D appears 1 times and has 3 bits each = Total of 3
3. Compression ratio: 88 / 23 = 3.82x!

Here is the breakdown of the method, in a diagram:

Why to (or not to) do it

Ideally, you want to use compression in a few scenarios, and that can bring in the following benefits:

Save space: That is the most obvious benefit. When you compress, it is a process that you can basically "compact" so you can fit more with less. When you have, for example, 1 GB of a CSV file, after compression, it can go to half of that or even more (depending on the method, how the file is structured, and so forth).
Save on cost: Since cloud providers, which are mostly the focus of this blog, charge you per use (how much you store, how much you write/read, and so forth). Having "less volume" of the same data can be a significant way to reduce your expenses in that front.
1. This can also reference about the transfer of the data. For example, data transfer and egress costs or compressing data in a streaming pipeline to "take more with less".
Improve performance: Although the compression/decompression process trades a bit of CPU for a larger saving on cost, you would assume this could affect performance, right? Not necessarily. You can have significant gains:
1. Query time can be significantly decreased since with less data (especially for a columnar format) is easier to select the parts you need. If the data compression is splittable, you can even read and process each part in parallel
2. This can be useful while computing data for specific tools, for example, when Spark tasks have compression enabled with shuffle spill, this can reduce disk I/O and speed up re-reads.

Although there are reasons to do data compression, it is important to know when NOT to do it:

Latency-critical architectures: If you need absolute performance by reading the files as fast as possible, the decompression process adds a CPU cost and some latency, which is not ideal for your requirement.
Small but significant volume of files: Compression works better when you are handling bigger files (around 100 MBs or more). But if you have, let's say, millions of small files, like 1 KB for log or JSON, compression will create a big overhead for no significant improvement in performance (assuming that you're not aggregating/joining those files, keeping them as it is for some reason).

What are the main types of compression

There are several algorithms and each one can have distinct features or functionalities. For the sake of this blog post, I will keep it brief and show the mains used for building data lakes or using with Spark in AWS (via Glue Jobs, which is relevant for the benchmark). Here are the main algorithms and codecs, using a basic definition from their official sites:

Algorithm/Codec	Natively Supported By AWS Glue	Description
GZIP	Yes	A single-file lossless compression utility built on the DEFLATE algorithm (zlib). The resulting compressed output generally carries the `.gz` suffix. Widely supported across every tool, OS, and cloud service.
Snappy	Yes	A compression/decompression library from Google that targets speed over ratio. It does not aim for maximum compression or compatibility with other libraries: the design goal is very high throughput with reasonable compression.
LZ4	Yes	Designed for extreme speed. LZ4 is a lossless compression algorithm providing compression speed greater than 500 MB/s per core, and decompression speed reaching multiple GB/s. It is the first algorithm on this list to have distinct levels of compressions, going up to 14.
LZO	Yes	A data compression library that prioritises decompression speed. The `lzo1b` variant tested here can decompress faster than it compresses. Like BZIP2, LZO files can be made splittable in Hadoop/Spark with a companion index file.
ZSTD	Yes	Zstandard, developed at Meta, is designed to replace GZIP as the general-purpose default. The levels range from 0 to 22, and speed/compression ratio can vary between each level.
Brotli	No	A modern lossless compression algorithm from Google that uses a combination of LZ77, Huffman coding, and 2nd-order context modelling. Quality levels can go from 0 to 11. Decompression is consistently fast regardless of the quality level used to compress.
DEFLATE	No	The lossless algorithm at the heart of GZIP, zlib, ZIP, and PNG, defined in RFC 1951. It combines LZ77 dictionary matching with Huffman coding: the same engine GZIP uses. A raw `.deflate` stream is essentially GZIP without the wrapper.
BZIP2	No	A block-sorting lossless compressor that typically reaches a higher compression ratio than GZIP, at the cost of considerably more CPU time. Instead of LZ77, it chains the Burrows-Wheeler Transform (BWT), Move-To-Front coding, run-length encoding, and Huffman coding.

Important! The algorithms and levels, for the benchmark, are only being considered in the ones that a Glue Job plainly support it, in which there is no requirement to install any external dependency. This eliminates deflate, brotli and bzip2, for example.

Let's see if they are actual fast or relevant to use, based on their official guidelines!

Part 2: Benchmark

To assess properly the compression algorithms, I consider the following details:

I've used a public dataset so people can reproduce the results if someone wants to. Access it in Kaggle, NBA Dataset: Box Scores and Stats (1947 - Today). The dataset contains basically:
1. One .parquet file and nine .csv files
2. In total, the dataset has around 1.86 GB uncompressed
Similar idea, I ran the transformation/compression/decompression jobs in AWS via Glue. Here is the main configuration:
1. Worker Type: G.1X
  1. 1 DPU (4 vCPUs, 16 GB memory)
  2. 94GB disk (approximately 44GB free)
2. Version: 4.0
3. Workers: 2

I don't intend to show every single Python/Shell Script/SQL code since it would make a large and unnecessary post, but you can check on my GitHub here and the links above for the main scripts that I used to support this work.

The process followed for the benchmark:

Download the NBA dataset and paste in a S3 bucket
1. Via ./datasets/download_aml.sh
Create a Glue job, run the compression and decompression
1. Via ./glue/deploy.sh --bucket my-bucket --region us-east-1
Run the Glue Job that will compress/decompress the files, across different codecs/algorithms and levels
1. With aws glue start-job-run --job-name consumption-benchmark --arguments '{"--s3_bucket":"my-bucket","--dataset":"historical-nba-data-and-player-box-scores","--engine":"glue_spark"}'
Collect metrics and send to S3
1. Already performed by each Glue Job automatically
Collect all the metrics and produce a report to show here
1. With aws s3 cp s3://<my-bucket>/reports/compression/ reports/compression/ --recursive

Results

Here is the successful run:

And here are some of the logs, it is possible to see each algorithm and level:

The raw results of the report are hosted here,

The Numbers

Best compression ratio is gzip-9 (the last digit being the number of the level of the codec during compression/decompression) with 3.5185x (although other codecs and levels are quite similar)
Best compression is lz4-7, with 250 mb/s
Best decompression is zstd-22, with 2,077 mb/s
The compression process that costed the less while running in Glue is lz4-7

Here is a list of all compression ratio per codec/algo and level. Some insights that I can drive from this:

If you use just any lower level, without thinking too much, you may end up with worst cases and bloated data like with zstd-1 or lz4-1
There are some good plain options like snappy
- But you can do better if you go above the traditional and explore different levels with zstd and gzip

Now let's compare decompression speed versus the compression ratio. The biggest one for both metrics is the ideal option for queries that will read and consume the data more than writing. As it is possible to see, the best ones will be zstd with higher levels, from 9 to above.

Not able to show all the algorithms and their compression/decompression time, but this is interesting to see that the compression time of each gzip level can take up to 6x more than any other algorithm/codec.

This is a good indication that, although gzip has a better compression ration, it may not be worth it since it would take more resources to run over with it.

Conclusion

Here are some insights I can draw from that experiment:

New Architecture: When starting a new data solution, aim to explore different compression algorithms based on your use case and design.
Data Lakes: My suggestion is to use zstd at a level 3 or higher. You will have a performance that can be 6x faster than gzip and can lead to significant performance gains when reading the data, which is great for production consumption.
Ingestion focus: If you have a case where, for example, you need to receive the streaming of logs and storage/archive, consider to use lz4 with level 7 or above, since it is more important write-heavy scenario and that can support up to 250 MB/sec.
Compression Levels: There is a moment that more levels does not indicate significant improvements of reducing the file size.
1. If you need extreme results (collect the "last drop" of performance), then explore the level of the codec.
2. If you just need a good enough result, go to a mid level of each codec and you should be fine (from 3 to 7, but it can very per codec/algorithm).
Don't only consider compression ratio: Evaluate compression/decompression time since it can lead to more CPU/memory being consumed and a higher cost (depending on your infrastructure). For example, gzip has the best compression ratio, but it can take up to 6x more of compress time compared with other algorithms.
Always check the support of each codec/algorithm for your application and platform: If you intend to run in a serverless option, like AWS Lambda or Glue, or in a specific programming language, does the current underlying infrastructure/solution already has a way to run and support it the algorithm/codec? If not, do you need to install a custom package/layer to support it? Is it worth it or better to go with a good enough option to simplify the solution?

References

To write this blog post, I used the following articles and books to guide the content:

NBA Dataset: Box Scores and Stats (1947 - Today) — Kaggle, Eoin A. Moore.
Choosing Between gzip, Brotli and zStandard Compression — Paul Calvano (2024).
A Review of Data Compression Techniques — Fitriya, L.A. & Purboyo, Tito & Prasasti, Anggunmeka. (2017). A review of data compression techniques. International Journal of Applied Engineering Research. 12. 8956-8963.
A Study of Various Data Compression Techniques — Ravi, Prakash G. and D. Ashokkumar. “A Study of Various Data Compression Techniques.” (2015).
Ameliorating Data Compression and Query Performance Through Cracked Parquet — Hansert, Patrick & Michel, Sebastian. (2022). Ameliorating data compression and query performance through cracked Parquet. 1-7. 10.1145/3530050.3532923.

My process over to be Claude Certified Architect (Foundations)

Israel Mendes — Sun, 21 Jun 2026 08:47:51 +0000

Important to mention that the certification is quite new, and I assume that the tips here can become a bit outdated in a few months. So consider that before studying for the certification and always check the Exam Guide for the latest information from Anthropic.

I decided to take advantage of the fact that the company I work for, Caylent, is an Anthropic partner and study for the Claude Certified Architect - Foundations exam. I've written about my experience with other certifications, and this one was special because it doesn't have a lot of materials out there. Because of that, I had many people asking about my process, so I'm sharing here a longer version to assist others.

Talking about about the test: it has 60 questions, and it is divided into 5 core competencies:

Important to say that those competencies are a bit distinct from the scenarios that you will face. My understanding is that:

Core competency: Basic elements that you will be evaluated on
Exam scenario: A situation that will evaluate the core competency. You will not be evaluated for every scenario. Each exam draws 4 scenarios at random from this set of 6.

The scenarios are:

Customer Support Resolution Agent
Code Generation with Claude Code
Multi-Agent Research System
Developer Productivity with Claude
Claude Code for Continuous Integration
Structured Data Extraction

Tip: The exam guide is your best friend, from the beginning to understand each core competency/exam scenario until the moment you access your results after you complete the exam.

Let's talk about my workflow in detail:

How did I study for it

First, here is a summary of my numbers while studying for the certification:

Start date: April 11
End date: April 26
1. I spent, roughly, 16 days in this study process
Total questions responded to in practice exams: 480

An important disclaimer: Keep in mind that I was able to take the exam in around 2 weeks because I work with Claude (specifically Code, Agent SDK, and with MCPs) daily. Your situation can be different, and you may need more time to prepare. Plan accordingly because, if you don't pass the exam, you will need to wait another 6 months to retake it.

My process was honestly straightforward. I will list the main resources at the end of the section, but here is how I did it before the exam:

Before the study of the certification, I built some applications in Python using the Agent SDK. I recommend having some hands-on experience there, like building something with Claude Code, developing an MCP or multi-agent product to make it easier for your learning.
I started to take some courses on Anthropic Academy and to read some blog posts related to the agents and LLMs in general
Went over the practice exams. That was the most consuming part. I would take maybe 2 hours to go over 60 questions and take notes of my mistakes, then do another 60 questions and another 2-hour session.

Tip: Once I started to notice the patterns based on the questions I faced on the mock/practice exams, I would feed the Exam Guide to Claude and ask it to generate questions on my mistakes/failures. This helped me a lot to advance on my weaknesses:

Data Analysis Prior Exam

Every time that I would go over some practice/mock exam, I would "log" my performance per competency and major score overall in a simple spreadsheet. Right in the last few days of my preparation, I used that to feed into Claude, and I asked it to generate some interesting insights that I want to share here:

Above, we have some basic numbers, which are useful overall to understand if I'm good to move forward with the exam or not. But let's dive into it. Here are my score per mock/practice exam that I took:

This section below was the most important for me: instead of spending time reviewing questions and competencies/domains that I already had some good knowledge of, I would plot and see which ones I could improve (or which ones I needed to pay more attention to because of the relevance in the exam structure). For example, the main ones that I spent time on in the last days of the exams were (in order of priority):

Prompt Engineering & Structured Output
Agentic Architecture & Orchestration
Claude Code Configuration & Workflows

Tip: Aim to track your performance when going over the questions, and try to break it down into smaller buckets (per competency, scenario, domain, tools). This can help you identify which areas to put your time and priority on to improve your overall results.

In the end, I asked Claude to generate a Diagnosis based on ways that I could improve. Each issue had a priority (based on the score and impact). It is pretty much aligned with the domain insights that I brought up in the previous paragraph.

With those numbers, I was able to:

Spend less time in my preparation by focusing on the main things that I was failing (like specific competencies or domains)
I felt more confident over time, which helped me find the right moment to take the exam
Stop doing certain tactics that were not working, like spending too much time on verbose questions that were not helping to improve my scores and knowledge overall. That is one of the insights from Diagnosis.

With those insights, let's dive into the materials that I used for my preparation.

Resources

Here is a summary of items that I used in the process:

Anthropic Academy:
- Building with the Claude API (The most important one, which covers the main topics. If you want to focus on one course, this is it)
- Introduction to subagents
- Model Context Protocol: Advanced Topics
Blog posts and other resources:
- The one-liner research agent
- How we built our multi-agent research system
Practice Exams/Questions:
- Practice Exam: Claude Certified Architect – Foundations Certification
- CertSafari
- Udemy - TutorialDojo
- ~~Udemy - Sundog Education~~ (I don't recommend that one, because of the reason below)

Tip: Avoid practice/mock exams that have verbose text and multi-layered distractors that all sound plausible. The exam is a "foundations" one, so the expectations are scenario-based questions with more expected elements of the competencies.

Results

Once I completed the exam on the Anthropic site, the next day I got the results and the score (801/1000), which I noticed was close to the practice exam projected score that Claude generated (details on the previous section of this blog post).

Worth it?

As always, I like to think it was beneficial for me to spend the time on it or not. For me, it was. I learned a lot more about how to use AI efficiently in my workflow and in production systems. I understand that the exam right now is a bit hard to get since you need to work with a partner to take the test. If you do have the chance to go over it, I recommend you study for it and take the exam!