The Apache Iceberg™ Small File Problem

#bigdata #apacheiceberg #datalakehouse #dataengineering

If you've been following Apache Iceberg™ at all, you've no doubt heard whispers about "the small file problem". So what is it? And why does it matter when building the data lakehouse of your dreams?

You've come to the right place! Let's dive in!

Small Files, Big Problem

To start, the small file problem is exactly what it sounds like at the surface. We have some dataset. In the case of Iceberg, our dataset is a bunch of data files bound together through metadata as a single Iceberg table. The issue is that, over time, as we add data to this dataset, we find that the dataset is made up of many, smaller files rather than fewer, bigger ones.

Having more small files might not sound like a big deal, but it actually has quite a few implications for Iceberg and can ultimately negatively impact performance, scalability, and efficiency in a number of ways, for a number of reasons:

🗄️ High Metadata Overhead: As we know already, an Iceberg table IS its metadata. So in Iceberg, we're constantly tracking every file in metadata for each table version. More small files increases the size of metadata files and, in turn, the cost of maintaining table snapshots.
🐢 Inefficient Query Planning and Execution: When it comes time to interact with our data, query engines like Apache Spark, Trino, or Snowflake need to read those many small files, which results in higher I/O overhead, slower data scanning, and reduced parallelism.
💰 Costs of Object Storage Operations: We've all experienced the frustration of unexpected cloud bills! In cloud object stores like S3 or GCS, frequent API calls for listing or retrieving many small files incur significant latency and cost.
🔊 Write Amplification: If you're unfamiliar, write amplification just means that more data is written, touched, or modified than originally intended. So, for Iceberg, many small writes will eventually generate unnecessary work for compaction and cleanup processes down the line.

Now that you know a bit more about it, you can see how the small file problem is actually a problem. But what can we do about it? 🤷‍♀️

Taking Action

The good news is that the broader Iceberg community isn't just sitting on this issue. You just have to know what's out there and how to take advantage of it!

🤖 The biggest fix is to eliminate existing small files through compaction and expiring snapshots. Iceberg already has compaction built-in to Spark through the rewriteDataFiles Action. The v2 Apache Flink Sink that was released as part of Apache Iceberg 1.7 includes support for small-file compaction, as well!
⚙️ Check your configs! You can set the target file size during writes in Iceberg with the configuration parameter, write.target-file-size-bytes.
🔀 Leveraging the Copy-on-Write (CoW) mode is helpful to get around some of the headaches of small files. Keep in mind that CoW means that any changes to existing files (think row updates or removals), will result on the entire file being copied over to a new file. This sounds horribly inefficient, but it's only a little inefficient because CoW will try to consolidate data files when it can so that the rewrite doesn't just affect one data file. This means more work up front on the writer, but fewer small files and less work for readers later on.
😵 Speaking of updates and removals, if you're using Merge-on-Read (MoR) mode, then you'll be introducing delete files into the mix to track which rows need to be removed when you query against your data files. For sparse deletes against your dataset, this can result in a number of small delete files floating around. And that's no good! Thankfully, with the upcoming Iceberg v3 table spec, we'll be introducing deletion vectors, which means that multiple files' worth of deletes can be stored in a single file.
📋 Query engines are also stepping up with smarter query planning that still allows for the existence of small files but optimizes how data is accessed.

Conclusion

That was kind of a big post on a what's otherwise a... small problem 😂 . But now you have a better idea of what the small file problem is, why it's important for folks building out a data lakehouse with Apache Iceberg, and what your options are for tackling it.

If you're interested in more Apache Iceberg content, like, follow, and find me across social media.