DEV Community

Danica Fine
Danica Fine

Posted on

5 1

The Apache Iceberg™ Small File Problem

If you've been following Apache Iceberg™ at all, you've no doubt heard whispers about "the small file problem". So what is it? And why does it matter when building the data lakehouse of your dreams?

You've come to the right place! Let's dive in!

Small Files, Big Problem

To start, the small file problem is exactly what it sounds like at the surface. We have some dataset. In the case of Iceberg, our dataset is bunch of data files bound together through metadata as a single Iceberg table. The issue is that the dataset is made up of many, smaller files rather than fewer bigger ones.

Having more small files might not sound like a big deal, but it actually has quiet a few implications for Iceberg and can ultimately negatively impact performance, scalability, and efficiency in a number of ways, for a number of reasons:

  • 🗄️ High Metadata Overhead: As we know already, an Iceberg table IS its metadata. So in Iceberg, we're constantly tracking every file in metadata for each table version. More small files increases the size of metadata files and, in turn, the cost of maintaining table snapshots.

  • 🐢 Inefficient Query Planning and Execution: When it comes time to interact with our data, query engines like #apacheSpark, #Trino, or #Snowflake need to read those many small files, which results in higher I/O overhead, slower data scanning, and reduced parallelism.

  • 💰 Costs of Object Storage Operations: We've all experienced the frustration of unexpected cloud bills! In cloud object stores like S3 or GCS, frequent API calls for listing or retrieving many small files incur significant latency and cost.

  • 🔊 Write Amplification: If you're unfamiliar, write amplification just means that more data is written, touched, or modified than originally intended. So, for Iceberg, many small writes will eventually generate unnecessary work for compaction and cleanup processes down the line.

Now that you know a bit more about it, you can see how the small file problem is actually a problem. But what can we do about it? 🤷‍♀️

Taking Action

The good news is that the broader Iceberg community isn't just sitting on this issue. You just have to know what's out there and how to take advantage of it!

  • 🤖 The biggest fix is to eliminate existing small files through compaction and expiring snapshots. Iceberg already has compaction built-in to Spark through the rewriteDataFiles Action. The v2 Apache Flink Sink that was released as part of Apache Iceberg 1.7 includes support for small-file compaction, as well!

  • ⚙️ Check your configs! You can set the target file size during writes in Iceberg with the configuration parameter, write.target-file-size-bytes.

  • 🔀 Leveraging the Merge-on-Read (MoR) paradigm (also controlled by a few Iceberg configurations), is also helpful to avoid write amplification and get around some of the headaches of small files without needing frequent compaction.

  • 📋 Query engines are also stepping up with smarter query planning that still allows for the existence of small files but optimizes how data is accessed.

Conclusion

That was kind of a big post on a what's otherwise a... small problem 😂 . But now you have a better idea of what the small file problem is, why it's important for folks building out a data lakehouse with Apache Iceberg, and what your options are for tackling it.

If you're interested in more Apache Iceberg content, like, follow, and find me across social media.

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (0)

Imagine monitoring actually built for developers

Billboard image

Join Vercel, CrowdStrike, and thousands of other teams that trust Checkly to streamline monitor creation and configuration with Monitoring as Code.

Start Monitoring

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay