Compaction should provide an easy solution to a very difficult problem: controlling file count, minimizing the cost of delete operations over read costs, and keeping metadata growth from turning every query plan into a deep, time-consuming walk through snapshots and manifests.
The data layer has compaction as its mechanism for solving these issues, but only if compaction is run using some defined set of rules governing the scope of compactions, the thresholds above which compactions are run, and the synchronization of compaction runs with the timing of snapshot expirations and the maintenance of manifests.
Compaction can be run manually via scripts and schedules, or automatically by a control plane.
When manual scripts are used to run compaction, they can effectively manage compaction for a small number of tables and one engine. However, as soon as there are many tables, and/or multiple engines, the manual process becomes guesswork; scripts may rewrite too much, may run too infrequently, may interfere with ongoing ingestions, and will cause churn in both the snapshots and manifests.
A control plane flips this model completely around.
Instead of rewriting everything all the time, a control plane continuously monitors the health and workload characteristics of tables. Then, only when necessary, a control plane spends rewrite budget on the parts of the tables that actually change performance or cost, while also managing the entire lifecycle of maintaining the table.
This article will teach you how to run compaction like the production lakes do it: how to choose your base line strategy (bin-packing vs sorting) for compaction, how to prevent rewrites of healthy partitioning, how to limit the scope of each compaction so maintenance remains invisible, how to focus on hot and delete-heavy areas first, how to prevent the continuous commit cadence of streaming data from creating a factory of snapshot partitions, and how to synchronize compaction with the metadata cleanup that is needed to maintain stable query planning.
For further reading on other aspects of optimizing and maintaining Iceberg, please also refer to:
For further reading on other aspects of optimizing and maintaining Iceberg, please also refer to:
Let's move on to the compaction strategies that actually work in real production lakes.
1. Add a Control Plane for 20x Faster Compaction and Optimized Table Maintenance
If you add a control plane to your lake, LakeOps comes with the most powerful and intelligent compaction engine that exists today, and will also manage and optimize table maintenance and lake operations for you.
Snowflake-like experience for Iceberg with 10x performance (source: lakeops.dev)
Instead of being confined to fixed schedules, LakeOps operates as a control plane for Iceberg tables and can know when and how to compact what. It treats compaction as a continuous operational problem rather than a periodic batch job, and optimizes it in real time.
It analyzes telemetry data from query engines and Iceberg catalogs and uses that data to decide when compaction is actually needed, what to compact, and how. It takes actual usage patterns into account as well.
Under the hood, LakeOps uses a dedicated Rust-based compaction engine that is designed specifically for Iceberg layouts and metadata behavior. Compaction is coordinated with snapshot expiration, manifest rewrites, orphan cleanup, and statistics maintenance so these operations reinforce each other instead of fighting.
The results are ~20x faster compaction, ~15x faster queries, and ~80% CPU/Storage cost saving.
🚢 Apache Iceberg compaction is not “background maintenance.” It’s a time-critical optimization problem that directly impacts query latency, metadata growth, and infrastructure cost.
Learn more about it here:
In addition to compaction LakeOps gives you control with manual and autopilot modes for all maintenance operations in youe tables and coordinates them with compaction. That includes expiring snapshots, manifest rewrites, orphan file cleanups and more.
Real-Time comoaction and maintenance optimization with a control plane (source: lakeops.dev/)
You can choose between manual mode and auto-pilot per table or for groups of tables to control compaction and maintenance proccesses.
LakeOps also lets you define policies across the lake to enforce your standards, and provides you with dashboards to see and manage all compaction and maintenance processes per table and for the entire lake.
Learn more: https://lakeops.dev
2. Use bin-pack as the baseline correction
Most iceberg tables do not require complex layout schemes. However, they all suffer from file fragmentation.
Before attempting to fix the problem using sort-based layouts, clustering, or partitioning, take a closer look at the most obvious source of file fragmentation: the writing process itself. In almost all cases, the initial performance decline caused by file fragmentation is due to the streaming ingestion of very small batches; micro-batch commits occur very frequently; and backfill data will always come in very unevenly-sized chunks.
As a result of the write process, many small Parquet files exist within each partition. None of these files are "broken," and queries will still continue to provide accurate answers. However, as more and more small Parquet files exist within each partition, planning time will increase, the overhead associated with task scheduling will grow, and the number of object store calls will grow.
This is not a layout issue; it is a file count issue.
The easiest and most reliable method to solve the file count issue is to use bin-pack compaction. Bin-pack compaction combines small data files into a smaller number of larger, properly sized files. Bin-pack compaction does not alter the existing file sort order, nor does it re-cluster data; it merely normalizes the file size and decreases the metadata overhead associated with having a high number of files.
In practice, this is usually sufficient.
Why This Approach to Compaction Really Does Improve Performance
Iceberg engines operate at the file level. The more files you add to an engine, the more the engine must plan:
More manifest entries must be read
More file footers must be inspected
More scan tasks must be scheduled
More file references to delete must be tracked
As the number of files grows exponentially, so too does the planning time. Bin-pack compaction eliminates the number of files physically on disk, while maintaining the existing logical layout. Therefore, there are fewer planning reads, and fewer tasks to schedule without requiring additional shuffling.
A good rule of thumb for most production tables is to target file sizes ranging from 128 MB to 512 MB. The specific size range will depend on the engine, and the workload. What is important is consistency.
Start with the Default Rewriting Method
Unless you specify otherwise, Iceberg uses the bin-pack rewriting method:
CALL prod.system.rewrite_data_files(
table => 'db.events'
);
Verify the effectiveness of the rewrite:
SELECT
count(*) AS file_count,
sum(file_size_in_bytes)/1024/1024/1024 AS total_size_gb
FROM prod.db.events.files;
You want to see fewer files and the same total size. If the total size is substantially changed, something other than file fragmentation is occurring.
Specify Your Target File Size
If file fragmentation continues, specify a target file size at the table level:
ALTER TABLE prod.db.events SET TBLPROPERTIES (
'write.target-file-size-bytes'='536870912' -- 512MB
);
And then perform the rewrite with the same target file size:
CALL prod.system.rewrite_data_files(
table => 'db.events',
options => map(
'target-file-size-bytes','536870912'
)
);
Without specifying a target file size, the engine and writer will create files of varying sizes, which will require subsequent compactions to restore the target file size.
Define Thresholds to Prevent Unnecessary Rewrites
At scale, performing rewrites on healthy data wastes compute resources. Define the following thresholds to prevent unnecessary rewrites:
CALL prod.system.rewrite_data_files(
table => 'db.events',
options => map(
'min-input-files','5',
'min-file-size-bytes','134217728' -- 128MB
)
);
This will ensure that only those partitions with a minimum of five input files are rewritten. Partitions with one or two files of reasonable sizes are ignored.
These thresholds can be made tighter by defining multiple conditions:
CALL prod.system.rewrite_data_files(
table => 'db.events',
options => map(
'min-input-files','5',
'min-file-size-bytes','134217728', -- 128MB
'target-file-size-bytes','536870912' -- 512MB
)
);
Therefore, compaction will now only run under the following conditions:
There are enough small files to warrant consolidation
Files are currently below a reasonable size threshold
There is a clear target to normalize to
This transforms compaction from an automated rewrite operation to a targeted repair operation.
Prior to executing a compaction operation, review the distribution of files:
SELECT
partition.event_date,
count(*) AS file_count,
avg(file_size_in_bytes)/1024/1024 AS avg_mb
FROM prod.db.events.files
GROUP BY partition.event_date
ORDER BY file_count DESC
LIMIT 20;
If a partition contains four files averaging 480 MB and your target file size is 512 MB, rewriting the partition will not significantly affect either planning time or scan time.
However, if another partition contains 180 files averaging 25 MB, that partition clearly requires compaction.
Decisions regarding compaction operations should be based on this type of signal. Not a schedule.
3. Conditional Compaction
Another way to waste compute cycles in an Iceberg lake is to blindly compact data on a regular basis.
It usually begins innocently. A periodic rewrite job is created to "keep things tidy." For a period of time, it appears to help. Eventually, however, it begins to rewrite partitions that were previously healthy. Each rewrite generates new files, new snapshots, and updates the manifest list. Although none of the files are "broken," the system is spending compute cycles on processing data that does not improve performance.
Compaction should be used as a corrective measure, not as a routine activity.
Why Unconditional Rewrites Cause Churn
Each time you rewrite data files, Iceberg:
Creates new data files
Generates a new snapshot
Updates manifest lists
Increases metadata history
If the files being rewritten are already close to their target size, you are essentially cycling the data through the system. The added churn causes increased metadata depth and longer planning times over time.
At scale, this overhead becomes noticeable.
Your objective is not to compact frequently. It is to compact when the layout is measurably unhealthy.
Implement Gateways to Control Rewrites
Iceberg provides a rewrite_data_files procedure that allows you to implement gateway conditions. The most effective condition is min_input_files.
CALL prod.system.rewrite_data_files(
table => 'db.events',
options => map(
'min-input-files','3'
)
);
Using this condition, Iceberg will only rewrite file groups that contain at least three files. Partitions that already have one or two files that are of reasonable size are excluded.
This is a relatively small change to make, but in large lakes, it will significantly reduce unnecessary compaction.
You can implement additional conditions to make this gateway even tighter:
CALL prod.system.rewrite_data_files(
table => 'db.events',
options => map(
'min-input-files','5',
'min-file-size-bytes','134217728', -- 128MB
'target-file-size-bytes','536870912' -- 512MB
)
);
Compaction will now only run when the following conditions are met:
There are sufficient small files to warrant consolidation
Files are currently below a reasonable size threshold
There is a valid target to normalize to
This converts compaction from an automatic rewrite into a targeted repair operation.
Let the Table State Drive the Decision
Prior to initiating a compaction operation, review the distribution of files:
SELECT
partition.event_date,
count(*) AS file_count,
avg(file_size_in_bytes)/1024/1024 AS avg_mb
FROM prod.db.events.files
GROUP BY partition.event_date
ORDER BY file_count DESC
LIMIT 20;
If a partition has four files averaging 480 MB and your target file size is 512 MB, rewriting the partition will not materialy affect either planning or scanning time.
However, if another partition has 180 files averaging 25 MB, that partition is a prime candidate for compaction.
Compaction decisions should be based on signals such as this. Not a schedule.
4. Limit Rewrite Scope Per Run
Big compaction jobs appear great on paper. In reality, they are among the easiest ways to create instability in a production lake.
Backfills, partition evolutions, or long stretches of time without maintenance can quickly turn terabytes of data into rewrite candidates. If you don't specify any boundaries, Iceberg will rewrite everything that meets its criteria. The outcome is well understood: long-running jobs, significant shuffles, large amounts of object store I/O, significant increases in snapshot sizes, and sometimes even cluster contention with users' queries.
Compaction operates at its best when it is incremental.
Why "Rewrite Everything" Is A Risk
When you rewrite a large section of a table in a single pass, you are executing a number of costly operations simultaneously:
Reading numerous data files
Shuffle and rewrite them
Create a lot of new files
Create a new large snapshot
Possibly rewrite manifests
Regardless of whether it is successful, you have produced a major maintenance event. Even if it fails in the middle of its execution, you will lose some resources and extend the length of your maintenance period.
Operationally, smaller, and more frequent corrections are safer than infrequent large-scale rewrites.
Restrict rewrite size explicitly
Rewrite_data_files is a method that allows Iceberg to provide parameters that can help limit the amount of effort that is put into a single run of rewriting data.
To illustrate, you may desire to restrict the maximum number of file group rewrites:
CALL prod.system.rewrite_data_files(
table => 'db.events',
options => map(
'max-file-group-rewrites','20'
)
);
This restricts the number of rewrite groups executed in a single pass of rewrite_data_files. Instead of rewriting hundreds of partitions in one pass, you will execute a controlled sequence of batch passes.
You can also utilize these with eligibility thresholds:
CALL prod.system.rewrite_data_files(
table => 'db.events',
options => map(
'min-input-files','5',
'max-file-group-rewrites','20',
'target-file-size-bytes','536870912'
)
);
Compaction now becomes predictable. Each pass of compaction corrects a finite amount of drift.
Disperse correction over cycles
There is rarely a good reason to try to "Fix everything tonight" if a table contains months of small files due to a large number of small writes.
Instead, a more sustainable and steady state approach would be:
Compact data with limited rewrite scope.
Permit normal operation of user workloads.
Repeat on the subsequent maintenance cycle.
Within a couple of cycles, fragmentation will decrease significantly, without generating a maintenance peak.
This also produces less "snapshot shock." Instead of a large, single-pass rewrite snapshot replacing nearly half of the table, you generate a series of smaller, incremental snapshots.
Prioritize rather than rewriting randomly
Limiting rewrite scope, prioritizing becomes essential.
Practically speaking, you want to rewrite:
Partition groups with the greatest number of files.
Partition groups with the least average file size.
Partition groups with the largest numbers of deletions.
Partition groups with the most user queries.
You can find the worst offending partition groups using the following query:
SELECT
partition.event_date,
count(*) AS file_count,
avg(file_size_in_bytes)/1024/1024 AS avg_mb
FROM prod.db.events.files
GROUP BY partition.event_date
ORDER BY file_count DESC
LIMIT 50;
Then either target the offending partition groups directly using a WHERE statement or permit your orchestration layer to determine the highest impact groups to rewrite first.
For example:
CALL prod.system.rewrite_data_files(
table => 'db.events',
WHERE => 'event_date >= DATE ''2026-01-01''',
options => map(
'max-file-group-rewrites','10'
)
);
This limits both the logical scope (only recent partitions) and the physical rewrite volume.
Maintain Maintenance Invisible To Users
The ultimate objective of limiting rewrite scope is not merely cluster stability. It is predictability.
When compaction is done in a manner such that each run is relatively small and bounded:
Maintenance windows are brief.
Resource spikes are under control.
Snapshots grow gradually.
Query performance improves incrementally, rather than suddenly.
In production lakes, stability is usually more important than the rate of correction. Incremental correction is generally preferred to dramatic restructuring.
5. Focus On Hot Partition Groups
In the majority of Iceberg tables, compaction impact is not uniformly distributed.
A small segment of partitions is responsible for most of the pain: They receive the most writes (thus they tend to fragment the quickest) and they receive the most reads (thus every additional file appears as both planning + scan overhead). If you rewrite the partitions that are "hot", you will typically gain 80% of the benefits with only a fraction of the rewrite volume.
The simplest approach to achieve this is to treat compaction as a rolling window problem.
"Hot" Generally Means Two Things
Hot partitions generally represent partitions that are still active:
They are still receiving new files from streaming or micro-batch systems
They are the partitions that your analysts / dashboards / downstream jobs are accessing constantly.
This results in two operational principles:
Compact relatively recent partitions regularly, since they will accumulate the most small files.
Do not compact the actively written partitions unless you know you can tolerate collisions with writers.
AWS describes this for Iceberg compaction: Use a where predicate to exclude actively written partitions, so that you do not encounter data conflicts with writers and leave only metadata conflicts that Iceberg can normally resolve.
Identify Your Rolling Window With Where
Iceberg's Spark procedure provides a where predicate for filtering which files (and hence which partitions) qualify for rewriting.
An extremely common use case is "Compact everything older than the current ingest window":
CALL prod.system.rewrite_data_files(
table => 'db.events',
WHERE => 'event_date < DATE ''2026-02-10''',
options => map(
'target-file-size-bytes', '536870912',
'min-input-files', '5'
)
);
This maintains compaction away from the partition(s) that are currently being mutated, while continually cleaning up yesterday's and previous data.
If your table is partitioned hourly, perform the same concept at the hour granularity:
CALL prod.system.rewrite_data_files(
table => 'db.events',
WHERE => 'event_hour < TIMESTAMP ''2026-02-11 12:00:00''',
options => map(
'target-file-size-bytes', '536870912',
'min-input-files', '5'
)
);
The main principle here is not the specific cut-off. It is maintaining a buffer so that compaction does not conflict with ingestion.
Locate the Worst Partition Groups First
Even within the "hot-ish" window, not all partition groups are equally bad. You can typically find the worst offender simply by examining the file count and average size:
SELECT
partition.event_date,
count(*) AS file_count,
avg(file_size_in_bytes)/1024/1024 AS avg_mb
FROM prod.db.events.files
GROUP BY partition.event_date
ORDER BY file_count DESC
LIMIT 30;
If you plan to compact only a limited portion of the data per pass (as you probably should), this query will indicate where you will obtain the largest initial benefit.
If You Must Compact Partition Groups That Are Still Receiving Late Data
There are certain types of workloads that receive late-arriving events, updates, or merges that keep older partition groups "active." If you compact them regardless, you may periodically collide with writers.
Iceberg includes a partial progress mode that commits compaction in smaller portions, rather than committing a large block, which reduces the collision risk associated with retries when conflicts occur.
CALL prod.system.rewrite_data_files(
table => 'db.events',
WHERE => 'event_date >= DATE ''2026-02-01'' AND event_date < DATE ''2026-02-10''',
options => map(
'partial-progress.enabled', 'true',
'partial-progress.max-commits', '10',
'max-concurrent-file-group-rewrites', '10'
)
);
You are making a tradeoff of "one clean commit" versus "multiple smaller commits that fail at lower expense." In actual production lakes with continuous write activity, that trade is typically worthwhile.
Where a Control Plane Helps
Once you have numerous tables, "hot partition groups" ceases to be something you determine through experience. You require a loop that continuously determines the "hot partition groups" based on the actual read/write usage of your system and then applies the rolling window concept to those partition groups.
That is the point at which a control plane such as LakeOps becomes useful: It is not adding a new compaction algorithm as much as it is determining where to expend your rewrite budget based on real workload telemetry and applying that determination in a consistent manner across hundreds of tables.
6. Sort or Z-Order When Scan Efficiency Is The Bottleneck
Bin-packing compaction decreases the number of files. However, it does not affect how the data is organized internally within those files.
If partitions are appropriately sized and file counts are reasonable, but selectivity-based queries are scanning a disproportionately larger number of data files than anticipated, the cause is likely clustering. You will commonly observe the following pattern: Planning times are consistent, partition pruning is functioning, but filtered queries are reading a large percent of files within a given partition. This is the point where sorting-based compaction becomes relevant.
Data engines depend on file-based statistical information (such as min and max values) to determine whether a file can be excluded from processing. When data is written in a random fashion, the value ranges between files overlap greatly. Therefore, even selective predicates typically cannot exclude a large number of files. Sorting changes this. When data is ordered by a frequently used filter column, each file will generally contain a narrower value range than previously existed, thus allowing more files to be skipped based upon the predicate and reducing the number of bytes to be scanned.
If there is one column that dominates your predicates (e.g., event_time within a date-partitioned table), a simple sort-based rewrite is typically sufficient:
CALL prod.system.rewrite_data_files(
table => 'db.events',
strategy => 'sort',
sort_order => 'event_time ASC NULLS LAST',
options => map(
'target-file-size-bytes','536870912',
'min-input-files','5'
)
);
This rearranges the rows of data within files so that time-based predicates can remove more files early in the process. The effect is evident not only in terms of runtime, but also in the reduction in scanned bytes and the number of splits generated.
If your workload filters on multiple columns (e.g., user_id, event_type, and occasionally device_type), Z-order is typically a superior choice:
CALL prod.system.rewrite_data_files(
table => 'db.events',
strategy => 'sort',
sort_order => 'zorder(user_id, event_type)',
options => map(
'target-file-size-bytes','536870912',
'min-input-files','5'
)
);
Z-ordering enhances locality across multiple dimensions. While Z-ordering will never perfectly optimize any individual column, it typically minimizes overall scan expansion when filter patterns vary.
Important Note: Defining a sort order at the table level does not rewrite historical data. It only affects newly-written data. All existing files will remain unmodified until a rewrite occurs.
It is typical to define a sort order and then expect improvements, only to discover that no changes occurred to the physical layout of the data.
After performing a sort-based compaction, verify it correctly. Examine the number of files that are scanned for common predicates. Compare the total bytes scanned before and after. Runtime can be difficult to quantify in shared environments; however, file count and total bytes scanned are more reliable metrics.
Sorting is more resource-intensive than bin-packing. Sorting involves additional shuffle and CPU overhead during compaction. If you were to blindly apply sorting to all partitions, the costs of maintenance could potentially exceed the query performance improvements. In general, sorting works best when applied selectively: Target high-traffic partitions; Align the sort with real filter patterns; Apply rewrites incrementally.
When scan efficiency is the primary bottleneck rather than file count, sorting or Z-order is one of the few techniques that will reliably enhance pruning. The key is to apply sorting or Z-order in a manner that aligns with the characteristics of your workload.
7. Compact delete-heavy partitions deliberately
You can view a table, note that file sizes are healthy, and believe that compaction is being handled - yet the queries are slowing down.
A common cause is delete files.
Iceberg does not immediately write out data files when rows are modified or deleted. Rather, it will store the row position deletes or equality deletes with the data. At read time, the engine will combine the data files with their associated delete files. Although this provides an efficient mechanism for writing, as delete files become numerous, every query will incur additional cost.
The affect is subtle. File sizes appear fine. Bin-pack has already normalized fragmentation. However, scan CPU increases, and partitions that are update-heavy begin to perform poorer than append-only partitions.
You may usually verify this by examining the delete file distribution:
SELECT
partition.event_date,
count(*) AS delete_file_count
FROM prod.db.events.delete_files
GROUP BY partition.event_date
ORDER BY delete_file_count DESC
LIMIT 20;
If certain partitions contain a high concentration of delete files, that is an indication that reads are performing more work than necessary.
Iceberg supports delete aware compaction. Instead of rewriting files solely based upon their size, you may specify a threshold for the ratio of deleted rows and have Iceberg rewrite data files that meet this criteria. For example:
CALL prod.system.rewrite_data_files(
table => 'db.events',
options => map(
'delete-ratio-threshold','0.3',
'remove-dangling-deletes','true',
'target-file-size-bytes','536870912'
)
);
In this case, Iceberg will rewrite data files that are severely impacted by deletes and will remove the physical representation of deleted rows as well as the associated delete files.
The practical result is that queries will no longer require the merging of as many delete files at runtime; CPU decreases; scan cost stabilizes; and planning becomes easier since there are fewer auxiliary files to track.
This has a significant impact primarily on tables that have been subjected to upserts, CDC pipelines, and/or frequent merges. Event tables that are used exclusively for appending data do not typically exhibit this behavior. Similarly, dimension tables and slowly changing datasets do.
Just like with any other compaction strategy, maintain focus. Use the combination of delete thresholds, partition filters, and rewrite limits to optimize the compaction strategy. It is unnecessary to rewrite the entire table simply because a handful of partitions contain a high amount of delete activity.
Healthy file size does not ensure healthy performance. If delete files comprise the majority of a partition, the compaction strategy must specifically target these delete files - otherwise, read cost will continue to increase regardless of whether the layout appears to be "correct".
8. Rewrite position delete files individually when needed
Rewriting data files does not necessarily resolve all delete related issues.
In many update-intensive workloads, position delete files accumulate more quickly than data files are rewritten. Even if you execute a delete aware compaction strategy, you can still find yourself with a large number of position delete files attached to otherwise healthy data files.
Even after executing a compaction strategy that reduces the number of data files, the engine still has to open and apply the delete files during a read operation. Therefore, as delete files accumulate, the scan overhead will remain greater than it should be.
This is particularly prevalent in tables that receive regular upserts or merges. Tables that are subject to append-only inserts do not typically exhibit this type of behavior. However, CDC pipelines and dimension tables do.
Iceberg permits the explicit rewriting of position delete files as follows:
CALL prod.system.rewrite_position_delete_files('db.events');
Rewrite smaller delete files into fewer, larger ones and attempt to eliminate obsolete entries wherever possible. The objective of this approach is not merely to reduce the total number of files, but rather to minimize the number of files that the engine has to open during a read operation and thus improve performance.
You may also limit the scope of this rewrite operation, similar to how you limit the scope of data file rewrites, by specifying the partition(s) to be rewrote:
CALL prod.system.rewrite_position_delete_files(
table => 'db.events',
where => 'event_date >= DATE ''2026-02-01'''
);
If you have already rewritten the data files and the performance of your application has not improved, inspect the distribution of delete files. In some cases, rewriting data files will reduce fragmentation, but will leave a heavy delete layer behind. In such cases, the separate rewriting of delete files will be required.
Similar to all of the strategies described throughout this guide, keep the rewriting of delete files focused. There is little to be gained by rewriting delete files across the entire table if only a limited number of partitions are subject to upserts.
In addition, treating the maintenance of delete files as a separate entity maintains the predictability of read cost. Otherwise, even though the data files themselves are sized appropriately, the accumulated overhead that is generated by the engine processing the delete files can slow over time.
9. Lower the commit frequency for streaming workloads
When writing to Iceberg from streaming or micro-batch applications, the commit frequency is one of the largest factors contributing to the overall cost multiplier in the system.
Each commit generates a new snapshot and produces new metadata work. This includes updating the manifest and creating new, small data files. As you commit every few seconds, you don't simply create small files; you create a long chain of snapshots and a continuous flow of metadata churn. While nothing "breaks," the planning is slowed and maintenance must continually struggle to keep pace with the increasing overhead.
The frustrating aspect is that teams typically attempt to resolve this issue by applying more compaction, while the true solution lies upstream: stop committing as often.
The benefits of modifying the commit frequency
When you increase the interval between commits, you generally gain three tangible benefits at once.
First, you generate fewer snapshots, resulting in fewer pieces of metadata that the engine has to evaluate during planning.
Second, you generate fewer manifests / manifest updates overall.
Third, each commit contains more data, resulting in larger files (or fewer small files) and therefore reduced compaction pressure.
You're essentially lowering entropy at the source.
Structured Streaming using Spark: Set a valid trigger interval
A common antipattern is to configure structured streaming to run "as fast as possible" or with a very short trigger. If you are writing to Iceberg tables, avoid this practice unless you have a true requirement for sub-minute freshness.
The following shows the configuration for setting a reasonable commit interval in PySpark:
(
df.writeStream
.format("iceberg")
.outputMode("append")
.option("checkpointLocation", "s3://prod-checkpoints/events/")
.trigger(processingTime="1 minute")
.toTable("prod.db.events")
)
If your service level agreement (SLA) allows it, increase the commit interval to 2–5 minutes. In most analytics lakes, this tradeoff is worthwhile: slightly increased data freshness lag in exchange for significantly decreased metadata churn and less maintenance overhead.
Flink: Commit frequency follows checkpointing
For Flink, Iceberg commits typically follow the checkpoint intervals. If you checkpoint every 30 seconds, you are essentially committing every 30 seconds. That's a lot.
A more reasonable interval would be minutes, not seconds, unless you are operating a low-latency serving pipeline.
Example:
// 5 minutes
env.enableCheckpointing(300_000);
Ultimately, the best value will depend on recovery requirements and end-to-end latency needs. However, the underlying premise is the same: do not checkpoint so frequently that you turn your Iceberg table into a snapshot factory.
A simple method to select the interval
Do not overcomplicate things. Ask yourself: what is the longest delay that your downstream consumers can tolerate for data freshness?
If the response is "near real-time", you may still be fine at 1 minute. If the response is "a few minutes", take advantage of the situation and commit every few minutes.
If the response is "we run dashboards hourly", then committing every 10 seconds is just self-imposed suffering.
Sanity check
If you observe thousands of snapshots being created daily for a single table, this is typically an indication that your commit cadence is too aggressive for an analytics lake. You can certainly use Iceberg as a means of generating data in this manner - it is designed to be correct - but you will pay for it in terms of planning overhead and ongoing maintenance.
Lowering the commit frequency is one of the few optimization techniques that will decrease costs and improve stability, independent of whether you adjust the compaction strategy. Fixing this earlier is beneficial because once you have multiple dozen or hundred of streaming-written tables, this behavior will dictate your operational overhead.
10. Stop the repair loop; fix the write path
Write paths that produce too many small files or heavily skewed partitions will make compaction a never-ending battle. As long as you continue to re-write the same issues, the lake will always drift back into an unhealthy condition.
The majority of "we need more compaction" situations are actually "our write path is poorly configured."
Begin with Distribution Mode
Small file generation is a common result of poor data distribution during the write process. A common scenario is when one writer (task) has the majority of the data for a given partition, it emits a couple of large files, while the remaining writers (tasks) emit a large number of smaller files. Worse, if the data distribution is unstable between batches, you will experience fragmentation regardless of how often you compact.
Iceberg allows you to configure the way data is written across multiple writers. A good baseline configuration for many workloads is hash, as it generally spreads rows out more evenly:
ALTER TABLE prod.db.events SET TBLPROPERTIES (
'write.distribution-mode' = 'hash'
);
This does not completely remove the necessity for compaction, but it helps slow down how quickly fragmentation occurs again.
Establish a Target File Size for Writers at the Table Level
When writers do not have a target file size, you will see variability in the file sizes produced by writers across the engine and job. Some will produce 16MB files, some will produce 1GB files, etc., and compaction will continually attempt to normalize the mess.
Create a target file size for writers at the table level and maintain it consistent:
ALTER TABLE prod.db.events SET TBLPROPERTIES (
'write.target-file-size-bytes' = '536870912' -- 512MB
);
Once you have established a target file size for writers at the table level, compaction will transition from "fix everything" to "fix the outliers."
Optimize the Writer and Not Just the Table
Another common reason writers produce small files is due to the number of tasks that are utilized when writing versus the amount of data in each micro-batch or partition. The easiest method to optimize this is to adjust the degree of parallelism at the point of write.
If you are experiencing hundreds of files per partition per batch, consider reducing the number of output partitions prior to writing:
(
df
.repartition(200) // pick a number that corresponds to your cluster and batch size
.writeTo("prod.db.events")
.append()
)
You don't have to find the optimal number. All you need to do is stop creating 1000 small files because your job happened to run with 1000 tasks.
Do Not Create "Hot Partitions" by Design
Some datasets inherently skew towards certain items, such as one customer producing 70% of the events, or a specific date receiving a massive backfill. When a partitioning scheme directs a large amount of data into a single partition, you will continually be fighting it with compaction.
This is one of the few instances where adjusting the partitioning scheme to provide less skewness can greatly reduce compaction load. A common strategy is to add another dimension to the partitioning scheme (or create a derived shard key) to ensure that a single logical partition does not become a physical hotspot.
You do not need to re-design the entire table. One additional dimension may be sufficient to prevent the worst skewness.
11. Maintain Metadata With Compaction
You can obtain the desired file sizes, reduce the number of files, and yet still end up with a table whose performance and cost characteristics degrade over time. This is typically a metadata issue and not a physical layout issue.
Every time you run compaction, you create a new snapshot. Every snapshot adds to the table's history. Each manifestation accumulates. The old metadata remains until something removes the history. If nothing removes the history, the table becomes deeper and more expensive to reason through, even if the physical data files appear to be clean.
This is the most common trap: Teams focus on rewrite_data_files and neglect what happens to the snapshots and manifestations after the fact.
In general, compaction should be run immediately followed by snapshot expiration. If you keep thousands of historical snapshots around "just in case," the engine still has to traverse the lineage when performing planning. Over time, this shows up as slower metadata reads and longer planning times.
Typically, a snapshot expiration would look something like this:
CALL prod.system.expire_snapshots(
table => 'db.events',
retain_last => 10
);
The actual number of retained snapshots will vary based on your rollback and time-travel policies, however, it is critical that you establish a clear retention policy. Retention policies greater than infinite are rarely what you truly need.
After expiring snapshots, it is also beneficial to perform manifest consolidation:
CALL prod.system.rewrite_manifests('db.events');
Even if your file sizes are acceptable, fragmented manifests will require the planner to open and evaluate numerous small metadata files. Manifest consolidation will reduce the fan-out and stabilize the planner costs.
Then there is the removal of orphans. Failed jobs, speculative tasks, and partial re-writes will leave files in object storage that are no longer referenced by the table. Over months, this is a lot of money. Removing these orphans will help to predictably manage the lake.
CALL prod.system.remove_orphan_files(
table => 'db.events',
older_than => TIMESTAMP '2026-02-10 00:00:00'
);
The older_than guardrail is essential. You cannot afford to be racing with active writers in a production environment. Safety is more important than aggression.
What complicates this is that the above actions are not independent. How you retain snapshots impacts what you can remove. How frequently the table is updated determines how frequently you need to rewrite manifests. Removing orphans is related to when commits occur and streaming jobs.
Therefore, compaction is not simply a single maintenance action. It is part of a life cycle. Physical data files, snapshots, manifests, and physical storage move in tandem.
At small scales, you can run these actions manually and get away with it. At larger scales, you need to be able to enforce consistency. Tables drift in various ways at varying rates. Without coordinated metadata maintenance, you will continue to repair file layout, while the metadata layer quietly continues to grow.
Your goal is not simply to minimize the number of small files. Your goal is to maintain a table whose performance and cost characteristics remain stable over time. Compaction addresses the data layer. The three above actions address the metadata layer to prevent it from becoming the next bottleneck.
Recap and Conclusion
Compaction in Iceberg is not about scheduling rewrite_data_files. It is about maintaining the alignment between the layout, deletes, and metadata of a table with its actual usage.
Compaction plus table maintenance is now a coordination problem. Therefore, having a control plane to continuously assess the health of tables, prioritize the top partitions requiring compaction, and coordinate compaction with metadata maintenance rather than treat these as separate jobs is a home run on the first step:
Manual work and scripting are typically the alternatives.
We previously reviewed the practical aspects of this:
Use bin-pack to control the file count
Escalate to sorting only when the scan efficiency is the limiting factor
Use gates to restrict the re-writing of healthy data
Limit the scope to make the maintenance predictable
Proactively resolve delete-heavy partitions
Reduce the commit entropy in streaming jobs
Correct the write path to prevent constant repair of the same issues
Connect compaction to snapshot expiration, manifest rewrites, and orphan removal
Most importantly, we connected compaction to snapshot expiration, manifest rewrites, and orphan removal - because the physical data layout and the metadata health are interdependent.
Your goal is not to simply minimize the number of small files. Your goal is to maintain a lake that remains predictable - in terms of performance, cost, and operational overhead - as it grows.
If you are operating Iceberg in production, I would appreciate your feedback regarding what has worked (and failed) for you. Real world patterns are always more interesting than theoretical ones.
Thank you for reading 🍺




Top comments (0)