Joni Sar

Posted on May 10

Managed Iceberg: Optimizing a Modern Lakehouse

#data #dataengineering #devops #backend

A modern lakehouse looks simple from the outside.

Data lands in object storage. Apache Iceberg gives you tables, snapshots, schema evolution, time travel, and multi-engine access. Spark writes. Trino queries. Flink streams. Snowflake or Athena may read the same data. Everyone is happy.

Then the lakehouse starts growing.

Small files pile up. Snapshots never expire. Manifest metadata gets heavier. Delete files slow down reads. Failed jobs leave orphan files behind. Query planning becomes slower. Storage cost grows in places nobody is watching. Every engine has its own behavior, its own tuning, and its own operational gaps.

This is the part that gets underestimated.

Iceberg solves the table format problem. It does not magically solve lakehouse operations.

The same pattern already happened in compute infrastructure: once systems grew large enough, manual tuning stopped scaling and platforms like LakeOps became useful because they continuously optimized resources instead of relying on people to chase every inefficiency.

Iceberg lakehouses need the same shift. Not more scripts. Not more periodic cleanup jobs. A real control plane.

That is the idea behind LakeOps: autonomous lakehouse management for Apache Iceberg. It sits above your lake, watches table and engine behavior, and continuously manages the operational work that keeps Iceberg fast, clean, and cost-efficient.

This article is a practical guide to what that means.

The real job of running an Iceberg lakehouse

When teams first adopt Iceberg, the focus is usually on features.

ACID transactions on object storage.

Time travel.

Schema evolution.

Partition evolution.

Hidden partitioning.

Multiple engines reading the same tables.

Those are important. They are why Iceberg became popular.

But after the first production workloads move in, the job changes. You are no longer just “using Iceberg.” You are operating a lakehouse.

That means you are responsible for table layout, file size, metadata growth, snapshot retention, stale data, query planning, engine behavior, storage waste, and workload safety.

A healthy lakehouse needs constant maintenance.

A production Iceberg table is not static. Every ingest, append, merge, delete, compaction, schema change, or streaming write changes its physical and metadata shape. Even when the logical table looks clean, the underlying table may be drifting.

That drift is the problem.

A table can still return correct results while becoming slower and more expensive every week.

Where lakehouse maintenance gets painful

The pain usually appears in a few predictable places.

Small files are the first one. Streaming jobs, CDC pipelines, frequent appends, and micro-batches create many small files. Query engines then spend too much time opening files, planning splits, and scanning inefficiently. A table that should be read in seconds starts feeling heavy.

Snapshots are the second. Iceberg creates snapshots so readers can see consistent table versions and users can time travel or roll back. That is useful, but old snapshots accumulate unless someone expires them. Over time, they keep metadata and old data references alive.

Manifests are the third. Iceberg tracks data files through metadata files. That is what makes Iceberg reliable and engine-independent, but metadata also needs maintenance. If manifests grow or fragment, query planning slows down before the engine even starts scanning data.

Orphan files are the fourth. Failed writes, aborted jobs, migrations, dropped tables, and imperfect cleanup flows can leave files in object storage that are no longer referenced by Iceberg metadata. Queries do not read them, but storage still bills for them.

Delete files are another common issue, especially with merge-on-read workloads. If they accumulate, every query may pay the cost of applying deletes at read time.

Then there is the engine layer. Spark, Trino, Flink, Athena, Snowflake, Databricks, DuckDB, and other engines do not behave the same way. They have different cost models, latency profiles, concurrency limits, and operational strengths. Managing Iceberg across engines is not the same as managing one Spark pipeline.

This is why “we have Iceberg” is not the same as “we have a managed lakehouse.”

The manual way most teams start

Most teams start with scripts.

A Spark job for compaction.

A scheduled job for snapshot expiration.

A cleanup script for orphan files.

A few dashboards.

Some alerts.

A runbook in a wiki.

A Slack channel where someone asks, “Why is this table slow again?”

There is nothing wrong with this as a starting point. It is how most platforms mature.

The problem is that static maintenance does not understand table behavior.

A daily compaction job does not know whether a table had a quiet day or a massive ingestion spike.

A weekly snapshot cleanup job does not know whether the table has long-running readers, branch retention requirements, or compliance rules.

A manifest rewrite schedule does not know which tables are suffering from planning latency.

A generic orphan cleanup script may be too conservative to reclaim meaningful storage or too aggressive to be safe.

And none of this naturally connects to cost, query performance, engine behavior, or table-level business importance.

Manual maintenance works when you have a small number of tables and a small number of workloads. At lakehouse scale, it becomes operational debt.

What managed Iceberg means

Managed Iceberg means the maintenance loop becomes part of the platform.

Not a one-off script.

Not a quarterly cleanup project.

Not a few jobs someone hopes are still running.

A managed Iceberg layer continuously observes the lakehouse, decides what needs attention, runs the right operation, and records the result.

It should manage the core lifecycle of Iceberg tables:

Compaction.

Snapshot expiration.

Manifest optimization.

Orphan file cleanup.

Delete file handling.

Statistics and metadata optimization.

Table health monitoring.

Policy enforcement.

Engine visibility.

Cost and performance tracking.

The key point is that management should be table-aware and workload-aware.

A hot BI table is not the same as a streaming staging table. A CDC table is not the same as a cold archive table. A table queried by Trino all day is not the same as a table used by a nightly Spark job. A table exposed to AI agents has different risk and cost patterns than an internal batch table.

A managed lakehouse should understand those differences.

The control plane model

A lakehouse control plane is the layer that coordinates operations across storage, Iceberg metadata, catalogs, engines, policies, and observability.

and

It does not replace Iceberg.

It does not replace your object storage.

It does not force all teams into one query engine.

It gives you one operating layer for the lakehouse.

LakeOps describes this as a control plane for your data lake: end-to-end optimization for tables and metadata across storage and query engines, with telemetry-driven orchestration and visibility in one place.

That distinction matters.

The goal is not to make Iceberg proprietary. The goal is to make open lakehouse operations manageable.

A good control plane should answer questions like:

Which tables are unhealthy?

Which tables are wasting the most storage?

Which tables have the worst small-file problem?

Which tables have metadata planning issues?

Which tables should be compacted now?

Which tables should not be touched because active workloads are running?

Which compaction strategy should be used?

Which snapshots can safely expire?

Which files are safe to delete?

Which engine is best for this workload?

Did the optimization actually improve cost or performance?

If you cannot answer these questions quickly, the lakehouse is being managed manually, even if it has automation scripts.

Solving the small-file problem

Small files are the most visible Iceberg maintenance issue.

They usually come from streaming ingestion, frequent appends, CDC, micro-batches, partition skew, and multi-writer workloads. The result is predictable: more files, more metadata, more object-store requests, more planning work, and slower queries.

The normal fix is compaction.

But not all compaction is equal.

The simple version is bin-packing: combine many small files into fewer larger files. This is often the right first step because it quickly reduces file count and improves scan efficiency.

The more advanced version is sort-based compaction: rewrite files according to the columns that queries filter or join on most often. This can improve data skipping and reduce scanned data, but it is more workload-sensitive. Sorting everything blindly can waste compute.

This is where autonomous management becomes useful.

LakeOps includes compaction for Apache Iceberg that uses table metadata and query patterns to decide which files to rewrite and how. The useful part is not only that it compacts. The useful part is that compaction becomes part of a continuous feedback loop.

A practical operating model looks like this:

Start with bin-pack compaction on tables with severe small-file pressure.

Use query-aware sort compaction only where query patterns justify it.

Avoid compacting cold tables just because a schedule says so.

Prioritize tables where compaction will reduce real query cost or latency.

Track before-and-after impact: file count, data scanned, planning time, runtime, and cost.

That is the difference between maintenance and optimization.

Managing snapshots safely

Snapshots are one of the best things about Iceberg.

They enable time travel, rollback, auditability, and consistent reads. But they also create retention work.

Every write creates a new table version. If snapshots are never expired, metadata grows and old data can remain retained longer than needed. On busy tables, this becomes a real cost and performance issue.

The hard part is not running expire_snapshots.

The hard part is knowing the right policy.

Some tables need long time-travel windows because they support audits, debugging, or recovery. Some tables only need short retention. Some tables may have branches or tags that must be protected. Some workloads may have long-running readers. Some environments need different rules for production, staging, and development.

A managed layer should make this explicit.

LakeOps provides snapshot management as part of table optimization, so retention can be controlled through policies rather than remembered manually per table.

For platform teams, this is a major shift.

Instead of asking, “Did someone remember to clean up snapshots on this table?”

You define retention behavior once, apply it at the right scope, and let the platform enforce it continuously.

A good default might be:

Keep enough snapshots for rollback and debugging.

Retain a minimum number of recent snapshots.

Use longer retention for critical regulated tables.

Use shorter retention for temporary or staging data.

Monitor how much storage is blocked by old snapshots.

Run expiration before orphan cleanup.

The exact values depend on the organization. The important thing is that snapshot retention becomes intentional.

Keeping metadata lean with manifest optimization

Iceberg query performance is not only about data files.

It is also about planning.

Before an engine scans data, it reads Iceberg metadata to understand which files belong to the snapshot and which files can be skipped. Manifest files are part of this metadata layer. They are essential, but they can also become fragmented over time.

When manifests grow poorly, planning time grows. Users experience this as “the query is slow,” but the engine may be spending too much time before meaningful scanning even begins.

This is easy to miss if you only look at execution time.

A managed Iceberg system should monitor metadata health directly.

LakeOps includes manifest optimization so teams can consolidate and optimize metadata as part of the same table health loop.

The principle is simple: metadata is part of performance.

If you only compact data files but ignore manifests, you are only managing half the table.

Cleaning orphan files without breaking things

Orphan files are a storage leak.

They sit in object storage but are not referenced by Iceberg metadata. Queries do not use them, but the cloud provider still charges for them.

They can appear after failed jobs, aborted commits, manual migrations, dropped tables, incorrect cleanup flows, or maintenance operations that leave old data behind.

The dangerous part is cleanup.

Deleting files from a data lake is easy. Deleting the right files safely is hard.

A safe orphan cleanup process must compare files in storage against Iceberg metadata, apply a conservative age threshold, avoid active write windows, and usually run after snapshot expiration. The age threshold matters because a file that looks unreferenced during an in-progress write may still be committed later.

LakeOps documents orphan file cleanup as a managed operation with metadata awareness and safety controls.

In practice, this is one of the strongest arguments for a control plane.

Nobody wants platform engineers manually reviewing millions of object-store paths. Nobody wants an unsafe script deleting files from production. And nobody wants to keep paying for dead data because cleanup feels risky.

Managed orphan cleanup should be boring, visible, and conservative.

Run a dry run.

Show candidates.

Apply retention thresholds.

Delete only when safe.

Record what was removed.

Measure storage reclaimed.

That is how cleanup becomes an operational capability instead of a dangerous maintenance task.

Policies are what make this scale

Manual tuning does not scale across hundreds or thousands of tables.

You need policies.

Policies let you define how table maintenance should behave at different scopes: organization, catalog, namespace, table, environment, or workload class.

For example:

Production BI tables get frequent compaction and manifest optimization.

Streaming tables get aggressive small-file management.

CDC tables get delete-file-aware compaction.

Staging tables get short snapshot retention.

Archive tables get minimal compute-heavy optimization but regular storage cleanup.

Critical tables require approvals or simulations before major rewrites.

Development tables get cheaper, more aggressive cleanup.

LakeOps supports policies for maintenance automation, allowing teams to define behavior for compaction, snapshots, manifests, orphan cleanup, and governance across the lakehouse.

This is important because the platform team should not be in the business of hand-tuning every table forever.

Good policies give teams defaults, guardrails, and exceptions.

That is how Iceberg operations become manageable.

Observability turns maintenance into engineering

If maintenance runs but nobody can measure the effect, it is not really managed.

A lakehouse control plane needs observability at the table, engine, and operation level.

You need to see:

Table health.

File count.

Average file size.

Small-file pressure.

Snapshot count.

Manifest count.

Delete file pressure.

Storage waste.

Query latency.

Planning time.

Data scanned.

Engine cost.

Operation history.

Before-and-after optimization impact.

Failed or skipped operations.

Policy coverage.

LakeOps includes lakehouse observability so platform teams can see table health, engine metrics, cross-system telemetry, and maintenance history from one place.

This changes how you operate.

Instead of waiting for users to complain that dashboards are slow, you can see which tables are drifting.

Instead of guessing whether compaction helped, you can measure the before and after.

Instead of discovering storage waste in a cloud bill, you can identify stale data and orphan files directly.

Good observability turns Iceberg maintenance from reactive firefighting into normal platform engineering.

Multi-engine lakehouses need engine-aware management

The whole point of Iceberg is that many engines can work over the same tables.

That is also what makes operations harder.

Spark may be good for heavy rewrites.

Trino may be better for interactive analytics.

Athena may be useful for serverless access.

Snowflake may serve BI workloads.

Flink may write continuously.

DuckDB may support local or embedded analytics.

Each engine has a different performance model. Each workload has a different latency and cost profile.

A managed lakehouse should not pretend all engines are the same.

LakeOps supports engine management and query routing, giving teams a unified view of engine health, cost, usage, and routing behavior.

This matters because optimization is not only about the table. It is also about where and how workloads run.

Sometimes the best optimization is a better file layout.

Sometimes it is a better engine choice.

Sometimes it is avoiding an expensive engine for simple queries.

Sometimes it is routing a workload away from an unhealthy engine.

A modern lakehouse control plane should see the full system, not just the storage layer.

Why continuous optimization beats scheduled maintenance

The old model is schedule-based.

Run compaction every night.

Expire snapshots every Sunday.

Clean orphan files once a month.

Rewrite manifests when someone remembers.

That is better than nothing, but it is not how real workloads behave.

A high-volume table may need attention multiple times a day.

A cold table may not need compaction for months.

A table may become hot because a new dashboard launched.

A backfill may create temporary file pressure.

A failed ingestion job may create orphan files.

A new AI agent may generate many new query patterns.

A static schedule cannot react to that.

Continuous optimization uses telemetry to decide what should happen next.

That is the core value of autonomous lakehouse management.

The lakehouse is not optimized because a cron job ran. It is optimized because the platform understands table state, workload behavior, and cost impact.

This is where LakeOps is useful in practice. It continuously analyzes the lakehouse, recommends or runs the right operation, and keeps optimizing as workloads change.

For a platform team, this reduces the amount of manual judgment required for routine operations.

You still set policies.

You still define guardrails.

You still decide what level of autonomy is acceptable.

But you are no longer manually chasing every unhealthy table.

A practical rollout plan

The best way to adopt managed Iceberg is not to turn everything on everywhere.

Start with visibility.

Connect the catalogs and engines. Let the platform observe table health, file layout, metadata, snapshots, and query behavior. Identify the worst tables by cost, latency, file count, metadata weight, and storage waste.

Then start with a small set of high-impact tables.

Good candidates are usually:

Large tables with many small files.

Hot BI tables with growing latency.

Streaming or CDC tables with constant writes.

Tables with high object-store request cost.

Tables with many snapshots.

Tables where users already complain about performance.

Apply conservative policies first.

Use bin-pack compaction before sort compaction.

Use dry runs for cleanup operations.

Set safe snapshot retention.

Run orphan cleanup with conservative age thresholds.

Measure everything.

Only after you see stable improvements should you expand to more tables, more aggressive compaction, sort optimization, and autonomous mode.

The point is not to give control away. The point is to move from manual table-by-table work to policy-driven operations with visibility.

What LakeOps solves, problem by problem

If you maintain Iceberg yourself, you eventually build pieces of a control plane internally.

You build table health checks.

You build compaction jobs.

You build snapshot cleanup.

You build orphan cleanup.

You build dashboards.

You build job orchestration.

You build policy conventions.

You build alerts.

You build runbooks.

You build engine-specific scripts.

You build cost reports.

Then you maintain all of that.

LakeOps packages that operating layer into one platform.

For small files, it provides autonomous compaction and layout optimization.

For slow queries, it optimizes file sizes, sort layout, manifests, and routing decisions.

For snapshot bloat, it manages retention policies.

For orphan files, it performs safe metadata-aware cleanup.

For fragmented metadata, it rewrites and optimizes manifests.

For multi-engine complexity, it gives one view of engines and can route workloads based on cost, latency, or throughput.

For operational visibility, it surfaces table health, engine metrics, events, recommendations, and optimization history.

For governance, it gives policies, auditability, and controlled automation.

For adoption risk, it works with the existing lakehouse stack instead of requiring pipeline rewrites or data movement.

That last point is important.

A control plane should reduce operational burden without becoming a migration project.

What to keep managing yourself

Autonomous management does not mean the platform team disappears.

You still own architecture.

You still own data modeling.

You still decide retention requirements.

You still define governance boundaries.

You still choose which engines belong in the platform.

You still control policies and exceptions.

You still review critical workloads.

The difference is where your time goes.

Instead of manually compacting tables, you define compaction policies.

Instead of hunting stale files, you monitor cleanup impact.

Instead of guessing why queries slowed down, you inspect table and engine telemetry.

Instead of writing one-off Spark jobs, you operate the lakehouse as a managed platform.

That is a better use of senior data platform engineering time.

The benefits beyond cost and performance

Cost and performance are the obvious wins.

Fewer small files means less scan overhead.

Cleaner metadata means faster planning.

Expired snapshots and orphan cleanup reduce storage waste.

Better layout reduces data scanned.

Better routing reduces unnecessary compute.

But there are other benefits that matter just as much.

Reliability improves because maintenance is consistent instead of ad hoc.

Governance improves because policies are explicit.

Debugging improves because every operation is visible.

Onboarding improves because new tables inherit sane defaults.

Security improves when access and actions are auditable.

Capacity planning improves because table growth and engine behavior are observable.

AI readiness improves because agents query cleaner, faster, better-governed tables.

Team focus improves because engineers stop spending so much time on repetitive maintenance.

These benefits compound.

A lakehouse that is continuously maintained becomes easier to trust.

Common mistakes when managing Iceberg manually

The first mistake is treating compaction as the whole problem. Compaction is important, but it does not replace snapshot expiration, manifest optimization, orphan cleanup, delete-file handling, or observability.

The second mistake is applying the same policy to every table. A staging table and a production revenue table should not have the same retention and optimization strategy.

The third mistake is running cleanup without a safety model. Orphan cleanup especially needs conservative thresholds and visibility.

The fourth mistake is ignoring metadata. Data files get attention because they are visible, but manifests and snapshots often explain planning latency and storage drift.

The fifth mistake is optimizing for one engine while the table is used by many engines.

The sixth mistake is not measuring impact. If you cannot show what changed after maintenance, you cannot tune the lakehouse intelligently.

The seventh mistake is waiting for incidents. Iceberg degradation is often gradual. By the time users complain, the table may have been unhealthy for weeks.

The target operating model

A modern Iceberg lakehouse should operate like this:

Tables are continuously monitored.

Health is measured at the file, metadata, snapshot, storage, and query level.

Policies define maintenance behavior.

Compaction runs when it has measurable value.

Snapshot expiration follows retention rules.

Orphan cleanup is safe and auditable.

Manifest optimization keeps planning fast.

Engine behavior is visible.

Query routing can account for cost and latency.

Optimization history is recorded.

Engineers can override, approve, or inspect operations.

The system improves continuously.

That is managed Iceberg.

Not a hosted table format.

Not a black box.

Not a replacement for engineering judgment.

A control plane that takes the repetitive, error-prone, high-volume operational work and turns it into policy-driven automation.

Final thoughts

Iceberg is a strong foundation for the modern lakehouse, but it is not the whole platform.

Once Iceberg becomes production infrastructure, the work shifts from “how do we create tables?” to “how do we keep hundreds or thousands of tables healthy while many engines and workloads use them?”

That is where many teams feel the pain.

Manual maintenance is fine at the beginning. Scripts are fine at the beginning. But as the lakehouse grows, entropy wins unless something is continuously managing the system.

Managed Iceberg is the next layer.

It means compaction, snapshots, manifests, orphan files, engines, policies, observability, and cost optimization are handled as one operating system for the lakehouse.

LakeOps is built around that idea: autonomous lakehouse management for Apache Iceberg, running on top of the stack teams already use.

For data platform engineers, the value is simple.

You keep Iceberg open.

You keep your storage.

You keep your engines.

You keep control.

But you stop managing the lakehouse one table, one script, and one incident at a time. Thanks for reading! :)

DEV Community