joni sar

Posted on May 29 • Originally published at lakeops.dev

Netflix Intelligent Lakehouse Solves Iceberg Maintenance — You Can Easily Too

#dataengineering #devops #opensource #aws

Every production Iceberg data lake eventually hits the same wall: tables that looked fast at 10 GB start crawling at 10 TB. Small files pile up from streaming ingestion, snapshots accumulate because nobody set expiration, orphaned data lingers from failed Spark jobs, and manifest lists grow until planning a simple SELECT takes longer than running it.

Netflix hit this wall years ago — and their solution shaped how the industry thinks about lakehouse architecture. At AWS re:Invent, their engineers walked through the ecosystem they assembled around Iceberg: Polaris for catalog management, Autotune for automated compaction, janitors for continuous cleanup, and Metacat for observability. The outcome was a 25% cost reduction and tables that stayed healthy without manual intervention.

But Netflix had something most teams don't: a dedicated platform organization building custom distributed services backed by CockroachDB, Kafka, and fleets of Spark clusters.

Today, a lakehouse control plane — just as good as Netflix's or better — is available for everyone to install on their Iceberg lakehouse.

The industry's favorite solution is LakeOps.

This article breaks down what actually makes a lakehouse "intelligent" — component by component — and shows how each piece maps to tooling that exists today.

Today with LakeOps, every team in the world is 10 minutes away from an intelligent lakehouse. And yes, it includes autonomous snapshot optimization as well as orphan files, metadata, manifests, and more.

The maintenance gap nobody talks about

Apache Iceberg solved the table format problem. Schema evolution, hidden partitioning, time travel, snapshot isolation — these features are why every major engine from Snowflake to DuckDB now speaks Iceberg natively.

What Iceberg intentionally left unsolved is who runs the maintenance. The format gives you powerful primitives. Keeping those primitives performing well at scale is your responsibility.

In practice, this creates a silent degradation cycle:

Streaming writes produce small files — a pipeline appending every 5 minutes to 100 partitions creates 100 new files per commit. After a week, some partitions contain thousands of sub-megabyte Parquet files.
Snapshots grow unbounded — without explicit expiration, every commit adds a snapshot. A table with hourly writes accumulates 8,760 snapshots per year, each referencing its own manifest list.
Orphan files accumulate — aborted Spark jobs, failed compaction runs, and expired snapshots leave behind data files that no snapshot references. These files cost storage but serve nothing.
Manifests fragment — as files are added and removed, the manifest layer becomes a web of small manifest files. Query planning reads every one of them before scanning a single data file.

The financial impact compounds from four directions: storage waste (orphans + snapshots), compute waste (scanning small files), metadata overhead (fragmented manifests), and engineering time (maintaining cron scripts that break silently).

Netflix's insight was that solving these problems one at a time, with isolated scripts, doesn't scale. You need an integrated system — a control plane that sees the full picture and acts on it continuously.

The six components of an intelligent lakehouse

Looking across Netflix's published architecture and detailed breakdowns of their Iceberg ecosystem, six capabilities separate an intelligent lakehouse from a collection of Iceberg tables with maintenance scripts taped to the side:

1. Universal catalog connectivity

Netflix built Polaris to replace the Hive Metastore with a catalog purpose-built for Iceberg — scalable, CockroachDB-backed, and supporting the Iceberg REST catalog specification for multi-engine access.

Most teams aren't replacing their catalog. They're running AWS Glue, or they adopted a REST catalog like Nessie or Lakekeeper early on, or they have tables spread across multiple catalogs in different regions.

An intelligent lakehouse connects to existing catalogs — Glue, DynamoDB, REST (Polaris, Nessie, Lakekeeper, Gravitino), S3 Tables, or custom implementations — discovers every namespace and table, and normalizes metadata into a single operational view. No catalog migration required.

LakeOps does exactly this: point it at your catalog credentials, and within minutes it inventories every table and starts collecting metadata signals.

2. Intelligent and efficient compaction that actually works

Netflix's Autotune watches for table write events through SQS and spins up Spark jobs to compact small files in the background. It's the core of their self-maintaining architecture.

The Spark-based approach works but carries significant overhead. You need provisioned compute clusters, JVM tuning, IAM roles for each cluster, and someone on-call for job failures. Spark compaction typically costs around $50 per TB processed.

A Rust-based alternative changes the economics entirely. LakeOps runs compaction with a native engine built on Apache DataFusion — no JVM, no cluster provisioning, no shuffle stages.

It reads Iceberg metadata, plans optimal merges, and writes compacted Parquet directly to your storage. Production benchmarks show roughly $5/TB — a 10x cost reduction over Spark.

On top, LakeOps runs compaction based on actual query patterns — so the way your files are organized is optimized to minimize I/O, cutting CPU costs as well as storage by up to 80% compared to Spark or S3 Tables.

It also coordinates all operations and events with Adaptive Maintenance to maximize results and cut time and costs. The sequence matters, and event- or trigger-driven ops are much smarter and more efficient than cron-based jobs.

Two strategies cover every workload:

Binpack — combines small files targeting optimal file sizes (~512 MB). Handles most tables well with minimal configuration.
Sort — reorders data by query-relevant columns so engines skip irrelevant row groups through predicate pushdown. Dramatic speedups for tables with clear access patterns.

Each table can be run manually first (configure → Execute → review results) and then switched to automated scheduling with a cron expression. No all-or-nothing commitment.

3. Metadata lifecycle automation

Netflix runs dedicated "janitor" services for orphan cleanup and snapshot expiration. Without them, their exabyte-scale lake would drown in stale metadata and unreferenced files.

The same operations — snapshot retention, orphan removal, manifest consolidation — need to run continuously on any production Iceberg lake. LakeOps provides all four as per-table operations with independent configuration:

Operation	What it does	Why it matters
Snapshot retention	Expires snapshots beyond a retention period, respecting min counts	Reclaims metadata, enables cleanup
Orphan file cleanup	Removes files unreferenced by any snapshot (with age threshold)	Recovers wasted storage
Manifest optimization	Consolidates fragmented manifests	Speeds up query planning
File compaction	Merges small files (Binpack or Sort)	Reduces scan overhead and S3 API costs

The execution order matters: expire first, then clean orphans, then compact, then consolidate manifests. Running them out of sequence wastes compute or risks removing files still in use.

When you want all four automated together, Adaptive Maintenance bundles them into a single data-driven policy that reacts to table activity — the closest equivalent to Netflix's integrated approach.

4. Full-stack observability without building a pipeline

Netflix's Metacat provides unified metadata access across all datasets, backed by Kafka event streams for real-time operational visibility. Building this took years and a dedicated team.

Out-of-the-box observability should include:

Table health classification — every table scored as Healthy, Warning, or Critical based on file counts, size distributions, snapshot accumulation, and metadata fragmentation
AI-generated insights — ranked recommendations that flag small-file hotspots, excessive snapshots, and missing retention before they become incidents
Event audit trail — every maintenance operation recorded with before/after metrics, timestamps, and status — per-table or lake-wide, filterable by catalog and operation type
Dashboard — total operations, query speed gains, cost savings, and resource reduction in a single view

And table health dashboards:

The difference from building it yourself is time-to-value: connect a catalog and immediately see what's degraded, what's wasting money, and what to fix first.

5. Policy-driven governance that scales with the lake

Configuring maintenance table-by-table stops working somewhere between 50 and 100 tables. Netflix needed organization-wide rules; so does every team that's past the proof-of-concept stage.

A policy engine lets you define maintenance rules at the catalog or namespace level — snapshot retention every hour, orphan cleanup daily, compaction at 2 AM — and every table in scope inherits them automatically. New tables that appear in a governed catalog get the right configuration without anyone touching them.

Two policy categories cover the ground:

Maintenance policies — schedule and configure any operation (or all of them via Adaptive Maintenance) across a scope
Configuration policies — enforce table settings like Iceberg format version, file format, and write distribution mode

Per-table overrides always take precedence, so you set sensible defaults broadly and customize only where needed.

6. Multi-engine query routing

Netflix connects engines through the REST catalog endpoint, but routing decisions — which engine handles which query — remain manual architecture choices in most organizations.

An intelligent routing layer dispatches queries to the best engine based on the workload:

Cost-optimized — sends queries to the cheapest engine that meets your latency SLA
Latency-optimized — picks the fastest engine for the query shape
Throughput-optimized — distributes load for maximum concurrency

Applications connect to a single SQL endpoint (Postgres wire, MySQL wire, or Arrow Flight). When an engine goes down, failover reroutes automatically. When you add or remove engines, application code doesn't change.

LakeOps handles this through QueryFlux, an open-source Rust SQL proxy that translates SQL dialects with sqlglot and adds ~0.35ms of overhead.

What this adds up to

Each component is useful independently. Together, they compound:

Storage costs drop 40–55% from continuous orphan removal, bounded snapshots, and compaction
Compute costs drop up to 75% from Rust-native compaction replacing Spark clusters, sort-order optimization reducing scan volume, and routing hitting the cheapest viable engine
Query latency improves up to 12x through optimized file sizes, sorted layouts, consolidated manifests, and Puffin column statistics
Engineering hours shift from maintaining scripts and debugging overnight failures to building data products

Beyond human users: AI agent access

The next layer of intelligence is enabling AI agents to interact with lakehouse data programmatically. LakeOps provides an MCP (Model Context Protocol) interface that gives agents structured access — table discovery, SQL execution through the routing layer, column statistics without scanning, and maintenance triggers — all within configurable guardrails.

You can enforce read-only access, row limits, PII masking, cost caps, and human approval per agent. As agent usage grows, their query telemetry feeds back into compaction decisions — tables agents query most get optimized first, with sort orders aligned to the predicates agents actually use. The lake self-optimizes as AI adoption scales.

Getting started

Netflix took years to build their intelligent lakehouse with dedicated teams.

The same architecture is now accessible in about ten minutes.

Visit lakeops.dev:

Connect your catalog — Glue, DynamoDB, REST, S3 Tables, or Custom. Every table is discovered automatically.
Optimize a few tables — run compaction or snapshot expiration manually, review results, then flip to automated scheduling.
Scale with policies — define rules at the catalog or namespace level. New tables inherit everything.
Monitor — the dashboard shows real-time impact, insights flag what needs attention, events provide the audit trail.

Your data never leaves your account. No agents to install, no pipelines to change, no infrastructure to provision.

The intelligent lakehouse is no longer reserved for companies that can build Netflix-scale infrastructure. The building blocks are here. The question is whether your tables are maintained — or quietly degrading while you read this.

DEV Community