Year-end 2025 reflections, looking ahead to 2026
Over the past few years, data platforms have crossed a tipping point. Rigid, centralized warehouses proved great at trustworthy BI but struggled with diverse data and elastic scale. Open data lakes delivered low-cost storage and freedom of choice, yet lacked the transactional rigor and performance guarantees analytics teams rely on. In 2025, the data lakehouse matured from an idea into an operating model: open table formats, transactional metadata, and multi-engine access over a single, governed body of data.
This guide distills what changed, why it matters, and how to put it to work. We’ll start by clarifying where warehouses shine and crack, where lakes empower and swamp, and why older directory-based table designs (think classic Hive tables) hit scaling and consistency limits. From there, we’ll show how modern table formats, Apache Iceberg, Delta Lake, Apache Hudi, and Apache Paimon, solved those limits by tracking files and snapshots instead of folders, enabling ACID transactions, time travel, and intelligent pruning at petabyte scale.
2025 also cemented a practical reference architecture. A successful lakehouse now looks less like a monolith and more like a layered system: cloud object storage for durability and cost, an open table format for transactions and evolution, ingestion that blends batch and streaming, a catalog for governance and discoverability, and a flexible consumption layer that serves SQL, BI, notebooks, and AI agents with consistent semantics.
Why now? Three forces converged this year:
1) Streaming-by-default workloads turned “daily batch” into “continuous micro-batch,” demanding exactly-once commits and small-file management.
2) AI and agentic workflows moved from proofs of concept to production, generating highly variable, ad-hoc queries that require low-latency acceleration without brittle hand-tuning.
3) Open interoperability became table stakes, organizations want one source of truth read by many engines, not many copies of truth managed by many teams.
This guide is an accessible deep dive for Data Engineers and Data Architects. You’ll get a clear mental model of the formats, their metadata structures, and the operational playbook: compaction, snapshot expiration, partition evolution, and reflection/materialization strategies for speed at scale. We’ll also survey the ingestion and streaming ecosystem (connectors, CDC, stream processors), Python-native options for lakehouse workloads (Polars, DuckDB, DataFusion, Daft, Dask), and emerging edge patterns where inference runs close to the data.
Finally, we’ll close with a curated reading list, books and long-form resources that stood out in 2025, and pragmatic guidance on choosing components in 2026 without locking yourself in. If your mandate is to deliver trustworthy, performant, and AI-ready analytics on open data, this guide is your map.
The Challenges in Modern Data Architecture
The rise of the lakehouse didn’t happen in a vacuum. It emerged as a response to the very real challenges of yesterday’s dominant architectures, data warehouses and data lakes. Understanding their strengths and weaknesses sets the stage for why the lakehouse model became inevitable.
Data Warehouses: Strength in Structure, Weakness in Flexibility
Data warehouses provided the first true enterprise-scale analytics platforms. They enforced schema-on-write, ensuring data quality and making business intelligence consistent across the organization. For years, this was invaluable: clean, curated, trusted dashboards.
But the cracks widened in the 2010s and 2020s:
- Rigid schemas: Every change to a source system meant heavy ETL work to keep the warehouse schema in sync. New data types, JSON, images, sensor streams, didn’t fit neatly into tables.
- High costs: Warehouses couple compute and storage. Scaling for more data or users often meant overpaying for resources you didn’t fully use.
- Latency in data freshness: The ETL pipelines that fed warehouses ran daily or hourly, leaving decision-makers working with stale data.
- Limited AI/ML support: Warehouses excel at structured SQL queries but aren’t designed to handle the diverse, unstructured, and large-scale data needed for machine learning.
Warehouses solved consistency but at the price of agility.
Data Lakes: Flexibility Meets the “Data Swamp”
Enter data lakes. By shifting to schema-on-read, organizations gained the freedom to store anything: logs, media, documents, semi-structured JSON, raw database dumps. Storage costs plummeted thanks to cloud object stores like S3 and ADLS, and data scientists loved having raw, unmodeled data at their fingertips.
But flexibility introduced new pain:
- Data quality issues: With no enforced schema, data lakes quickly devolved into “data swamps”, vast, uncurated collections of files that few trusted.
- Poor governance: Security, lineage, and access controls were bolted on, often inconsistently. Teams struggled to know what data existed and whether it was safe to use.
- Performance bottlenecks: Query engines like Hive, Spark, or Presto had to scan massive directories of files. Without transactional guarantees, concurrent writes could corrupt datasets or leave analysts with incomplete results.
- High operational overhead: Managing partitions, small files, and manual compactions became part of daily operations.
Lakes solved agility but at the price of trust.
The gap was clear: warehouses offered trust but no flexibility, while lakes offered flexibility but no trust. By the early 2020s, organizations wanted the best of both, structured reliability and open flexibility, laying the foundation for the modern data lakehouse.
What Is Hive and the Challenges of Hive Tables
Before the lakehouse era, Apache Hive was the workhorse that made large-scale data in Hadoop clusters queryable with SQL. Hive introduced the Hive Metastore, which stored table definitions (schemas, partitions, and locations), enabling analysts to run SQL-like queries over files sitting in HDFS or cloud storage. It was one of the first major attempts to give a data-lake-like environment a relational feel.
But Hive’s approach, tracking directories of files as tables, brought structural limitations that became bottlenecks as datasets and expectations grew.
Directory-Centric Table Management
In Hive, each table maps to a folder, and each partition to a subfolder. Query engines scan these directories at runtime to discover files. While this worked when data volumes were modest, modern cloud object stores made directory scans painfully slow. Listing millions of files before executing a query often dominated total query time.
Lack of ACID Transactions
Hive tables were essentially append-only. Without built-in transactions, concurrent writers risked corrupting tables, and readers could encounter partial data during an update. Later ACID extensions attempted to patch this with delta files and compaction, but these added complexity and overhead, and weren’t consistently supported across engines.
Painful Updates and Schema Evolution
Modifying data in Hive tables was inefficient:
- Updates and deletes required rewriting entire partitions or entire tables.
- Schema changes (like renaming a column) often broke downstream jobs or forced costly rewrites.
The result: rigid datasets that were expensive to maintain and slow to evolve with business needs.
The Small Files Problem
Hive ingestion pipelines, especially those running frequently, created floods of small files. Query performance degraded because engines had to open and read from thousands of tiny files. Without built-in small-file management, engineers had to implement periodic compaction jobs to maintain performance.
Hive was a critical stepping stone: it proved the value of SQL on big data and inspired the metadata-driven approach all lakehouse formats now follow. But its reliance on directory-based tracking and limited support for transactional, evolving workloads ultimately constrained its ability to power the next generation of data platforms.
The Innovation of Tracking Tables by Tracking Files vs. Tracking Directories
The turning point from Hive-style tables to modern lakehouse formats came with a deceptively simple idea:
stop tracking directories of files, and start tracking individual files in metadata.
Why Directories Fall Short
Directory-based tracking (as in Hive) meant that the engine had to:
- List every file in a partition directory before running a query.
- Infer table state from the file system at query time.
This created major problems in cloud storage, where operations like LIST
are slow and expensive. It also made concurrency hard, two jobs writing to the same folder could overwrite each other’s files without the catalog knowing until it was too late.
File-Level Tracking
Modern table formats introduced file-level manifests: structured metadata that explicitly records every file that belongs to a table, along with statistics about its contents. Instead of scanning folders, engines read this compact metadata to know exactly which files to use.
Benefits include:
- Faster planning: Queries skip expensive directory listings, instead reading a few manifest files that describe thousands of data files.
- Atomic commits: Updates create a new manifest (or snapshot) in a single operation. Readers either see the old version or the new one, never a half-written state.
- Schema evolution: Metadata can track multiple schema versions, allowing columns to be added, renamed, or dropped without rewriting entire datasets.
- Partition independence: Partitioning is recorded in metadata, not folder structures, enabling hidden partitions and even partition evolution over time.
- Fine-grained deletes and upserts: Since every file is individually tracked, formats can support row-level operations by marking old files as deleted and adding new ones.
Snapshots and Time Travel
By treating metadata itself as a versioned object, table formats unlocked time travel: the ability to query data as it existed at any point in time. Each snapshot references a specific set of files, creating a complete, immutable view of the table at that moment.
This shift, from directories to files, from implicit state to explicit metadata, transformed raw data lakes into reliable, database-like systems. It’s the foundation that made the lakehouse architecture possible and paved the way for the new generation of table formats.
The New Generation of Data Lake Tables: Iceberg, Delta, Hudi, and Paimon
With file-level tracking as the breakthrough, several open-source projects emerged to redefine how data lakes operate. These table formats provide the transactional, metadata-rich foundation that transforms a raw data lake into a full-fledged lakehouse. Each project shares core principles, ACID transactions, schema evolution, and time travel, but emphasizes different strengths.
Apache Iceberg
Born at Netflix and now a top-level Apache project, Iceberg is designed for engine-agnostic interoperability. Its hierarchical metadata structure (table metadata → manifest lists → manifest files) allows scaling to billions of files while still enabling fast query planning.
Key features:
- Hidden partitioning and partition evolution.
- Broad engine support (Spark, Flink, Trino, Presto, Dremio, and more).
- Strong focus on openness through the REST catalog API.
- Rich schema evolution, including column renames and type promotions.
Iceberg has become a de facto standard for enterprises seeking an open, future-proof lakehouse.
Delta Lake
Originally created by Databricks, Delta Lake popularized the concept of a transactional log for data lakes. It uses an append-only transaction log (_delta_log
) with JSON entries and Parquet checkpoints to track file state.
Key features:
- ACID transactions tightly integrated with Apache Spark.
- Time travel and schema evolution.
- Optimizations like Z-Ordering for clustering.
- Deep integration with the Databricks ecosystem, though community adoption beyond Spark is growing.
Delta remains particularly strong for teams standardized on Databricks or Spark-centric workflows.
Apache Hudi
Developed at Uber, Hudi was one of the earliest attempts to bring database-like capabilities to data lakes. It excels at incremental processing and change data capture (CDC).
Key features:
- Two storage modes: Copy-on-Write (CoW) for read-optimized workloads, and Merge-on-Read (MoR) for write-heavy, near-real-time use cases.
- Native upserts and deletes.
- Built-in indexing for record-level operations.
- Tight integrations with Spark, Flink, and Hive.
Hudi is especially attractive for pipelines that demand frequent updates and streaming ingestion.
Apache Paimon
A newer entrant, Paimon (formerly Flink Table Store) emphasizes streaming-first lakehouse design. It uses an LSM-tree style file organization to unify batch and stream processing.
Key features:
- Native CDC and incremental queries.
- Deep integration with Apache Flink.
- Snapshot isolation with continuous compaction.
- Growing ecosystem to support Spark, Hive, and beyond.
Paimon fills a niche where real-time data ingestion and analytics converge, making it compelling for event-driven architectures.
Together, these formats represent the evolution from directory-based tables to transactional, metadata-driven lakehouse systems. Each brings a unique philosophy: Iceberg for openness, Delta for Spark-native simplicity, Hudi for streaming updates, and Paimon for unified batch-stream processing. Understanding their trade-offs is critical when designing a modern data platform.
Fundamental Architecture of the Data Lakehouse
At its core, the data lakehouse is not a single product but an architectural pattern. It blends the scalability and openness of data lakes with the transactional reliability and governance of data warehouses. By 2025, a consensus emerged: a lakehouse succeeds when it clearly defines layers, each with its own role but working together as a cohesive whole.
Storage as the Foundation
The lakehouse begins with cloud object storage (e.g., Amazon S3, Azure Data Lake Storage, Google Cloud Storage). This layer offers low-cost, durable, infinitely scalable storage for all file types. Unlike warehouses, it decouples compute from storage, multiple engines can read from the same data without duplicating it.
Metadata and Table Formats
On top of storage sits a table format, the metadata layer that turns a set of files into a logical table. Formats like Iceberg, Delta, Hudi, and Paimon bring:
- ACID transactions
- Schema enforcement and evolution
- Partition pruning and statistics for efficient queries
- Time travel through snapshot-based metadata
This layer is what transforms a “data swamp” into structured, queryable datasets.
Catalog and Governance
A catalog connects metadata to the outside world. It tracks what tables exist and their locations, while enforcing access policies and governance rules. Think of it as the bridge between storage and consumption. Examples include Hive Metastore, AWS Glue, Unity Catalog, Dremio Catalog and open-source options like Nessie or Apache Polaris.
Compute and Federation
Query engines like Dremio, Trino, Spark, and Flink sit on top, accessing tables via the catalog. These engines provide federation, joining and querying data from multiple systems, and execute transformations, BI queries, or machine learning pipelines. The lakehouse architecture allows multiple engines to share the same data without conflict.
Consumption and Semantics
Finally, end users connect through BI dashboards, notebooks, or AI systems. A semantic layer often sits here, defining consistent metrics and business concepts across tools. This ensures a “single version of truth” for everyone consuming data.
This layered design, storage, table format, catalog, compute, and consumption, has become the reference architecture for the modern data lakehouse. It solves the warehouse vs. lake tradeoff by delivering flexibility, trust, and performance in one unified stack.
Metadata Structures Across Modern Table Formats
While all lakehouse table formats share the principle of tracking files rather than directories, each one implements its own metadata architecture. Understanding these differences is crucial for choosing the right format and for operating them at scale.
Apache Iceberg
- Snapshots: Every commit creates a new snapshot that references a set of files.
- Manifests: Each snapshot points to manifest lists, which then point to manifest files. These manifest files contain the actual list of data files, along with stats like min/max values for columns.
- Table Metadata File: A JSON/Avro file storing schema versions, partition specs, snapshot history, and pointers to the current snapshot.
- Strengths: Hierarchical design scales to billions of files, supports hidden partitioning, and makes time travel lightweight.
Delta Lake
-
Transaction Log: All operations are recorded in an append-only
_delta_log
directory as JSON files. - Checkpoints: Periodically, the log is compacted into Parquet checkpoint files for faster reads.
- Table State: Current table state is reconstructed by combining the latest checkpoint with newer JSON entries.
- Strengths: Simple, linear log model tightly integrated with Spark; efficient for workloads within the Databricks ecosystem.
Apache Hudi
-
Timeline: A series of commit, deltacommit, and compaction files in a
.hoodie
directory describe changes. -
Storage Modes:
- Copy-on-Write (CoW): rewrites files on update.
- Merge-on-Read (MoR): writes delta logs and later compacts them with base files.
- Indexing: Optional record-level indexes accelerate upserts and deletes.
- Strengths: Optimized for streaming ingestion and CDC use cases, with incremental pull queries.
Apache Paimon
- LSM-Tree Inspired: Uses log segments and compaction levels, optimized for high-frequency updates.
- Snapshots: Metadata tracks current file sets and supports branching for consistent queries.
- Changelog Streams: Natively emits row-level changes for downstream streaming consumers.
- Strengths: Built for unified batch and streaming, with strong Flink integration.
In summary:
- Iceberg emphasizes scalability and cross-engine interoperability.
- Delta Lake focuses on simplicity and Spark-native performance.
- Hudi delivers real-time upserts and incremental views.
- Paimon pioneers streaming-first design with changelogs.
These metadata designs reflect each project’s philosophy, and they form the backbone of how the modern lakehouse balances flexibility, consistency, and speed.
Implementing a Lakehouse: The Five Core Layers
Designing a modern lakehouse isn’t about choosing a single tool, it’s about assembling the right components across five architectural layers. Each layer has its own responsibilities, and together they create a system that is scalable, governed, and usable for analytics and AI.
Deep Dive into the 5 layers is the core of the book "Architecting an Apache Iceberg Lakehouse"
1. Storage Layer
This is the foundation: low-cost, durable storage capable of holding structured, semi-structured, and unstructured data.
- Common choices: Amazon S3, Azure Data Lake Storage, Google Cloud Storage, or on-prem HDFS/MinIO.
- Data is stored in open formats such as Parquet, ORC, or Avro.
- Separation of storage from compute allows multiple engines to share the same data without duplication.
2. Table Format Layer
Here, the metadata format gives structure and reliability to raw files.
- Options: Apache Iceberg, Delta Lake, Apache Hudi, Apache Paimon.
- Capabilities include ACID transactions, schema evolution, partition pruning, and time travel.
- This layer transforms the lake from a “data swamp” into a transactional system of record.
3. Ingestion Layer
The ingestion layer handles data movement into the lakehouse.
- Batch ingestion: Tools like Fivetran, Airbyte, Estuary, Hevo or custom ETL jobs land data periodically.
- Streaming ingestion: Systems like Confluent, Aiven, StreamNative, RisingWave, Kafka, Redpanda, Pulsar, or Flink push events into table formats in near real-time.
- Goal: balance freshness, cost, and reliability while avoiding problems like excessive small files.
4. Catalog & Governance Layer
The catalog is the central registry of your tables, schemas, and access rules.
- Examples: Hive Metastore, AWS Glue, Unity Catalog, Dremio Catalog, open-source catalogs like Nessie or Apache Polaris.
- Responsibilities: discovery, schema validation, access control, lineage, and auditability.
- Acts as the bridge between storage and compute, ensuring data is both secure and discoverable.
5. Federation & Consumption Layer
At the top, query engines and semantic layers make data consumable.
- Query federation engines like Dremio or Trino can join lakehouse tables with other sources.
- Consumption tools include BI platforms (Tableau, Power BI, Looker), notebooks, and AI agents.
- A semantic layer ensures consistency by defining metrics and business terms across all tools.
In practice: these five layers form the blueprint of every successful lakehouse. They separate concerns, storage, metadata, movement, governance, and consumption, while enabling interoperability. The result is a unified platform that scales with data growth, adapts to new workloads, and keeps analytics both flexible and trustworthy.
Lakehouse Ingestion
Once the foundational layers are in place, the next challenge is getting data into the lakehouse efficiently and reliably. Ingestion strategies determine not only data freshness but also table health, file organization, and downstream usability.
Batch Ingestion
Batch remains the most common entry point:
- ETL/ELT Services: Tools like Fivetran and Airbyte extract data from SaaS applications, relational databases, and APIs, then land it in cloud object storage. Many now write directly into open table formats (Iceberg, Delta, Hudi) rather than dumping raw CSVs.
- Custom Jobs: Python, Spark, or dbt pipelines often transform and load data on schedules, nightly, hourly, or in micro-batches.
- Advantages: Predictable loads, simpler monitoring, and often easier cost control.
- Challenges: Data freshness is limited by schedule, and frequent batches can generate lots of small files if not managed.
Streaming Ingestion
Real-time data is no longer a luxury, it’s an expectation in 2026:
- Event Streams: Platforms like Apache Kafka, Redpanda, Aiven, Confluent, RisingWave, StreamNative, and Apache Pulsar capture streams of events (e.g., clickstream, IoT data) and push them into the lakehouse using connectors or stream processors.
- CDC Pipelines: Change data capture tools (Debezium, Estuary Flow) replicate updates from operational databases into Iceberg or Delta tables with low latency.
- Stream Processing Engines: Apache Flink and Spark Structured Streaming can apply transformations inline, then commit results directly to lakehouse tables.
Small File Management
One critical concern in ingestion is avoiding a small files problem:
- Each tiny file adds overhead to query planning.
- Solutions include writer-side batching, file-size thresholds, and downstream compaction jobs.
- Modern ingestion platforms often integrate with the table format’s APIs to commit larger, optimized files.
Reliability and Governance
Ingestion isn’t just about moving bytes, it’s about ensuring trustworthy pipelines:
- Idempotency: Re-runs shouldn’t create duplicates.
- Schema Drift Handling: New source columns should be gracefully added to lakehouse tables with metadata updates.
- Monitoring: Data observability platforms (Monte Carlo, Bigeye) can alert when loads fail or data volumes deviate unexpectedly.
In short, ingestion is where data quality meets data freshness. A strong strategy combines batch tools for breadth (ingesting from many SaaS and DB sources) with streaming pipelines for depth (real-time operational data), all while keeping file sizes healthy and metadata consistent.
Lakehouse Streaming
If 2025 was the year of “batch meets real-time,” then 2026 is the year of streaming-first lakehouses. Instead of treating streaming as an afterthought, the modern lakehouse expects ingestion, processing, and query serving to happen continuously. This shift is powered by both table format features (incremental commits, changelogs) and by the streaming ecosystem maturing around open lakehouse standards.
Confluent
As the commercial steward of Apache Kafka, Confluent has led in making streams and tables converge. Their Tableflow and Stream Designer products now write directly to Iceberg and Delta Lake, providing exactly-once guarantees and seamless CDC ingestion. This reduces the need for custom Flink or Spark jobs, Kafka topics become queryable lakehouse tables in real time.
Aiven
Aiven, a managed open-source data platform provider, has expanded its Kafka, Flink, and Postgres services with native Iceberg integrations. Their goal: give teams a turnkey way to capture events, run stream transformations, and land results directly into a governed lakehouse, without stitching together multiple vendors.
Redpanda
Redpanda brings Kafka-API-compatible streaming with higher throughput and lower latency, and in 2025 introduced Iceberg Topics. With this feature, every topic can materialize into an Iceberg table automatically, combining log storage with table metadata. This means developers can treat the same data as both a stream and a table, depending on the workload.
StreamNative
Built around Apache Pulsar, StreamNative pushes the lakehouse deeper into event-driven architectures. Pulsar’s tiered storage, combined with integrations for Iceberg and Delta, means historical message backlogs can be instantly queryable as tables. Their work on unifying messaging and lakehouse storage blurs the boundary between stream broker and data platform.
RisingWave
RisingWave focuses on streaming databases: continuously maintaining materialized views over streams. Its integration with Iceberg allows those real-time views to be published directly into the lakehouse, governed alongside batch data. This bridges operational analytics (e.g., monitoring metrics in near real time) with historical analytics in the same architecture.
Other Notables
- Materialize: A streaming database that outputs real-time materialized views, often targeting data lakes and warehouses as sinks.
- ksqlDB: Kafka-native SQL for defining streaming transformations, which can also materialize tables into downstream lakehouse storage.
- Apache Flink: Still the backbone of many custom streaming-to-lakehouse pipelines, powering advanced transformations before committing results to Iceberg, Hudi, or Delta.
Takeaway: Streaming is no longer bolted onto the lakehouse, it is embedded. Whether through Kafka, Redpanda, Pulsar, Flink, or streaming databases like RisingWave and Materialize, streams now flow directly into transactional tables. The result is a lakehouse where batch and real-time are not two separate worlds but a single, unified system delivering always-fresh data.
Lakehouse Catalogs: Architecture, Compatibility & When to Use Which
A lakehouse catalog is the control plane for your open tables, it tracks metadata locations, permissions, and exposes standard APIs to every engine. Below is a concise, practitioner-focused map of today’s major options and how they fit into a multi-engine, multi-cloud lakehouse.
Apache Polaris (Incubating)
What it is: An open-source, fully featured Apache Iceberg REST catalog designed for vendor-neutral, multi-engine interoperability. Backed by multiple vendors and born from a cross-industry push to standardize the Iceberg catalog layer.
Where it shines: Teams standardizing on Iceberg who want a portable, community-governed catalog that Spark, Flink, Trino, Dremio, StarRocks/Doris can all use via the Iceberg REST API.
Notable: Open governance and REST-by-default avoid lock-in and simplify multi-engine access. Also has the feature to federate other catalogs and soon other table sources.
Apache Gravitino (Incubating)
What it is: A geo-distributed, federated metadata lake that manages metadata in place across heterogeneous sources (file stores, RDBMS, streams) and exposes a unified view to engines like Spark/Trino/Flink.
Where it shines: Hybrid/multi-cloud estates with multiple catalogs and sources that need one governance and discovery layer without migrations.
Notable: “Catalog of catalogs” approach; can present Iceberg/Hive/Paimon/Hudi catalogs under one umbrella.
AWS Glue Data Catalog (+ Lake Formation)
What it is: AWS’s managed Hive-compatible catalog with first-party governance (Lake Formation) and native support for Iceberg/Delta/Hudi tables in S3, consumed by Athena, EMR, Redshift Spectrum, and Glue jobs.
Where it shines: All-in AWS lakehouses needing centralized metadata and fine-grained access control enforced across AWS analytics services.
Notable: Managed, integrated, and convenient, cloud-specific by design.
Microsoft OneLake Catalog (Fabric)
What it is: The central catalog for Microsoft Fabric’s “OneLake”, a tenant-wide, Delta-native lake with unified discovery (“Explore”) and governance (“Govern”) experiences.
Where it shines: Fabric-centric stacks that want a single catalog for Spark, SQL, Power BI, and Real-Time Analytics over Delta tables in ADLS/OneLake.
Notable: Deeply integrated SaaS experience; shortcuts/mirroring help connect external sources, but it’s Azure/Fabric-scoped.
Google BigLake (Metastore + Iceberg)
What it is: Google’s open lakehouse layer: BigLake Metastore catalogs Iceberg tables on GCS; BigQuery reads them natively while Spark/Flink and other engines use the Iceberg REST interface.
Where it shines: GCP stacks wanting warehouse-grade operations (BigQuery) over open Iceberg tables stored in customer buckets with multi-engine access.
Notable: Managed table maintenance and unified governance via Dataplex/BigLake; Iceberg-first approach.
Project Nessie
What it is: A Git-like, transactional catalog for data lakes that adds branches, tags, time-travel, and cross-table commits on top of Iceberg.
Where it shines: Teams needing dev/test isolation, reproducibility, or multi-table atomic commits in an Iceberg lakehouse.
Notable: Works with Spark, Flink, Trino, Dremio; deploy anywhere (K8s/containers). Complements standard catalogs with versioning semantics.
Unity Catalog (Open Source)
What it is: An open-sourced universal catalog for data & AI assets with multi-format (Delta, Iceberg via REST/UniForm, files) and multi-engine ambitions; compatible with Hive Metastore API and Iceberg REST.
Where it shines: Enterprises seeking broad governance (tables, files, functions, ML models) and consistent policies across engines/clouds.
Notable: Recent updates added external engine read GA and write preview for Iceberg via REST, expanding interoperability.
Lakekeeper
What it is: A lightweight, Rust-based Apache Iceberg REST catalog focused on speed, security (OIDC/OPA), and simplicity; Apache-licensed.
Where it shines: Teams wanting a small, fast Iceberg catalog they can self-host, integrate with Trino/Spark, and plug into modern authz.
Notable: Ecosystem-first design; good fit for DIY open lakehouses and CICD-style deployments.
Quick Guide: Picking the Right Catalog
- Open, vendor-neutral Iceberg core: Polaris (add Nessie if you need Git-style branching & multi-table commits).
- Federate many sources/catalogs across regions/clouds: Gravitino.
- Deep cloud integration: Glue/Lake Formation (AWS), OneLake Catalog (Azure/Fabric), BigLake Metastore (GCP) or Dremio Catalog (Managed Polaris Service from Dremio).
- Broad data & AI governance (tables + files + models): Unity Catalog (OSS); growing multi-engine support including Iceberg REST.
Tip: For multi-engine Iceberg lakehouses, a common pattern is: Polaris as the primary REST catalog for engines, with Nessie layered in when you need branches/isolated environments. Cloud-native teams may still register those tables in their cloud catalogs for service-level features (e.g., Athena/BigQuery/Power BI), but keep the source of truth open.
Lakehouse Optimization
Once a lakehouse is in production, the focus shifts from building to sustaining performance and efficiency at scale. Without ongoing optimization, query times creep up, storage costs balloon, and data reliability weakens. The key is to manage both physical data layout and metadata growth.
Compaction and Small File Management
Frequent batch loads and streaming pipelines often generate thousands of small Parquet or ORC files.
- Problem: Query engines spend more time opening files than scanning data.
- Solution: Table formats support compaction actions, rewriting many small files into fewer large ones (hundreds of MBs to 1GB).
-
Examples:
- Iceberg’s
rewriteDataFiles
action merges small files efficiently (Dremio also has anOPTIMIZE
command for Iceberg tables). - Delta Lake offers the
OPTIMIZE
command (with Z-Ordering for clustering). - Hudi provides asynchronous background compaction for MoR tables.
- Iceberg’s
Snapshot Expiration and Metadata Cleanup
Modern formats keep snapshots for time travel, but unchecked, these create metadata bloat.
-
Iceberg:
expireSnapshots
safely removes old snapshots and associated data files. -
Delta Lake:
VACUUM
cleans up unreferenced files after a retention period. - Hudi: Timeline service supports configurable retention of commits and delta logs. Regular cleanup keeps both storage and query planning efficient.
Partitioning and Clustering
Good partition design reduces data scanned per query.
- Iceberg: Hidden partitions abstract complexity away from end users.
- Delta: Z-Ordering clusters data across dimensions for multidimensional pruning.
- Hudi: Can cluster records within files to optimize MoR query performance.
Partition evolution, changing partition strategy over time without breaking old data, is now supported by most formats and prevents schema rigidity.
Query Acceleration
Beyond storage optimization, query acceleration techniques deliver speed at scale.
- Dremio's Reflections and materialized views in platforms like Dremio provide always-fresh, cache-like performance boosts without manual tuning.
- Column stats and bloom filters stored in metadata allow engines to skip files entirely when filters exclude them.
- Vectorized execution and Arrow-based memory models reduce CPU costs across query engines.
Format-Specific Optimizations
- Iceberg: Manifest merging, hidden partitioning, and metadata caching.
- Delta: Frequent checkpoints and file skipping using data skipping indexes.
- Hudi: Incremental queries for consuming only new changes.
- Paimon: Continuous compaction to reconcile streaming write amplification.
Bottom line: Lakehouse optimization is not a one-off task, it’s an ongoing discipline. By managing file sizes, pruning metadata, evolving partitions, and using acceleration features, organizations keep performance predictable and costs controlled, even as data volumes and workloads scale into 2026.
The Intelligent Data Lakehouse Built for Agentic AI with Dremio
By the end of 2025, the conversation about data platforms shifted from “how do we manage data?” to “how do we make data intelligent and AI-ready?” This is where the intelligent lakehouse comes in, and where Dremio stands out as the reference implementation.
From Static Analytics to Agentic Workloads
Traditional BI queries are predictable: weekly reports, dashboards, and KPIs. AI-driven workloads are not. Agentic AI systems, large language models and autonomous agents, generate dynamic, ad-hoc queries that span datasets in unpredictable ways. This requires:
- Consistent low-latency responses.
- A platform that can optimize itself without human intervention.
- Seamless integration between structured data, semantic meaning, and AI agents.
Dremio as the Intelligent Lakehouse
Dremio is more than a query engine; it’s a self-optimizing lakehouse platform built natively on Apache Iceberg and Arrow. Key capabilities include:
- Reflections: Always-fresh materializations that accelerate queries automatically. Unlike traditional materialized views, reflections are invisible to end users, the optimizer decides when to use them, making acceleration adaptive to changing workloads.
- Semantic Layer: A unified place to define datasets, metrics, and business concepts. This ensures that whether it’s an analyst writing SQL or an AI agent generating queries, results remain consistent and governed.
- AI-Ready APIs: Through Arrow Flight and REST endpoints, Dremio streams data directly into Python, notebooks, or AI frameworks with zero-copy efficiency. This bridges the gap between analytics and machine learning pipelines.
- Open Standards: By embracing Iceberg, Polaris (for catalogs), and Arrow, Dremio ensures interoperability, your AI agents or external engines can interact with the same governed data without lock-in.
- All the above allow Agentic AI applications connecting to Dremio through Dremio's MCP server successful in enabling Agentic Analytics.
Why This Matters for Agentic AI
AI agents thrive on autonomy and adaptability. They need a platform that:
- Handles ever-changing queries without brittle pre-optimizations.
- Keeps acceleration aligned with shifting patterns (autonomous reflections).
- Provides governed access so that AI doesn’t hallucinate unauthorized or inconsistent definitions of metrics.
- Scales seamlessly from small exploratory prompts to massive training-data extractions.
In short: Dremio delivers the intelligent lakehouse, a platform that not only stores and serves data but actively adapts to how humans and AI consume it. As agentic AI moves from hype to everyday practice in 2026, this intelligence layer will be the key to transforming raw data into reliable, actionable, and AI-ready insights.
Python for the Lakehouse
Python has become the lingua franca of modern data engineering and data science, and the lakehouse ecosystem is no exception. By 2026, a rich set of Python-first tools and frameworks have emerged that make it easier to ingest, process, analyze, and serve data directly from open table formats like Apache Iceberg, Delta, and Hudi. These tools not only enable lightweight experimentation but also power production-grade pipelines that rival traditional big data stacks.
DuckDB
Often described as the “SQLite for analytics,” DuckDB is an in-process analytical database that excels at local workloads:
- Direct Parquet & Iceberg Reads: DuckDB can query Parquet files and integrate with Iceberg catalogs, making it a natural fit for small-to-medium lakehouse use cases.
- Speed: Its vectorized execution engine makes it extremely fast for analytical queries on a single machine.
- Python Integration: Native bindings allow seamless use within notebooks or Python apps.
DuckDB has become the go-to for prototyping, ad hoc exploration, and embedding analytics directly into applications.
Dask
Dask is a parallel computing framework for Python that scales workflows from laptops to clusters.
- Flexible API: Works with familiar NumPy, pandas, and scikit-learn APIs while distributing workloads.
- Lakehouse Integration: Reads and writes Parquet, and combined with Iceberg connectors, it enables scalable transformations on lakehouse data.
- Ecosystem Fit: Useful for machine learning preprocessing and large-scale data transformations where Spark might be overkill.
Dask democratizes distributed compute for teams already invested in Python.
Daft
A newer entrant, Daft positions itself as a distributed data processing engine optimized for AI and ML workloads.
- Arrow-Native: Built on Apache Arrow for fast columnar in-memory processing.
- Flexible Backends: Runs locally or on clusters, supporting both CPUs and GPUs.
- Lakehouse Ready: Reads directly from Parquet and Iceberg sources, enabling high-performance pipelines that integrate analytics and ML training.
Daft is gaining traction for teams that want a modern, Pythonic alternative to Spark for big data and AI-centric workflows.
Bauplan
Bauplan Labs brings a serverless, Python-first lakehouse approach.
- Pipeline-as-Code: Data pipelines are written in Python and executed in a serverless runtime that scales automatically.
- Version Control for Data: Bauplan integrates Iceberg tables with Git-like branching via catalogs like Nessie, making schema and data versioning first-class features.
- Developer Experience: With Arrow under the hood, Bauplan emphasizes reproducibility, modular pipelines, and minimal infrastructure overhead.
Bauplan is designed for teams that want the power of the lakehouse without the complexity of managing heavy infrastructure.
In practice:
- DuckDB is the Swiss Army knife for local analytics.
- Dask scales familiar Python workflows across clusters.
- Daft brings Arrow-native distributed compute optimized for AI.
- Bauplan simplifies pipeline execution with a serverless lakehouse model.
Together, these tools give Python along other libraries like Polaris, Ibis, SQLFrame and others developers an end-to-end toolkit for building, maintaining, and consuming modern data lakehouses.
Graphs in the Data Lakehouse with PuppyGraph
While the data lakehouse excels at tabular and relational analytics, many real-world problems are graph-shaped: fraud rings, identity networks, supply chains, lineage tracking, and recommendation systems. Traditionally, these problems required loading data into a specialized graph database, an extra layer of ETL and storage that added cost and complexity. PuppyGraph changes this equation by bringing graph analytics directly into the lakehouse.
What is PuppyGraph?
PuppyGraph is a cloud-native graph engine designed to run on top of existing data in your lakehouse. Instead of requiring a proprietary graph database, PuppyGraph lets you query your Iceberg, Delta, Hudi, or Hive tables as a graph. It connects directly to open table formats, relational databases, and warehouses, automatically sharding and scaling queries without duplicating data. This means you can turn your existing datasets into a graph model in minutes, with no ETL.
Integration with the Lakehouse
PuppyGraph integrates seamlessly with:
- Apache Iceberg (including REST catalogs like Tabular or Polaris)
- Delta Lake and Apache Hudi
- Hive Metastore and AWS Glue
- Databases and Warehouses such as PostgreSQL, MySQL, Redshift, BigQuery, and DuckDB
Each source is treated as a catalog. PuppyGraph lets you define a graph schema across one or many catalogs, effectively federating multiple data sources into a single graph. For example, you can link customer nodes in PostgreSQL with transaction edges in Iceberg, all without moving the data.
Querying Graphs at Scale
Because PuppyGraph queries directly against Parquet-backed tables, you can run multi-hop traversals and graph algorithms over your lakehouse data. It supports popular graph query languages:
- Gremlin (Apache TinkerPop)
- openCypher
This ensures compatibility with existing graph tooling and reduces the learning curve. Performance is optimized for large, complex traversals: PuppyGraph has demonstrated 6-hop traversals over hundreds of millions of edges in under a second. Cached mode allows even faster repeated queries, often surpassing the performance of traditional graph databases.
Use Cases
- Fraud Detection: Traverse transaction graphs in real time to uncover hidden fraud rings.
- Cybersecurity: Model logins, access patterns, and network flows as a graph to detect threats.
- Supply Chain Optimization: Connect suppliers, shipments, and logistics into a graph for bottleneck analysis.
- Customer 360: Combine relational and behavioral data into a graph to better understand customer journeys.
- Graph + AI: PuppyGraph supports Graph RAG (Retrieval Augmented Generation), enabling LLMs and agents to query structured relationships for better context and reasoning.
Why It Matters
By plugging directly into the lakehouse, PuppyGraph removes the wall between tabular and graph analytics. Data engineers and architects can:
- Avoid data duplication and ETL pipelines into separate graph stores.
- Keep governance and security consistent via existing catalogs.
- Support SQL, BI, and graph queries side-by-side on the same data.
In essence, PuppyGraph makes the lakehouse not just the foundation for relational analytics and AI, but also a native home for graph workloads, all with the same open formats and scalable storage.
Edge Inference for the Lakehouse: Spice AI
As organizations embrace AI-driven applications, the edge has become a critical deployment target. Instead of sending all data to centralized clusters, inference can increasingly happen close to where data is generated, IoT devices, factories, mobile applications, or regional data centers. The lakehouse, traditionally viewed as a central hub, is now extending outward. Platforms like Spice AI make this possible.
Why Edge Inference Matters
- Latency: Inference needs to happen in milliseconds, not seconds. Shipping every query to the cloud adds unacceptable delays for use cases like predictive maintenance or fraud detection.
- Cost Efficiency: Processing locally reduces bandwidth and cloud compute costs, especially when dealing with high-volume sensor or event data.
- Resilience: Edge inference continues to function even with intermittent network connectivity, syncing back to the lakehouse when available.
- Privacy & Compliance: Processing data locally helps meet regulatory requirements by minimizing the movement of sensitive information.
Spice AI at the Edge
Spice AI positions itself as an operational data lakehouse tailored for real-time and AI workloads. At the edge, this means:
- Federated Querying with DataFusion: Spice uses the Rust-based DataFusion engine (part of the Arrow ecosystem) to execute high-performance queries locally. This allows lightweight nodes to join, filter, and aggregate data directly.
- Vector + Relational Search: Spice combines vector search (for embeddings) with SQL-style queries. This means an edge application can run both semantic AI lookups and structured analytics in one step.
- Lightweight Runtimes: Spice can run in containers or edge environments, consuming a small footprint while still supporting open table formats like Iceberg, Delta, and Hudi.
- Hybrid Sync: Results and inferences can be materialized locally, then synchronized back to the central lakehouse when connectivity is restored, ensuring global consistency without sacrificing local responsiveness.
Example Use Cases
- Manufacturing IoT: Edge devices monitor sensor streams, detect anomalies with on-device inference, and sync flagged events to the lakehouse for broader analysis.
- Retail: In-store applications recommend products in real time based on customer behavior while syncing aggregated insights centrally.
- Telecom/5G: Local edge inference supports real-time network optimization while global models are trained and governed in the lakehouse.
In summary: Edge inference extends the reach of the lakehouse from the cloud to the edge, enabling AI applications to be both real-time and governed.
DuckLake: Simplifying Lakehouse Metadata with SQL
While formats like Iceberg, Delta, and Hudi advanced the lakehouse by bringing ACID transactions to data lakes, they also introduced operational complexity: JSON manifests, Avro metadata files, separate catalog services, and eventual consistency challenges. DuckLake takes a fresh approach by asking a simple question: what if the entire metadata layer was just stored in a relational database?
What is DuckLake?
DuckLake is a new open table format developed by the DuckDB team. Its core idea is to move all catalog and table metadata into a SQL database, while keeping table data as Parquet files in object storage or local filesystems. This means no manifest lists, no Hive Metastore, and no extra catalog API services—just SQL tables that track schemas, snapshots, and file pointers.
Architecture
DuckLake splits the lakehouse into two layers:
- Catalog Database: Any ACID-compliant database (DuckDB, SQLite, Postgres, MySQL, or even MotherDuck) stores all metadata—schemas, table versions, statistics, and transactions.
- Data Layer: Standard Parquet files (and optional delete files) stored in directories or S3 buckets hold the actual table data.
This design yields fast commits (a single SQL transaction to update metadata), strong consistency (no reliance on eventually consistent file stores), and simpler operations (just back up or replicate the metadata DB). It also enables advanced features like multi-table transactions, time travel, and transactional schema changes without a complex stack.
Integration with DuckDB
DuckLake ships as a DuckDB extension. Once installed, users can:
INSTALL ducklake;
LOAD ducklake;
ATTACH 'ducklake:mycatalog.ducklake' AS lakehouse;
From there, you can create tables, insert, update, delete, and query with full ACID guarantees. Multiple DuckDB instances can share the same DuckLake if the catalog is in a multi-user database like Postgres, effectively making DuckDB “multiplayer” with a shared lakehouse.
Interoperability
DuckLake is its own format, but it’s designed to interoperate:
Iceberg: Parquet and delete files are compatible, and DuckLake can import Iceberg metadata directly, even preserving snapshot history.
Delta: DuckDB continues to support Delta Lake separately; data can be copied between Delta and DuckLake when needed.
Hudi: Not natively supported yet, but Hudi’s Parquet files can be queried as plain Parquet.
This makes DuckLake a flexible companion in mixed-format environments and a potential bridge for migrating or experimenting.
Use Cases
Local & Embedded Lakehouses: Run a mini data warehouse on your laptop with DuckDB + DuckLake, no heavy services required.
Small Team Data Warehouses: Share a DuckLake catalog in Postgres for concurrent analytics across a team.
Streaming & CDC: Handle high-frequency small writes efficiently without metadata file bloat.
CI/CD Pipelines: Spin up ephemeral lakehouses in tests with time travel and rollback for validation.
Limitations & Roadmap
DuckLake is still young (v0.3 as of late 2025). At present:
Ecosystem support is centered on DuckDB/MotherDuck.
No built-in fine-grained governance; relies on the underlying DB’s permissions.
No branching/merge semantics like Project Nessie, though time travel is supported..
Books on the Data Lakehouse and Open Table Formats
For data engineers and architects who want to go beyond blogs and documentation, books provide the depth and structured learning needed to master the lakehouse paradigm. Between 2023 and early 2026, O’Reilly, Manning, and Packt have released (or announced) a range of titles that cover the architecture, theory, and practice of the data lakehouse, including the major open table formats, Apache Iceberg, Delta Lake, and Apache Hudi.
O’Reilly Media
Apache Iceberg: The Definitive Guide – Data Lakehouse Functionality, Performance, and Scalability on the Data Lake
Tomer Shiran, Jason Hughes, and Alex Merced (Jun 2024)
Comprehensive deep dive into Apache Iceberg’s architecture, metadata model, features like partition evolution and time travel, and integrations across engines such as Spark, Flink, Trino, and Dremio.Apache Polaris: The Definitive Guide – Enriching Apache Iceberg Lakehouse with a robust open-source catalog
Alex Merced, and Andrew Madson (Jun 2024)
Revolutionize your understanding of modern data management with Apache Polaris (incubating), the open source catalog designed for data lakehouse industry standard Apache Iceberg. This comprehensive guide takes you on a journey through the intricacies of Apache Iceberg data lakehouses, highlighting the pivotal role of Iceberg catalogs.Delta Lake: Up and Running – Modern Data Lakehouse Architectures with Delta Lake
Bennie Haelen and Dan Davis (Oct 2023)
Introductory and practical guide to Delta Lake, covering ACID transactions, schema enforcement, time travel, and how to build reliable data pipelines that unify batch and streaming.Delta Lake: The Definitive Guide – Modern Data Lakehouse Architectures with Data Lakes
Denny Lee, Tristen Wentling, Scott Haines, and Prashanth Babu (Dec 2024)
Written by core Delta Lake contributors, this book explores Delta’s transaction log, medallion architecture, deletion vectors, and advanced optimization strategies for enterprise-scale workloads.Practical Lakehouse Architecture: Designing and Implementing Modern Data Platforms at Scale
Gaurav Ashok Thalpati (Aug 2024)
A broad architectural guide to designing, implementing, and migrating to lakehouse platforms. Covers design layers, governance, catalogs, and security with a practical step-by-step framework.Apache Hudi: The Definitive Guide – Building Robust, Open, and High-Performance Lakehouses
Shiyan Xu, Prashant Wason, Sudha Saktheeswaran, and Rebecca Bilbro (Forthcoming Dec 2025)
Focuses on Hudi’s approach to incremental processing, upserts/deletes, clustering, and indexing. Demonstrates how to run production-ready lakehouses with streaming data ingestion.
Manning Publications
- Architecting an Apache Iceberg Lakehouse Alex Merced (MEAP, 2025 – Forthcoming 2026) A hands-on, architecture-first guide to designing scalable Iceberg-based lakehouses. Covers all five layers (storage, table formats, ingestion, catalog, consumption) with exercises and real-world design trade-offs.
Packt Publishing
Building Modern Data Applications Using Databricks Lakehouse
Will Girten (Oct 2024)
Practical guide to deploying end-to-end pipelines on Databricks Lakehouse using Delta Lake and Unity Catalog, including batch and streaming workflows, governance, and CI/CD.Engineering Lakehouses with Open Table Formats
Dipankar Mazumdar and Vinoth Govindarajan (Dec 2025)
Covers Iceberg, Hudi, and Delta Lake together, focusing on how to choose between them, optimize tables, and build interoperable, vendor-agnostic architectures. Includes hands-on examples with Spark, Flink, and Trino.Data Engineering with Databricks Cookbook
Pulkit Chadha (May 2024)
Recipe-based approach to building data pipelines on Databricks, with step-by-step instructions for managing Delta Lake tables, handling streaming ingestion, orchestrating workflows, and applying Unity Catalog governance.
Takeaway
Whether you’re looking for a deep dive into a specific format (Iceberg, Delta, or Hudi) or a broader perspective on lakehouse architecture, these titles form the essential reading list for data engineers and architects in 2026. They not only document the current state of the technology but also provide practical frameworks and best practices to implement reliable, scalable, and open lakehouses.
Conclusion
The journey from data warehouses to data lakes and finally to the data lakehouse reflects one constant: organizations need a platform that balances trust, flexibility, and performance. Warehouses gave us governance but lacked agility. Lakes gave us scale and freedom but sacrificed reliability. The lakehouse unites these worlds by layering open table formats, catalogs, and intelligent query engines on top of low-cost object storage.
By 2025, this model matured from a promise into a proven architecture. With formats like Apache Iceberg, Delta Lake, Hudi, and Paimon, data teams now have open standards for transactional data at scale. Streaming-first ingestion, autonomous optimization, and catalog-driven governance have become baseline requirements. Looking ahead to 2026, the lakehouse is no longer just a central repository, it extends outward to power real-time analytics, agentic AI, and even edge inference.
For data engineers and architects, the message is clear:
- Adopt open table formats to avoid lock-in and ensure interoperability.
- Embrace a layered architecture that separates storage, metadata, ingestion, catalog, and consumption.
- Optimize continuously, through compaction, snapshot expiration, and acceleration features, so performance scales with data.
- Prepare for the future where AI workloads are not occasional but constant, demanding a platform that is both intelligent and adaptive.
The lakehouse has become the backbone of modern data platforms. As you step into 2026, building on this foundation isn’t just a best practice, it’s the path to delivering data that is truly trusted, governed, and AI-ready.
Top comments (0)