Alex Merced

Posted on Feb 4

Apache Data Lakehouse Weekly: January 27 - February 2, 2026

#data #datascience #database #dataengineering

Get Data Lakehouse Books:

Lakehouse Community:

The Apache lakehouse stack entered February with momentum. Across Iceberg, Parquet, Arrow, and Polaris, contributors are tightening specifications, modernizing code, and addressing real-world challenges that affect performance, governance, and AI readiness. Here’s a technical summary of the week’s most important activity.

Apache Iceberg

Java 17 Migration and Switch Expression Debate

Iceberg is dropping support for Java 11 and shifting to Java 17. One major change is the adoption of switch expressions. This improves compile-time checks and simplifies logic, but it introduces conflicts for downstream forks. The community is evaluating three migration options:

Atomic: Migrate all at once to set a clean baseline.
Release-Aligned: Defer until the end of a release cycle to reduce disruption.
Incremental: Use switch expressions only in new or modified code, which risks long-term inconsistency.

Consensus is building around the atomic or release-aligned approaches to avoid fragmentation.

REST Catalog Authorization and Spec Corrections

REST Catalog adoption is accelerating as Iceberg moves away from Hive Metastore. New work proposes adding “referenced-by” context in loadTable calls, enabling catalogs to grant access only when a table is queried via approved paths. This allows dynamic, context-aware authorization decisions.

There’s also a fix to the REST spec that formalizes support for partition statistics updates. This helps generated clients in languages like Python handle partition-level metadata correctly.

Bloom Filter Indexing and Statistics Planning

The community is evaluating strategies to extend indexing support. One area of focus is Bloom filters, which can dramatically reduce I/O by skipping files that don’t contain query values—especially helpful for high-cardinality columns.

A broader indexing framework is also under discussion, including support for pluggable index types and better integration with Iceberg’s metadata layers.

PyIceberg 0.11.0 Release Candidate

A new release candidate for PyIceberg brings Python support closer to parity with the Java implementation. This is especially relevant for AI and ML teams building workflows around Python-based engines. The update improves table interaction, schema handling, and metadata support.

Apache Parquet

Dictionary Bit-Width 0 Vulnerability

Parquet maintainers discussed a security flaw where dictionary-encoded pages could be constructed with a bit width of 0, allowing a large number of values to be encoded in a very small space. This creates a compression bomb that can overwhelm readers during decompression.

The fix involves limiting bit width 0 to pages with no values and clarifying this constraint in the spec. This reinforces the growing view that file formats must be validated as if they are untrusted input.

Global Dictionary Compression Research

There’s renewed interest in extending dictionary encoding across files. Global dictionaries can reduce storage by over 10 percent in string-heavy datasets. These ideas align with table formats like Iceberg, which already manage metadata across many files.

Apache Arrow

Arrow 23.0.0 Release and GPU Alignment

Arrow 23.0.0 was released with support for CUDA 13, reinforcing its role in GPU-accelerated analytics. The release removes the experimental tag from the Arrow PyCapsule interface, enabling safe memory sharing between Python libraries in production.

Updates also include improved Parquet integration, more control over page size in writers, and better support for decimal statistics in Arrow scalars.

Validity Buffer Ambiguity in StructArray

A technical debate emerged around validity buffers in nested structs. The issue: when a parent struct is null, should its non-nullable child fields be treated as null? Some argue for strict enforcement. Others point to gaps in current implementations. The discussion reflects deeper tensions around spec clarity in cross-language environments.

Thread Safety in JDBC Drivers

Arrow maintainers clarified that JDBC Connection objects are not thread-safe and should not be shared across threads. This reinforces the need for connection pooling and proper lifecycle management in high-concurrency services using Arrow.

Apache Polaris

Planning for Version 1.4.0-Incubating

Polaris is in the planning phase for its next release. A new release manager has been appointed, and milestones are being defined. The team is also working to consolidate its CI workflows, replacing six independent pipelines with a unified process.

This is part of a larger effort to move Polaris from a loosely connected project to a production-grade catalog service.

Breaking Changes for Federated Connectivity

A proposed change to the ExternalCatalogFactory interface would add support for advanced HTTP client configurations, including proxy settings and timeouts. While breaking, this change improves Polaris’s ability to connect securely with external metadata sources.

Iceberg Catalog Migrator Tool

A 1.0.0 release candidate is now available for the Iceberg Catalog Migrator. This CLI tool helps users move Iceberg tables from existing catalogs to Polaris, streamlining onboarding and reducing friction for new users.

Ecosystem Trends

Across all projects, three priorities stand out.

First is the shift to REST-first metadata architecture. Iceberg and Polaris are both building richer context-aware REST interfaces that improve security and flexibility across engines.

Second is the focus on resilience. Parquet and Arrow are tackling low-level ambiguity and validation concerns to prevent silent errors and runtime failures, especially in multi-tenant or AI-driven environments.

Third is developer productivity. The move to Java 17, streamlined CI systems, and modern Python support are all about helping contributors build faster and safer.

Taken together, these trends show a maturing ecosystem. Each change—whether a spec correction or a release planning update—pushes the Apache lakehouse stack toward greater openness, safety, and performance.

DEV Community