DEV Community: Whitney

ASF Project Spotlight: Apache Iceberg

Whitney — Wed, 27 May 2026 22:11:13 +0000

Dipankar Mazumdar is the Director of Developer Relations at Cloudera, leading global developer initiatives across lakehouse architecture and AI. He previously held advocacy and engineering roles at Dremio, Onehouse, and Qlik, contributing to open source projects including Apache Iceberg, Apache Hudi, and Apache XTable (incubating) and building communities. His work focuses on the intersection of data engineering and AI. He is the author of Engineering Lakehouses with Open Table Formats and a contributor to Apache Iceberg: The Definitive Guide.

Apache Iceberg has quickly become a foundational technology in modern data architectures—but its impact goes far beyond performance and scale. This conversation with Dipankar explores how Iceberg redefined the data lake, and how community, education, and open collaboration fueled its adoption.

What Is Apache Iceberg and Why It Exists

Q: Can you tell us a bit about Apache Iceberg?
A: Apache Iceberg is a high-performance open table format for huge analytic datasets. It was designed to bring reliability and simplicity to data lakes, allowing multiple engines to safely read and write to the same datasets with strong guarantees. By introducing a table abstraction on top of raw data files, Iceberg helps organizations manage large-scale data with the consistency typically associated with data warehouses while retaining the flexibility of data lakes.

Q: What did the data ecosystem look like before Apache Iceberg, and what gaps did you see early on?
A: Before Iceberg, most data lakes relied on tables built on technologies like Apache Hive, with Apache Parquet as the underlying storage format. These systems worked well in the Hadoop era, but as workloads diversified and organizations were moving to cloud object stores (like Amazon S3), a number of structural limitations began to surface. Updates were unreliable, partitioning strategies were brittle, and schema evolution was difficult to manage. Metadata handling also became increasingly expensive, especially with large numbers of files, and query performance would degrade over time.

At the same time, data warehouses abstracted all of this away behind proprietary systems, so many engineers never had to think about these problems directly. However, they were dealing with significant cost issues and entering a vendor-locked environment.

These limitations/issues in both data lakes and warehouses made it clear that a new approach was needed that treated tables as first-class objects rather than just collections of files.

From Netflix to Apache: How Iceberg Took Shape

Q: When was Iceberg started and why?
A: Iceberg was originally developed at Netflix to address these large-scale data challenges. The team needed a solution that could handle massive datasets reliably while supporting evolving data requirements. Recognizing that these challenges were industry-wide, the project was later open sourced and contributed to The Apache Software Foundation (ASF) in 2018 to foster broader collaboration and adoption.

Q: What technology problem is Apache Iceberg solving?
A: Iceberg addresses several fundamental issues in traditional data lakes:

Lack of consistency when multiple engines works on the same data
Complex and brittle partitioning strategies
Interoperability at the storage layer
Challenges with schema and partition evolution

By rethinking how tables are defined and managed, Iceberg enables scalable, reliable data operations without the overhead and fragility of legacy approaches.

Q: What were some of the most important design decisions that shaped Iceberg early on?
A: A few key principles guided Iceberg’s design:

Treating metadata as a first-class concern for performance and scalability
Decoupling logical table structure from physical storage layout
Supporting schema evolution as a core feature, not an afterthought
Enabling engine-agnostic access to the data

These decisions allowed Iceberg to avoid many of the constraints that limited earlier systems.

Real-World Use and Impact

Q: Iceberg is known for working across multiple compute engines. Why was that so important from the start?
A: Interoperability is essential because modern data ecosystems rely on multiple processing engines. Iceberg was designed to act as a shared table layer, enabling different tools to safely access the same data without tight coupling. This approach gives organizations flexibility and helps prevent vendor lock-in.

Q: Are there any use cases you would like to tell us about?
A: Iceberg is used across industries for:

Large-scale analytics and reporting
AI pipelines
Streaming and batch data processing

It enables teams to unify different workloads on a single, reliable data foundation.

The Challenge: Explaining a New Layer of the Data Stack

Q: What made evangelizing Iceberg particularly challenging in those early days?
A: The challenge wasn’t just that Iceberg was new – it was that the problem it addressed wasn’t clearly recognized.

From the outside, most systems appeared to be working. Data warehouses handled structured workloads, data lakes handled large-scale storage, and teams had already built processes around them. So when we started talking about open table formats and formal specifications, it didn’t feel like solving an urgent problem.

Even when issues did exist, they weren’t attributed to the table abstraction itself. Slow jobs, partitioning limitations, schema breakages, or data corruption from concurrent writes were seen as isolated operational problems. Teams would patch them with scripts, conventions, or workarounds rather than questioning the underlying design.

That made the conversation harder. We weren’t just introducing a new approach – we were pointing out that a foundational layer people relied on had limitations they hadn’t fully understood yet.

Building Understanding: Advocacy, Education, and Content

Q: How did advocacy around Iceberg evolve beyond the core PMC & Committers?
A: In the beginning, most of the evangelism came from the engineers building the project. There wasn’t a structured effort around storytelling or community building.

That started to change when dedicated developer advocacy roles emerged with a focus on Iceberg. I was lucky enough to have been in that boat. Back in 2021-2022, there was no established playbook for how to evangelize an open table format. The approach was largely experimental, i.e. learning in public, translating technical concepts into something practical, and consistently engaging with the community.

Over time, that effort became more deliberate. If the abstraction wasn’t obvious to most people, we had to make it understandable. And in places where the ecosystem didn’t even have the right words, we had to build that vocabulary ourselves.

Q: At what point did you realize that community and education would play a central role?
A: It became clear fairly early that adoption was not going to happen through feature awareness alone. We were asking people to think about a layer they had historically never needed to reason about. Most engineers were focused on pipelines, performance, and reliability. The storage layer was abstracted away, and the issues they encountered, like slow jobs, partitioning problems, schema breakages, were treated as isolated operational concerns rather than symptoms of a deeper limitation.

So before we could talk about Iceberg, we had to establish why the table abstraction itself mattered. That meant explaining what a table actually represents in a data lake, how metadata governs behavior, and why things like snapshot isolation or partition evolution are not features, but necessary primitives for running multiple workloads safely.

Again, those are not surface-level concepts. They require building mental models from the ground up. That’s where community and education became central!

Q: You mentioned technical content playing a huge role in Iceberg’s early adoption. Can you share more?
A: Absolutely! Long-form technical content became the foundation for our advocacy goals. We definitely were more interested in explaining how the system actually works – what happens under the hood, how reads and writes are executed, and how design decisions translate into real-world behavior.

At the same time, it became clear that reading alone wasn’t enough for engineers. They needed to run things themselves and understand what Iceberg brought to the table. That led to hands-on exercises where people could experiment with things like table creation, schema evolution, partitioning, and table optimization to see the behavior directly.

Deep explanations paired with practical exploration helped bridge the gap between theory and adoption.

Community in Action: From Conversations to Momentum

Q: How did live engagement (talks, office hours, etc.) shape the evolution of the Iceberg community?
A: Webinars, conferences, and live sessions enabled a space for real-time interaction. Early on, a lot of those sessions were spent clarifying fundamentals – what Iceberg is and what it isn’t.

But over time, the nature of the conversation changed. Engineers began bringing their own systems, workloads, and constraints into the discussion. And the questions shifted from conceptual to more operational – which was great to see. That’s what led to introducing dedicated office hours. Instead of one-to-many sessions, these became open forums where people could discuss real production scenarios. Those conversations became one of the most important feedback loops, shaping how we explained Iceberg and highlighting gaps in understanding.

Conferences became places where the ecosystem came together. And the conversations moved from understanding the technology to discussing how it was being used in production. People started comparing approaches, sharing operational strategies, and aligning on best practices. That’s when you could see the shift that Iceberg was no longer just a technology that came out of Netflix- it was becoming a shared foundation that multiple organizations were building on.

Q: Was there a moment when you realized Iceberg was gaining real momentum?
A: The inflection point came when the community itself started shaping the narrative. Iceberg was never positioned as a vendor-owned format, and its specification evolved in the open.

As more contributors, adopters, and organizations got involved, the conversation expanded beyond any single perspective. Integrations, use cases, and improvements came from different directions, each grounded in real production needs. At that point, growth was no longer something being driven – it was something emerging from the open source community itself. And that was a huge success for us!

Q: How did you contribute to building and growing the Apache Iceberg community?
A: A lot of the early Iceberg activity was fragmented. There were contributors building the system, users trying it out, and discussions happening across Slack, mailing lists, and conferences – but these weren’t always connected. We were quite deliberate about bringing that together.

That meant consistently creating places where those interactions could happen. We were actively helping users on Slack, running webinars, publishing blogs and books, setting up office hours, and making sure the people actually building and using Iceberg were part of those conversations. Instead of keeping discussions isolated, we tried to pull them into shared spaces.

Over time, that started to compound. We turned questions from users into topics for talks. There were production use cases that would show up in webinars or conferences. Conversations that happened in one place would get carried into others. It gave the community a kind of continuity that wasn’t there before.

It also changed the level of discussion. Early on, most conversations were about understanding what Iceberg is. As more people got involved, it shifted toward how it behaves under real workloads – scale, concurrency, and multi-engine access. This shift came from repeatedly bringing practitioners into the same loop and letting those discussions build on each other.

That’s really where we had the most impact – creating that loop and keeping it active.

The Apache Way: Governance and Community Over Code

Q: How has moving to The ASF influenced the project’s growth and direction?
A: Becoming part of The ASF reinforced Iceberg’s commitment to open governance and vendor neutrality. It created a foundation for long-term sustainability and encouraged broader participation from across the industry.

Q: The ASF’s mission is to provide software for the public good. In what ways does your project embody that mission and the “community over code” ethos?
A: Iceberg is developed in the open, with decisions shaped by a diverse community rather than a single organization. This collaborative approach ensures the technology serves a wide range of users and use cases. The emphasis on shared ownership and transparency reflects the ASF’s “community over code” philosophy.

Getting Started and Getting Involved

Q: What advice would you give to teams considering adopting Iceberg today?
A: Start by understanding your current data challenges and how Iceberg (and the open lakehouse ecosystem) align with your needs. Take advantage of its interoperability to experiment within your existing ecosystem and tools and engage with the community to learn from others’ experiences.

Q: What’s the best way to learn about the project and try it out?
A: The best way to get started is through the official documentation and by experimenting with Iceberg using your preferred data processing engine. Community channels (Slack), Dev mailing list, and conference talks also provide valuable guidance for new users (see links below).There are also other avenues like Cloudera Community, Developer Playlist, and Iceberg 101 (Dremio) that focus on Iceberg’s education.

Q: How can others contribute to this project (code contributions being only one of the ways)?
A: Contributions come in many forms, including improving documentation, sharing use cases, participating in discussions, and helping others adopt the technology. These efforts are just as important as code in building a strong, sustainable project.

What’s Next for Apache Iceberg

Q: What does the future hold for the project?
A: A big part of where Iceberg is heading is around supporting newer workloads, especially AI-driven ones. We are starting to see more demand for things like wider schemas, semi-structured data, and access patterns that go beyond traditional analytics. To support that, there’s growing interest in areas like vector-based indexing and more flexible data representations.

At the same time, there’s continued focus on making the system more efficient. That includes improvements to metadata handling and commit performance, which become more important as tables scale. There’s also work around indexing. Instead of relying only on file-level stats, newer proposals are exploring primary, secondary, and even vector-based indexes, especially as access patterns shift.

Overall, the future direction is about evolving the core pieces, so Iceberg can support a broader set of workloads while keeping the same design principles.

Iceberg Resources

Official Iceberg documentation: https://iceberg.apache.org/
Join the Iceberg community: https://iceberg.apache.org/community/
Discuss on Slack:https://apache-iceberg.slack.com/join/shared_invite/zt-3tkrk9gpf-1eFZ8ozS2In0~zM_BeZiRQ#/shared-invite/email

The ASF is home to nearly 9,000 committers contributing to more than 320 active projects including Apache Airflow, Apache Camel, Apache Flink, Apache HTTP Server, Apache Kafka, and Apache Superset. With the support of volunteers, developers, stewards, and more than 75 sponsors, ASF projects create open source software that is used ubiquitously around the world. This work helps us realize our mission of providing software for the public good.

In the midst of hosting community events, engaging in collaboration, producing code and so much more, we often forget to take a moment to recognize and adequately showcase the important work being done across the ASF ecosystem. This blog series aims to do just that: shine a spotlight on the projects that help make the ASF community vibrant, diverse and long lasting. We want to share stories, use cases and resources among the ASF community and beyond so that the hard work of ASF communities and their contributors is not overlooked.

If you are part of an ASF project and would like to be showcased, please reach out to markpub@apache.org.

Apache Geode 2.0, Part II: Rebuilding a Distributed System for the Modern Java Era

Whitney — Wed, 27 May 2026 21:58:42 +0000

Java 17, Jakarta EE 10, Spring 6—and a thousand dominoes

By: Jinwoo Hwang
Lead Developer, Project Lead, and Release Manager, Apache Geode 2.0
https://JinwooHwang.com

This post is divided into three parts. Part I explains why Apache Geode 2.0 matters. Part II walks through how it was modernized. Stay tuned for Part III which looks ahead—what we learned, what changed, and how you can help shape what comes next.

Modernizing Apache Geode 2.0 meant confronting decades of accumulated technical debt and a dense web of dependencies. This was not a linear upgrade path but a cascading transformation, in which each change triggered many more. What began as a security necessity quickly became a full‑scale re‑architecture.

The Reality of Scale

Apache Geode is not a small project. The codebase spans more than 11,000 Java classes, over 10,000 test cases, and 32 subprojects. Every decision rippled across build tooling, runtime compatibility, security, CLI tooling, web containers, and user‑facing APIs.

The Domino Effect of Dependencies

Modernization followed a strict dependency order:

Gradle: Upgraded from 6 to 7 to support Java 17.
Java: Migrated from JDK 8 to JDK 17 LTS.
Jakarta EE: Full migration to Jakarta EE 10 and the jakarta.* namespace.
Spring Framework: Upgraded to Spring Framework 6.
Spring Security: Complete rewrite using modern security configuration models.

Ignoring this order led to compounding failures. Respecting it restored momentum.

Security as a First‑Class Driver

Security was not a side concern—it was a primary motivator. Apache Geode 2.0 remediates critical vulnerabilities, including deserialization flaws, SSRF risks, denial‑of‑service vectors, path traversal issues, and authentication weaknesses. Upgrading to supported frameworks restored the ability to receive timely security patches.

Major Platform Transformations

CLI (GFSH): Fully rewritten using Spring Shell 3.x, modernizing over 118 commands.
HTTP and REST: Migrated to Apache HttpComponents 5 with HTTP/2 support.
Search: Upgraded to Apache Lucene 9 with updated command integration.
Web Containers: Migrated to Eclipse Jetty 12 and Apache Tomcat 10.1+, embracing Jakarta EE 10.

One unavoidable breaking change was Tomcat session management. Legacy Tomcat versions (6–9) are no longer supported. Users must remain on Apache Geode 1.x or migrate to the new Tomcat 10 module.

By the end of the effort, more than 800 files were changed, over 18,000 lines added, and all 10,600+ tests passed cleanly. Apache Geode 2.0 now runs on a fully modern, supported, and maintainable foundation.

When the last dependency finally fell into place and the build turned green, the work was not finished. The most valuable outcomes of Apache Geode 2.0 were not only technical—they were the lessons learned and the foundation laid for the future.

Learn what Apache Geode is https://geode.apache.org/

Lessons from Log4Shell: Building a CRA-Ready Log4j

Whitney — Wed, 06 May 2026 22:01:56 +0000

By: Piotr P. Karwasz, VP Logging, Apache Software Foundation

The disclosure of Log4Shell (CVE-2021-44228) on December 9, 2021 did not just expose a vulnerability: it exposed a way of building software that was no longer fit for purpose, and it helped bring the European Cyber Resilience Act into being.

I recently hosted a session for the Open Regulatory Compliance community’s CRA Monday series to tell the story from the inside: what the Apache Logging team actually did in the years after Log4Shell to rebuild the project as something CRA-ready.

This blog recaps and expands upon that session; you can also watch the recording or view the slides.

A Wake-Up Call for the Software Ecosystem

Log4Shell’s impact was unprecedented in scale. Apache Log4j is embedded so deeply across the software ecosystem that the vulnerability propagated almost everywhere at once and most organizations had no idea where they were exposed. The rush to assess risk revealed a fundamental problem: few teams maintained a reliable Software Bill of Materials (SBOM), and the question “are we affected?” had no quick answer.

The scramble had at least one useful side effect: it pushed many teams to finally migrate from Log4j 1, already end-of-life since 2015, to Log4j 2.

Lessons from Log4j perspective

Since Log4j is mostly consumed as a dependency rather than built upon, the lessons the Apache Software Foundation’s Logging Services team drew from Log4Shell were different from those of the broader ecosystem. The problems were not about visibility into our own dependencies, but about the state of the project itself:

Documentation was hard to navigate, with many features either undocumented or described only in terms a new contributor could not act on.
The release process was antiquated, understood by only a handful of people, and run on personal hardware: a single point of failure that nobody had reason to address until a crisis made it unavoidable.
Builds were slow and tests were flaky, meaning a failure late in a multi-hour process sent you back to the beginning.

None of these were unique to Log4j. Log4Shell made them impossible to ignore, and addressing them put us on a path that anticipates much of what the CRA now asks of software maintainers.

Documentation: from maintainer knowledge to public record

Logging is not always safe. There are real security concerns: CRLF injection from unstructured logging; sensitive information leaking into debug output; and injection of Log4j formatting patterns through user-supplied strings. Before Log4Shell, much of this knowledge lived in the heads of a few maintainers: not written down, not discoverable, and not actionable for the thousands of teams depending on the library.

We rewrote the documentation website from scratch. The goal was to turn that private knowledge base into a public record by:

Covering security best practices and an explicit security model
Providing reference documentation generated directly from code, so it stays in sync as the library evolves
Making Log4j’s versioning policy and support status explicit and visible, both required for CRA attestationsMoving the issue tracker from JIRA closer to the code in GitHub Issues
Mirroring some discussions on both GitHub
Discussions and mailing lists

The results were measurable: more documentation pull requests, more site visits, a useful proxy for coverage and clarity, and noticeably better answers from LLMs trained on our new content.

Release process: from manual to reproducible

In December 2021, Log4j’s tests ran on a Jenkins instance, binaries were built on maintainer machines, signing was manual, and builds were not reproducible. A full binary and site build literally took hours. This was not unusual for open source projects, but it created real risks around build integrity, and it was clearly not sustainable.

By September 2024 we had migrated to GitHub Actions, achieved reproducible builds signed by a CI GPG key only known to ASF admins, parallelized tests, and reduced the build-and-deploy cycle to around 30 minutes.

Currently:

The CI pipeline now automatically stages releases up to the voting phase: the first project in The ASF to do this.
We are working on integration with Apache Trusted Releases, which will bring automation to the voting and publishing steps as well.
We are working on full-SLSA build and source attestations, which will make us one of the first ASF projects to achieve this. This includes SLSA source level 4, requiring a non-author review for every commit: a critical guarantee for a project at the center of the most significant supply-chain incident in recent memory.

Machine-readable metadata: SBOMs, VEX, and beyond

One of the most concrete CRA requirements is the expectation that software comes with machine-readable security information. We now publish CycloneDX SBOMs to Maven Central, which references a Vulnerability Disclosure Report, a machine-readable version of our CVE list, on our website. This gives downstream users a complete, well-curated source of vulnerability information, unaffected by the data loss that public vulnerability databases sometimes introduce when converting between formats. It is also open to improvements by contributors.

The next step is Vulnerability Exploitability eXchange (VEX) statements, generated automatically through an open source toolset we are developing with OpenRefactory. The system combines:

An AI-backed Root Cause Service that identifies vulnerable methods
A Call Graph Service that maps per-component call graphs
A VEX Generation Service that determines the maximum reachable path and generates enriched VEX statements, which we call VEXplanations

We are currently testing this within Apache Solr and plan to extend it to Log4j and Commons. The goal is to give downstream users a machine-readable guarantee of no known exploitable vulnerabilities, assessed automatically rather than by hand.

We are also planning to support Common Lifecycle Enumeration (ECMA-428), the machine-readable equivalent of our supported versions list, generated ASF-wide through Apache Trusted Releases.

Vulnerability handling: from ad hoc to structured

Since Log4Shell, the Logging Services team has put in place several structural improvements to vulnerability handling:

A dedicated reporting address (security@logging.apache.org) separate from the general ASF security team
An explicit threat model that clearly separates the security responsibilities of Log4j from those of its consumers
A bug bounty program hosted on YesWeHack, funded by the Sovereign Tech Resilience program

Since July 2024, the program has received 140 reports across Log4cxx, Log4j, and Log4net, yielding 10 CVEs.

At the architectural level, Log4j 3 addresses the attack surface problem directly. Log4j 2 was built for a pre-Maven world and shipped as a monolithic core with many optional dependencies bundled in. Log4j 3 modularises most of that core, so each module only pulls in what it actually needs. Smaller surface area means fewer things that can go wrong.

Community sustainability: the hardest and unsolved problem

All of the above is meaningless if the project burns out. And that risk is real. Log4j currently has two active maintainers doing most of the work. Log4cxx and Log4net are in a similar position. This is not a criticism of the community. It is the structural reality of most open source projects, and it is the problem that CRA compliance pressure will make worse if it is not addressed alongside the technical work.

We are exploring two directions:

The Open Source Economy initiative: offering consulting and compliance attestations as a funded model, where fees support maintainers, infrastructure, and upstream dependencies.
ECMA TC-54’s CONTRIBUTING.yaml specification: a machine-readable format for describing a project’s maintenance status, contribution needs, and support expectations, so that users and organisations can understand what they are depending on.

There is also a cultural dimension. Many open source communities grew up in an era when maintainers could commit directly to the repository and build software on their own machines. Shifting to a model where maintainers review all contributions and CI builds everything is the right move for security, but it is genuinely less fun, and communities resist it. Making that transition while keeping contributors engaged is one of the real challenges of building a CRA-ready project.

What CRA readiness actually looks like

The CRA introduces a set of “due diligence” requirements for manufacturers and voluntary attestations for OSS projects. What I hope this post makes clear is that meeting them is not a compliance exercise you bolt on at the end. It is the output of years of work on documentation, build integrity, machine-readable metadata, vulnerability processes, and community health.

The good news is that the direction was right before the regulation arrived. Log4Shell forced hard questions that led, eventually, to a project that is genuinely more secure, more transparent, and more sustainable than it was in 2021. The CRA gives that work a formal framework and, ideally, the organisational backing to fund it. For other open source projects facing the same pressures, that is perhaps the most useful takeaway: the path to CRA readiness and the path to a healthier, more sustainable project are the same path.

Get involved

The path to a CRA-ready ecosystem is not walked by one project. If any of this matters to you, here is where to start:

Contribute to Log4j directly. Code, documentation, and review all count. Start at logging.apache.org, or report security issues to security@logging.apache.org.
Help across the ASF. The Contributing to Apache Security guide is the way in.
Shape OSS regulatory response. The Open Regulatory Compliance Working Group is where CRA implementation for open source is being worked out in public.
Define what sustainable maintenance means. ECMA TC54 TG4 is building the machine-readable standards for project health and support.
Secure the supply chain. The SLSA Community is advancing the build-integrity framework Log4j will soon use.

Apache Geode 2.0: Revival, Reinvention, and the Road Ahead

Whitney — Tue, 03 Mar 2026 18:22:06 +0000

Originally published at https://news.apache.org/foundation/entry/apache-geode-2-0-revival-reinvention-and-the-road-ahead

By: Jinwoo Hwang
Lead Developer, Project Lead, and Release Manager, Apache Geode 2.0
https://JinwooHwang.com

This post is divided into three parts. Part I explains why Apache Geode 2.0 matters. Part II walks through how it was modernized. Part III looks ahead—what we learned, what changed, and how you can help shape what comes next.

Apache Geode 2.0, Part I: The Revival of Apache Geode

Legacy, purpose, and the moment a terminated project came back to life

Apache Geode 2.0 is not just a new release—it is a statement of intent. Before diving into code, frameworks, or version numbers, it is worth understanding why this release exists at all.

This story begins with a platform that once powered mission-critical systems, drifted toward obsolescence, and then found new life through conviction, persistence, and community. This first section sets the stage: the purpose behind the work, the legacy of Apache Geode, and the moment when a seemingly-finished project began its comeback.

I have the privilege of serving as a Committer and Release Manager for Apache Geode 2.0. This release represents one of the most ambitious modernization efforts in the project’s history. For me, it has been more than engineering work—it has been a journey shaped by purpose, responsibility, and a deep belief in the value of our shared open source legacy.

When I stepped into these roles, it became clear that Apache Geode could not survive on incremental change. The Java ecosystem had moved forward—Jakarta EE, Spring, Jetty, Tomcat, and security practices had all evolved—while Geode had effectively stood still. At the same time, unpatched vulnerabilities threatened user trust. To remain relevant, Geode needed a fundamental reset: technically, architecturally, and culturally.

Why I Took on This Project

I do not earn additional compensation for the nights and weekends spent on Apache Geode. I am grateful to my employer for supporting my open source contributions, but this work did not replace my day job. I carried the same responsibilities while taking on this effort.

The reason I stayed with it is simple: purpose. I believe in this project and the community behind it. Friedrich Nietzsche famously wrote, “He who has a why to live can bear almost any how,” an idea later echoed by Viktor Frankl in his work on meaning and resilience. That sense of why—of keeping something valuable alive—carried me through the hardest moments of this journey.

With that context, it is worth stepping back and answering a foundational question.

What Exactly Is Apache Geode?

Apache Geode is a distributed, in‑memory data management platform designed for low‑latency, scalable, and consistent data access. It is built for systems that must react in real time, handle massive data volumes, and remain operational under failure. Data is dynamically partitioned or replicated across a cluster, with built‑in fault tolerance and optional persistence to disk.

As modern applications have shifted toward real-time analytics, event-driven architectures, and microservices, latency has become a central architectural constraint. Disk-backed storage systems, while durable and cost-efficient, often introduce millisecond-scale access times that are incompatible with sub-millisecond response requirements. In-memory data platforms address this gap by keeping active or frequently-accessed data in RAM, significantly reducing access latency and increasing throughput. This approach is particularly important in domains such as financial services, telecommunications, e-commerce, and IoT, where responsiveness, scale, and availability directly influence user experience and operational outcomes.

At its core, Geode aggregates memory, CPU, and network resources across multiple nodes into a single, coherent data fabric. Applications continue running even when individual nodes fail—no blinking, no downtime. Geode supports multiple deployment models, including peer‑to‑peer, client/server, and multi‑site configurations, enabling it to scale from tightly coupled application clusters to geographically distributed systems.

From GemFire to Geode—and Almost to Obsolescence

Apache Geode’s lineage traces back to 2002, when GemStone Systems introduced GemFire, a commercial platform widely used in financial services for real‑time workloads. Through acquisitions—GemStone to SpringSource, then to VMware, and later to Pivotal—the technology evolved before being open sourced in 2015 and donated to The Apache Software Foundation as Apache Geode.

For several years, the project thrived. But after 2019, corporate shifts and changing priorities reduced contributor engagement. By 2022, most committers were inactive. By mid‑2023, development had stopped entirely. In 2024, the PMC voted to terminate the project. Apache Geode appeared to be finished.

Apache Geode Contributors Over Time

Then came 2025. I began upstreaming my internal fork. The community delivered Apache Geode 1.15.2 in September, followed by Apache Geode 2.0 in December. What looked like an ending became a comeback—a transition from long winter to spring.

By the time Apache Geode’s revival began, it was clear that survival alone was not enough. To remain viable, trusted, and relevant, the platform would need far more than incremental fixes—it would need a complete modernization from the ground up.

Learn what Apache Geode is https://geode.apache.org/

How DiDi Scaled to Hundreds of Petabytes with Apache Ozone

Whitney — Thu, 29 Jan 2026 23:42:28 +0000

Building a cost-effective, high-performance data foundation for global mobility

When you’re operating one of the world’s largest ride-hailing and mobility platforms, every millisecond and megabyte counts. For DiDi Global, which generates over one petabyte of new data every day, scaling storage isn’t just a technical challenge—it’s a business imperative.

As the company’s data footprint grew to more than 500PB annually, DiDi’s engineers found themselves battling the limits of their legacy Apache HadoopⓇ Distributed File System (HDFS) storage layer. The infrastructure was struggling to keep pace with the company’s explosive data growth, slowing downstream analytics and machine learning (ML) workloads that power everything from route optimization to dynamic pricing.

The Challenge: Scaling Without Compromise

DiDi’s HDFS-based infrastructure had served the company well, but it was beginning to show its age under the weight of petabyte-scale workloads. The team faced several interconnected problems:

Metadata bottlenecks: File count limits in HDFS created stress on metadata services, driving up latency and throttling performance.
Read-heavy workloads: RPC congestion and HDD I/O bottlenecks introduced lag for analytics and AI pipelines.
Escalating costs: Triple replication inflated storage use and operational expenses.
Operational risk: Even routine maintenance, such as decommissioning, carried stability concerns.

These issues had tangible business impacts. Slow metadata operations increased latency for end users, inflated costs, and created risks during peak demand periods.

“Metadata latency wasn’t just a technical problem—it slowed down business units that rely on real-time analytics and AI insights,” said JiangHua Zhu, Software Engineer, DiDi’s Storage Team.

The Solution: Apache Ozone

After a rigorous evaluation, DiDi selected Apache Ozone™, a next-generation distributed storage system designed for scalability and performance in large, unstructured data environments.

Ozone’s modern architecture—featuring RocksDB-based metadata management, separation of Object Manager (OM) and Storage Container Manager (SCM) services, and containerized data storage—provide the foundation DiDi needed to scale with confidence.

Key Benefits

Massive scalability: Ozone comfortably supports tens of billions of files, removing HDFS metadata constraints.
Performance optimizations: Features like OM Follower Read, multi-cluster routing, and NVMe caching help minimize latency and balance system load.
Cost efficiency through Erasure Coding:
Transitioning from 3x replication to EC 6-3 reduce storage overhead from 3.0x to roughly 1.5x—saving hundreds of petabytes.
Enhanced resilience: Container-based data granularity improves fault tolerance and streamlined operations.

“Ozone gave us the flexibility to scale elastically across hundreds of petabytes without sacrificing performance,” said Wei Ming, DiDi engineer.

The Results: Faster, Leaner, and More Reliable

The move to Apache Ozone delivered measurable, cross-functional benefits across DiDi’s data ecosystem:

Latency: P90 GetMetaLatency improved from 90ms to 17ms.
Throughput: Production read throughput increased by more than 20% with OM follower reads.
Cost savings: Erasure Coding cut the storage footprint nearly in half, saving both capital and operational expenses.
Stability under load: The platform now operates smoothly even during cluster maintenance and peak traffic periods.
Developer productivity: Application teams no longer need to manage small-file compaction, reducing complexity and accelerating data delivery.

Smooth Adoption Through Planning and Community Collaboration

DiDi’s migration to Ozone was meticulous and deliberate. Engineers ensured data consistency with DistCp COMPOSITE_CRC checksums, implemented dual-write for rollback safety, and validated end-to-end compatibility with Hadoop, Apache Spark™, and S3 APIs.

The company also leaned heavily on the Apache Ozone open source community—which contribute bug fixes, performance enhancements, and feedback that benefit all users.

“The open source community was instrumental in our success—we gained support, shared knowledge, and received bug fixes that help everyone,” said Shilun Fan, DiDi’s storage leadership.

DiDi engineers even became active contributors, helping resolve issues such as metadata inconsistencies and Erasure Coding container handling. The collaboration ultimately strengthened both DiDi’s deployment and the broader Ozone ecosystem.

Technical Highlights

Storage savings: Hundreds of petabytes saved through Erasure Coding (6-3).
Read efficiency: 20%+ improvement from OM follower reads and NVMe caching.
Unified access: Hadoop API and S3 compatibility for batch, interactive, and ML workloads.
Scalability: A single Ozone cluster can handle ~5 billion files, with the potential to scale to tens of billions.

Looking Ahead

DiDi’s storage team continues to push the boundaries of performance and efficiency. Upcoming initiatives include:

Integrating IO_URING and SPDK to enhance I/O performance.
Developing AI-driven operational insights for anomaly detection and auto-remediation.
Piloting tiered storage strategies for hot, warm, and cold data layers to optimize cost and performance.

“Ozone is more than a storage layer—it’s the backbone of DiDi’s data ecosystem and future AI innovation,” said Hongbing Wang, DiDi technical lead.

The Takeaway

By embracing Apache Ozone, DiDi transformed its data storage infrastructure from a limitation into a competitive advantage. The move delivered lower costs, higher reliability, and faster access to the insights that power intelligent mobility.

At petabyte scale, even incremental improvements deliver outsized impact—and with Apache Ozone, DiDi has built a storage foundation ready for the next decade of data-driven innovation.

To learn more about Apache Ozone:

Apache Ozone GitHub: https://github.com/apache/ozone
Apache Ozone Getting Started: https://ozone.apache.org/docs/edge/start/startfromdockerhub.html
Apache Ozone LinkedIn page: https://www.linkedin.com/company/apache-ozone/
Apache Ozone X.com handle: https://x.com/ApacheOzone
Apache Ozone Best Practices at Didi: https://ozone.apache.org/assets/ApacheOzoneBestPracticesAtDidi.pdf