<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lê Đình Phú</title>
    <description>The latest articles on DEV Community by Lê Đình Phú (@dinhphu2k1gif).</description>
    <link>https://dev.to/dinhphu2k1gif</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1088795%2F5b0ab8d8-7ac7-4a86-bdae-15445f871b4d.jpeg</url>
      <title>DEV Community: Lê Đình Phú</title>
      <link>https://dev.to/dinhphu2k1gif</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dinhphu2k1gif"/>
    <language>en</language>
    <item>
      <title>Oracle Golden Gate For Big Data</title>
      <dc:creator>Lê Đình Phú</dc:creator>
      <pubDate>Mon, 29 Jun 2026 14:15:56 +0000</pubDate>
      <link>https://dev.to/dinhphu2k1gif/oracle-golden-gate-for-big-data-5ba0</link>
      <guid>https://dev.to/dinhphu2k1gif/oracle-golden-gate-for-big-data-5ba0</guid>
      <description>&lt;p&gt;Not many people know that Oracle GoldenGate also has a very intuitive and user-friendly administration interface. 😁😁&lt;br&gt;
Every time I see this dashboard, I think: if you're using Debezium, you'd probably love having a UI like this. 😄&lt;br&gt;
And a common misconception is that Oracle products aren't always free. The Oracle GoldenGate version can be used for learning, experimenting, and building CDC pipelines without needing to purchase a license.&lt;/p&gt;

&lt;p&gt;I'm currently learning more about GoldenGate, and I think it's a worthwhile option to try if you work in Data Engineering, especially in CDC and real-time data pipeline problems in banking.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhhladsl96jgmoh160q7p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhhladsl96jgmoh160q7p.png" alt=" " width="799" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>dataengineering</category>
      <category>goldengate</category>
      <category>kafka</category>
    </item>
    <item>
      <title>How Uber Built Its Big Data System — From a Few TBs to 350 Petabytes with Sub-Hour Latency</title>
      <dc:creator>Lê Đình Phú</dc:creator>
      <pubDate>Mon, 22 Jun 2026 15:55:25 +0000</pubDate>
      <link>https://dev.to/dinhphu2k1gif/how-uber-built-its-big-data-system-from-a-few-tbs-to-350-petabytes-with-sub-hour-latency-2el2</link>
      <guid>https://dev.to/dinhphu2k1gif/how-uber-built-its-big-data-system-from-a-few-tbs-to-350-petabytes-with-sub-hour-latency-2el2</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Uber's data platform handles 6 trillion rows/day, 350 PB of storage, and 4 million analytical queries/week. They got there by rebuilding their entire architecture from scratch — not once, but &lt;strong&gt;three times&lt;/strong&gt;. This post breaks down why, how, and what every data engineer can learn from it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What is Uber's Big Data Platform?
&lt;/h2&gt;

&lt;p&gt;Uber's Big Data Platform is a distributed data infrastructure that processes over &lt;strong&gt;350 petabytes&lt;/strong&gt; of data, ingests &lt;strong&gt;6 trillion rows daily&lt;/strong&gt; from hundreds of microservices, and serves &lt;strong&gt;4 million analytical queries per week&lt;/strong&gt; with sub-30-minute latency — built on HDFS, Apache Hudi, Apache Spark, and Presto.&lt;/p&gt;

&lt;p&gt;Every ride you book on Uber leaves a digital footprint. Not just one row — but dozens: pickup location, dropoff, driver rating, ETA calculation, surge pricing events, payment processing, trip status updates. That's just one ride, for one user, in one city.&lt;/p&gt;

&lt;p&gt;Multiply that by millions of daily rides across hundreds of cities, add Uber Eats, Uber Freight, and constantly launching new features — the result is &lt;strong&gt;6 trillion rows ingested every day&lt;/strong&gt;, totaling &lt;strong&gt;350 petabytes&lt;/strong&gt; of storage.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The central question:&lt;/strong&gt; &lt;em&gt;When a company grows at Uber's breakneck speed, how do you design a data system so it doesn't get crushed by its own data?&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Uber's journey is a story of continuously recognizing the limits of old architectures and bravely re-architecting from scratch — not once, but &lt;strong&gt;three times&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Uber Needed Non-Standard Big Data
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Uber's Unique Data Characteristics
&lt;/h3&gt;

&lt;p&gt;Most companies' data is &lt;strong&gt;append-only&lt;/strong&gt;: logs are written and never modified. But Uber's data is inherently &lt;strong&gt;mutable&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A driver rating might be updated the next day.&lt;/li&gt;
&lt;li&gt;A fare might be adjusted days later due to a dispute.&lt;/li&gt;
&lt;li&gt;A backfill might occur weeks later due to business rule changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the crux of why standard industry solutions were insufficient. While the rest of the industry was building append-only data lakes, Uber fundamentally needed something else.&lt;/p&gt;

&lt;p&gt;Furthermore, Uber serves &lt;strong&gt;three user groups with completely different needs&lt;/strong&gt; — simultaneously, on a single platform:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;City Ops&lt;/strong&gt; (thousands of users): Need simple, near-real-time SQL for daily operational decisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Scientists/Analysts&lt;/strong&gt; (hundreds of users): Need full datasets for long-term modeling and forecasting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering Teams&lt;/strong&gt; (hundreds of users): Need data for automated pipelines such as fraud detection and driver onboarding.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Scale of the Problem (as of 2024–2025)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;19,500&lt;/strong&gt; managed Hudi datasets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6 trillion rows&lt;/strong&gt; ingested daily / &lt;strong&gt;3 million new data files&lt;/strong&gt; daily&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;350 petabytes&lt;/strong&gt; across HDFS and Google Cloud Storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;350,000 commits&lt;/strong&gt; daily / &lt;strong&gt;70,000 daily table service operations&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 million analytical queries/week&lt;/strong&gt; via Presto and Spark&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Generation 1: "Warehouse-as-Lake" and Its Costs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Initial Architecture (Pre-2014)
&lt;/h3&gt;

&lt;p&gt;Before 2014, Uber's data was scattered across various OLTP databases — MySQL, PostgreSQL. Any engineer needing data had to know exactly which database to query and write custom code to join across sources. Total volume was just a few terabytes.&lt;/p&gt;

&lt;p&gt;This worked when Uber was small. But as the business began to scale exponentially, the lack of consistency became a disaster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F49b3g4lz92r3ybz16x41.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F49b3g4lz92r3ybz16x41.png" alt="Architecture Pre-2014" width="800" height="455"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Architecture Pre-2014&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Gen 1: Data Warehouse with Vertica
&lt;/h3&gt;

&lt;p&gt;The first solution was building a centralized data warehouse. Uber chose &lt;strong&gt;Vertica&lt;/strong&gt; for its columnar design — fast and scalable compared to other options at the time.&lt;/p&gt;

&lt;p&gt;The architecture was simple: multiple ad hoc ETL jobs copied data from various sources (AWS S3, OLTP databases, service logs) into Vertica. SQL was standardized as the single interface.&lt;/p&gt;

&lt;p&gt;The initial result was a massive success — for the first time, everyone had a global view. Hundreds of new users emerged. City operators started using SQL for decision-making. New ML and experimentation teams formed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fbr583b4kscp6getaxirx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fbr583b4kscp6getaxirx.png" alt="Uber BigData Platform Gen 1" width="800" height="489"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Uber: BigData Platform Gen 1&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  When Success Becomes a Burden
&lt;/h3&gt;

&lt;p&gt;However, this popularity was also the beginning of the problem:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema drift was the first issue.&lt;/strong&gt; Source data was primarily in JSON — a flexible format without strict schema enforcement. If a team quietly changed their data structure (e.g., renaming a column), downstream ETL pipelines would break without warning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Duplicate ingestion complicated everything.&lt;/strong&gt; Because ETL jobs were ad hoc, the same dataset could be ingested multiple times with different transformations. Multiple versions of the "truth" co-existed, and nobody knew which was correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scaling costs became irrational.&lt;/strong&gt; The warehouse wasn't horizontally scalable. As data grew, Uber had to delete old data to make room — an analytics system that couldn't retain history.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The warehouse became a single point of failure.&lt;/strong&gt; Everything ran on Vertica. This was no longer a data warehouse; it was a data monolith.&lt;/p&gt;




&lt;h2&gt;
  
  
  Generation 2: Hadoop Data Lake and the Lesson of Incremental Processing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Re-architecting Around the Hadoop Ecosystem
&lt;/h3&gt;

&lt;p&gt;The core decision of Gen 2 was radical: &lt;strong&gt;completely separate raw ingestion from data modeling&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The design principles were simple but crucial:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raw data enters Hadoop &lt;strong&gt;once&lt;/strong&gt;, with &lt;strong&gt;no transformation&lt;/strong&gt; during ingestion (EL, not ETL).&lt;/li&gt;
&lt;li&gt;All transformations occur inside Hadoop via horizontally scalable batch jobs.&lt;/li&gt;
&lt;li&gt;Only processed data tables are pushed to Vertica — for city ops requiring fast SQL.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fn0rnwzhxevbqlhasdoku.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fn0rnwzhxevbqlhasdoku.png" alt="Uber BigData Platform Gen 2" width="800" height="527"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Uber: BigData Platform Gen 2&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Technology Stack and Rationale
&lt;/h3&gt;

&lt;p&gt;Uber built Gen 2 on four main technologies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Parquet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Columnar storage format enabling efficient compression and faster querying&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Presto&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serves instant ad-hoc queries (results in seconds, not hours)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Spark&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Programs complex data pipelines, supporting both SQL and non-SQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Apache Hive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Workhorse" specializing in processing massive batch queries&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The New Limit: The Snapshot-Based Ingestion Problem
&lt;/h3&gt;

&lt;p&gt;Gen 2 brought an inherent limitation: &lt;strong&gt;snapshot-based ingestion&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Because HDFS and Parquet do not support in-place updates, every ingestion job had to recreate an entire snapshot of the dataset. The consequences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data latency reached up to 24 hours.&lt;/strong&gt; For a real-time business like Uber, this was unacceptable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Irrational waste of compute.&lt;/strong&gt; A 100TB table with only 100GB changing daily still required scanning the entire 100TB — over 99.9% of compute resources wasted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HDFS NameNode Bottleneck.&lt;/strong&gt; Too many small files from ad-hoc jobs began overloading the NameNode.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By early 2017, with over &lt;strong&gt;100 Petabytes stored&lt;/strong&gt; and &lt;strong&gt;100,000 vcores&lt;/strong&gt;, the system had hit its scalability limits.&lt;/p&gt;




&lt;h2&gt;
  
  
  Apache Hudi is Born: When There is No Solution, Build It
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4zacti4scmczwgjt83cs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F4zacti4scmczwgjt83cs.png" alt="Uber BigData Platform Gen 3" width="800" height="533"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Uber: BigData Platform Gen 3&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  A Problem with No Market Solution
&lt;/h3&gt;

&lt;p&gt;Uber needed three things simultaneously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Upsert operations&lt;/strong&gt; on HDFS/Parquet — which is append-only storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental reads&lt;/strong&gt; — reading only changed data since the last checkpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sub-hour latency&lt;/strong&gt; instead of 24 hours.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There was no technology on the market in early 2017 that could simultaneously meet all three requirements. Uber was forced to build it themselves.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Hudi Works
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Hadoop Upserts anD Incremental (Hudi)&lt;/strong&gt; is a Spark library that creates an abstraction layer over HDFS and Parquet — adding upsert and incremental capabilities.&lt;/p&gt;

&lt;p&gt;The central mechanism is &lt;strong&gt;Copy-on-Write (CoW)&lt;/strong&gt;: when a record needs an update, Hudi rewrites the entire Parquet file containing that record. Write costs increase, but reading remains extremely fast because there is no complex merge logic required.&lt;/p&gt;

&lt;p&gt;Hudi provides two ways to read data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot view&lt;/strong&gt;: Returns a holistic view of the entire table at a point in time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental view&lt;/strong&gt;: Returns only records inserted or updated since a specific checkpoint timestamp.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Data latency dropped from 24 hours to &lt;strong&gt;under 1 hour&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzikz6gk9d003998ezwxm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzikz6gk9d003998ezwxm.png" alt="Hudi Architecture" width="800" height="439"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Standardized Data Model: Two Types of Tables
&lt;/h3&gt;

&lt;p&gt;Uber standardized raw data organization into two primary table types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Changelog History Table&lt;/strong&gt;: Stores the entire history of changelogs from upstream systems. Can be sparse — upstream systems may only send partial rows for changed columns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merged Snapshot Table&lt;/strong&gt;: Provides a unified, up-to-date view — all columns present for a given key, regardless of update history complexity.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Marmaray: The Ingestion Platform of Gen 3
&lt;/h3&gt;

&lt;p&gt;Hudi solved the storage layer problem. But what system would feed data into Hudi?&lt;/p&gt;

&lt;p&gt;Uber built &lt;strong&gt;Marmaray&lt;/strong&gt; — a generic data ingestion platform operating on a &lt;strong&gt;mini-batch model running every 10–15 minutes&lt;/strong&gt;, consuming changelogs from &lt;strong&gt;Apache Kafka&lt;/strong&gt; and applying changes to Hadoop via Hudi.&lt;/p&gt;

&lt;p&gt;Uber established an unbreakable rule: &lt;strong&gt;absolutely no data transformation during ingestion&lt;/strong&gt;. Marmaray is EL (Extract-Load), not ETL. Raw data must remain 100% intact upon entering Hadoop. Result: &lt;strong&gt;30-minute raw data latency&lt;/strong&gt; end-to-end.&lt;/p&gt;

&lt;h3&gt;
  
  
  Evolution: From Marmaray to Apache Flink (IngestionNext)
&lt;/h3&gt;

&lt;p&gt;While Marmaray reduced latency to 30 minutes, it eventually revealed new limits. For highly time-sensitive workloads, 30 minutes is still too long. Uber launched &lt;strong&gt;IngestionNext&lt;/strong&gt; — shifting from Marmaray (Spark) to &lt;strong&gt;Apache Flink&lt;/strong&gt; for streaming-native ingestion.&lt;/p&gt;

&lt;p&gt;Result: Data latency dropped from 30 minutes to &lt;strong&gt;under 15 minutes&lt;/strong&gt;, while significantly cutting hardware resource consumption.&lt;/p&gt;




&lt;h2&gt;
  
  
  Comparison of the 3 Architectural Generations
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Gen 1 (Vertica)&lt;/th&gt;
&lt;th&gt;Gen 2 (Hadoop)&lt;/th&gt;
&lt;th&gt;Gen 3 (Hudi)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Few TBs&lt;/td&gt;
&lt;td&gt;~100 PB&lt;/td&gt;
&lt;td&gt;350 PB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Minutes&lt;/td&gt;
&lt;td&gt;24 hours&lt;/td&gt;
&lt;td&gt;&amp;lt;30 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Upsert support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Horizontal scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Incremental reads&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Hudi at Trillion-Record Scale: 3 Engineering Breakthroughs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Workload Classification at True Scale (2024)
&lt;/h3&gt;

&lt;p&gt;Uber divides its 19,500 datasets into four workload classes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Append-Only (11,200 tables)&lt;/strong&gt;: Largest volume, bulk insert. &lt;em&gt;Key metric: ingestion speed.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Upsert-Heavy (4,400 tables)&lt;/strong&gt;: Mutable business states with high update frequency. &lt;em&gt;Key metric: write throughput.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Derived (1,600 tables)&lt;/strong&gt;: ETL and ML pipelines using incremental reads. &lt;em&gt;Key metric: correctness.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Realtime Flink-Native Streaming (500 tables)&lt;/strong&gt;: SLO for freshness under 15 minutes. &lt;em&gt;Key metric: latency.&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Breakthrough 1: Metadata Table (MDT) — Solving the File Listing Problem
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; With tens of thousands of datasets, each with thousands of partitions containing millions of files — even a simple file listing became a bottleneck. The HDFS NameNode was completely overwhelmed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; MDT is a key-value store managed directly by Hudi using &lt;strong&gt;HFile&lt;/strong&gt; (SSTable-based) to store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File listing metadata per partition&lt;/li&gt;
&lt;li&gt;Column-level statistics (min/max per Parquet file) for query pruning&lt;/li&gt;
&lt;li&gt;Bloom filters for fast record lookups during upserts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Algorithmic complexity dropped from &lt;strong&gt;O(n) linear scans&lt;/strong&gt; to &lt;strong&gt;O(1) per lookup&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production impact:&lt;/strong&gt; Deployed across &lt;strong&gt;90%+ of datasets&lt;/strong&gt;. HDFS NameNode bottleneck completely eliminated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Breakthrough 2: Record Index (RI) — Upserts on Trillion-Row Tables
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; To perform an upsert, the system must identify exactly which Parquet file contains the record to be updated. At hundreds of billions of rows, Bloom filters and file scans are insufficient.&lt;/p&gt;

&lt;p&gt;Uber initially tried &lt;strong&gt;HBase as an external index&lt;/strong&gt; — powerful but operationally complex, creating a potential single point of failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; Record Index (RI) — an HFile-backed structure &lt;strong&gt;inside the MDT&lt;/strong&gt; — no external servers needed. Maps each record key to a file group with &lt;strong&gt;O(1)&lt;/strong&gt; complexity.&lt;/p&gt;

&lt;p&gt;Real production numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3,600 large tables&lt;/strong&gt; use RI&lt;/li&gt;
&lt;li&gt;Largest tables exceed &lt;strong&gt;300 billion rows&lt;/strong&gt;: indexes sharded into &lt;strong&gt;10,000 HFiles&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Lookup latency: &lt;strong&gt;1–2ms per record key&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Initialization of a 300B-row table: ~&lt;strong&gt;7 hours&lt;/strong&gt; using ~&lt;strong&gt;4,000 executors&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Roadmap: Scaling to &lt;strong&gt;1.2 trillion row tables&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Breakthrough 3: Multi-Data-Center Reliability
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Requirements:&lt;/strong&gt; Mission-critical datasets must remain highly available and consistent across multiple regions. A single-region outage must not bring down the entire analytics platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Primary dataset with a replicated secondary in a different geographical region&lt;/li&gt;
&lt;li&gt;Hudi's &lt;strong&gt;commit timeline and atomic operations&lt;/strong&gt; safely propagate all writes to the secondary region&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table availability service&lt;/strong&gt; monitors regional health and automatically promotes the secondary to primary on disruption&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intelligent query routing&lt;/strong&gt;: Presto and Spark queries automatically routed to the closest/healthiest region&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Uber has also successfully migrated part of its data lake to &lt;strong&gt;Google Cloud Storage (GCS)&lt;/strong&gt; with &lt;strong&gt;zero downtime&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways for Data Engineers
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;There is no perfect first architecture.&lt;/strong&gt; Nobody chooses perfectly when scale changes exponentially. The goal is to choose something horizontally scalable enough to buy time for the next generation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Understand the nature of your data first.&lt;/strong&gt; Uber's entire Gen 2→3 migration was driven by one insight: their data is &lt;em&gt;mutable&lt;/em&gt;, not append-only. The wrong assumption about data mutability led to 24-hour latency and massive compute waste.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;EL before ETL during ingestion.&lt;/strong&gt; Marmaray's "no transformation during ingestion" rule gave Uber the flexibility to re-run transformation steps without re-ingesting from source — a massive operational advantage.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;When there's no solution, the best companies build their own.&lt;/strong&gt; Apache Hudi is now used by 65% of the Fortune 500 — born entirely from Uber's specific engineering problem.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.uber.com/us/en/blog/uber-big-data-platform/" rel="noopener noreferrer"&gt;Uber's Big Data Platform: 100+ Petabytes with Minute Latency&lt;/a&gt; (2018)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.uber.com/us/en/blog/apache-hudi-at-uber/" rel="noopener noreferrer"&gt;Apache Hudi at Uber: Engineering for Trillion-Record-Scale&lt;/a&gt; (2026)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.uber.com/us/en/blog/marmaray-hadoop-ingestion-open-source/" rel="noopener noreferrer"&gt;Marmaray: Uber's Generic Data Ingestion and Dispersal Framework&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.uber.com/us/en/blog/from-batch-to-streaming-accelerating-data-freshness-in-ubers-data-lake/" rel="noopener noreferrer"&gt;IngestionNext: From Batch to Streaming&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;If you found this useful, I write about data engineering, AI, and large-scale systems on my &lt;a href="https://dinhphuvn.substack.com/" rel="noopener noreferrer"&gt;Substack&lt;/a&gt;. Subscribe to get notified when I publish next.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>dataengineering</category>
      <category>apachehudi</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Why Big Tech is Migrating from Traditional Databases to NewSQL</title>
      <dc:creator>Lê Đình Phú</dc:creator>
      <pubDate>Mon, 08 Jun 2026 15:29:41 +0000</pubDate>
      <link>https://dev.to/dinhphu2k1gif/why-big-tech-is-migrating-from-traditional-databases-to-newsql-d95</link>
      <guid>https://dev.to/dinhphu2k1gif/why-big-tech-is-migrating-from-traditional-databases-to-newsql-d95</guid>
      <description>&lt;p&gt;💡 NoSQL used to be a trend, but NewSQL is the practical solution for today's large-scale core systems.&lt;br&gt;
Share with everyone a pretty detailed analysis of why Big Tech is gradually moving away from the traditional SQL system.&lt;br&gt;
🎯 The most "money-hungry" point of this architecture is to thoroughly solve the headache of Distributed Transaction – a problem that previous generations of DB patchwork often made errors. This is especially important when building a foundation for extremely demanding areas of data integrity such as finance or banking.&lt;br&gt;
🤔 I wonder if any of you have put NewSQL lines (such as TiDB, CockroachDB or Spanner...) into production? How does the actual experience in terms of operating costs and performance compare to traditional DBs?&lt;br&gt;
Let's 👇 share your perspective&lt;br&gt;
🔗 Read more details here:&lt;a href="https://dinhphuvn.substack.com/p/why-big-tech-is-migrating-from-traditional?r=u02d8&amp;amp;utm_campaign=post-expanded-share&amp;amp;utm_medium=web&amp;amp;triedRedirect=true" rel="noopener noreferrer"&gt;https://dinhphuvn.substack.com/p/why-big-tech-is-migrating-from-traditional?r=u02d8&amp;amp;utm_campaign=post-expanded-share&amp;amp;utm_medium=web&amp;amp;triedRedirect=true&lt;/a&gt;&lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>dataengineering</category>
      <category>database</category>
      <category>sql</category>
    </item>
    <item>
      <title>Raft Leader Election: How Does the Leader Election Mechanism Work?</title>
      <dc:creator>Lê Đình Phú</dc:creator>
      <pubDate>Sun, 07 Jun 2026 11:05:57 +0000</pubDate>
      <link>https://dev.to/dinhphu2k1gif/raft-leader-election-how-does-the-leader-election-mechanism-work-3ckm</link>
      <guid>https://dev.to/dinhphu2k1gif/raft-leader-election-how-does-the-leader-election-mechanism-work-3ckm</guid>
      <description>&lt;p&gt;🚀 [Part 2] When the Leader goes down, how does the system save itself?&lt;/p&gt;

&lt;p&gt;In the previous post, we covered the basics of Raft through a simple coffee shop analogy. But harsh reality always hits: What if the "shift supervisor" (Leader) suddenly crashes? Does the shop just close its doors? 😱&lt;/p&gt;

&lt;p&gt;Of course not! In this article, I'll break down the Leader Election mechanism in detail — how nodes automatically detect failures, step up to run for election, and secure a new Leader in the blink of an eye (just a few hundred milliseconds).&lt;/p&gt;

&lt;p&gt;Give it a read and let me know your thoughts:&lt;br&gt;
👉 &lt;a href="https://dinhphuvn.substack.com/p/02-raft-leader-election-how-does" rel="noopener noreferrer"&gt;https://dinhphuvn.substack.com/p/02-raft-leader-election-how-does&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>bigdata</category>
      <category>algorithms</category>
    </item>
  </channel>
</rss>
