<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Petascale Labs</title>
    <description>The latest articles on DEV Community by Petascale Labs (@petascalelabs).</description>
    <link>https://dev.to/petascalelabs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3962856%2Fd98bb0fa-6966-4446-bae3-69c6a1427f64.png</url>
      <title>DEV Community: Petascale Labs</title>
      <link>https://dev.to/petascalelabs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/petascalelabs"/>
    <language>en</language>
    <item>
      <title>The Story Behind Apache Iceberg's Format-Version: v1 to v4</title>
      <dc:creator>Petascale Labs</dc:creator>
      <pubDate>Tue, 23 Jun 2026 09:58:00 +0000</pubDate>
      <link>https://dev.to/petascalelabs/the-story-behind-apache-icebergs-format-version-v1-to-v4-3m7h</link>
      <guid>https://dev.to/petascalelabs/the-story-behind-apache-icebergs-format-version-v1-to-v4-3m7h</guid>
      <description>&lt;p&gt;Most engineers meet Apache Iceberg as a one-line answer: "it's the thing that&lt;br&gt;
gives you ACID transactions and time travel on object storage." That's true, and&lt;br&gt;
it's also where most people stop. But Iceberg has a version dial baked into every&lt;br&gt;
table - a single integer called &lt;code&gt;format-version&lt;/code&gt;, currently 1 through 4 - and&lt;br&gt;
each turn of that dial is a chapter in a single, surprisingly coherent story.&lt;/p&gt;

&lt;p&gt;It's a story about pushing transactional behavior down through layers. v1 made a&lt;br&gt;
table &lt;em&gt;atomic&lt;/em&gt; on top of immutable files, but it could only append and overwrite&lt;br&gt;
whole files. v2 pushed correctness down to the &lt;em&gt;row&lt;/em&gt;. v3 pushed identity down to&lt;br&gt;
the row, so a row keeps the same id across compactions. v4 turned around and&lt;br&gt;
refactored the &lt;em&gt;metadata itself&lt;/em&gt; so it can scale and move without rewrites.&lt;/p&gt;

&lt;p&gt;To follow that arc you need the one mental model the rest of this post hangs&lt;br&gt;
on. So we start there, then walk the four versions in order, keeping the parts&lt;br&gt;
that actually matter for understanding what's on disk.&lt;/p&gt;
&lt;h2&gt;
  
  
  The primer: an Iceberg table is a tree of files, not a folder
&lt;/h2&gt;

&lt;p&gt;The instinct everyone brings from Hive is that a table &lt;em&gt;is&lt;/em&gt; a directory: point at&lt;br&gt;
a path, list the files under it, that's your table. Iceberg breaks that instinct&lt;br&gt;
on purpose. An Iceberg table is &lt;strong&gt;not&lt;/strong&gt; a directory of data files. It is a&lt;br&gt;
&lt;strong&gt;tree of immutable metadata files&lt;/strong&gt; rooted at a single pointer held in a&lt;br&gt;
catalog.&lt;/p&gt;

&lt;p&gt;Every commit &lt;em&gt;appends&lt;/em&gt; new files to that tree and atomically moves the pointer.&lt;br&gt;
Nothing is ever mutated in place. That single design choice - immutable files,&lt;br&gt;
one movable pointer - is what makes everything else possible.&lt;/p&gt;

&lt;p&gt;The tree has five layers:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkjeddcvjzmn9xdip9wn6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fkjeddcvjzmn9xdip9wn6.png" alt="Iceberg's five metadata layers as a top-down flow. &lt;br&gt;
  Layer 0, the Catalog (REST, Hive, Glue, or JDBC), &lt;br&gt;
  holds the current pointer for a table and points at &lt;br&gt;
  one metadata.json. Layer 1, the table metadata JSON, &lt;br&gt;
  holds schemas, partition specs, sort orders, &lt;br&gt;
  snapshots, refs, and properties; its snapshot's &lt;br&gt;
  manifest-list points to Layer 2. Layer 2, the &lt;br&gt;
  manifest list (one per snapshot), lists manifests &lt;br&gt;
  with partition summaries; each manifest_path points &lt;br&gt;
  to Layer 3. Layer 3, a manifest in Avro, indexes data&lt;br&gt;
  and delete files with per-column stats; each &lt;br&gt;
  data_file.file_path points to Layer 4. Layer 4 holds &lt;br&gt;
  the data, delete, and DV files in Parquet, ORC, Avro,&lt;br&gt;
  or Puffin &lt;br&gt;
  format." width="666" height="1156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Read it top to bottom and you have the whole format:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Layer 0 - the catalog.&lt;/strong&gt; It holds one fact per table: "the current pointer
for &lt;code&gt;prod.db.sales&lt;/code&gt; is &lt;em&gt;this&lt;/em&gt; metadata file." The catalog is the only mutable
thing in the entire system, and it lives &lt;em&gt;outside&lt;/em&gt; the format spec (REST
catalog, Hive Metastore, Glue, JDBC, and so on).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 1 - the table metadata JSON.&lt;/strong&gt; A new one is written on every commit.
It holds the schemas, partition specs, sort orders, table properties, the list
of snapshots, and named references. This is the file the catalog points at.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 2 - the manifest list.&lt;/strong&gt; One per snapshot. An Avro file listing the
manifests that make up that snapshot, each row carrying a partition summary so a
scan can skip whole manifests it doesn't need.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 3 - the manifests.&lt;/strong&gt; Avro files, each an index over a set of data or
delete files, with per-column statistics for each file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 4 - the data.&lt;/strong&gt; The actual Parquet/ORC/Avro files, plus delete files
and deletion-vector blobs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why build a table this way? Two reasons, and they're the whole pitch:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Atomic commits without coordination.&lt;/strong&gt; Because everything below the catalog
is content-addressed by path and never mutated, readers and writers can build
files independently with no locking. The &lt;em&gt;only&lt;/em&gt; contended operation in the
entire system is the catalog pointer swap. One compare-and-set, and the commit
is either visible or it isn't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cheap reads.&lt;/strong&gt; A scan loads the manifest list (roughly 100 KB even for a
huge table), prunes manifests by partition, and only then opens any data file.
You never list a directory; you walk an index.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On disk, a table looks roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://warehouse/db/sales/                 &amp;lt;- table location
  metadata/
    00001-....metadata.json              &amp;lt;- Layer 1: root JSON, one per commit
    00002-....metadata.json
    00042-....metadata.json              &amp;lt;- currently pointed at by the catalog
    snap-....avro                        &amp;lt;- Layer 2: a manifest list per snapshot
    8ef6-....avro                        &amp;lt;- Layer 3: manifests
    9b21-....avro
    stats-....puffin                     &amp;lt;- optional: table stats (NDV sketches)
  data/
    year=2025/month=06/00000-0-....parquet   &amp;lt;- Layer 4: data
    year=2025/month=06/00001-0-....parquet
    dv-....puffin                            &amp;lt;- Layer 4: deletion vectors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Hold that picture. Every version below is a change to one or more of these five&lt;br&gt;
layers, and the format-version integer in Layer 1 is what tells a reader which&lt;br&gt;
rules apply. A reader that supports up to version &lt;em&gt;N&lt;/em&gt; will refuse to open a table&lt;br&gt;
whose &lt;code&gt;format-version&lt;/code&gt; is higher than &lt;em&gt;N&lt;/em&gt; rather than silently misread it.&lt;/p&gt;

&lt;p&gt;Now the four chapters.&lt;/p&gt;
&lt;h2&gt;
  
  
  v1: a table that's atomic, but only at the file
&lt;/h2&gt;

&lt;p&gt;v1 is the foundation, and it nails the hard part: snapshot isolation and atomic&lt;br&gt;
commits on immutable files. A v1 table already has schemas, partition specs,&lt;br&gt;
snapshots, and time travel. If all you ever do is &lt;em&gt;append&lt;/em&gt; data and occasionally&lt;br&gt;
&lt;em&gt;overwrite whole files&lt;/em&gt;, v1 is a complete, correct table format.&lt;/p&gt;

&lt;p&gt;Its limits are exactly the things the later versions go after.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It can't delete or update a single row.&lt;/strong&gt; The smallest unit v1 can change is a&lt;br&gt;
file. Want to delete one row? Rewrite the entire file without it, then swap the&lt;br&gt;
file in a new snapshot. This is copy-on-write, and on a wide table it means&lt;br&gt;
rewriting gigabytes to remove a handful of rows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Its metadata carries a few quirks that v2 had to clean up:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The schema and partition spec were stored as &lt;strong&gt;singular&lt;/strong&gt; fields - one schema,
one spec - rather than a list of historical ones. A v1 metadata file has a
&lt;code&gt;schema&lt;/code&gt; and a &lt;code&gt;partition-spec&lt;/code&gt;; both are deprecated from v2 onward in favor of
&lt;code&gt;schemas[]&lt;/code&gt; and &lt;code&gt;partition-specs[]&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Partition field IDs were &lt;strong&gt;not tracked explicitly&lt;/strong&gt;. Implementations just
assigned them sequentially starting at 1000. That caused real ambiguity when
the same logical field got a different transform across specs - there was no
stable identity to tie them together.&lt;/li&gt;
&lt;li&gt;A snapshot could &lt;strong&gt;embed its manifests inline&lt;/strong&gt; as a &lt;code&gt;manifests: [path, ...]&lt;/code&gt;
list instead of pointing at a separate manifest-list file. Convenient, but it
meant the per-snapshot bookkeeping that later inheritance rules depend on had
nowhere to live.&lt;/li&gt;
&lt;li&gt;A few &lt;code&gt;data_file&lt;/code&gt; fields existed that nobody needed: &lt;code&gt;block_size_in_bytes&lt;/code&gt;,
&lt;code&gt;file_ordinal&lt;/code&gt;, &lt;code&gt;sort_columns&lt;/code&gt;. All removed in v2.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this makes v1 wrong. It makes v1 &lt;em&gt;append-shaped&lt;/em&gt;. The entire push of v2&lt;br&gt;
is to make the table &lt;em&gt;mutable at the row&lt;/em&gt;, and to add the bookkeeping that makes&lt;br&gt;
that safe.&lt;/p&gt;
&lt;h2&gt;
  
  
  v2: correctness pushed down to the row
&lt;/h2&gt;

&lt;p&gt;v2 is the version most production tables ran on for years, because it's the one&lt;br&gt;
that turns Iceberg from an append log into a real transactional table. Three big&lt;br&gt;
ideas arrive together: &lt;strong&gt;row-level deletes&lt;/strong&gt;, &lt;strong&gt;sequence numbers&lt;/strong&gt;, and &lt;strong&gt;named&lt;br&gt;
references&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Row-level deletes: merge-on-read
&lt;/h3&gt;

&lt;p&gt;Instead of rewriting a file to remove rows, v2 lets you write a small &lt;strong&gt;delete&lt;br&gt;
file&lt;/strong&gt; that says "these rows in that data file are gone." The reader applies&lt;br&gt;
deletes on the fly at scan time. This is &lt;strong&gt;merge-on-read&lt;/strong&gt;, and it comes in two&lt;br&gt;
flavors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Position deletes&lt;/strong&gt; - a list of &lt;code&gt;(file_path, position)&lt;/code&gt; tuples: "row 42 and
row 1007 of &lt;em&gt;this&lt;/em&gt; file are deleted." Precise, cheap to write, used by &lt;code&gt;DELETE&lt;/code&gt;
and &lt;code&gt;MERGE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Equality deletes&lt;/strong&gt; - a predicate on column values: "any row where
&lt;code&gt;id = 12345&lt;/code&gt; is deleted." These don't need to know &lt;em&gt;where&lt;/em&gt; the row lives, which
makes them ideal for streaming upserts where you delete-then-insert without
reading the old file first.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To make this work, the format needed a way to say whether a file holds data or&lt;br&gt;
deletes, and a way to order deletes against data. Both arrived in v2.&lt;/p&gt;
&lt;h3&gt;
  
  
  The &lt;code&gt;content&lt;/code&gt; discriminator and sequence numbers
&lt;/h3&gt;

&lt;p&gt;A v2 manifest, and each file it lists, now carries a &lt;code&gt;content&lt;/code&gt; field. On the&lt;br&gt;
manifest list it's &lt;code&gt;0 = data&lt;/code&gt; or &lt;code&gt;1 = deletes&lt;/code&gt;; a single manifest holds &lt;em&gt;either&lt;/em&gt;&lt;br&gt;
data files &lt;em&gt;or&lt;/em&gt; delete files, never both, so scan planning can load all the&lt;br&gt;
delete manifests first. On the data file itself, &lt;code&gt;content&lt;/code&gt; is &lt;code&gt;0 = DATA&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;1 = POSITION_DELETES&lt;/code&gt;, &lt;code&gt;2 = EQUALITY_DELETES&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The ordering problem is subtler. If a delete file says "delete &lt;code&gt;id = 12345&lt;/code&gt;,"&lt;br&gt;
&lt;em&gt;which&lt;/em&gt; inserts of that id does it kill - the ones before it, or also the ones&lt;br&gt;
after? v2 answers this with a monotonic &lt;strong&gt;sequence number&lt;/strong&gt; assigned at commit&lt;br&gt;
time and threaded through every layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The table metadata tracks &lt;code&gt;last-sequence-number&lt;/code&gt;, bumped on each commit.&lt;/li&gt;
&lt;li&gt;Each snapshot records its &lt;code&gt;sequence-number&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The manifest list records each manifest's &lt;code&gt;sequence_number&lt;/code&gt; and a
&lt;code&gt;min_sequence_number&lt;/code&gt; (the smallest data sequence number among live files in
it).&lt;/li&gt;
&lt;li&gt;Each manifest entry carries the file's &lt;code&gt;sequence_number&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rule that falls out: an &lt;strong&gt;equality delete applies to a data file only when&lt;br&gt;
the delete's sequence number is greater than the file's&lt;/strong&gt; (and they share a&lt;br&gt;
partition). A position delete applies at equal-or-greater sequence. That's how&lt;br&gt;
"delete then insert" does the right thing - the new insert has a higher sequence&lt;br&gt;
number than the delete, so it survives.&lt;/p&gt;
&lt;h3&gt;
  
  
  Inheritance: why this is cheap
&lt;/h3&gt;

&lt;p&gt;Here's the piece that surprises people. A manifest entry can leave its&lt;br&gt;
&lt;code&gt;snapshot_id&lt;/code&gt;, &lt;code&gt;sequence_number&lt;/code&gt;, and &lt;code&gt;file_sequence_number&lt;/code&gt; &lt;strong&gt;null in the&lt;br&gt;
file&lt;/strong&gt;, and the reader fills them in from the manifest list. Why bother? Because&lt;br&gt;
it lets the &lt;em&gt;same manifest file&lt;/em&gt; be reused across optimistic-retry attempts. When&lt;br&gt;
a commit loses the compare-and-set race and has to retry with a new sequence&lt;br&gt;
number, only the small manifest list needs rewriting - the manifests and data&lt;br&gt;
files it points at are untouched.&lt;/p&gt;
&lt;h3&gt;
  
  
  Named references: branches and tags
&lt;/h3&gt;

&lt;p&gt;v2 adds a &lt;code&gt;refs&lt;/code&gt; map to the table metadata - named branches and tags pointing at&lt;br&gt;
snapshots:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"refs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"main"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"snapshot-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8392648&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"branch"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"audit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"snapshot-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"branch"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                   &lt;/span&gt;&lt;span class="nl"&gt;"min-snapshots-to-keep"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"max-snapshot-age-ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;604800000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"prod_release"&lt;/span&gt;&lt;span class="p"&gt;:{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"snapshot-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7281001&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tag"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                   &lt;/span&gt;&lt;span class="nl"&gt;"max-ref-age-ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2592000000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;main&lt;/code&gt; always exists; if &lt;code&gt;refs&lt;/code&gt; is empty it implicitly points at&lt;br&gt;
&lt;code&gt;current-snapshot-id&lt;/code&gt;. Branches let you stage and validate writes off to the side&lt;br&gt;
(write-audit-publish); tags pin a snapshot so expiration won't garbage-collect&lt;br&gt;
it. Branches carry their own retention floor (&lt;code&gt;min-snapshots-to-keep&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;max-snapshot-age-ms&lt;/code&gt;); tags and non-main branches carry &lt;code&gt;max-ref-age-ms&lt;/code&gt;. &lt;code&gt;main&lt;/code&gt;&lt;br&gt;
never expires.&lt;/p&gt;
&lt;h3&gt;
  
  
  The v2 metadata cleanup
&lt;/h3&gt;

&lt;p&gt;v2 also formalized a lot of Layer 1. These fields became &lt;strong&gt;required&lt;/strong&gt;:&lt;br&gt;
&lt;code&gt;last-sequence-number&lt;/code&gt;, &lt;code&gt;current-schema-id&lt;/code&gt;, &lt;code&gt;schemas&lt;/code&gt;, &lt;code&gt;default-spec-id&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;partition-specs&lt;/code&gt;, &lt;code&gt;last-partition-id&lt;/code&gt;, &lt;code&gt;default-sort-order-id&lt;/code&gt;, &lt;code&gt;sort-orders&lt;/code&gt;,&lt;br&gt;
and &lt;code&gt;table-uuid&lt;/code&gt; (a stable identity generated at create time, used as a&lt;br&gt;
refresh-time integrity check). The singular &lt;code&gt;schema&lt;/code&gt; and &lt;code&gt;partition-spec&lt;/code&gt; are&lt;br&gt;
deprecated, the inline snapshot &lt;code&gt;manifests&lt;/code&gt; list is gone, and partition field IDs&lt;br&gt;
are now &lt;strong&gt;explicit and unique across all specs&lt;/strong&gt; - fixing the v1 ambiguity.&lt;/p&gt;

&lt;p&gt;One nice compatibility property: &lt;strong&gt;a v1 file reads cleanly as v2.&lt;/strong&gt; A missing&lt;br&gt;
&lt;code&gt;sequence_number&lt;/code&gt; is read as 0, and a missing &lt;code&gt;content&lt;/code&gt; is read as 0 (data). So&lt;br&gt;
upgrading is a metadata-only operation; nothing has to be rewritten on day one.&lt;/p&gt;

&lt;p&gt;By the end of v2, Iceberg is a full transactional table: insert, delete, update,&lt;br&gt;
upsert, branch, tag, time-travel. So what's left for v3?&lt;/p&gt;
&lt;h2&gt;
  
  
  v3: identity, efficient deletes, and richer data
&lt;/h2&gt;

&lt;p&gt;If v2 made the &lt;em&gt;table&lt;/em&gt; transactional, v3 makes the &lt;em&gt;row&lt;/em&gt; a first-class citizen.&lt;br&gt;
It adds three things that don't fit neatly into v2's model: a stable identity for&lt;br&gt;
every row, a far more efficient delete mechanism, and a richer type and security&lt;br&gt;
surface.&lt;/p&gt;
&lt;h3&gt;
  
  
  Row lineage: every row gets a stable id
&lt;/h3&gt;

&lt;p&gt;This is the headline feature, and it's genuinely clever because it touches three&lt;br&gt;
layers at once without storing an id per row anywhere. v3 mandates that every row&lt;br&gt;
has a stable &lt;code&gt;_row_id&lt;/code&gt; that survives compaction - so you can track a row across&lt;br&gt;
rewrites, build change feeds, and reason about lineage. It works by seeding,&lt;br&gt;
not storing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Set when&lt;/th&gt;
&lt;th&gt;Used for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 (table metadata)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;next-row-id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;bumped per commit&lt;/td&gt;
&lt;td&gt;seeds the next snapshot's first row id&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2 (snapshot)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;first-row-id&lt;/code&gt;, &lt;code&gt;added-rows&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;at commit&lt;/td&gt;
&lt;td&gt;starting &lt;code&gt;_row_id&lt;/code&gt; for the manifest list&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2 (manifest list)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;first_row_id&lt;/code&gt; per manifest&lt;/td&gt;
&lt;td&gt;at commit&lt;/td&gt;
&lt;td&gt;starting &lt;code&gt;_row_id&lt;/code&gt; for files in that manifest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 (manifest entry)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;data_file.first_row_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;at commit&lt;/td&gt;
&lt;td&gt;starting &lt;code&gt;_row_id&lt;/code&gt; for rows in that file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 (data file)&lt;/td&gt;
&lt;td&gt;reserved fields &lt;code&gt;_row_id&lt;/code&gt;, &lt;code&gt;_last_updated_sequence_number&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;inherited at read&lt;/td&gt;
&lt;td&gt;stable identity across compactions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The reader computes a row's id with one formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_row_id = data_file.first_row_id + row_position_in_file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No per-row storage; the id is derived from where the row sits. If &lt;code&gt;first_row_id&lt;/code&gt;&lt;br&gt;
is null - say a v2-era file in a table that was upgraded to v3 - then &lt;code&gt;_row_id&lt;/code&gt;&lt;br&gt;
reads as null for those rows, which is exactly the honest answer. Equality&lt;br&gt;
deletes deliberately break lineage: an equality-delete update never reads the old&lt;br&gt;
row, so the replacement gets a fresh &lt;code&gt;_row_id&lt;/code&gt; rather than inheriting one it&lt;br&gt;
can't prove.&lt;/p&gt;
&lt;h3&gt;
  
  
  Deletion vectors: position deletes, done right
&lt;/h3&gt;

&lt;p&gt;Position delete &lt;em&gt;files&lt;/em&gt; worked, but they had a scaling problem: lots of tiny&lt;br&gt;
delete files, each needing to be opened and merged. v3 replaces them with&lt;br&gt;
&lt;strong&gt;deletion vectors (DVs)&lt;/strong&gt; - a single compressed bitmap per data file, stored as&lt;br&gt;
a blob inside a Puffin file. One bitmap, one referenced data file, looked up by&lt;br&gt;
byte offset.&lt;/p&gt;

&lt;p&gt;The manifest entry for a DV reuses the position-delete &lt;code&gt;content&lt;/code&gt; code but adds&lt;br&gt;
three fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;manifest_entry {
  status = 1
  data_file {
    content              = 1                              // shares position-delete code
    file_path            = "s3://warehouse/db/sales/data/dv-....puffin"
    file_format          = "puffin"
    partition            = { country: "IN", ts_day: 2025-06-15 }  // same partition as target
    record_count         = 3                              // number of deleted positions
    referenced_data_file = "s3://.../00000-abc.parquet"   // v3 NEW: which data file this DV covers
    content_offset       = 4                              // v3 NEW: byte offset of the blob in the Puffin file
    content_size_in_bytes= 23                             // v3 NEW: blob length
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;referenced_data_file&lt;/code&gt;, &lt;code&gt;content_offset&lt;/code&gt;, and &lt;code&gt;content_size_in_bytes&lt;/code&gt; are the new&lt;br&gt;
fields that let the reader jump straight to one bitmap. Position delete files are&lt;br&gt;
&lt;strong&gt;deprecated&lt;/strong&gt; in v3: writers can't create new ones, and existing ones get merged&lt;br&gt;
into DVs over time. The result is one delete artifact per data file instead of a&lt;br&gt;
pile of small files.&lt;/p&gt;
&lt;h3&gt;
  
  
  New types and column defaults
&lt;/h3&gt;

&lt;p&gt;v3 broadens what a column can hold. New primitive types: &lt;code&gt;variant&lt;/code&gt; (semi-&lt;br&gt;
structured), &lt;code&gt;geometry&lt;/code&gt; and &lt;code&gt;geography&lt;/code&gt; (spatial), &lt;code&gt;unknown&lt;/code&gt; (a column whose type&lt;br&gt;
isn't known yet), and nanosecond timestamps &lt;code&gt;timestamp_ns&lt;/code&gt; / &lt;code&gt;timestamptz_ns&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It also adds &lt;strong&gt;column defaults&lt;/strong&gt;, which finally make adding a non-null column&lt;br&gt;
sane:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"country"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"initial-default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"IN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;rows&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;files&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;written&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;before&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;this&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;column&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;existed&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"write-default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"IN"&lt;/span&gt;&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;value&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;fill&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;when&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;writer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;omits&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;column&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;initial-default&lt;/code&gt; is the value that &lt;em&gt;existing&lt;/em&gt; rows get for a freshly added&lt;br&gt;
column, with no file rewrite - the reader synthesizes it. &lt;code&gt;write-default&lt;/code&gt; is what&lt;br&gt;
new writes use when the column is omitted. Together they make schema evolution a&lt;br&gt;
metadata change instead of a backfill.&lt;/p&gt;

&lt;p&gt;The type-promotion rules that keep schema evolution safe also expand in v3:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;From&lt;/th&gt;
&lt;th&gt;v1 / v2 promotion&lt;/th&gt;
&lt;th&gt;v3+ adds&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;unknown&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;promotable to &lt;em&gt;any&lt;/em&gt; type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;int&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;long&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;long&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;date&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;timestamp&lt;/code&gt;, &lt;code&gt;timestamp_ns&lt;/code&gt; (not the tz variants)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;float&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;double&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;double&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;decimal(P, S)&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;decimal(P', S)&lt;/code&gt; with &lt;code&gt;P' &amp;gt; P&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;same&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Partition transforms get multi-argument
&lt;/h3&gt;

&lt;p&gt;v3 adds &lt;code&gt;source-ids&lt;/code&gt; (plural) on partition fields, so a transform can take more&lt;br&gt;
than one source column. Single-argument transforms still write the old&lt;br&gt;
&lt;code&gt;source-id&lt;/code&gt;. The full set of allowed transforms is &lt;code&gt;identity&lt;/code&gt;, &lt;code&gt;bucket[N]&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;truncate[W]&lt;/code&gt;, &lt;code&gt;year&lt;/code&gt;, &lt;code&gt;month&lt;/code&gt;, &lt;code&gt;day&lt;/code&gt;, &lt;code&gt;hour&lt;/code&gt;, and &lt;code&gt;void&lt;/code&gt;. And a forward-&lt;br&gt;
compatibility rule lands: &lt;strong&gt;v3 readers must tolerate an unknown transform&lt;/strong&gt; and&lt;br&gt;
simply skip filter pushdown on it, rather than refusing to read. (v1/v2 only&lt;br&gt;
&lt;em&gt;should&lt;/em&gt;.) Writers, of course, still can't commit a transform they don't&lt;br&gt;
understand.&lt;/p&gt;
&lt;h3&gt;
  
  
  Encryption arrives
&lt;/h3&gt;

&lt;p&gt;v3 adds table-level encryption with a three-place key model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"encryption-keys"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"key-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"k1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"encrypted-key-metadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"BASE64..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;KMS-wrapped&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;data&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;encryption&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;key&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"encrypted-by-id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"kms-master-2025"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;logical&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;key-encryption-key&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;id&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;The table JSON holds &lt;code&gt;encryption-keys[]&lt;/code&gt; - data encryption keys (DEKs) each
wrapped by a KMS-resident key-encryption-key (KEK).&lt;/li&gt;
&lt;li&gt;Each snapshot carries a &lt;code&gt;key-id&lt;/code&gt; naming which DEK protects that snapshot's
manifest-list key metadata.&lt;/li&gt;
&lt;li&gt;Each file can carry per-file &lt;code&gt;key_metadata&lt;/code&gt; (this field already existed in
v1/v2 on data files, but without a central registry).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The DEK-to-KEK chain is opaque to the format; implementations plug into AWS KMS,&lt;br&gt;
GCP KMS, Vault, and so on via the wrapped bytes.&lt;/p&gt;

&lt;p&gt;By the end of v3, a row has an identity, deletes are a single bitmap, columns can&lt;br&gt;
default and hold variant or spatial data, and the table can be encrypted. The&lt;br&gt;
&lt;em&gt;user-visible&lt;/em&gt; feature set is essentially complete. Which is why v4 looks&lt;br&gt;
different from everything before it.&lt;/p&gt;
&lt;h2&gt;
  
  
  v4: the refactor for scale and portability
&lt;/h2&gt;

&lt;p&gt;v4 introduces &lt;strong&gt;no new user-visible types and no new delete mechanisms.&lt;/strong&gt; It is a&lt;br&gt;
metadata refactor aimed at three things: performance, portability, and richer&lt;br&gt;
per-file statistics. The changes are quieter, but two of them matter a lot in&lt;br&gt;
production.&lt;/p&gt;
&lt;h3&gt;
  
  
  Relative paths: move a table without rewriting it
&lt;/h3&gt;

&lt;p&gt;This is the biggest &lt;em&gt;invisible&lt;/em&gt; change in the whole format. In v1 through v3,&lt;br&gt;
every path stored inside metadata - &lt;code&gt;file_path&lt;/code&gt;, &lt;code&gt;manifest_path&lt;/code&gt;, the manifest&lt;br&gt;
list, &lt;code&gt;metadata-file&lt;/code&gt;, statistics paths - had to be &lt;strong&gt;absolute&lt;/strong&gt;, complete with a&lt;br&gt;
URI scheme like &lt;code&gt;s3://&lt;/code&gt; or &lt;code&gt;hdfs://&lt;/code&gt;. That meant the moment you wanted to move a&lt;br&gt;
table to a different bucket, every one of those absolute paths was wrong, and you&lt;br&gt;
had to rewrite the entire metadata tree to fix them.&lt;/p&gt;

&lt;p&gt;v4 allows paths to be &lt;strong&gt;relative to the table location&lt;/strong&gt;. The resolution rule is&lt;br&gt;
simple:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Table location&lt;/th&gt;
&lt;th&gt;Stored path&lt;/th&gt;
&lt;th&gt;Resolves to&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;s3://bucket/db/table&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;data/00000.parquet&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;s3://bucket/db/table/data/00000.parquet&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;s3://bucket/db/table&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;hdfs://wh/...&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;hdfs://wh/...&lt;/code&gt; (absolute, used as-is)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v3 and earlier&lt;/td&gt;
&lt;td&gt;&lt;code&gt;s3://bucket/db/table&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;s3://bucket/db/table/data/00000.parquet&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If a stored path has a URI scheme, it's absolute and used as-is. If it doesn't,&lt;br&gt;
the reader resolves it as &lt;code&gt;table_location + "/" + path&lt;/code&gt;. The writer rule:&lt;br&gt;
default to relative for files under the table location, use absolute for files&lt;br&gt;
outside it (say, a backfill from another bucket). Because location is now what&lt;br&gt;
ties relative paths together, &lt;code&gt;location&lt;/code&gt; in the table metadata JSON becomes&lt;br&gt;
&lt;strong&gt;optional&lt;/strong&gt; - the catalog can supply it.&lt;/p&gt;

&lt;p&gt;The operational payoff is the headline: moving a table from&lt;br&gt;
&lt;code&gt;s3://bucket-a/db/sales/&lt;/code&gt; to &lt;code&gt;s3://bucket-b/db/sales/&lt;/code&gt; needs only a catalog&lt;br&gt;
pointer update and, optionally, a new metadata.json with the new &lt;code&gt;location&lt;/code&gt;. &lt;strong&gt;No&lt;br&gt;
manifest list, no manifest, no data file gets rewritten.&lt;/strong&gt; Pre-v4 the same move&lt;br&gt;
required a full metadata rewrite.&lt;/p&gt;
&lt;h3&gt;
  
  
  Typed &lt;code&gt;content_stats&lt;/code&gt;: five maps become one struct
&lt;/h3&gt;

&lt;p&gt;In v3 and earlier, per-column statistics on a data file were &lt;strong&gt;five parallel&lt;br&gt;
maps&lt;/strong&gt; keyed by field id: &lt;code&gt;value_counts&lt;/code&gt;, &lt;code&gt;null_value_counts&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;nan_value_counts&lt;/code&gt;, &lt;code&gt;lower_bounds&lt;/code&gt;, &lt;code&gt;upper_bounds&lt;/code&gt; (plus the on-disk&lt;br&gt;
&lt;code&gt;column_sizes&lt;/code&gt;). Five maps to keep in sync, all loosely typed (bounds were raw&lt;br&gt;
binary-encoded bytes).&lt;/p&gt;

&lt;p&gt;v4 replaces them with &lt;strong&gt;one typed struct&lt;/strong&gt;, &lt;code&gt;content_stats&lt;/code&gt;, whose layout is&lt;br&gt;
generated from the table schema itself. Each column reserves a block of ids and&lt;br&gt;
gets a typed sub-struct:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;146: optional struct content_stats {
  10_400: optional struct id (default null) {        // stats for table field 2 (int)
    10_401: optional int     lower_bound;
    10_402: optional int     upper_bound;
    10_403: optional boolean tight_bounds;
    10_404: optional long    value_count;
  }
  10_600: optional struct data (default null) {       // stats for table field 3 (string)
    10_601: optional string  lower_bound;
    10_602: optional string  upper_bound;
    10_603: optional boolean tight_bounds;            // v4 NEW: exact vs truncated min/max
    10_604: optional long    value_count;
    10_605: optional long    null_value_count;
    10_607: optional int     avg_value_size_in_bytes; // v4 NEW: variable-length sizing
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The id assignment is mechanical - each column reserves 200 ids - and two genuinely&lt;br&gt;
new pieces of information appear: &lt;code&gt;tight_bounds&lt;/code&gt;, a flag saying whether the&lt;br&gt;
min/max are exact or truncated (truncated bounds still prune, but you have to scan&lt;br&gt;
to confirm a match), and &lt;code&gt;avg_value_size_in_bytes&lt;/code&gt; for variable-length columns,&lt;br&gt;
which helps the planner estimate read cost. Spatial columns use typed&lt;br&gt;
&lt;code&gt;geo_lower&lt;/code&gt; / &lt;code&gt;geo_upper&lt;/code&gt; structs instead of opaque WKB bytes.&lt;/p&gt;

&lt;p&gt;The reassuring part: &lt;strong&gt;v3 and v4 statistics are equivalent.&lt;/strong&gt; A missing map key in&lt;br&gt;
v3 is the same as a missing-or-null sub-struct in v4. Nothing is lost in the&lt;br&gt;
translation; it's the same information, typed and consolidated.&lt;/p&gt;
&lt;h3&gt;
  
  
  The file-system catalog is gone
&lt;/h3&gt;

&lt;p&gt;v1 through v3 allowed a "file-system table": sequential metadata filenames&lt;br&gt;
(&lt;code&gt;v1.metadata.json&lt;/code&gt;, &lt;code&gt;v2.metadata.json&lt;/code&gt;, ...) where a commit was an atomic file&lt;br&gt;
&lt;em&gt;rename&lt;/em&gt;. That only ever worked safely on HDFS, because object stores like S3&lt;br&gt;
don't offer atomic rename. v4 &lt;strong&gt;removes it entirely.&lt;/strong&gt; Every v4 table uses a real&lt;br&gt;
catalog (the metastore model), where a commit is a compare-and-set on the catalog&lt;br&gt;
pointer. This closes a long-standing source of silent corruption on object&lt;br&gt;
storage.&lt;/p&gt;

&lt;p&gt;That's the whole of v4: relative paths, optional &lt;code&gt;location&lt;/code&gt;, typed&lt;br&gt;
&lt;code&gt;content_stats&lt;/code&gt; with &lt;code&gt;tight_bounds&lt;/code&gt; and average value size, and the death of the&lt;br&gt;
file-system catalog. A refactor, not a feature release - and exactly the kind of&lt;br&gt;
change a format makes once its feature surface has settled.&lt;/p&gt;
&lt;h2&gt;
  
  
  How the layers earn their keep: scan planning
&lt;/h2&gt;

&lt;p&gt;The reason all this structure exists is to make a scan cheap, so it's worth&lt;br&gt;
watching a query actually use it. Given the current snapshot, a scan does&lt;br&gt;
&lt;strong&gt;three-level pruning&lt;/strong&gt; without ever needing a separate planner index:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Open the manifest list.&lt;/strong&gt; One read, roughly 100 KB.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer-2 pruning.&lt;/strong&gt; Drop any manifest whose partition summary can't match the
query predicate. Whole manifests skipped without opening them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open the surviving delete manifests first&lt;/strong&gt;, then the data manifests. Delete
manifests come first so the reader knows which deletes are in play before it
decides what to emit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer-3 pruning.&lt;/strong&gt; For each data file, check its &lt;code&gt;lower_bounds&lt;/code&gt; /
&lt;code&gt;upper_bounds&lt;/code&gt;; if they rule the file out, skip it. Otherwise emit it as a scan
task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pair each scan task with the deletes that apply&lt;/strong&gt;, using the sequence-number
rules from v2:

&lt;ul&gt;
&lt;li&gt;a &lt;strong&gt;deletion vector&lt;/strong&gt; applies when its &lt;code&gt;referenced_data_file&lt;/code&gt; matches and its
sequence is greater-or-equal, in the same partition;&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;position delete&lt;/strong&gt; applies by the same rule, but only when no DV is present;&lt;/li&gt;
&lt;li&gt;an &lt;strong&gt;equality delete&lt;/strong&gt; applies when its sequence is strictly greater than the
data file's, in the same partition (or globally if unpartitioned).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Manifest list, then manifest, then file. Three reads narrow a petabyte table to&lt;br&gt;
the handful of files a query actually needs. That funnel is the entire reason the&lt;br&gt;
tree-of-files design beats a directory listing.&lt;/p&gt;
&lt;h2&gt;
  
  
  Reading across versions
&lt;/h2&gt;

&lt;p&gt;A practical note that saves real debugging time: the format is designed so older&lt;br&gt;
files read correctly under newer rules.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v1 read as v2:&lt;/strong&gt; a missing &lt;code&gt;sequence_number&lt;/code&gt; is 0; a missing &lt;code&gt;content&lt;/code&gt; is 0
(data). Upgrading v1 to v2 is metadata-only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v2 read as v3:&lt;/strong&gt; files without &lt;code&gt;first_row_id&lt;/code&gt; simply report &lt;code&gt;_row_id&lt;/code&gt; as null;
position delete files keep working but can't be created anew.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v3 read as v4:&lt;/strong&gt; a missing stats map key equals a null typed sub-struct; absolute
paths keep working unchanged alongside new relative ones.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the one hard rule in the other direction: a reader refuses to open a table&lt;br&gt;
whose &lt;code&gt;format-version&lt;/code&gt; is higher than the reader supports. The version integer is&lt;br&gt;
a contract, not a hint.&lt;/p&gt;
&lt;h2&gt;
  
  
  The whole format on one page
&lt;/h2&gt;

&lt;p&gt;Here is the entire metadata surface, top to bottom, with the version each piece&lt;br&gt;
arrived in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;catalog -&amp;gt; metadata.json
  |-- table level: format-version, schemas[], partition-specs[], sort-orders[],
  |               properties{}, last-*-id, last-sequence-number, encryption-keys[] (v3+)
  |-- refs{} -&amp;gt; main / branches / tags -&amp;gt; snapshot-id                            (v2+)
  |-- snapshots[] (whole history; expired entries pruned)
        |-- one snapshot:
              snapshot-id, parent-snapshot-id, sequence-number, summary{op,...},
              first-row-id + added-rows (v3+), key-id (v3+)
                |-- manifest list (Avro): [ manifest_file x N ]
                      manifest_file: path, len, spec-id, content (data|deletes),  (v2+)
                                     seq#, min_seq#, counts, partitions[],
                                     first_row_id (v3+)
                        |-- manifest (Avro): header{schema, spec, content} + [ entry x N ]
                              entry: status, snapshot_id, seq#, file_seq#,
                                     data_file {
                                       content, file_path, format, partition,
                                       record_count, size, sort_order_id,
                                       metrics maps (v1-v3)  -- OR --  content_stats struct (v4),
                                       equality_ids (v2+),
                                       referenced_data_file / content_offset / size (v3+),
                                       first_row_id (v3+), key_metadata
                                     }
                                |-- data file / delete file / Puffin DV blob
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the version-by-version cheat sheet:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;v1&lt;/th&gt;
&lt;th&gt;v2&lt;/th&gt;
&lt;th&gt;v3&lt;/th&gt;
&lt;th&gt;v4&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;atomic snapshots on immutable files&lt;/td&gt;
&lt;td&gt;+ row-level deletes, sequence numbers&lt;/td&gt;
&lt;td&gt;+ stable row identity&lt;/td&gt;
&lt;td&gt;metadata refactor only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deletes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;rewrite whole file (copy-on-write)&lt;/td&gt;
&lt;td&gt;position + equality delete files&lt;/td&gt;
&lt;td&gt;deletion vectors (position deletes deprecated)&lt;/td&gt;
&lt;td&gt;unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Schema/spec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;singular &lt;code&gt;schema&lt;/code&gt; / &lt;code&gt;partition-spec&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;lists, explicit partition field ids&lt;/td&gt;
&lt;td&gt;column defaults, new types, &lt;code&gt;source-ids&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;References&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;refs&lt;/code&gt;: branches + tags&lt;/td&gt;
&lt;td&gt;unchanged&lt;/td&gt;
&lt;td&gt;unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Row lineage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;next-row-id&lt;/code&gt; / &lt;code&gt;first-row-id&lt;/code&gt; / &lt;code&gt;_row_id&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Types&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;base&lt;/td&gt;
&lt;td&gt;+ sort orders&lt;/td&gt;
&lt;td&gt;variant, geometry, geography, unknown, ns timestamps&lt;/td&gt;
&lt;td&gt;unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Encryption&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;per-file key metadata only&lt;/td&gt;
&lt;td&gt;per-file key metadata only&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;encryption-keys&lt;/code&gt;, snapshot &lt;code&gt;key-id&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;unchanged&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Statistics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;metrics maps&lt;/td&gt;
&lt;td&gt;metrics maps&lt;/td&gt;
&lt;td&gt;metrics maps&lt;/td&gt;
&lt;td&gt;typed &lt;code&gt;content_stats&lt;/code&gt; + &lt;code&gt;tight_bounds&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Paths&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;absolute only&lt;/td&gt;
&lt;td&gt;absolute only&lt;/td&gt;
&lt;td&gt;absolute only&lt;/td&gt;
&lt;td&gt;relative or absolute; &lt;code&gt;location&lt;/code&gt; optional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;file-system or metastore&lt;/td&gt;
&lt;td&gt;file-system or metastore&lt;/td&gt;
&lt;td&gt;file-system or metastore&lt;/td&gt;
&lt;td&gt;metastore only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every commit appends a new metadata.json plus new manifest-list, manifest, and&lt;br&gt;
data files. The old tree stays reachable through the metadata log for rollback&lt;br&gt;
and time travel, until snapshot expiration garbage-collects it. Nothing is ever&lt;br&gt;
mutated; the whole format is an append-only tree with one movable pointer at the&lt;br&gt;
root.&lt;/p&gt;

&lt;h2&gt;
  
  
  The throughline
&lt;/h2&gt;

&lt;p&gt;Read the four versions back to back and the arc is clean. v1 made a table&lt;br&gt;
&lt;em&gt;atomic&lt;/em&gt; on immutable files. v2 pushed correctness down to the &lt;em&gt;row&lt;/em&gt; with deletes&lt;br&gt;
and sequence numbers, and added branches. v3 pushed &lt;em&gt;identity&lt;/em&gt; down to the row,&lt;br&gt;
made deletes a single bitmap, and broadened types and security. v4 turned inward&lt;br&gt;
and refactored the metadata so it can move and scale without rewrites.&lt;/p&gt;

&lt;p&gt;It's the same instinct applied at finer and finer grain: make the unit of change smaller, and make the metadata that tracks it cheaper. That's why a format that started as an append-only snapshot log can now back a streaming upsert table with row-level lineage across petabytes - without ever giving up the one property it started with, the single atomic pointer swap.&lt;/p&gt;

&lt;p&gt;If you want to build this understanding from the ground up - why data lakes broke,how snapshots and manifests really work, and the hands-on mechanics of deletes,branches, and compaction - that's the&lt;br&gt;
&lt;a href="https://petascalelabs.com/curriculum/open-table-formats/iceberg-foundations" rel="noopener noreferrer"&gt;Iceberg Foundations&lt;/a&gt; track,&lt;br&gt;
part of the broader &lt;a href="https://petascalelabs.com/curriculum/open-table-formats" rel="noopener noreferrer"&gt;open table formats&lt;/a&gt;&lt;br&gt;
curriculum. The format rewards reading it as a story, because that's how it was written.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>interview</category>
      <category>bigdata</category>
      <category>opensource</category>
    </item>
    <item>
      <title>The Data Engineer Roadmap for 2026 (in an AI-Native World)</title>
      <dc:creator>Petascale Labs</dc:creator>
      <pubDate>Sun, 14 Jun 2026 19:03:58 +0000</pubDate>
      <link>https://dev.to/petascalelabs/the-data-engineer-roadmap-for-2026-in-an-ai-native-world-3lf4</link>
      <guid>https://dev.to/petascalelabs/the-data-engineer-roadmap-for-2026-in-an-ai-native-world-3lf4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is the narrated version of our free, interactive &lt;a href="https://petascalelabs.com/data-engineer-roadmap" rel="noopener noreferrer"&gt;Data Engineer Roadmap&lt;/a&gt;. Same areas, same order, with a focus on the one thing each layer asks of you that AI can't do for you.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every data engineer roadmap written before early 2025 made the same quiet assumption: that the hard part was &lt;em&gt;writing the code&lt;/em&gt;. &lt;strong&gt;Learn SQL. Learn Python. Wire up a pipeline in Airflow. Ship it. Congratulations, you're a data engineer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That assumption is dead. AI writes the SQL now. It writes the DAG, the PySpark job, the dbt model, the masking policy - and it writes them faster than you, at 2am, without complaining. If your roadmap is a checklist of &lt;em&gt;tools to learn so you can produce that code&lt;/em&gt;, you're training for a race that has already been run.&lt;/p&gt;

&lt;p&gt;So a 2026 roadmap has to be a different shape. Not "what do I learn so I can write a pipeline," but &lt;strong&gt;"what do I understand so I can tell whether the AI-written pipeline is right, and fix it when it isn't."&lt;/strong&gt; That is a map of &lt;em&gt;depth&lt;/em&gt;, not a list of tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one idea that makes the whole map work
&lt;/h2&gt;

&lt;p&gt;Most roadmaps draw becoming-senior as &lt;em&gt;new areas appearing&lt;/em&gt;: the junior does SQL and dbt, the senior does Kafka and Spark and Kubernetes. That is not how it works.&lt;/p&gt;

&lt;p&gt;A senior engineer works the &lt;strong&gt;same areas&lt;/strong&gt; a junior does. The difference is how far into each one they go.&lt;/p&gt;

&lt;p&gt;A junior knows Parquet is "the fast columnar format" and can partition a table. A senior reasons about row groups, page statistics, dictionary encoding, and why a scan cost what it cost. A junior writes a Spark job. A senior debugs its shuffle and its skew. Same topic, different altitude.&lt;/p&gt;

&lt;p&gt;That matters more now than ever, because &lt;strong&gt;AI raises the floor to roughly the junior line.&lt;/strong&gt; It reliably gets you the partitioned table and the working Spark job. The depth above that line is exactly the part it can't reason about for you, and exactly where your career value now lives.&lt;/p&gt;

&lt;p&gt;So as we walk the areas, watch for the pattern: &lt;strong&gt;AI does the surface; you own the depth.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Foundations and SQL
&lt;/h2&gt;

&lt;p&gt;Joins, window functions, CTEs, Python, the command line, Git, ETL vs ELT.&lt;/p&gt;

&lt;p&gt;AI writes almost all of this now. That doesn't make SQL optional, it makes it table stakes. You learn it not to produce it, but to &lt;strong&gt;catch when the generated query is quietly wrong&lt;/strong&gt;: the join that fans out and double-counts revenue, the &lt;code&gt;WHERE&lt;/code&gt; that silently drops NULLs, the window frame that is off by one row. The senior depth is reading an &lt;code&gt;EXPLAIN&lt;/code&gt; plan and knowing &lt;em&gt;why&lt;/em&gt; a query is slow. AI hands you the query; understanding it is still yours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data modeling and transformation
&lt;/h2&gt;

&lt;p&gt;Dimensional modeling, star and snowflake schemas, fact vs dimension tables, dbt models and tests. Then the depth: Slowly Changing Dimensions, grain, conformed dimensions, the One Big Table pattern, Data Vault.&lt;/p&gt;

&lt;p&gt;AI drafts the model. What it can't do is the judgement calls: what is the grain of this fact table, what does "one customer" mean across three source systems, which dimension is conformed across marts. The classic trap is Slowly Changing Dimensions - everyone can recite the types, almost nobody internalizes which version of a dimension their facts join to. Get it wrong and "revenue by region last quarter" reports a number that was never true.&lt;/p&gt;

&lt;p&gt;Replay a change timeline yourself in the free, in-browser &lt;a href="https://petascalelabs.com/tools/scd-playground" rel="noopener noreferrer"&gt;SCD Playground&lt;/a&gt;, then practice the area in the &lt;a href="https://petascalelabs.com/curriculum/dimensional-data-modeling" rel="noopener noreferrer"&gt;Dimensional Data Modeling track&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Orchestration and pipelines
&lt;/h2&gt;

&lt;p&gt;Airflow DAGs, scheduling, sensors, backfills, retries, idempotency. Senior: scheduler and executor internals, data-aware scheduling, lineage, freshness SLAs, being on-call.&lt;/p&gt;

&lt;p&gt;AI generates the DAG, and it is good at it. What it doesn't generate is the &lt;em&gt;understanding of failure modes&lt;/em&gt; the job actually requires, because the real work here isn't the happy path, it's the 3am page. Why did this task hang? Why did the backfill double-write? Is this retry safe, or did it just send the same email twice? Idempotency is a property you reason about, not a snippet AI sprinkles in. See the &lt;a href="https://petascalelabs.com/curriculum/orchestration-and-pipelines" rel="noopener noreferrer"&gt;Orchestration and Pipelines track&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Storage and file formats
&lt;/h2&gt;

&lt;p&gt;Parquet, row vs columnar, compression, object storage, partitioning. Senior: row groups, page statistics, predicate pushdown, encoding, the small-file problem, the internals of ORC, Avro and Arrow.&lt;/p&gt;

&lt;p&gt;This is where AI is least useful and depth pays the most, because &lt;strong&gt;why a scan costs what it costs is a property of the bytes on disk, not the query text.&lt;/strong&gt; AI reads and writes Parquet fine. It can't tell you why two files with identical rows differ tenfold in scan cost - that is row group sizing, encoding choice, and whether min/max statistics let the engine skip pages.&lt;/p&gt;

&lt;p&gt;Point the free &lt;a href="https://petascalelabs.com/tools/parquet-viewer" rel="noopener noreferrer"&gt;Parquet Viewer&lt;/a&gt; at your own files (100% in-browser, nothing is uploaded) to see the row groups and statistics yourself. Track: &lt;a href="https://petascalelabs.com/curriculum/storage-and-file-formats" rel="noopener noreferrer"&gt;Storage and File Formats&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data lakes and table formats
&lt;/h2&gt;

&lt;p&gt;Lake vs warehouse vs lakehouse, Iceberg and Delta, time travel, schema evolution. Senior: ACID and snapshot isolation internals, compaction, catalogs, the Iceberg-vs-Delta-vs-Hudi tradeoffs.&lt;/p&gt;

&lt;p&gt;AI scaffolds the table operations happily. The part that bites, and that it won't warn you about, is &lt;strong&gt;what happens when two writers commit at once.&lt;/strong&gt; Snapshot isolation, optimistic concurrency, conflict resolution, compaction fighting your ingest job: that is distributed-systems reasoning, not autocomplete. Track: &lt;a href="https://petascalelabs.com/curriculum/open-table-formats" rel="noopener noreferrer"&gt;Open Table Formats&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ingestion and streaming
&lt;/h2&gt;

&lt;p&gt;Batch ingestion, Kafka basics, producers and consumers, event time vs processing time. Senior: exactly-once semantics, consumer group rebalancing, Change Data Capture, stream processing in Flink or Kafka Streams.&lt;/p&gt;

&lt;p&gt;AI writes the producer and the consumer. Where it goes quiet is &lt;strong&gt;where data-quality bugs are actually born&lt;/strong&gt;: the gap between event time and processing time that makes your windowed aggregates wrong, the rebalance that reprocessed a batch, the "exactly-once" guarantee that was only ever at-least-once because of how you committed offsets. Track: &lt;a href="https://petascalelabs.com/curriculum/ingestion-and-transport" rel="noopener noreferrer"&gt;Ingestion and Transport&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distributed compute
&lt;/h2&gt;

&lt;p&gt;Spark DataFrames, transformations vs actions, lazy evaluation. Senior: shuffle and partitioning, broadcast joins and data skew, Catalyst and codegen, memory and fault tolerance.&lt;/p&gt;

&lt;p&gt;AI writes the transformation. It cannot tune the &lt;em&gt;execution&lt;/em&gt;. Why did this job spill to disk? Why is one task taking 40 times longer than the other 199 (hello, data skew)? Should this join broadcast or shuffle? That reasoning, about how a logical DataFrame becomes physical work across a cluster, is squarely yours. Track: &lt;a href="https://petascalelabs.com/curriculum/compute-engines" rel="noopener noreferrer"&gt;Compute Engines&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Query engines and OLAP
&lt;/h2&gt;

&lt;p&gt;What OLAP is, warehouse vs query engine, ClickHouse, Trino. Senior: MergeTree and projections, federation and pushdown, execution models, cost-based optimization, real-time OLAP, &lt;code&gt;EXPLAIN&lt;/code&gt; literacy.&lt;/p&gt;

&lt;p&gt;AI writes the SQL the dashboard runs. Why that dashboard is &lt;em&gt;slow&lt;/em&gt;, and how to fix it at the engine rather than by rewriting the query, is senior work. It lives in how the engine sorts and merges data, what it can push down, and what its optimizer chose. Track: &lt;a href="https://petascalelabs.com/curriculum/query-engines-and-olap" rel="noopener noreferrer"&gt;Query Engines and OLAP&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Semantic and metrics layer
&lt;/h2&gt;

&lt;p&gt;Metrics and dashboards, the semantic layer, data-quality tests. Senior: data contracts, schema registries, metric governance, reverse ETL.&lt;/p&gt;

&lt;p&gt;AI drafts a metric definition. What it can't do is the &lt;em&gt;organizational&lt;/em&gt; work of making "revenue" mean exactly one thing across finance, sales and product. That is a human contract - negotiated, governed, enforced - and it is the layer where data finally becomes shared business language instead of seven conflicting spreadsheets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Governance, quality and cloud
&lt;/h2&gt;

&lt;p&gt;PII basics, GDPR and CCPA, cloud, CI/CD for data. Senior: masking and tokenization, row and column access control, right-to-erasure across a lakehouse, infrastructure as code, data observability at scale.&lt;/p&gt;

&lt;p&gt;AI flags the obvious PII column. What it can't design is right-to-erasure across a lakehouse with time travel and immutable snapshots - that is architecture, not autocomplete. The masking itself is full of guarantee-breaking gotchas: an unsalted hash is a lookup table, a redacted ZIP that keeps five digits still re-identifies people. Generate the DDL with the free &lt;a href="https://petascalelabs.com/tools/pii-masking-generator" rel="noopener noreferrer"&gt;PII Masking Policy Generator&lt;/a&gt;. Track: &lt;a href="https://petascalelabs.com/curriculum/pii-data-governance" rel="noopener noreferrer"&gt;PII and Data Governance&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  So, will AI replace data engineers?
&lt;/h2&gt;

&lt;p&gt;It raises the floor and moves the value up.&lt;/p&gt;

&lt;p&gt;AI now does the old junior checklist well: queries, DAGs, glue code, boilerplate pipelines. What's left for you is the durable part - reasoning about the system. Why a scan costs what it costs. What happens when two writers commit. Why a job spilled to disk.&lt;/p&gt;

&lt;p&gt;AI doesn't replace the engineer who understands that depth, it gives them leverage. They direct the AI through the surface work and spend their judgement on the part it can't reach. The engineer who only knew the surface is the one under pressure now, because the surface is free.&lt;/p&gt;

&lt;p&gt;That is the whole premise of the map. You touch every area early - junior and senior work the same areas. What stretches out over a career is how deep you go into each, and the deep end is precisely the part AI can't shortcut for you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://petascalelabs.com/data-engineer-roadmap" rel="noopener noreferrer"&gt;Open the full interactive Data Engineer Roadmap&lt;/a&gt;&lt;/strong&gt; to see every topic on a single timeline, with a "Going senior" toggle that reveals the depth in each layer. Then if you want to practice that depth on real engines instead of slideware, that is what the &lt;a href="https://petascalelabs.com/curriculum" rel="noopener noreferrer"&gt;curriculum&lt;/a&gt; and the free &lt;a href="https://petascalelabs.com/tools" rel="noopener noreferrer"&gt;in-browser tools&lt;/a&gt; are for.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published on the &lt;a href="https://petascalelabs.com/blog/data-engineer-roadmap-2026" rel="noopener noreferrer"&gt;Petascale Labs blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>career</category>
      <category>roadmap</category>
      <category>ai</category>
    </item>
    <item>
      <title>Data Engineering Skills Gap Nobody Fills — and the Side Project I Finally Finished to Fill It</title>
      <dc:creator>Petascale Labs</dc:creator>
      <pubDate>Thu, 04 Jun 2026 17:15:01 +0000</pubDate>
      <link>https://dev.to/petascalelabs/data-engineering-skills-gap-nobody-fills-and-the-side-project-i-finally-finished-to-fill-it-d4j</link>
      <guid>https://dev.to/petascalelabs/data-engineering-skills-gap-nobody-fills-and-the-side-project-i-finally-finished-to-fill-it-d4j</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/github-2026-05-21"&gt;GitHub Finish-Up-A-Thon Challenge&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Petascale Labs&lt;/strong&gt; — a data engineering learning platform that teaches the&lt;br&gt;
stack &lt;strong&gt;from the bytes up&lt;/strong&gt;. Most DE curriculum shows you &lt;em&gt;which&lt;/em&gt; button to click. We teach you &lt;em&gt;why&lt;/em&gt; it breaks in production and how to reason about it from first principles. &lt;/p&gt;

&lt;p&gt;What makes it ours:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Strata model&lt;/strong&gt; — the data platform as layers: storage &amp;amp; file formats →
ingestion → open table formats → compute engines → orchestration → query
engines/OLAP → semantic layer. A mental map for the whole stack.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident-driven lessons&lt;/strong&gt; — every lesson is a real production failure and
its fix. You learn the way you actually grow at work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An Incident-Response Arcade&lt;/strong&gt; — interactive, time-pressured sims where you
diagnose and resolve infra failures (the phantom lag, shuffle spills, broken
CDC) under a budget and a cluster-health clock -&lt;a href="https://petascalelabs.com/arcade/games" rel="noopener noreferrer"&gt;https://petascalelabs.com/arcade/games&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free, client-side DE tools&lt;/strong&gt; — a Parquet Inspector, an SCD Playground, and a PII Masking Policy Generator that run entirely in your browser - &lt;a href="https://petascalelabs.com/tools" rel="noopener noreferrer"&gt;https://petascalelabs.com/tools&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;🔗 &lt;strong&gt;Live:&lt;/strong&gt; &lt;a href="https://petascalelabs.com" rel="noopener noreferrer"&gt;https://petascalelabs.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4ba7v3hl1vx3qvpzy32.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4ba7v3hl1vx3qvpzy32.png" alt="The Platform" width="800" height="430"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwectiz064mul4j4z81u6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwectiz064mul4j4z81u6.png" alt="Simulation Arcade" width="800" height="407"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faoush5yqnyfg9czd3tru.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faoush5yqnyfg9czd3tru.png" alt="Free Tools" width="800" height="433"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qu5p0xivllgod04ni1r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qu5p0xivllgod04ni1r.png" alt="Acrade Access" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Things to try:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Incident-Response Arcade&lt;/strong&gt; — pick a scenario, work the terminal, and
ship a post-mortem before the cluster falls over (timer + budget +
cluster-health clock).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free DE Tools&lt;/strong&gt; (&lt;a href="https://petascalelabs.com/tools" rel="noopener noreferrer"&gt;https://petascalelabs.com/tools&lt;/a&gt;) — fast, &lt;strong&gt;100% client-side&lt;/strong&gt; utilities
for working data engineers:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parquet Inspector&lt;/strong&gt; — drop in a &lt;code&gt;.parquet&lt;/code&gt; file and read its schema, row
groups, column stats, and metadata, all in-browser (DuckDB-WASM), nothing
uploaded anywhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SCD Playground&lt;/strong&gt; — a customer relocates, a tier gets upgraded, and every
historical fact is suddenly at risk of silently re-stating under today's
attributes. Replay the timeline and watch the dimension transform under each
Slowly Changing Dimension type.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PII Masking Policy Generator&lt;/strong&gt; — paste a sample, auto-detect the PII, and
generate ready-to-run dynamic data masking policies for &lt;strong&gt;Snowflake,
Databricks, and BigQuery&lt;/strong&gt; — while you learn what hashing, tokenization,
redaction, and generalization each actually protect.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;The Strata map&lt;/strong&gt; — browse the data platform layer by layer, from storage &amp;amp;
file formats up to the semantic layer.&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Comeback Story
&lt;/h2&gt;

&lt;p&gt;This started as scattered notes and a half-built course engine — an idea&lt;br&gt;
buried under "I'll finish it later." The bones existed: a lesson renderer, a few&lt;br&gt;
Strata, a rough game loop. None of it hung together.&lt;/p&gt;

&lt;p&gt;The finish-up sprint closed the gap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shipped the &lt;strong&gt;Incident-Response Arcade&lt;/strong&gt; end to end — game engine, HUD
(timer/credits/health), terminal, Slack-style alert stream, and the
post-mortem screen.&lt;/li&gt;
&lt;li&gt;Built a &lt;strong&gt;free tools hub&lt;/strong&gt; — Parquet Inspector, SCD Playground, and PII
Masking Policy Generator — all client-side, each one shippable on its own.&lt;/li&gt;
&lt;li&gt;Wired &lt;strong&gt;content authoring&lt;/strong&gt; into a real contract so new incidents and lessons
drop in as data, not code.&lt;/li&gt;
&lt;li&gt;Fixed the unglamorous-but-fatal stuff: production SSR/routing, auth, and the
rough edges that keep a side project from ever feeling "done."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It went from a folder I was embarrassed to share to something I'll put a demo&lt;br&gt;
link next to.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Experience with GitHub Copilot
&lt;/h2&gt;

&lt;p&gt;Copilot was most useful in the &lt;strong&gt;glue and grind&lt;/strong&gt; — the parts that stall a finishing sprint. Concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Boilerplate velocity&lt;/strong&gt; — React component scaffolds, TypeScript interfaces
for the game state, and repetitive handlers came out fast from a comment or a
type signature, so I could spend attention on the game &lt;em&gt;design&lt;/em&gt;, not the
plumbing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In-editor pattern-matching&lt;/strong&gt; — once one phase component (e.g. the HUD) had a
shape, Copilot inferred the next ones from context, keeping the codebase
consistent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unblocking the boring last 20%&lt;/strong&gt; — Go handler stubs, JSON scaffolds for new
incident scenarios, and small refactors where momentum matters more than
novelty.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where I stayed hands-on: the architecture, the incident pedagogy, and anything&lt;br&gt;
touching correctness in production. Copilot is a force multiplier on the typing,&lt;br&gt;
not a substitute for the thinking — which is exactly the philosophy we teach.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Petascale Labs — understand the data stack from the bytes up.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>githubchallenge</category>
    </item>
  </channel>
</rss>
