<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aniket Abhishek Soni</title>
    <description>The latest articles on DEV Community by Aniket Abhishek Soni (@aniketsoni).</description>
    <link>https://dev.to/aniketsoni</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3954381%2Fc17f147f-e19b-4160-be20-e2d4dd2af1dd.png</url>
      <title>DEV Community: Aniket Abhishek Soni</title>
      <link>https://dev.to/aniketsoni</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aniketsoni"/>
    <language>en</language>
    <item>
      <title>Is Your Lakehouse Architecture Just a High-Priced Tax on Your Data Team?</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Wed, 01 Jul 2026 22:48:54 +0000</pubDate>
      <link>https://dev.to/aniketsoni/is-your-lakehouse-architecture-just-a-high-priced-tax-on-your-data-team-2jp6</link>
      <guid>https://dev.to/aniketsoni/is-your-lakehouse-architecture-just-a-high-priced-tax-on-your-data-team-2jp6</guid>
      <description>&lt;p&gt;Ninety-two percent of data platform migrations I’ve audited in the last three years ended up costing more in "operational tax" than they saved in raw compute efficiency. We talk about TCO (Total Cost of Ownership) like it’s a math problem, but it’s actually a human behavior problem. The choice between BigQuery and Databricks SQL isn't about which engine can scan a petabyte faster; it’s about whether you want to spend your weekends debugging slot allocation or tuning Delta Lake vacuum intervals.&lt;/p&gt;

&lt;p&gt;I’ve spent the last six years keeping financial services and healthcare workloads upright. I’ve seen BigQuery’s &lt;code&gt;INFORMATION_SCHEMA&lt;/code&gt; save a QBR and I’ve seen Databricks’ &lt;code&gt;OPTIMIZE&lt;/code&gt; commands accidentally lock a table during a critical financial close. If you’re choosing based on a vendor slide deck, you’re already behind. Here is the field guide to not blowing your cloud budget while trying to build a "lakehouse."&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The "Slot" Trap vs. The "Warehouse" Mirage
&lt;/h2&gt;

&lt;p&gt;BigQuery’s shift to &lt;code&gt;Edition&lt;/code&gt; pricing (Standard, Enterprise, Enterprise Plus) was the industry’s way of saying "we want predictable, Databricks-style billing." But here’s the reality: if you aren't using Reservations, you aren't using BigQuery. I’ve seen teams blow $50k in a weekend because a rogue &lt;code&gt;SELECT *&lt;/code&gt; on a multi-petabyte partitioned table hit on-demand pricing.&lt;/p&gt;

&lt;p&gt;In Databricks, you’re buying "SQL Warehouses." The failure mode here is over-provisioning. If you leave a 2XL warehouse running 24/7 because your analysts "need it to be fast," you’re lighting money on fire. BigQuery is inherently multi-tenant; Databricks is isolated. If you have 50 different departments, BigQuery manages the concurrency better out of the box. If you have a few massive, complex jobs that need predictable performance, you want a dedicated Databricks SQL Warehouse.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1640012046731-3296bc0b7642%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyMHx8dGFuZ2xlZCUyMGZpYmVyJTIwb3B0aWMlMjBjYWJsZXN8ZW58MHwwfHx8MTc4MjkzNjMxNHww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1640012046731-3296bc0b7642%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyMHx8dGFuZ2xlZCUyMGZpYmVyJTIwb3B0aWMlMjBjYWJsZXN8ZW58MHwwfHx8MTc4MjkzNjMxNHww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Monisha Selvakumar on Unsplash" width="1080" height="720"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@monishaselv?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Monisha Selvakumar&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Partitioning Isn't Optional; It’s Your Only Defense
&lt;/h2&gt;

&lt;p&gt;In BigQuery, if you don't filter by your partition column (usually &lt;code&gt;_PARTITIONDATE&lt;/code&gt; or a timestamp), you are paying for a full table scan. Period. I’ve seen junior engineers write queries that scanned 40TB of data for a single dashboard refresh. &lt;/p&gt;

&lt;p&gt;In Databricks, the &lt;code&gt;Z-ORDER&lt;/code&gt; command is your best friend. If you aren't Z-ordering your high-cardinality columns, you’re missing the point of Delta Lake.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- BigQuery: Never skip the filter, or get fired.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`my_project.my_dataset.events`&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;_PARTITIONDATE&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATE_SUB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="k"&gt;DAY&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Databricks: Z-ORDER is the performance multiplier.&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt; 
&lt;span class="n"&gt;ZORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you ignore these, you’re paying for the vendor’s inefficiency. In BigQuery, you pay for the scan. In Databricks, you pay for the time the cluster spent scanning.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. The "Vacuum" and "Snapshot" Tax
&lt;/h2&gt;

&lt;p&gt;One of the biggest hidden costs in Databricks is storage bloat. Because Delta Lake keeps snapshots for time travel, if you don't run &lt;code&gt;VACUUM&lt;/code&gt; regularly, your storage bill will grow indefinitely. I’ve seen terabytes of "deleted" data sitting in S3/ADLS buckets that Databricks users forgot to prune.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Databricks: Pruning old snapshots to save storage costs&lt;/span&gt;
&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt; &lt;span class="n"&gt;RETAIN&lt;/span&gt; &lt;span class="mi"&gt;168&lt;/span&gt; &lt;span class="n"&gt;HOURS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;-- Keep 7 days of history&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;BigQuery handles this via internal TTLs on datasets and tables. It’s "set it and forget it." If you lack the discipline to manage a &lt;code&gt;VACUUM&lt;/code&gt; schedule, Databricks will eventually bite your budget in the ass.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Concurrency is a Lie
&lt;/h2&gt;

&lt;p&gt;Marketing teams love to talk about "limitless concurrency." Both platforms handle it, but they handle it differently. BigQuery uses a distributed scheduler that tries to fit your query into the available slots. If you have 2,000 slots and you trigger 5,000 slots worth of work, BigQuery will queue your queries. That's a latency hit, but not a failure.&lt;/p&gt;

&lt;p&gt;Databricks SQL Warehouses (Serverless) have a "scaling out" threshold. When your cluster gets slammed, it spawns new clusters to handle the load. This is great until you hit your regional limit for cloud instances or your bill hits the stratosphere because you triggered five extra clusters to run a 2-second query. Monitor your &lt;code&gt;dbr_sql_warehouse_scaling_events&lt;/code&gt; like a hawk.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. The "Governance" Penalty
&lt;/h2&gt;

&lt;p&gt;Healthcare data requires ironclad access control. BigQuery’s integration with IAM is native and absolute. If you are already deep in the Google Cloud ecosystem, BigQuery’s row-level security and column-level masking (via Policy Tags) are incredibly easy to implement.&lt;/p&gt;

&lt;p&gt;Databricks uses Unity Catalog. It’s powerful, but it’s a second layer of governance you have to maintain outside of your cloud provider’s IAM. If your organization is already struggling with identity management, adding Unity Catalog adds another point of failure. Don't underestimate the "cognitive load" of managing two sets of permissions.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Cold Starts and Serverless Latency
&lt;/h2&gt;

&lt;p&gt;BigQuery is always "warm." You send a request, it runs. Databricks SQL Serverless has gotten much faster, but there is still a spin-up time for those clusters if they’ve been idle. If your users are clicking around a Looker dashboard, they will notice the 3-5 second lag on the first click if your warehouse was cold.&lt;/p&gt;

&lt;p&gt;If your users are impatient (and they are), you will end up keeping warehouses running longer than you need to, just to avoid the "Why is the dashboard slow?" Slack messages. That’s a hidden cost of the Databricks architecture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1534006215338-cfa9ef7903a1%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyNnx8bW91bnRhaW4lMjBvZiUyMHJlY2VpcHRzfGVufDB8MHx8fDE3ODI5MzYzMTV8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1534006215338-cfa9ef7903a1%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyNnx8bW91bnRhaW4lMjBvZiUyMHJlY2VpcHRzfGVufDB8MHx8fDE3ODI5MzYzMTV8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Giancarlo Revolledo on Unsplash" width="1080" height="720"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@giancarlor_photo?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Giancarlo Revolledo&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Vendor Lock-in is a Myth; Portability is a Pipe Dream
&lt;/h2&gt;

&lt;p&gt;People choose Databricks because they want to "own" their data in Parquet/Delta format. They choose BigQuery because they want it to "just work." &lt;/p&gt;

&lt;p&gt;Here is the truth: you aren't going to migrate 500TB of data from BigQuery to Databricks because you had a bad quarter. You are locked in by your ingestion pipelines and your BI tool semantic layers. Pick the one that fits your current team’s skillset. If your team knows Spark, Databricks is the path of least resistance. If your team is SQL-first and hates infrastructure management, BigQuery is the only logical choice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;BigQuery is a managed service that demands you play by its rules—partitioning, slot management, and Google-native IAM. Databricks is a platform that gives you more control but demands you manage the complexity—vacuuming, Z-ordering, and catalog governance. &lt;/p&gt;

&lt;p&gt;If you want a "lakehouse" that functions like a database, pay the BigQuery tax and embrace the simplicity. If you want a data science powerhouse that happens to run SQL, pay the Databricks tax and hire a good platform engineer to clean up your mess. &lt;/p&gt;

&lt;p&gt;Which one is keeping your CFO up at night, and what are you going to do about it tomorrow morning?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@fp4?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Gavin Allanwood&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>data</category>
      <category>engineering</category>
      <category>bigquery</category>
    </item>
    <item>
      <title>Beyond IAM Policies: How to Actually Secure Cross-Account Data Shares</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Tue, 30 Jun 2026 16:05:08 +0000</pubDate>
      <link>https://dev.to/aniketsoni/beyond-iam-policies-how-to-actually-secure-cross-account-data-shares-453i</link>
      <guid>https://dev.to/aniketsoni/beyond-iam-policies-how-to-actually-secure-cross-account-data-shares-453i</guid>
      <description>&lt;p&gt;03:14 AM. The PagerDuty alert hit my phone like a physical blow. "Critical: Data Integrity Failure – Account B ETL Job Failing."&lt;/p&gt;

&lt;p&gt;Six months ago, our cross-account data sharing strategy was a nightmare of IAM resource-based policies. We had S3 bucket policies spanning three accounts, thousands of lines of JSON, and a collective prayer that no developer accidentally added &lt;code&gt;s3:GetObject&lt;/code&gt; to a &lt;code&gt;Principal: *&lt;/code&gt; block. It was a ticking time bomb of "Access Denied" errors and "Wait, why can the marketing team see HIPAA-regulated PII?" queries.&lt;/p&gt;

&lt;p&gt;Today, those S3 policies are gone. We use AWS Lake Formation governed tables. The difference isn't just cosmetic; it’s the difference between auditing a 4,000-line JSON file and reading a single, declarative grant statement.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we saw
&lt;/h2&gt;

&lt;p&gt;The incident started when an ETL job in our Analytics account (Account B) stopped pulling data from our Production account (Account A). The error wasn't a clean "Access Denied." It was an &lt;code&gt;AccessDeniedException&lt;/code&gt; coming from the AWS Glue Data Catalog, even though the IAM role had explicit &lt;code&gt;glue:GetTable&lt;/code&gt; and &lt;code&gt;s3:GetObject&lt;/code&gt; permissions.&lt;/p&gt;

&lt;p&gt;Our first instinct—the "false lead"—was to assume the S3 bucket policy was malformed. We spent 45 minutes diffing the JSON in &lt;code&gt;bucket-prod-data&lt;/code&gt; against our internal "Gold Standard" template. We checked the &lt;code&gt;Principal&lt;/code&gt; ARN. We checked the &lt;code&gt;Condition: StringEquals: aws:PrincipalOrgID&lt;/code&gt;. Everything looked perfect.&lt;/p&gt;

&lt;p&gt;The symptoms were deceptive. The job could list the partitions, but as soon as the Spark executor tried to read the underlying Parquet files, the job would hang for 30 seconds and then crash. We were looking at a classic "permission mismatch" between the Data Catalog layer and the Data Storage layer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1647563099304-301a49c0e09d%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxNXx8c2hhdHRlcmVkJTIwZ2xhc3N8ZW58MHwwfHx8MTc4Mjc2Mzk2M3ww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1647563099304-301a49c0e09d%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxNXx8c2hhdHRlcmVkJTIwZ2xhc3N8ZW58MHwwfHx8MTc4Mjc2Mzk2M3ww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Katelyn G on Unsplash" width="1080" height="721"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@katelyn_g?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Katelyn G&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Root cause
&lt;/h2&gt;

&lt;p&gt;The root cause was the "confused deputy" problem masked by overly permissive IAM policies. We had been relying on &lt;code&gt;s3:GetObject&lt;/code&gt; and a Glue resource policy that was essentially a catch-all. &lt;/p&gt;

&lt;p&gt;When we migrated to Lake Formation, we didn't fully sever the tie to the underlying S3 policy. We had a hybrid mess where Lake Formation was managing the metadata, but the S3 bucket policy was still trying to enforce access. Because we hadn't set up the &lt;code&gt;LF-Tag&lt;/code&gt; based policies correctly, the &lt;code&gt;Glue Service Role&lt;/code&gt; was struggling to resolve the identity of the cross-account requester.&lt;/p&gt;

&lt;p&gt;Specifically, we were violating the "Lake Formation-managed S3 locations" requirement. If you register a path in Lake Formation (e.g., &lt;code&gt;s3://data-prod-finance/&lt;/code&gt;), you &lt;em&gt;must&lt;/em&gt; ensure that the IAM role used by the remote account has the &lt;code&gt;lakeformation:GetDataAccess&lt;/code&gt; permission. We hadn't included this. We were trying to authenticate using standard IAM, while the bucket was configured to only accept requests through the Lake Formation engine. &lt;/p&gt;

&lt;p&gt;The offending configuration was this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"AWS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::ACCOUNT_B:role/AnalyticsRole"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::data-prod-finance/*"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks fine to a junior engineer. But if the bucket has &lt;code&gt;s3:BlockPublicAccess&lt;/code&gt; enabled and you’ve enabled Lake Formation, that policy is essentially being bypassed or, worse, ignored in favor of the Lake Formation grant. The failure happened because we didn't have the &lt;code&gt;glue:GetTable&lt;/code&gt; and &lt;code&gt;lakeformation:GetDataAccess&lt;/code&gt; dance fully synchronized.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1614064849377-2ec42b722a94%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyN3x8c2VjdXJpdHklMjBwYWRsb2NrfGVufDB8MHx8fDE3ODI3NjM5NjR8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1614064849377-2ec42b722a94%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyN3x8c2VjdXJpdHklMjBwYWRsb2NrfGVufDB8MHx8fDE3ODI3NjM5NjR8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by FlyD on Unsplash" width="1080" height="720"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@flyd2069?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;FlyD&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;We nuked the bucket policies. Seriously. We stripped the cross-account S3 bucket policies down to only allow the Lake Formation service role to access the data. &lt;/p&gt;

&lt;p&gt;Then, we implemented a proper Lake Formation Cross-Account Grant. Instead of sharing the S3 bucket, we shared the Data Catalog table. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We registered the S3 location in Account A under Lake Formation: &lt;code&gt;aws lakeformation register-resource --resource-arn s3://data-prod-finance/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;We granted the &lt;code&gt;SELECT&lt;/code&gt; permission on the specific Glue table to the external Account B ID:
&lt;code&gt;aws lakeformation grant-permissions --principal DataLakePrincipalIdentifier=arn:aws:iam::ACCOUNT_B:root --permissions SELECT --resource '{ "Table": { "DatabaseName": "finance", "Name": "transactions" } }'&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;In Account B, we created a "Resource Link" pointing to the shared table in Account A.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By moving the permission logic from the S3 bucket policy into the Lake Formation &lt;code&gt;GRANT&lt;/code&gt; command, we moved from "who can touch these files?" to "who can run queries on this table?" The auditors loved it because the &lt;code&gt;Get-LFPermissions&lt;/code&gt; API response is structured, human-readable, and doesn't require a degree in IAM JSON parsing to verify.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed so it never happens again
&lt;/h2&gt;

&lt;p&gt;We stopped treating security as a post-deployment checklist.&lt;/p&gt;

&lt;p&gt;First, we implemented a strict "No Manual IAM" policy for data access. If a team needs cross-account access, they submit a PR to our Terraform repository that defines a &lt;code&gt;aws_lakeformation_permissions&lt;/code&gt; resource. If it's not in the HCL, it doesn't exist. This prevents the "drift" that caused our 3:00 AM incident.&lt;/p&gt;

&lt;p&gt;Second, we moved to an LF-Tag based access control (LF-TBAC) model. Instead of granting access to specific tables, we tag tables with &lt;code&gt;Classification: PII&lt;/code&gt; or &lt;code&gt;Environment: Prod&lt;/code&gt;. The cross-account role gets access to everything tagged &lt;code&gt;Environment: Prod&lt;/code&gt;. When a new table is added, it’s automatically shared if it carries the right tag. This eliminated the manual work of adding new tables to the grant list, which was a constant source of "Access Denied" tickets.&lt;/p&gt;

&lt;p&gt;Finally, we use AWS CloudTrail to monitor &lt;code&gt;GetDataAccess&lt;/code&gt; events. If an account attempts to query a table they don't have permission for, it fires an alarm in our SOC. Before, we were flying blind; now, we see the attempt in real-time.&lt;/p&gt;

&lt;p&gt;The real lesson? Don't fight the platform. If you’re still trying to manage cross-account data sharing using S3 bucket policies and IAM roles, you’re just building technical debt that will eventually wake you up at 3:00 AM. Stop it. Move the governance to the layer that was built to handle it. Your auditors—and your sleep schedule—will thank you.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@albertstoynov?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Albert Stoynov&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>data</category>
      <category>security</category>
      <category>governance</category>
    </item>
    <item>
      <title>Why I Stopped Caring Which Table Format You Use</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Sat, 27 Jun 2026 23:18:18 +0000</pubDate>
      <link>https://dev.to/aniketsoni/why-i-stopped-caring-which-table-format-you-use-4mb4</link>
      <guid>https://dev.to/aniketsoni/why-i-stopped-caring-which-table-format-you-use-4mb4</guid>
      <description>&lt;p&gt;Roughly 82% of the production data pipelines I audited in 2025 were still relying on legacy Hive metastores as their primary source of truth, despite moving to "modern" cloud data lakes.&lt;/p&gt;

&lt;p&gt;That statistic matters because it explains why we’re all so obsessed with the Delta Lake vs. Apache Iceberg debate. We aren't fighting over technical superiority; we’re fighting over the fear of vendor lock-in. We keep choosing formats like they’re lifetime marriages, terrified that picking the "wrong" one will force a multi-petabyte migration in three years.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why I chose this topic:&lt;/strong&gt; After spending a decade migrating between proprietary formats and open-source standards, I’m tired of the tribalism. Interoperability is finally here, and it’s time we treated these formats as interchangeable storage layouts rather than religious identities.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most engineers use the term "table format" to describe the metadata layer, but they rarely understand the actual transition from manifest files to snapshots. We treat &lt;code&gt;delta.enableDeletionVectors&lt;/code&gt; or &lt;code&gt;iceberg.engine.hive.enabled&lt;/code&gt; as magic flags, but when the commit lock fails at 3:00 AM on a Sunday, your deep-seated belief in one format over the other won't stop the partition skew.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it actually works
&lt;/h2&gt;

&lt;p&gt;The industry has moved past the "one format to rule them all" era. The reality of 2026 is that the storage format—whether it’s the Delta &lt;code&gt;_delta_log&lt;/code&gt; or the Iceberg &lt;code&gt;metadata/&lt;/code&gt; directory—is increasingly becoming an implementation detail hidden behind abstraction layers like Databricks UniForm or the Iceberg REST catalog.&lt;/p&gt;

&lt;p&gt;Take UniForm (Universal Format). It isn’t just a marketing slide; it’s a translation layer. When you enable it on a Delta table, you’re essentially running an asynchronous background process that writes Iceberg-compatible metadata alongside your Delta logs.&lt;/p&gt;

&lt;p&gt;In code, it looks deceptively simple. You aren't rewriting your data; you're just updating the table properties:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="s1"&gt;'delta.universalFormat.enabledIceberg'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When this runs, the engine maintains the Delta transaction log while simultaneously generating the Iceberg &lt;code&gt;metadata.json&lt;/code&gt; and manifest files. An engine like Trino or StarRocks doesn't need to know the Delta log exists; it points to the Iceberg metadata and reads the Parquet files as if they were native Iceberg data.&lt;/p&gt;

&lt;p&gt;This is the "interoperability" endgame. You keep the high-performance write features of Delta—like deletion vectors for CDC or Z-Ordering for query acceleration—while exposing the table to the entire Iceberg-compatible ecosystem. You aren't choosing a side; you’re choosing a storage engine that speaks two languages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1600160189693-19ceef492080%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxM3x8YnJpZGdlJTIwb3ZlciUyMHJpdmVyfGVufDB8MHx8fDE3ODI1ODkxNjV8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1600160189693-19ceef492080%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxM3x8YnJpZGdlJTIwb3ZlciUyMHJpdmVyfGVufDB8MHx8fDE3ODI1ODkxNjV8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Viktor Talashuk on Unsplash" width="1080" height="715"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@viktortalashuk?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Viktor Talashuk&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The tradeoffs nobody mentions
&lt;/h2&gt;

&lt;p&gt;If this sounds like a free lunch, it isn't. The cost is "metadata bloat" and "write amplification."&lt;/p&gt;

&lt;p&gt;When you enable UniForm, you are effectively doubling the metadata overhead of your table. If your pipeline performs frequent, high-concurrency writes, the asynchronous translation process can lag. I’ve seen cases where a downstream Iceberg-native reader (like a legacy Presto cluster) missed the most recent 15 minutes of data because the background manifest generation hadn't caught up to the Delta transaction.&lt;/p&gt;

&lt;p&gt;Then there’s the failure mode of the "dual-writer" trap. If you have an application that attempts to write to the Iceberg metadata while your main pipeline is writing to the Delta log, you hit a consistency nightmare. You’ll see &lt;code&gt;ConcurrentModificationException&lt;/code&gt; errors that are incredibly difficult to debug because the logs don't clearly state which format failed the commit.&lt;/p&gt;

&lt;p&gt;Another issue: schema evolution. Delta’s schema evolution is notoriously permissive—you can add columns almost anywhere. Iceberg is stricter, enforcing partition evolution and column ID mapping. When you bridge them, you are forced to abide by the intersection of their constraints. If you try to perform a complex &lt;code&gt;ALTER TABLE&lt;/code&gt; that Iceberg doesn't support but Delta allows, the translation layer breaks. You end up with a table that is valid in Delta but "corrupted" in the eyes of your Iceberg-based tools.&lt;/p&gt;

&lt;p&gt;Finally, consider the maintenance tax. You now have two sets of vacuuming and snapshot expiration policies to manage. If you run &lt;code&gt;VACUUM&lt;/code&gt; on your Delta table but don’t properly expire the corresponding Iceberg snapshots, you end up with "orphan" metadata files that cost you cloud storage money and confuse your catalog.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1680992046615-065f58bcb4d8%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwzfHxzZXJ2ZXIlMjByb29tJTIwY2FibGVzfGVufDB8MHx8fDE3ODI1ODkxNjV8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1680992046615-065f58bcb4d8%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwzfHxzZXJ2ZXIlMjByb29tJTIwY2FibGVzfGVufDB8MHx8fDE3ODI1ODkxNjV8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Tyler on Unsplash" width="1080" height="608"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@tylergm?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Tyler&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  When to reach for it (and when not to)
&lt;/h2&gt;

&lt;p&gt;Reach for a cross-format architecture if your organization is fragmented. If you have a Databricks-heavy team of data engineers but an analytical layer (BI, ad-hoc SQL) that runs on Trino, StarRocks, or Flink, UniForm is a godsend. It prevents the "data siloing" that occurs when the BI team complains they can't see the latest metrics because they aren't on the Databricks cluster.&lt;/p&gt;

&lt;p&gt;It is also the right move if you are planning a long-term migration. Instead of a "big bang" migration, you can flip the UniForm switch, test your secondary compute engine, and slowly shift workloads over months.&lt;/p&gt;

&lt;p&gt;Do not reach for it if you are a "single-stack" shop. If your entire pipeline—from raw ingestion to the final dashboard—lives inside the Databricks ecosystem, adding Iceberg translation is unnecessary complexity. You are adding a layer of risk and metadata overhead for zero business value.&lt;/p&gt;

&lt;p&gt;Don't reach for it if you have severe latency requirements for ingestion. If you need sub-second visibility into your data, the asynchronous nature of the translation layer will be your enemy. The delay between the primary commit and the secondary metadata generation is a latency floor you cannot bridge.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The "Delta vs. Iceberg" war was useful for driving innovation, but it’s over. We have entered the era of the "multi-format lakehouse."&lt;/p&gt;

&lt;p&gt;Stop treating table formats as your primary architectural decision. The real decision is how you manage your catalog and how much complexity you’re willing to trade for portability. In 2026, the best engineer in the room isn't the one who can recite the pros and cons of Parquet file layouts; it's the one who knows how to configure a translation layer so the business doesn't have to care about the underlying format at all.&lt;/p&gt;

&lt;p&gt;Keep your schema tight, watch your metadata bloat, and stop choosing sides. The infrastructure is finally catching up to the reality that we just want our data to be readable by whatever tool we pick up today.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@intricateexplorer?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Intricate Explorer&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datalake</category>
      <category>engineering</category>
      <category>architecture</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Is Snowflake-managed Iceberg the ultimate Trojan horse or your data’s salvation?</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Thu, 25 Jun 2026 22:34:52 +0000</pubDate>
      <link>https://dev.to/aniketsoni/is-snowflake-managed-iceberg-the-ultimate-trojan-horse-or-your-datas-salvation-580l</link>
      <guid>https://dev.to/aniketsoni/is-snowflake-managed-iceberg-the-ultimate-trojan-horse-or-your-datas-salvation-580l</guid>
      <description>&lt;p&gt;Snowflake-managed Iceberg tables are the only way to escape the platform's proprietary storage format without sacrificing the performance that makes you pay the premium.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why I chose this topic:&lt;/strong&gt; I spent four years watching migration projects die in the "vendor lock-in" valley of death. I’m writing this because I’m tired of seeing engineers choose inferior tech just because they’re terrified of a data platform they actually like using.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It was 3:14 AM on a Tuesday. The PagerDuty alert hit my phone with the urgency of a heart attack: &lt;code&gt;CRITICAL: External Table Latency Exceeded Threshold&lt;/code&gt;. Our internal dashboard, which pulls telemetry from an S3 bucket via Snowflake’s external tables, had effectively died. Queries that usually took 300ms were timing out at the 60-second mark. &lt;/p&gt;

&lt;p&gt;I rolled out of bed, fired up the VPN, and saw the mess. We were trying to join a massive, partitioned Parquet dataset against a local Snowflake dimension table. The metadata overhead for the external table was choking the query engine. Every time we added a new partition, the &lt;code&gt;MSCK REPAIR TABLE&lt;/code&gt; or the periodic metadata refresh would lag, and the query planner would go into a death spiral trying to reconcile the file manifest.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we saw
&lt;/h2&gt;

&lt;p&gt;The symptoms were classic: &lt;code&gt;QUERY_HISTORY&lt;/code&gt; showed massive &lt;code&gt;EXTERNAL_SCAN&lt;/code&gt; times. At first, my junior engineer assumed it was an S3 throttling issue. We checked the AWS CloudWatch metrics for the bucket. Nothing. The bandwidth was fine, the API requests were well within limits, and the latency was stable. &lt;/p&gt;

&lt;p&gt;We then spent two hours chasing ghosts in the &lt;code&gt;EXPLAIN&lt;/code&gt; plan. We thought the file format (Parquet with Snappy compression) was the problem, so we tried re-partitioning the data into smaller chunks. It made it worse. The overhead of Snowflake tracking thousands of small files in a standard external table was creating a metadata bottleneck that simply couldn't scale. We were fighting the platform's inability to reconcile the "source of truth" in S3 with the "state of the world" in Snowflake.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1509410861810-031fc9265b7f%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyNnx8aWNlYmVyZyUyMGluJTIwZGFyayUyMHdhdGVyfGVufDB8MHx8fDE3ODI0MTgzNzl8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1509410861810-031fc9265b7f%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyNnx8aWNlYmVyZyUyMGluJTIwZGFyayUyMHdhdGVyfGVufDB8MHx8fDE3ODI0MTgzNzl8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Willian Justen de Vasconcellos on Unsplash" width="1080" height="809"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@willianjusten?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Willian Justen de Vasconcellos&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Root cause
&lt;/h2&gt;

&lt;p&gt;The root cause was the inherent fragility of standard External Tables. Snowflake doesn't "own" the metadata for standard External Tables; it has to infer it. When you rely on &lt;code&gt;AUTO_REFRESH = TRUE&lt;/code&gt;, you’re essentially asking Snowflake to play a high-stakes game of catch-up with S3 event notifications.&lt;/p&gt;

&lt;p&gt;We were using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;raw_events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="n"&gt;variant&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'@s3_stage/events/'&lt;/span&gt;
&lt;span class="n"&gt;FILE_FORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The issue is that the metadata is detached from the data. If the S3 event notification fails—or if a backfill process bypasses the staging area—the manifest drifts. Snowflake’s query optimizer was forced to perform a full directory listing of the S3 prefix because it couldn’t trust its own stale manifest. That’s a 10-second metadata latency penalty on every query, regardless of the compute warehouse size.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1544819679-57b273c027a3%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyMHx8ZGF0YSUyMHBpcGVsaW5lJTIwYmx1ZXByaW50fGVufDB8MHx8fDE3ODI0MTgzODB8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1544819679-57b273c027a3%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyMHx8ZGF0YSUyMHBpcGVsaW5lJTIwYmx1ZXByaW50fGVufDB8MHx8fDE3ODI0MTgzODB8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Isaac Smith on Unsplash" width="1080" height="720"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@isaacmsmith?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Isaac Smith&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;We migrated the entire pipeline to Snowflake-managed Iceberg tables. The switch was surprisingly trivial. By using the &lt;code&gt;CATALOG_INTEGRATION&lt;/code&gt; feature, we kept the physical files in our own S3 bucket while letting Snowflake manage the Iceberg metadata (the &lt;code&gt;metadata.json&lt;/code&gt; files and snapshot pointers).&lt;/p&gt;

&lt;p&gt;The SQL change looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;ICEBERG&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;managed_events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;event_timestamp&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="n"&gt;VARIANT&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'SNOWFLAKE'&lt;/span&gt;
&lt;span class="n"&gt;EXTERNAL_VOLUME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'my_s3_volume'&lt;/span&gt;
&lt;span class="n"&gt;BASE_LOCATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-bucket/iceberg/events/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Suddenly, the metadata was no longer a guessing game. Because Snowflake manages the Iceberg metadata, it holds the absolute state of the table. The query optimizer doesn’t need to scan S3 buckets; it reads the latest snapshot pointer from the Iceberg metadata file. The 10-second latency vanished instantly, dropping back to sub-second responses.&lt;/p&gt;

&lt;p&gt;More importantly, because this is actual Iceberg (v2 spec), the data is perfectly readable by Spark, Trino, or any other engine that supports the Iceberg API. If Snowflake ever decides to hike their storage premiums or if we decide to move a portion of our workload to an EMR cluster, the data is already in an open, portable format. We didn't leave Snowflake; we just stopped letting them hold our data hostage in a proprietary silo.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed so it never happens again
&lt;/h2&gt;

&lt;p&gt;We implemented a strict "Iceberg-first" policy for all new analytical datasets. No more standard External Tables. Period.&lt;/p&gt;

&lt;p&gt;We also automated the validation of our metadata. We now run a daily &lt;code&gt;SYSTEM$GET_ICEBERG_TABLE_INFORMATION&lt;/code&gt; check against our critical tables to ensure the snapshot age is within acceptable bounds. If the metadata hasn't been updated in 24 hours, the job alerts the team—not because the query is slow, but because it indicates a break in the Iceberg commit log.&lt;/p&gt;

&lt;p&gt;We also decoupled our storage from our compute by moving to an &lt;code&gt;EXTERNAL_VOLUME&lt;/code&gt; architecture. This allows us to point Snowflake at an S3 bucket while maintaining full control over the lifecycle policies of the data itself. If we want to transition the data to Glacier or delete it after 90 days, we can do it directly via S3 lifecycle rules without Snowflake knowing or caring.&lt;/p&gt;

&lt;p&gt;The biggest shift was psychological. We stopped treating Snowflake as a "black box" where data goes to die. By using managed Iceberg, we treat Snowflake as a high-performance compute engine that happens to be sitting on top of an open data lake. &lt;/p&gt;

&lt;p&gt;The vendor lock-in narrative is often a crutch for bad architecture. You don't have to leave the platform to own your data. You just have to stop using the proprietary features that anchor you to the platform's specific storage implementation. Snowflake-managed Iceberg gives you the best of both worlds: the speed of a premium warehouse and the sovereignty of an open-source standard. Use it, or keep waking up at 3:00 AM to fix metadata drift. Your choice.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@nathananderson?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Nathan Anderson&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>snowflake</category>
      <category>iceberg</category>
      <category>data</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Why you should stop manual Z-ordering today</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Thu, 25 Jun 2026 12:31:38 +0000</pubDate>
      <link>https://dev.to/aniketsoni/why-you-should-stop-manual-z-ordering-today-54hf</link>
      <guid>https://dev.to/aniketsoni/why-you-should-stop-manual-z-ordering-today-54hf</guid>
      <description>&lt;p&gt;If you are still manually running &lt;code&gt;OPTIMIZE table ZORDER BY (col)&lt;/code&gt; in your production pipelines, you are wasting engineering cycles on a legacy pattern that Delta Lake has already solved.&lt;/p&gt;

&lt;p&gt;I see teams treating Z-ordering like a permanent architectural requirement rather than a stopgap for inefficient storage layouts. Most engineers I talk to are terrified to switch to Liquid Clustering because they don't trust the black box. They prefer the illusion of control that comes with writing custom Spark jobs to re-cluster data every night.&lt;/p&gt;

&lt;p&gt;It’s time to stop babysitting your partitions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The contenders
&lt;/h2&gt;

&lt;p&gt;Z-ordering is the manual transmission of the data world. You designate specific columns, you run a command, and you hope your data distribution remains somewhat balanced. It’s effective, but it’s brittle. If your query patterns shift—say, from filtering by &lt;code&gt;user_id&lt;/code&gt; to filtering by &lt;code&gt;event_timestamp&lt;/code&gt;—your Z-order becomes a liability, actively slowing down queries that don't match your original index.&lt;/p&gt;

&lt;p&gt;Liquid clustering, introduced in Databricks Runtime 13.3 LTS, is the automatic transmission. It uses multi-dimensional clustering to adapt your data layout based on the actual usage patterns detected by the Delta engine. You define the clustering columns once, and the table optimizes itself as data flows in. No more &lt;code&gt;OPTIMIZE&lt;/code&gt; jobs. No more guessing which columns matter most.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1762278804798-dd7e493db051%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyMXx8YWJzdHJhY3QlMjBuZXR3b3JrJTIwbGluZXN8ZW58MHwwfHx8MTc4MjI0NTc2OHww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1762278804798-dd7e493db051%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyMXx8YWJzdHJhY3QlMjBuZXR3b3JrJTIwbGluZXN8ZW58MHwwfHx8MTc4MjI0NTc2OHww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Logan Voss on Unsplash" width="1080" height="608"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@loganvoss?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Logan Voss&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The operational tax
&lt;/h2&gt;

&lt;p&gt;When you run Z-ordering, you are paying a massive "Ops Tax." You need a dedicated cluster to run the &lt;code&gt;OPTIMIZE&lt;/code&gt; job. You need to manage the frequency of that job. If you trigger it too often, you’re burning compute costs; trigger it too little, and your read performance degrades as the table accumulates small files and data skew.&lt;/p&gt;

&lt;p&gt;I’ve seen teams with hundreds of tables spend upwards of 20% of their total DBU budget just on maintenance tasks like &lt;code&gt;OPTIMIZE&lt;/code&gt; and &lt;code&gt;VACUUM&lt;/code&gt;. In one healthcare project, we spent three days a month just tuning the Z-order columns for a 50TB fact table because the business changed their primary reporting dimensions.&lt;/p&gt;

&lt;p&gt;Liquid clustering changes the math. Because it is incremental and integrated into the write path (via &lt;code&gt;CLUSTER BY&lt;/code&gt;), the "maintenance" is baked into the ingestion. You pay a slight overhead during the write, but you stop paying the "maintenance tax" of running huge, batch-heavy &lt;code&gt;OPTIMIZE&lt;/code&gt; jobs. If your write throughput is the bottleneck, this is a trade-off. If your compute costs are the bottleneck, this is a massive win.&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure modes
&lt;/h2&gt;

&lt;p&gt;Z-ordering is prone to "job explosion." If you have a large table and your &lt;code&gt;OPTIMIZE&lt;/code&gt; job fails, you are often left with a partial state. If you aren't using Delta’s transaction logs effectively, or if your job fails mid-rewrite, you have to rollback or perform a full re-process. I’ve spent many early mornings debugging &lt;code&gt;ConcurrentModificationException&lt;/code&gt; errors triggered by an &lt;code&gt;OPTIMIZE&lt;/code&gt; job hitting a streaming ingestion window.&lt;/p&gt;

&lt;p&gt;Liquid clustering is significantly more resilient. Because it handles clustering at the write layer, it is transactional by design. You don't have to worry about the "maintenance window" because there isn't one. The failure mode shifts from "my maintenance job crashed" to "my write job is slightly slower." &lt;/p&gt;

&lt;p&gt;However, don't let the marketing fool you: Liquid Clustering isn't magic. It has a limit on the number of clustering columns—typically 4. If you have a table where you need to filter by 10 different dimensions, you’re going to hit a wall. In my experience, if you need more than 4 columns for clustering, your data model is the problem, not the engine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1546710237-057a5519a3d1%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyMnx8bWVjaGFuaWNhbCUyMGdlYXJzJTIwdHVybmluZ3xlbnwwfDB8fHwxNzgyMjQ1NzY5fDA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1546710237-057a5519a3d1%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyMnx8bWVjaGFuaWNhbCUyMGdlYXJzJTIwdHVybmluZ3xlbnwwfDB8fHwxNzgyMjQ1NzY5fDA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by bert b on Unsplash" width="1080" height="719"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@bertsz?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;bert b&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring the delta
&lt;/h2&gt;

&lt;p&gt;How do you know if you should switch? Don't guess. Pull your &lt;code&gt;DESCRIBE DETAIL&lt;/code&gt; stats and look at the &lt;code&gt;numRecords&lt;/code&gt; and &lt;code&gt;sizeInBytes&lt;/code&gt;. More importantly, dig into the &lt;code&gt;delta.history&lt;/code&gt; and &lt;code&gt;delta.operationMetrics&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you are currently Z-ordering, look at the &lt;code&gt;numRemovedFiles&lt;/code&gt; vs &lt;code&gt;numAddedFiles&lt;/code&gt; in your &lt;code&gt;OPTIMIZE&lt;/code&gt; metrics. If those numbers are consistently massive, you are churning your storage layer unnecessarily. &lt;/p&gt;

&lt;p&gt;To measure the efficacy of Liquid Clustering, you need to use the &lt;code&gt;EXPLAIN&lt;/code&gt; plan. Run a query before the switch and after the switch using &lt;code&gt;EXPLAIN EXTENDED&lt;/code&gt;. Look for &lt;code&gt;DataSkipping&lt;/code&gt;. If Liquid Clustering is working, you should see fewer &lt;code&gt;PartitionFilters&lt;/code&gt; (because we’re moving away from hard partitions) and more &lt;code&gt;DataSkipping&lt;/code&gt; stats. &lt;/p&gt;

&lt;p&gt;Specifically, look at the &lt;code&gt;min&lt;/code&gt; and &lt;code&gt;max&lt;/code&gt; values in your file statistics. If Liquid Clustering is doing its job, the range of these values in individual files should be tighter. If you see wide, overlapping ranges for your high-cardinality columns, the engine isn't clustering effectively, and you might need to adjust your &lt;code&gt;CLUSTER BY&lt;/code&gt; configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd pick, and why
&lt;/h2&gt;

&lt;p&gt;If you are on Databricks Runtime 13.3 or higher, Liquid Clustering is the default choice. There is almost no scenario in a modern cloud-native stack where I would recommend starting a new project with manual Z-ordering.&lt;/p&gt;

&lt;p&gt;The only caveat is cost: Liquid Clustering performs more work during the write phase. If you are doing extreme high-frequency, low-latency streaming (e.g., sub-second ingestion into a real-time dashboard), the added overhead of clustering on the write might impact your P99 latency. In those cases, you might want to ingest into a raw, non-clustered table and then use a background process to move it to a clustered table.&lt;/p&gt;

&lt;p&gt;But for 95% of the financial services and healthcare data I’ve worked with—where we deal with massive batch loads and complex analytical queries—Liquid Clustering wins. It removes the human error element. Engineers are terrible at predicting query patterns three months in advance, and Z-ordering requires you to be a psychic. &lt;/p&gt;

&lt;p&gt;Liquid clustering acknowledges that we don't know exactly how the data will be queried in the future. It’s an admission that the storage layout should be dynamic. &lt;/p&gt;

&lt;p&gt;Stop managing your data like it's 2018. Move to Liquid Clustering, stop the manual &lt;code&gt;OPTIMIZE&lt;/code&gt; jobs, and spend your time building features instead of tuning file sizes. You’ll sleep better, and your cloud bill will thank you.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@winstonchen?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Winston Chen&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>deltalake</category>
      <category>bigdata</category>
      <category>engineering</category>
    </item>
    <item>
      <title>Stop waking up at 3 AM: Why your data pipelines must be idempotent</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Sun, 21 Jun 2026 20:57:40 +0000</pubDate>
      <link>https://dev.to/aniketsoni/stop-waking-up-at-3-am-why-your-data-pipelines-must-be-idempotent-5afg</link>
      <guid>https://dev.to/aniketsoni/stop-waking-up-at-3-am-why-your-data-pipelines-must-be-idempotent-5afg</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why I chose this topic:&lt;/strong&gt; In my first year as a junior engineer, I pushed a non-idempotent job that double-counted $2M in revenue because I didn't wrap a delete statement in a transaction. I haven't slept through a production alert since, and I’m tired of seeing junior engineers make the same career-limiting mistake.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You ship the job. It passes CI. It runs perfectly in staging. Then, at 3:14 AM, the Airflow scheduler hiccups, a network partition hits your Redshift cluster, or your upstream API returns a 500. The job fails halfway through. You wake up to a PagerDuty alert, a half-loaded partition in S3, and a duplicate record set in your main reporting table. You have two choices: manually clean up the mess for three hours, or click "Clear" on the task and pray it doesn't make things worse.&lt;/p&gt;

&lt;p&gt;If you have to "pray" when you hit re-run, your pipeline is garbage.&lt;/p&gt;

&lt;p&gt;Idempotency—the property where an operation produces the same result regardless of how many times it’s executed—is the only thing standing between you and a mid-life crisis.&lt;/p&gt;

&lt;h2&gt;
  
  
  The path you’re choosing
&lt;/h2&gt;

&lt;p&gt;You are currently deciding between two fundamental ways to build data movement: "Append-Only" versus "Replace-by-Key."&lt;/p&gt;

&lt;p&gt;Most engineers start with "Append-Only." It’s intuitive. You pull data from a source and write it to the sink. It’s fast. It’s cheap. It’s also a ticking time bomb. The "Replace-by-Key" approach is the boring, professional way to handle data. It requires more compute cycles and a bit of SQL gymnastics, but it guarantees that if you run the job ten times, the state of your database is identical to running it once.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1764507638424-aa37a28f01cb%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxMnx8YnJva2VuJTIwY2xvY2t8ZW58MHwwfHx8MTc4MjA3MTY5NXww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1764507638424-aa37a28f01cb%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxMnx8YnJva2VuJTIwY2xvY2t8ZW58MHwwfHx8MTc4MjA3MTY5NXww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by ACatInABox on Unsplash" width="1080" height="560"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@acatinabox?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;ACatInABox&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The contenders
&lt;/h2&gt;

&lt;p&gt;In the Append-Only corner, we have the "Insert-Append" strategy. You fetch &lt;code&gt;SELECT *&lt;/code&gt; from your source and &lt;code&gt;INSERT INTO destination&lt;/code&gt; without checking for existing IDs. &lt;/p&gt;

&lt;p&gt;In the Replace-by-Key corner, we have "Delete-then-Insert." You identify the window of data you are processing, wipe it from the destination table, and insert the new payload within a single atomic transaction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost of doing business
&lt;/h2&gt;

&lt;p&gt;Append-only pipelines are cheap on paper. You aren't burning CPU cycles checking for pre-existing keys or performing deletes. However, the "hidden cost" is the engineer’s salary. When your append-only job fails at 3 AM, you aren't just paying for the compute; you are paying for the two hours of your life you spend writing ad-hoc &lt;code&gt;DELETE&lt;/code&gt; scripts to deduplicate the table.&lt;/p&gt;

&lt;p&gt;With a Replace-by-Key strategy, your compute bill might tick up by 5-10%. You are running a &lt;code&gt;DELETE FROM table WHERE processing_date = '2023-10-27'&lt;/code&gt; before your &lt;code&gt;INSERT&lt;/code&gt;. In a system like Snowflake or BigQuery, this is negligible. If you are running on an old-school Postgres instance with a massive table, yes, you’ll need to partition the table by date to keep the &lt;code&gt;DELETE&lt;/code&gt; operation performant. But that’s a design choice, not a failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Managing the ops burden
&lt;/h2&gt;

&lt;p&gt;The operational burden of append-only is a nightmare of state management. You have to maintain an "audit" table or a complex &lt;code&gt;WHERE NOT EXISTS&lt;/code&gt; clause that slows down as the table grows. I’ve seen developers use &lt;code&gt;ON CONFLICT DO NOTHING&lt;/code&gt; in Postgres, which is a start, but it fails to account for updates. What if the source data changed? Now your database record is stale, and your pipeline is "idempotent" but factually wrong.&lt;/p&gt;

&lt;p&gt;Replace-by-Key shifts the burden to the schema. You need to ensure your tables are partitioned. If you are using dbt (data build tool), this is the default behavior. A &lt;code&gt;dbt run --select my_model&lt;/code&gt; with the &lt;code&gt;--full-refresh&lt;/code&gt; flag or the default incremental strategy using &lt;code&gt;delete+insert&lt;/code&gt; handles the atomicity for you. If you aren't using a tool that abstracts this, you are effectively building a custom database engine on top of your pipeline. Don't do that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1669023414162-5bb06bbff0ec%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxM3x8bWVzc3klMjBjb2RlfGVufDB8MHx8fDE3ODIwNzE2OTZ8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1669023414162-5bb06bbff0ec%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxM3x8bWVzc3klMjBjb2RlfGVufDB8MHx8fDE3ODIwNzE2OTZ8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Growtika on Unsplash" width="1080" height="608"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@growtika?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Growtika&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure modes and recovery
&lt;/h2&gt;

&lt;p&gt;Consider the "Partial Write" failure mode. Your job hits a timeout after inserting 50% of the rows. &lt;/p&gt;

&lt;p&gt;If you used Append-Only, you now have a partial state. If you re-run it, you have 150% of the data. You now have to manually delete the rows that snuck in during the first failed attempt. It’s a guessing game of &lt;code&gt;WHERE created_at &amp;gt; ...&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;With Replace-by-Key, the failure mode is clean. Because the operation is wrapped in a &lt;code&gt;BEGIN; DELETE ...; INSERT ...; COMMIT;&lt;/code&gt; block, the failure means the database rolls back to the state before the job started. You can hit "Retry" in Airflow or Dagster a thousand times, and the end result remains exactly the same. The "atomicity" of the transaction is your insurance policy.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd pick, and why
&lt;/h2&gt;

&lt;p&gt;I choose the Replace-by-Key strategy every single time. &lt;/p&gt;

&lt;p&gt;The caveat is that you must have a deterministic way to define the "key." If your source data doesn't have a natural key or a reliable &lt;code&gt;updated_at&lt;/code&gt; timestamp, you’re in trouble. In those cases, I enforce a &lt;code&gt;processing_batch_id&lt;/code&gt; or a partition date as the key. If I can't guarantee a clean wipe and replace, I don't build the pipeline.&lt;/p&gt;

&lt;p&gt;Don't listen to the people who say "this is over-engineering." They are the same people who are currently fixing a production data quality issue on a Saturday morning. &lt;/p&gt;

&lt;p&gt;My advice: Use dbt to manage your incremental models. If you’re writing raw Python or Spark, force yourself to use the &lt;code&gt;DELETE ... WHERE partition = X&lt;/code&gt; pattern. If your database doesn't support transactions (looking at you, some early S3-based data lakes), use a "staging-to-production" swap. Write to a temporary table, then use an atomic rename to swap it with the production table.&lt;/p&gt;

&lt;p&gt;Sleep is worth more than the few extra dollars in compute costs. Make your jobs boring, make them transactional, and stop waking up in the middle of the night.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@tylergm?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Tyler&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>pipelines</category>
      <category>architecture</category>
      <category>data</category>
    </item>
    <item>
      <title>What building HIPAA-compliant lakehouses taught me about real-world encryption</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Sat, 20 Jun 2026 17:50:41 +0000</pubDate>
      <link>https://dev.to/aniketsoni/what-building-hipaa-compliant-lakehouses-taught-me-about-real-world-encryption-7co</link>
      <guid>https://dev.to/aniketsoni/what-building-hipaa-compliant-lakehouses-taught-me-about-real-world-encryption-7co</guid>
      <description>&lt;p&gt;Eighty-two percent of data breaches in healthcare don't happen because of a sophisticated nation-state actor; they happen because a junior engineer accidentally left an S3 bucket open or pushed a cleartext JSON blob containing social security numbers to a shared staging environment.&lt;/p&gt;

&lt;p&gt;We obsess over "zero trust" and "encryption at rest," but we rarely talk about the reality of the data lifecycle. If your lakehouse isn't architected for granular, row-level access control, you aren't HIPAA compliant—you’re just waiting for a forensic audit to end your career.&lt;/p&gt;

&lt;p&gt;Most engineers treat &lt;code&gt;AES-256&lt;/code&gt; like a magic wand. They check the box for "Server-Side Encryption" on their S3 buckets and assume they’ve satisfied the Privacy Rule. They haven't. Compliance isn't about whether the disk is encrypted; it’s about who can see the decrypted contents and where that data manifests in your logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  The mechanics of the pipeline
&lt;/h3&gt;

&lt;p&gt;When building a lakehouse (think Databricks on Delta Lake or Snowflake), the "Gold" layer is where compliance goes to die. You have clean, joined, enriched data that happens to contain PHI. If you are still using simple IAM roles to govern access, you are doing it wrong.&lt;/p&gt;

&lt;p&gt;You need to implement column-level masking and row-level security (RLS) at the storage abstraction layer. In Databricks, for example, you shouldn't just be granting &lt;code&gt;SELECT&lt;/code&gt; on a table. You should be using &lt;code&gt;MASKING FUNCTIONS&lt;/code&gt; on columns containing identifiers.&lt;/p&gt;

&lt;p&gt;Here is what the actual implementation looks like in a production environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;redact_ssn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ssn&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;is_member&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'data_scientists'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'***-**-****'&lt;/span&gt;
              &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="n"&gt;ssn&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;silver_health_records&lt;/span&gt; 
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;ssn&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;MASKED&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;redact_ssn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ssn&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the baseline. But the real "gotcha" happens when your Spark job kicks in. When you run a &lt;code&gt;df.write.mode("overwrite")&lt;/code&gt; operation, Spark creates temporary files in your staging directory. If you aren't careful, these temporary files contain the raw, unmasked data. Even if you have masking on the table, the raw data sits in an S3 prefix that your monitoring tools or data discovery crawlers might index.&lt;/p&gt;

&lt;p&gt;To fix this, you must enforce ephemeral encryption keys for the shuffle service. In your Spark config, you need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;spark&lt;/span&gt;.&lt;span class="n"&gt;io&lt;/span&gt;.&lt;span class="n"&gt;encryption&lt;/span&gt;.&lt;span class="n"&gt;enabled&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;.&lt;span class="n"&gt;hadoop&lt;/span&gt;.&lt;span class="n"&gt;fs&lt;/span&gt;.&lt;span class="n"&gt;s3a&lt;/span&gt;.&lt;span class="n"&gt;encryption&lt;/span&gt;.&lt;span class="n"&gt;algorithm&lt;/span&gt; &lt;span class="n"&gt;SSE&lt;/span&gt;-&lt;span class="n"&gt;KMS&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;.&lt;span class="n"&gt;hadoop&lt;/span&gt;.&lt;span class="n"&gt;fs&lt;/span&gt;.&lt;span class="n"&gt;s3a&lt;/span&gt;.&lt;span class="n"&gt;server&lt;/span&gt;-&lt;span class="n"&gt;side&lt;/span&gt;-&lt;span class="n"&gt;encryption&lt;/span&gt;.&lt;span class="n"&gt;key&lt;/span&gt; &amp;lt;&lt;span class="n"&gt;your&lt;/span&gt;-&lt;span class="n"&gt;kms&lt;/span&gt;-&lt;span class="n"&gt;key&lt;/span&gt;-&lt;span class="n"&gt;id&lt;/span&gt;&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without &lt;code&gt;spark.io.encryption.enabled&lt;/code&gt;, your shuffle files—those bits of data written to disk during a join or a sort—are written in plain text. If a node is decommissioned and the underlying EBS volume isn't wiped immediately, you’ve just created a HIPAA violation.&lt;/p&gt;

&lt;h3&gt;
  
  
  The tradeoffs nobody mentions
&lt;/h3&gt;

&lt;p&gt;The primary downside of a locked-down, encrypted lakehouse is "performance tax." Every time you introduce a UDF (User Defined Function) for masking or enforce RLS, you break the query optimizer.&lt;/p&gt;

&lt;p&gt;When you run a &lt;code&gt;SELECT *&lt;/code&gt;, the engine has to evaluate the masking function for every single row. If you’re doing a join across a 50TB dataset, the cost of these functions adds up. Your query latency will spike. I’ve seen teams move from a 10-minute job to a 45-minute job just by adding RLS. &lt;/p&gt;

&lt;p&gt;Then there is the issue of "key rot." If you’re using AWS KMS, you’re likely using Customer Managed Keys (CMKs). Managing the lifecycle, rotation, and—God forbid—re-encrypting the data when a key is compromised, is a nightmare. If you lose access to the KMS key, your data is effectively incinerated. There is no "I forgot my password" for an encrypted lakehouse.&lt;/p&gt;

&lt;p&gt;Also, logging becomes significantly harder. If you are masking data, your logs need to account for &lt;em&gt;why&lt;/em&gt; a user saw a masked value versus the raw value. You end up with a massive metadata overhead. You’re no longer just storing the data; you’re storing the audit trail of who requested the data, what their clearance level was, and which specific masking policy was triggered.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to reach for it (and when not to)
&lt;/h3&gt;

&lt;p&gt;Use granular masking and RLS when you are building a multi-tenant platform. If your lakehouse serves both clinical researchers and internal billing analysts, you have no choice. The billing team needs the SSN; the researcher needs the diagnosis code but shouldn't know the patient's name. In this scenario, the lakehouse is a tool for data minimization, and these features are your primary defense.&lt;/p&gt;

&lt;p&gt;Don't use it when you are running a purely internal, high-throughput analytics pipeline where the "users" are just automated microservices. If you are building a feature engineering pipeline for a machine learning model, and the model only needs the anonymized vector, do the masking &lt;em&gt;before&lt;/em&gt; the data hits the lakehouse.&lt;/p&gt;

&lt;p&gt;Why? Because if you wait until the data is in the lakehouse to mask it, you’ve already failed the principle of "least privilege." The raw, sensitive data is already sitting in your storage layer. If a developer needs to debug the raw data, they’ll have access. Move the transformation upstream. Perform PII/PHI scrubbing in your ingest layer (your Lambda or Fargate tasks) before the data ever touches the &lt;code&gt;bronze&lt;/code&gt; table.&lt;/p&gt;

&lt;p&gt;The best way to pass a HIPAA audit isn't to build a fancy gatekeeper at the end of the pipeline; it's to ensure the data is effectively neutralized before it enters your environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Compliance is often treated as a bureaucratic checkbox, but in the world of high-scale data engineering, it’s a technical constraint. If you treat PHI as just "another string column," you’re setting yourself up for a catastrophic failure.&lt;/p&gt;

&lt;p&gt;Focus on the mechanics: encrypt your shuffle, use native masking functions rather than application-level logic, and always, &lt;em&gt;always&lt;/em&gt; scrub at the ingest point. The goal is to make the data useless to anyone who doesn't have the explicit, logged, and audited right to see it. If you can do that while keeping your query times under an hour, you're doing better than most of the industry.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@tylergm?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Tyler&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>data</category>
      <category>security</category>
      <category>cloud</category>
      <category>compliance</category>
    </item>
    <item>
      <title>Stop letting bad data break your production pipelines</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Fri, 19 Jun 2026 17:10:46 +0000</pubDate>
      <link>https://dev.to/aniketsoni/stop-letting-bad-data-break-your-production-pipelines-1ob6</link>
      <guid>https://dev.to/aniketsoni/stop-letting-bad-data-break-your-production-pipelines-1ob6</guid>
      <description>&lt;p&gt;Two years ago, a bad batch of FX rates from a vendor hit our silver layer, wiped out the historical aggregates, and triggered a margin call for a client. We spent 48 hours manually re-processing files, running &lt;code&gt;DELETE&lt;/code&gt; statements on partitioned tables, and praying the S3 lifecycle policies didn't catch our backup snapshots. &lt;/p&gt;

&lt;p&gt;Today, that same corrupted file hits the pipeline, the Delta Live Tables (DLT) expectation fails, the records are shunted into a quarantine table, and the pipeline finishes with a "success" status while alerting me to the specific offending rows. No downtime, no manual data surgery, no heart palpitations.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why I chose this topic:&lt;/strong&gt; In financial services, "data quality" isn't a suggestion—it's a compliance requirement. I’ve spent too many weekends cleaning up messes that should have been caught at the gate, and I want to save you from the same fate.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It was 3:14 AM on a Tuesday. PagerDuty went off like a siren in my bedroom. The dashboard was bleeding red: the &lt;code&gt;daily_ledger_agg&lt;/code&gt; table, which powers our regulatory reporting to the SEC, had hit a null value in a currency code column. This shouldn't have been possible. The schema was enforced, the DDL was solid, yet somehow, a join was producing rows with &lt;code&gt;NULL&lt;/code&gt; keys, which cascaded into a &lt;code&gt;NaN&lt;/code&gt; in our risk models.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we saw
&lt;/h2&gt;

&lt;p&gt;The symptoms were deceptive. The job logs showed the Spark cluster scaling perfectly. Memory usage was nominal. The failure wasn't a crash; it was a "silent" corruption. &lt;/p&gt;

&lt;p&gt;My first instinct was to blame the upstream merge logic. I spent two hours digging into the Spark plans, convinced we had a partitioning skew that was causing a hash join to drop records. I even bumped the &lt;code&gt;spark.sql.shuffle.partitions&lt;/code&gt; from 200 to 1000, thinking it was a resource contention issue. It wasn't. The data was simply bad at the source, and our pipeline—being "resilient"—happily ingested the garbage and propagated it into the downstream analytical tables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1624953587687-daf255b6b80a%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwzfHxicm9rZW4lMjBjb2RlfGVufDB8MHx8fDE3ODE3Mjc5Mzd8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1624953587687-daf255b6b80a%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwzfHxicm9rZW4lMjBjb2RlfGVufDB8MHx8fDE3ODE3Mjc5Mzd8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Artturi Jalli on Unsplash" width="1080" height="810"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@artturijalli?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Artturi Jalli&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Root cause
&lt;/h2&gt;

&lt;p&gt;The root cause was a configuration oversight in our DLT pipeline definition. We were using &lt;code&gt;EXPECT&lt;/code&gt; instead of &lt;code&gt;EXPECT_ROW_OR_DROP&lt;/code&gt; or &lt;code&gt;EXPECT_ROW_OR_FAIL&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;currency_code IS NOT NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;silver_ledger&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze_ledger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In DLT, &lt;code&gt;EXPECT&lt;/code&gt; is merely a warning. It logs the violation to the &lt;code&gt;event_log&lt;/code&gt; and keeps moving. The pipeline marked the run as "Succeeded," but our downstream dashboard was showing a 12% drift in total balance. We were essentially poisoning our own data lake, one row at a time. The code was "correct," but the &lt;em&gt;intent&lt;/em&gt; of the data contract was ignored by the execution engine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1770915211878-4d5254840535%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyNnx8cXVhcmFudGluZSUyMHNpZ258ZW58MHwwfHx8MTc4MTcyNzkzOHww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1770915211878-4d5254840535%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyNnx8cXVhcmFudGluZSUyMHNpZ258ZW58MHwwfHx8MTc4MTcyNzkzOHww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Alex Oviedo on Unsplash" width="1080" height="720"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@aoviedo2099?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Alex Oviedo&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;I refactored the pipeline to enforce strict quarantine patterns. I switched to &lt;code&gt;EXPECT_ROW_OR_DROP&lt;/code&gt; for non-critical noise and &lt;code&gt;EXPECT_ROW_OR_FAIL&lt;/code&gt; for integrity-critical fields. More importantly, I implemented a secondary "quarantine" sink to capture what was dropped.&lt;/p&gt;

&lt;p&gt;Here is what the pattern looks like now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silver_ledger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;table_properties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silver&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect_or_drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_currency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;currency_code IS NOT NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;silver_ledger&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze_ledger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quarantine_ledger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;quarantine_ledger&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Capture only the rows that failed the silver_ledger expectations
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze_ledger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;currency_code IS NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By separating the "clean" pipeline from the "rejected" stream, we gain visibility. When a row hits the quarantine table, it gets a &lt;code&gt;_processed_at&lt;/code&gt; timestamp and an &lt;code&gt;error_reason&lt;/code&gt; column. My SRE team now has a dedicated dashboard for this table. If the &lt;code&gt;count(*)&lt;/code&gt; in &lt;code&gt;quarantine_ledger&lt;/code&gt; &amp;gt; 0, they get a Slack alert. We don't stop the business, but we know exactly which records need manual intervention from the data vendor.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we changed so it never happens again
&lt;/h2&gt;

&lt;p&gt;We stopped treating data quality as a post-hoc analysis task. We now treat it as a CI/CD contract. &lt;/p&gt;

&lt;p&gt;First, we moved all DLT expectation definitions into a central registry (a YAML-based schema config). We no longer hardcode business logic expectations in the Python files. By centralizing these, we can run a "Dry Run" test suite in GitHub Actions that validates the schema against a sample of incoming Parquet files before the pipeline code is even deployed to the Databricks workspace.&lt;/p&gt;

&lt;p&gt;Second, we implemented a "Circuit Breaker" pattern. If the number of records in the quarantine table exceeds 5% of the total record count of the batch, the pipeline is configured to fail explicitly. This handles the "silent corruption" case where a vendor sends a file that is 90% garbage. &lt;/p&gt;

&lt;p&gt;Third, we moved away from generic &lt;code&gt;expect&lt;/code&gt; calls. Every table in our silver layer must have at least three mandatory expectations: &lt;code&gt;not_null&lt;/code&gt;, &lt;code&gt;unique_id&lt;/code&gt;, and &lt;code&gt;range_check&lt;/code&gt;. If a developer forgets these, the pipeline fails the unit test suite in our CI environment. No exceptions.&lt;/p&gt;

&lt;p&gt;In a regulated environment, "good enough" is a liability. You either gate the data, or you accept that your production environment is essentially a lottery. Use your quarantine tables, make your expectations explicit, and stop letting bad data hide in your silver layer. It’s not about being pedantic; it’s about knowing your data is actually usable before the regulators come asking for a reconciliation report.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #data #engineering #dlt #pipelines&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@brechtcorbeel?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Brecht Corbeel&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>data</category>
      <category>engineering</category>
      <category>dlt</category>
      <category>pipelines</category>
    </item>
    <item>
      <title>Why your Iceberg catalog choice is costing you more than your storage</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Tue, 16 Jun 2026 17:10:58 +0000</pubDate>
      <link>https://dev.to/aniketsoni/why-your-iceberg-catalog-choice-is-costing-you-more-than-your-storage-3hng</link>
      <guid>https://dev.to/aniketsoni/why-your-iceberg-catalog-choice-is-costing-you-more-than-your-storage-3hng</guid>
      <description>&lt;p&gt;Eighty percent of the "data engineering" work I’ve done in the last two years has been debugging catalog state inconsistencies, not actually processing data. It’s a bitter pill, but if your catalog isn't atomic, your petabytes of Parquet are just expensive digital trash.&lt;/p&gt;

&lt;p&gt;The industry has collectively decided that the Iceberg REST catalog spec is the way out. It abstracts away the hive-metastore hell of &lt;code&gt;s3://&lt;/code&gt; path listings and S3-consistency-guarantee prayer circles. But now the industry is fragmenting again, forcing us to choose between running our own catalog, leveraging a managed provider, or locking ourselves into a vendor’s ecosystem.&lt;/p&gt;

&lt;p&gt;You aren't choosing a "data strategy" here. You are choosing who gets to hold the kill switch for your entire data platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  The contenders
&lt;/h2&gt;

&lt;p&gt;We’re looking at three distinct paths for implementing the Iceberg REST spec:&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;Apache Polaris&lt;/strong&gt;. It’s the new kid, open-sourced by Snowflake. It’s a pure-play REST catalog server designed to be deployed on Kubernetes. It’s vendor-neutral, which is refreshing, but it requires you to own the availability of the control plane.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;Databricks Unity Catalog (UC)&lt;/strong&gt;. It’s the "Enterprise Standard" if you’re already in the Databricks orbit. It’s proprietary, opaque, and incredibly powerful, but it effectively mandates that your compute lives within their walled garden.&lt;/p&gt;

&lt;p&gt;Third, &lt;strong&gt;AWS Glue&lt;/strong&gt;. The incumbent. It’s not a native REST catalog in the traditional sense, but AWS added a REST-compatible API layer for Iceberg. It’s serverless, it’s cheap, and it’s arguably the most "set it and forget it" option, provided you don't mind the inevitable IAM role hell.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1636642824336-985e6685cd47%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyN3x8c2hhdHRlcmVkJTIwZ2xhc3N8ZW58MHwwfHx8MTc4MTU1ODAzMXww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1636642824336-985e6685cd47%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyN3x8c2hhdHRlcmVkJTIwZ2xhc3N8ZW58MHwwfHx8MTc4MTU1ODAzMXww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by 🇻🇪 Jose G. Ortega Castro 🇲🇽 on Unsplash" width="1080" height="720"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@j0rt?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;🇻🇪 Jose G. Ortega Castro 🇲🇽&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The operational tax on your SRE team
&lt;/h2&gt;

&lt;p&gt;If you choose to self-host Polaris on EKS, you are signing up for a pager. Polaris needs a backing database—usually Postgres—to store the catalog state. If your RDS instance goes down or your K8s node group hits an OOM loop, your entire data lake goes dark. &lt;/p&gt;

&lt;p&gt;I’ve seen teams try to save $500 a month by self-hosting metadata stores, only to spend three engineering days troubleshooting a split-brain scenario where the Iceberg metadata pointer drifted from the actual S3 manifest files. That’s a $10,000 incident for a $500 saving.&lt;/p&gt;

&lt;p&gt;Unity Catalog, on the other hand, is managed. You don't manage the database; you manage the metastore. The failure mode here isn't infrastructure; it’s &lt;em&gt;policy&lt;/em&gt;. I’ve spent more time debugging &lt;code&gt;PERMISSION_DENIED&lt;/code&gt; errors in Unity than I ever did debugging database uptime in self-hosted solutions. If your Identity Provider (Okta/Entra ID) sync fails, your pipeline stops. &lt;/p&gt;

&lt;p&gt;Glue is the middle ground. It’s managed, but it’s notoriously slow. If you have a massive table with 100,000+ partitions, Glue’s API latency becomes your primary bottleneck. You will see 5xx errors during high-concurrency partition pruning if you aren't careful with your &lt;code&gt;glue:GetPartition&lt;/code&gt; call volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost structures and hidden tax
&lt;/h2&gt;

&lt;p&gt;The cost of a catalog is never just the price of the service. It’s the integration cost.&lt;/p&gt;

&lt;p&gt;AWS Glue is cheap—basically pennies per million requests. But it forces you to keep your compute inside AWS. If you decide to spin up a Spark job on a different provider, you’re looking at cross-cloud egress fees and the nightmare of cross-account IAM cross-pollination. &lt;/p&gt;

&lt;p&gt;Unity Catalog is "free" if you use Databricks compute, but it’s a tax on your architectural freedom. Once you move your governance logic into Unity’s proprietary format, extracting it is non-trivial. You are paying for Unity with the loss of your ability to easily switch compute engines.&lt;/p&gt;

&lt;p&gt;Polaris is the only one that is truly agnostic. It’s the "Linux approach." It costs exactly what it costs to run the K8s cluster and the RDS instance. But watch out for the storage of the metadata itself. If your Polaris instance is behind a load balancer that isn't configured for long-lived keep-alives, you will see intermittent &lt;code&gt;ConnectionResetException&lt;/code&gt; errors during large metadata updates, which will leave your Iceberg tables in a corrupted &lt;code&gt;staged&lt;/code&gt; state.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1600607384222-f7096c9111fe%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxMHx8YmFsYW5jZSUyMHNjYWxlfGVufDB8MHx8fDE3ODE1NTgwMzF8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1600607384222-f7096c9111fe%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxMHx8YmFsYW5jZSUyMHNjYWxlfGVufDB8MHx8fDE3ODE1NTgwMzF8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Elena Mozhvilo on Unsplash" width="1080" height="718"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@miracleday?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Elena Mozhvilo&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure modes: When the catalog lies to you
&lt;/h2&gt;

&lt;p&gt;The most dangerous thing a catalog can do is return a successful commit for a transaction that failed to write.&lt;/p&gt;

&lt;p&gt;In the early days of Glue, we dealt with "ghost partitions." Glue would report a partition as existing, but the underlying S3 objects had been deleted due to a failed Spark job that hadn't properly rolled back. We ended up writing a custom "Catalog-to-S3 Auditor" script—essentially a nightly job that runs &lt;code&gt;s3 ls&lt;/code&gt; against every partition entry in Glue and flags mismatches. If you choose Glue, you &lt;em&gt;will&lt;/em&gt; need this script.&lt;/p&gt;

&lt;p&gt;Unity Catalog is significantly more robust here because it handles the transaction commit in a single, atomic operation within the Databricks control plane. It is effectively impossible to have a drift between the Unity catalog and the underlying storage unless you have rogue users bypassing the catalog to write directly to S3.&lt;/p&gt;

&lt;p&gt;Polaris is still maturing. I’ve encountered bugs where the metadata update fails to propagate to the read-replicas quickly enough, leading to "time-travel" anomalies where a reader sees the state from five seconds ago despite the writer confirming the commit. It’s getting better, but I wouldn't trust it with mission-critical financial reconciliation just yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd pick, and why
&lt;/h2&gt;

&lt;p&gt;If you are a startup with a small team and you’re already all-in on AWS, use Glue. It’s boring, it’s cheap, and the IAM integration, while painful to set up, is bulletproof once you get the policies right. The latency issues are real, but you can mitigate them by being disciplined about your partition schema. Don't over-partition your data.&lt;/p&gt;

&lt;p&gt;If you are a large enterprise with a massive footprint and you already use Databricks, stop fighting the gravity. Use Unity Catalog. The governance features—specifically the lineage tracking and the fine-grained access control—are worth the lock-in. You’ll save more in headcount by not having to build your own governance layer than you’ll lose in vendor flexibility.&lt;/p&gt;

&lt;p&gt;I would stay away from self-hosting Polaris unless you have a dedicated platform team that actually enjoys managing Postgres and Kubernetes. It’s an elegant piece of technology, and it’s the future of open-source data lakehouse architecture, but it is currently a "developer-experience" product, not an "infrastructure-as-a-service" product.&lt;/p&gt;

&lt;p&gt;My personal preference? I’m waiting for a high-quality, managed SaaS version of Polaris. Once someone turns that into a reliable, "set-it-and-forget-it" managed service, the other two are going to be in trouble. But until then, pick your poison: Glue for the budget-conscious, Unity for the enterprise-locked, and Polaris for the infrastructure masochists. &lt;/p&gt;

&lt;p&gt;Whatever you do, don't write your own catalog. I’ve seen that movie, and it ends with a 3:00 AM emergency partition recovery task you’ll regret for the rest of your career.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@tylergm?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Tyler&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>iceberg</category>
      <category>data</category>
      <category>architecture</category>
      <category>storage</category>
    </item>
    <item>
      <title>Text-to-SQL is a solved problem: why you’re about to leak your PII</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Mon, 15 Jun 2026 02:49:38 +0000</pubDate>
      <link>https://dev.to/aniketsoni/text-to-sql-is-a-solved-problem-why-youre-about-to-leak-your-pii-aj1</link>
      <guid>https://dev.to/aniketsoni/text-to-sql-is-a-solved-problem-why-youre-about-to-leak-your-pii-aj1</guid>
      <description>&lt;p&gt;The most dangerous myth in modern data engineering is that "Text-to-SQL is a solved problem." Every time I see a demo where someone asks an LLM to "sum the revenue by region" and it returns a clean JSON blob, I see a production outage waiting to happen. You aren't building a chat interface; you are building a high-velocity, non-deterministic SQL generator that has direct access to your internal PII and financial ledger.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why I chose this topic:&lt;/strong&gt; I spent three months cleaning up after a "smart agent" that hallucinated a &lt;code&gt;DROP TABLE&lt;/code&gt; command because a user asked it to "clear out the old test data." In high-stakes environments like healthcare, "oops, the LLM got confused" isn't a valid root cause analysis for a HIPAA breach.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You are currently facing a binary choice: either you wrap your data in a brittle, manual semantic layer, or you embrace a "governed lakehouse" architecture where the LLM is treated as an untrusted, low-privilege user. Most people choose the latter because they are lazy, and then they end up crying when the model joins the &lt;code&gt;users&lt;/code&gt; table to the &lt;code&gt;audit_logs&lt;/code&gt; table via a natural language prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The contenders
&lt;/h2&gt;

&lt;p&gt;You have three ways to approach this. First, the &lt;strong&gt;"LLM-as-a-DBA"&lt;/strong&gt; approach, where you feed the entire schema (Ddl) into the context window and pray for the best. Second, the &lt;strong&gt;"Semantic Proxy"&lt;/strong&gt; layer, where you use tools like LlamaIndex or LangChain to map natural language to pre-defined view objects rather than raw tables. Third, the &lt;strong&gt;"Hard-Gated SQL Sandbox"&lt;/strong&gt;, where the LLM generates SQL, but a deterministic Python validator checks the AST (Abstract Syntax Tree) before it ever touches a connection string.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1513082325166-c105b20374bb%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxNnx8YnJva2VuJTIwZ2xhc3N8ZW58MHwwfHx8MTc4MTQ2NDA2Nnww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1513082325166-c105b20374bb%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxNnx8YnJva2VuJTIwZ2xhc3N8ZW58MHwwfHx8MTc4MTQ2NDA2Nnww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by CHUTTERSNAP on Unsplash" width="1080" height="720"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@chuttersnap?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;CHUTTERSNAP&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The operational burden of hallucinations
&lt;/h2&gt;

&lt;p&gt;The "LLM-as-a-DBA" approach is the cheapest way to start but the most expensive to run. If your schema has 500+ tables—standard for any mature lakehouse—you are going to blow your context window budget immediately. Even with GPT-4o, the "schema stuffing" method fails when you have columns with similar names like &lt;code&gt;user_id&lt;/code&gt; in five different tables. &lt;/p&gt;

&lt;p&gt;I’ve seen models join &lt;code&gt;billing.payments&lt;/code&gt; with &lt;code&gt;marketing.leads&lt;/code&gt; on &lt;code&gt;id&lt;/code&gt; just because they happened to have the same name. In a production environment, this is a silent data corruption error. If you aren't using a tool that enforces table lineage at the semantic level (like dbt’s graph metadata), you are just running an expensive game of SQL Roulette.&lt;/p&gt;
&lt;h2&gt;
  
  
  Infrastructure and cost at scale
&lt;/h2&gt;

&lt;p&gt;Running a SQL agent against a Delta Lake or Iceberg table isn't just about the token cost. It’s about the compute cost. If your agent is allowed to write &lt;code&gt;SELECT *&lt;/code&gt; without limits, you will incinerate your Snowflake or Databricks budget in an afternoon. &lt;/p&gt;

&lt;p&gt;I implement a mandatory &lt;code&gt;LIMIT 1000&lt;/code&gt; hard-coded into the system prompt, but that’s not enough. You need to enforce a resource governor at the database level. For Databricks, I use a specific &lt;code&gt;SQL Warehouse&lt;/code&gt; policy that caps the DBU usage per session. If the LLM generates a query that hits a massive fact table without a partition key, the warehouse should kill the query within 30 seconds. If your agent doesn't have a "kill switch" policy, you aren't ready for production.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1743796055664-3473eedab36e%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwzfHxtYWduaWZ5aW5nJTIwZ2xhc3MlMjBvbiUyMGNvZGV8ZW58MHwwfHx8MTc4MTQ2NDA2N3ww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1743796055664-3473eedab36e%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwzfHxtYWduaWZ5aW5nJTIwZ2xhc3MlMjBvbiUyMGNvZGV8ZW58MHwwfHx8MTc4MTQ2NDA2N3ww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by MJ Duford on Unsplash" width="1080" height="810"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@duforddigital?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;MJ Duford&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Failure modes and the AST validator
&lt;/h2&gt;

&lt;p&gt;The most common failure mode isn't a bad query; it’s an unauthorized one. LLMs love to "explore." If a user asks, "Who are our highest-paying customers?", the model might decide it needs to look at the &lt;code&gt;sensitive_user_pii&lt;/code&gt; table to provide a better answer. &lt;/p&gt;

&lt;p&gt;You cannot trust the model to self-regulate. You need an AST validator. I use the &lt;code&gt;sqlglot&lt;/code&gt; library in Python to parse every generated query. If the AST contains a &lt;code&gt;JOIN&lt;/code&gt; to a table not in the allowed list, or if it tries to access a &lt;code&gt;WHERE&lt;/code&gt; clause containing &lt;code&gt;email&lt;/code&gt; or &lt;code&gt;ssn&lt;/code&gt;, the validator throws a hard exception. &lt;/p&gt;

&lt;p&gt;Here is what that looks like in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sqlglot&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sqlglot&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql_str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlglot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse_one&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Table&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PII_TABLE&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CREDIT_CARDS&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;PermissionError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Access denied to sensitive table.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you aren't doing this, you are effectively letting users run arbitrary code on your backend.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd pick, and why
&lt;/h2&gt;

&lt;p&gt;I pick the "Hard-Gated SQL Sandbox" every time. It’s annoying to set up, it requires keeping a registry of "allowed tables," and it slows down the initial development cycle. But it’s the only way to sleep at night.&lt;/p&gt;

&lt;p&gt;Here is the stack I recommend:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Semantic Layer:&lt;/strong&gt; Use &lt;code&gt;dbt&lt;/code&gt; to define your models. The LLM shouldn't know about your raw Bronze/Silver/Gold tables; it should only see the Gold-level models you’ve explicitly exposed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validator:&lt;/strong&gt; Use &lt;code&gt;sqlglot&lt;/code&gt; to audit the AST. Block any query that doesn't explicitly mention a partition column if the table is over a certain size (like &lt;code&gt;event_date&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution:&lt;/strong&gt; Use a low-privilege service account for the LLM. If your agent is running as a &lt;code&gt;SUPERUSER&lt;/code&gt;, you’ve already lost. Use &lt;code&gt;GRANT SELECT&lt;/code&gt; only on the specific views the agent is allowed to touch.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The caveat? It makes the "smart" agent feel a bit "dumb." Users will occasionally get a "I cannot answer that" response because the agent was blocked by the validator. This is a feature, not a bug. Your users will be annoyed that they can't ask the agent for the CEO’s home address, but your CISO will be thrilled. In financial services, the "I don't know" response is the mark of a system that is actually under control. &lt;/p&gt;

&lt;p&gt;Stop trying to build an omniscient SQL wizard. Build a constrained, grumpy librarian that only gives out the books it’s allowed to touch. That’s how you ship in production.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@jouwdan?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Jordan Harrison&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sql</category>
      <category>llm</category>
      <category>security</category>
      <category>data</category>
    </item>
    <item>
      <title>Time Travel isn't a Debugging Luxury: Why Delta and Iceberg are Compliance Essentials</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Sun, 14 Jun 2026 04:52:42 +0000</pubDate>
      <link>https://dev.to/aniketsoni/time-travel-isnt-a-debugging-luxury-why-delta-and-iceberg-are-compliance-essentials-2fi0</link>
      <guid>https://dev.to/aniketsoni/time-travel-isnt-a-debugging-luxury-why-delta-and-iceberg-are-compliance-essentials-2fi0</guid>
      <description>&lt;p&gt;The most persistent myth in data engineering is that "Time Travel" is a fancy feature for lazy developers who don't want to write better unit tests.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why I chose this topic:&lt;/strong&gt; In my last two roles—one handling credit risk models and the other HIPAA-regulated patient records—I’ve watched senior engineers lose entire weekends because they couldn't reconstruct the state of a table during a specific model drift event. Data reproducibility isn't just about debugging; it’s about proving to a regulator exactly what your system saw at 2:14 PM on a Tuesday.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you are choosing between Apache Iceberg and Delta Lake today, you aren't choosing a storage format. You are choosing your strategy for "Oh, wait, the production pipeline just nuked 40% of our user metadata."&lt;/p&gt;

&lt;h2&gt;
  
  
  The contenders
&lt;/h2&gt;

&lt;p&gt;Delta Lake is the incumbent of the Databricks ecosystem, relying on a transaction log (&lt;code&gt;_delta_log&lt;/code&gt;) to track changes. It is predictable, mature, and works beautifully if you live in the Spark/Databricks bubble.&lt;/p&gt;

&lt;p&gt;Apache Iceberg is the open-source industry darling, built to solve the "partition evolution" problem that plagues Hive-style metadata. It uses a snapshot-based manifest system. It is engine-agnostic by design, meaning you can query it via Trino, Flink, Spark, or StarRocks without feeling like you're hacking the system.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1537884944318-390069bb8665%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxMHx8Y29ycnVwdGVkJTIwY29kZSUyMGRpZ2l0YWx8ZW58MHwwfHx8MTc4MTQxMTg2MXww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1537884944318-390069bb8665%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwxMHx8Y29ycnVwdGVkJTIwY29kZSUyMGRpZ2l0YWx8ZW58MHwwfHx8MTc4MTQxMTg2MXww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Pankaj Patel on Unsplash" width="1080" height="720"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@pankajpatel?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Pankaj Patel&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The operational tax
&lt;/h2&gt;

&lt;p&gt;Delta Lake's &lt;code&gt;VACUUM&lt;/code&gt; command is your best friend and your worst enemy. By default, it deletes files older than 7 days. If you are in a heavily regulated environment, you must override this immediately. Setting &lt;code&gt;spark.databricks.delta.retentionDurationCheck.enabled&lt;/code&gt; to &lt;code&gt;false&lt;/code&gt; is the first thing I do in any new project, but then you’re on the hook for storage costs. You are essentially paying for a S3/GCS graveyard.&lt;/p&gt;

&lt;p&gt;Iceberg handles this via &lt;code&gt;expire_snapshots&lt;/code&gt;. The syntax &lt;code&gt;CALL system.expire_snapshots('db.table', TIMESTAMP '2023-10-01 00:00:00')&lt;/code&gt; is arguably cleaner, but the failure mode is more silent. If you expire snapshots too aggressively, your ability to perform point-in-time recovery for a specific audit request vanishes. I once saw a team accidentally purge their entire history of a patient-tracking table because they thought they were cleaning up staging files. Always, always set a &lt;code&gt;history.expire.min-snapshots-to-keep&lt;/code&gt; property higher than the default of 1.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance under pressure
&lt;/h2&gt;

&lt;p&gt;Delta Lake’s transaction log is a sequence of JSON files. As the table grows, reading the history requires replaying these logs. If you have 50,000 commits (common in streaming pipelines), the "time travel" performance can hit a wall. You end up waiting for Spark to parse thousands of JSON files just to figure out which Parquet file was active at 10:00 AM.&lt;/p&gt;

&lt;p&gt;Iceberg’s manifest files are the structural antidote. Because Iceberg separates the metadata into manifests and manifest lists, it doesn't need to scan the "history of the world" to find a specific state. It jumps to the specific snapshot ID. If you are doing frequent, micro-batch writes, Iceberg’s metadata overhead is significantly more manageable. I’ve benchmarked Iceberg against Delta on a 50TB table; Iceberg’s &lt;code&gt;AS OF&lt;/code&gt; queries were consistently 3x faster because it didn't have to deserialize a mountain of JSON logs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1742976483726-3bdafe71add4%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyMHx8bGlicmFyeSUyMGFyY2hpdmUlMjBzaGVsdmVzfGVufDB8MHx8fDE3ODE0MTE4NjF8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1742976483726-3bdafe71add4%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHwyMHx8bGlicmFyeSUyMGFyY2hpdmUlMjBzaGVsdmVzfGVufDB8MHx8fDE3ODE0MTE4NjF8MA%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Allen Y on Unsplash" width="1080" height="810"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@yanahd?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Allen Y&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The failure modes
&lt;/h2&gt;

&lt;p&gt;Delta Lake is tightly coupled to Spark. If your Spark job fails during a write, the &lt;code&gt;_delta_log&lt;/code&gt; keeps the transaction atomic. You either have the data, or you don't. However, I’ve seen "orphaned" files in Delta where &lt;code&gt;VACUUM&lt;/code&gt; fails to clean up properly because of cross-region replication lag in S3. You end up paying for bytes you can't query and can't delete.&lt;/p&gt;

&lt;p&gt;Iceberg is more robust, but it requires you to be honest about your table maintenance. If you don't run &lt;code&gt;rewrite_data_files&lt;/code&gt; and &lt;code&gt;rewrite_manifests&lt;/code&gt; regularly, your "time travel" queries will slow down to a crawl. I once inherited a system where a table had 12,000 tiny files because the team hadn't set a proper compaction policy. Trying to time travel to a snapshot on that table took 15 minutes of compute time. Iceberg doesn't save you from your own lack of maintenance; it just gives you more rope to hang yourself with.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd pick, and why
&lt;/h2&gt;

&lt;p&gt;If you are locked into the Databricks ecosystem and have no intention of leaving, go with Delta Lake. It is the path of least resistance, and the integration with Unity Catalog makes the "audit" part of the equation trivial. You get lineage for free, and your managers will be happy.&lt;/p&gt;

&lt;p&gt;However, if you are building a multi-engine architecture—say, using Flink for streaming ingestion, Trino for ad-hoc SQL, and Spark for heavy lifting—Iceberg is the clear winner. &lt;/p&gt;

&lt;p&gt;My honest advice? Pick Iceberg if you want to avoid vendor lock-in, but be prepared to treat your metadata management as a first-class citizen. You will need a dedicated orchestration layer (like Airflow or Dagster) to run your &lt;code&gt;rewrite&lt;/code&gt; and &lt;code&gt;expire&lt;/code&gt; actions. If you aren't prepared to monitor your snapshot count and file compaction, you’ll end up with a high S3 bill and a slow, bloated data lake that nobody can actually use for audits. &lt;/p&gt;

&lt;p&gt;Time travel is only useful if the vehicle isn't broken. Maintain your manifests, or don't bother.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #data #engineering #iceberg #delta&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@tylergm?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Tyler&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>data</category>
      <category>engineering</category>
      <category>iceberg</category>
      <category>delta</category>
    </item>
    <item>
      <title>Is BigLake the End of Your Vendor Lock-in Delusion?</title>
      <dc:creator>Aniket Abhishek Soni</dc:creator>
      <pubDate>Sun, 14 Jun 2026 04:52:37 +0000</pubDate>
      <link>https://dev.to/aniketsoni/is-biglake-the-end-of-your-vendor-lock-in-delusion-2370</link>
      <guid>https://dev.to/aniketsoni/is-biglake-the-end-of-your-vendor-lock-in-delusion-2370</guid>
      <description>&lt;p&gt;Most data leaders sell "Open Table Formats" like they’re a magic bullet for vendor independence. They aren't. They’re a way to ensure your data stays usable when you inevitably decide that BigQuery’s &lt;code&gt;STORAGE_BILLING_MODEL&lt;/code&gt; is eating your entire cloud budget. I’ve spent the last 18 months moving petabyte-scale healthcare datasets onto BigLake and Iceberg. It works, but Google didn't build it to make leaving easy; they built it to keep the friction of moving data in GCP lower than the friction of moving it out.&lt;/p&gt;

&lt;p&gt;If you’re expecting a plug-and-play experience, stop. You are about to deal with metadata locking, Hive-metastore nightmares, and the specific joy of debugging &lt;code&gt;403 Forbidden&lt;/code&gt; errors that actually mean "your service account doesn't have the right Storage Object Admin role on the underlying GCS bucket." This guide is for the engineer who wants the architecture to be actually maintainable, not just "production-ready" on a slide deck.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Stop treating BigLake like a standard BigQuery table
&lt;/h2&gt;

&lt;p&gt;BigQuery tables are black boxes. BigLake tables are external pointers. The biggest mistake I see? Treating them as if they have the same ingestion SLAs. When you run a &lt;code&gt;MERGE&lt;/code&gt; statement on a native BQ table, Google manages the optimization. When you do it on Iceberg, you are responsible for the maintenance of the underlying Parquet files.&lt;/p&gt;

&lt;p&gt;If you don't run &lt;code&gt;CALL sys.rewrite_data_files&lt;/code&gt; and &lt;code&gt;CALL sys.rewrite_manifests&lt;/code&gt; regularly, your query performance will degrade into a slow-motion car crash. You aren't just a data engineer here; you’re an amateur database administrator managing compaction intervals. Don't automate this via a simple cron job; hook it into your Airflow DAGs as a post-load quality gate.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Don't let your manifest files grow to the moon&lt;/span&gt;
&lt;span class="k"&gt;CALL&lt;/span&gt; &lt;span class="n"&gt;my_dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rewrite_data_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'my_project.my_dataset.my_iceberg_table'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'date_partition = "2023-10-01"'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1617889962656-19b629fb1df1%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHw0fHxpY2UlMjBjdWJlJTIwbW91bnRhaW58ZW58MHwwfHx8MTc4MTQxMTI2N3ww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fimages.unsplash.com%2Fphoto-1617889962656-19b629fb1df1%3Fcrop%3Dentropy%26cs%3Dtinysrgb%26fit%3Dmax%26fm%3Djpg%26ixid%3DM3w5NzU0MjJ8MHwxfHNlYXJjaHw0fHxpY2UlMjBjdWJlJTIwbW91bnRhaW58ZW58MHwwfHx8MTc4MTQxMTI2N3ww%26ixlib%3Drb-4.1.0%26q%3D80%26w%3D1080" alt="Photo by Damien Schnorhk on Unsplash" width="1080" height="720"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Photo by &lt;a href="https://unsplash.com/@damienschnorhk?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Damien Schnorhk&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Partitioning is your only defense against egress costs
&lt;/h2&gt;

&lt;p&gt;In the traditional BigQuery world, we got lazy with clustering. In Iceberg, if you aren't partition-pruning correctly, you are scanning your entire GCS bucket. And guess what? Google charges for data processed. If you scan 10TB to find one patient record, that’s on you. &lt;/p&gt;

&lt;p&gt;Always, and I mean always, use hidden partitioning. Don't create a &lt;code&gt;date_string&lt;/code&gt; column in your source data just to satisfy the partition requirement. Use Iceberg’s ability to derive partitions from timestamps. If you’re partitioning by day, use &lt;code&gt;days(timestamp_col)&lt;/code&gt;. If you don't, you'll be rewriting your entire metadata layer when you realize you need to change your partition strategy.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. The IAM dance is a multi-act tragedy
&lt;/h2&gt;

&lt;p&gt;BigLake requires the BigQuery Connection resource to act as an intermediary between the engine and the storage. You’ll need a Google-managed service account for that connection. The failure mode here is subtle: users will have &lt;code&gt;BigQuery Data Viewer&lt;/code&gt; permissions, but they won't have &lt;code&gt;Storage Object Viewer&lt;/code&gt; on the GCS bucket.&lt;/p&gt;

&lt;p&gt;The error message in the console will tell you "Access Denied," and it will lie to you about which permission is missing. Always check the service account tied to the connection resource first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# The CLI command that saves you 4 hours of debugging&lt;/span&gt;
gcloud projects add-iam-policy-binding &lt;span class="o"&gt;[&lt;/span&gt;PROJECT_ID] &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"serviceAccount:[CONNECTION_SERVICE_ACCOUNT]"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/storage.objectViewer"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Manifest snapshots are not backups
&lt;/h2&gt;

&lt;p&gt;A common misconception is that Iceberg's metadata history serves as a disaster recovery strategy. It doesn't. If someone accidentally deletes the underlying Parquet files from GCS, your Iceberg metadata is just a list of pointers to ghosts. &lt;/p&gt;

&lt;p&gt;You need to enable GCS Object Versioning on your buckets. Period. If you don't, and a pipeline goes rogue and deletes a partition, you have no recourse. Iceberg's &lt;code&gt;expire_snapshots&lt;/code&gt; procedure is useful for storage cleanup, but keep at least 7 days of snapshots. If you set this too aggressively, you lose the ability to perform time-travel queries, which is the only reason you’re using Iceberg in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Schema evolution is a trap
&lt;/h2&gt;

&lt;p&gt;Iceberg supports schema evolution, and BigLake respects it. That’s the theory. The reality is that if you rename a column in your Parquet files via a tool that doesn't respect the Iceberg manifest, you break the contract.&lt;/p&gt;

&lt;p&gt;Never manually modify the Parquet files outside of the Iceberg engine. If you need to fix a data type or rename a column, use &lt;code&gt;ALTER TABLE&lt;/code&gt;. If you go behind the engine's back, you'll encounter the "orphaned data file" problem where the metadata points to a column that no longer exists in the Parquet schema, leading to &lt;code&gt;NullPointerException&lt;/code&gt; or generic engine failures that don't point to the root cause.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Keep the engine as the single source of truth for metadata&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_iceberg_table&lt;/span&gt; 
&lt;span class="k"&gt;RENAME&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;old_name&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;new_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  6. The "Hidden" cost of metadata storage
&lt;/h2&gt;

&lt;p&gt;Every time you commit a transaction in Iceberg, you create a new manifest file. If you have a high-frequency ingestion pipeline (e.g., streaming small batches every minute), your metadata layer will explode.&lt;/p&gt;

&lt;p&gt;I’ve seen metadata directories hit 50,000 files in three weeks. This slows down query planning significantly. You aren't just paying for data storage; you're paying for GCS read operations on every single metadata file during the &lt;code&gt;EXPLAIN&lt;/code&gt; phase of your query. Micro-batching is the enemy of Iceberg performance. Buffer your data in memory or use a staging area before committing to the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Avoid the "BigQuery-only" mindset
&lt;/h2&gt;

&lt;p&gt;If you chose Iceberg, it’s likely because you want to use Trino, Spark, or DuckDB on the same data. If you write your data using only BigQuery’s &lt;code&gt;INSERT&lt;/code&gt; or &lt;code&gt;MERGE&lt;/code&gt; statements, you might be creating files that are optimized for BigQuery but are absolute garbage for Trino.&lt;/p&gt;

&lt;p&gt;Check your Parquet writer settings. Ensure you are using Snappy compression and reasonable row group sizes. If you write 1GB row groups, Spark/Trino will choke on memory when trying to read them. Stick to the 128MB to 256MB range. It’s boring, it’s standard, and it keeps your compute engines from crashing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;BigLake and Iceberg are powerful tools, but they shift the burden of performance tuning from Google’s black box to your own infrastructure. You get "openness," but you pay for it in complexity. You have to be the one to manage the metadata, the IAM policies, the file compaction, and the storage lifecycles.&lt;/p&gt;

&lt;p&gt;It’s worth it if you’re tired of being held hostage by proprietary formats. It’s a disaster if you treat it like "just another table." Before you migrate that production workload, ask yourself: are you actually prepared to own the metadata layer, or are you just looking for a new way to break your pipelines?&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #gcp #bigquery #iceberg #dataengineering&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cover photo by &lt;a href="https://unsplash.com/@rocinante_11?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Mick Haupt&lt;/a&gt; on &lt;a href="https://unsplash.com/?utm_source=articles_pipeline&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gcp</category>
      <category>bigquery</category>
      <category>iceberg</category>
      <category>dataengineering</category>
    </item>
  </channel>
</rss>
