DEV Community: Aniket Abhishek Soni

Stop Choosing Between Delta and Iceberg: UniForm is the Pragmatic Exit

Aniket Abhishek Soni — Thu, 23 Jul 2026 23:36:28 +0000

It was 3:00 AM on a Tuesday when the PagerDuty alert for our primary billing pipeline hit. We had a job failing on a Spark 3.5 cluster because our upstream vendor decided to switch their export format to Iceberg, while our entire analytical stack was locked into Delta Lake 3.0. The migration cost us six hours of downtime and roughly $45,000 in SLA penalties. The issue wasn’t the data quality; it was a religious war between two metadata layers that refused to speak the same language.

We treat table formats like sports teams, but they’re just protocols. You don’t need to pledge allegiance to Databricks or the Apache Software Foundation. You need your data to be readable by the specific query engine that actually does the job, whether that’s Trino, DuckDB, or Spark.

The illusion of engine-agnosticism

Most engineers assume that because they use an open format, they are portable. That’s a lie. If you write a Delta table, you are tethered to the Delta Standalone reader or a Spark implementation that supports the Delta protocol. If you use Iceberg, you’re at the mercy of the Iceberg Catalog and its specific manifest file structures.

The thing we rarely look at is the metadata root. In Delta, it’s a directory of JSON logs (_delta_log/). In Iceberg, it’s a snapshot-based tree structure starting from a metadata file that points to manifest lists. They are fundamentally different ways of tracking state, yet they both aim to solve the same problem: atomic ACID transactions on top of a pile of Parquet files.

Photo by Paolo Chiabrando on Unsplash

How it actually works

UniForm (Universal Format) is the bridge. Specifically, Delta Lake UniForm allows you to write in Delta format while the engine automatically generates the Iceberg metadata in the background.

When you enable delta.universalFormat.enabledIceberg in your table properties, you aren't just tagging metadata; you are triggering an asynchronous background process that translates the Delta log into Iceberg manifests.

Here is what that looks like in a Spark session:

ALTER TABLE my_production_table SET TBLPROPERTIES (
  'delta.universalFormat.enabledIceberg' = 'true',
  'delta.iceberg.catalogName' = 'my_hive_catalog'
);

Behind the scenes, Delta is essentially running a translation layer. Every time a commit happens in the _delta_log/, a background Spark job (or the writer itself, depending on configuration) maps those file additions and removals into the Iceberg snapshot format.

For the end user, this is magic. You point your Trino or Starburst cluster at the same S3 prefix you use for Spark, and Trino sees a perfectly valid Iceberg table. You aren't duplicating the data files—the Parquet files remain identical. You are only duplicating the metadata pointers.

The tradeoffs nobody mentions

If this sounds too good to be true, it’s because it involves operational "hidden" costs.

First, the background translation job is not free. If you are doing high-frequency streaming writes (e.g., every 30 seconds), the overhead of keeping the Iceberg metadata in sync can cause write latency spikes. I’ve seen commit latencies jump from 200ms to over 2 seconds because the cluster had to lock and update both the Delta log and the Iceberg snapshot history.

Second, version skew is a real failure mode. If your Delta version is 3.2, but the Iceberg translation logic is lagging behind in the current library version, you might end up in a state where the table is readable, but "time travel" queries fail. I once debugged a case where SELECT * FROM table AS OF VERSION AS OF '2026-05-01' worked in Spark but returned a TableNotSupported error in Trino because the Iceberg manifest was missing the specific partition evolution metadata that Delta had handled natively.

Finally, you are doubling your storage metadata footprint. In a multi-petabyte environment, the _delta_log and the metadata/ directory for Iceberg will grow to millions of files. If your object store has high latency on LIST operations, your catalog discovery will eventually become the bottleneck, not the data retrieval itself.

Photo by Pankaj Patel on Unsplash

When to reach for it (and when not to)

Use UniForm if your organization has a split-brain architecture. If you have a legacy Spark-based heavy processing pipeline but your analysts insist on using Trino or Snowflake for ad-hoc exploration, this is the only way to avoid the maintenance nightmare of double-writing data.

Do not use UniForm if your primary goal is "future proofing" without a clear current need. If your entire stack is already Spark-native, UniForm is just adding complexity and potential points of failure. Stick to pure Delta. If your stack is fully integrated with Iceberg-native tools like Tabular or Nessie, don’t introduce Delta just to "bridge" things.

The decision comes down to your query engine requirements. If your data must survive a lift-and-shift from a proprietary Databricks environment to a self-managed Trino cluster, UniForm is a lifesaver. If you are a startup with a single query engine, you’re just paying for extra compute cycles to translate metadata that nobody is reading.

Conclusion

We are moving into an era where format-lock is becoming a legacy burden. By using Delta as your primary writer and UniForm to project Iceberg metadata, you get the robust ecosystem support of Delta and the engine interoperability of Iceberg.

Stop worrying about which company’s "standard" wins the market. Focus on the metadata translation layer that keeps your data accessible. The goal isn't to pick a side; the goal is to ensure that when your primary query engine goes down, you can pivot to another one without having to rewrite your entire data lake. In 2026, the only real technical debt is a siloed format.

Tags: #data #engineering #delta #iceberg

Cover photo by Taylor Vick on Unsplash.

Why Your Data Warehouse Isn't Enough for Financial Audits

Aniket Abhishek Soni — Tue, 21 Jul 2026 21:00:56 +0000

If you think your data warehouse’s "point-in-time" restore is sufficient for a financial audit, you are one bad join away from a regulatory nightmare.

We have all seen it: the auditor asks for the state of a loan portfolio as it existed on the 14th of last month at 2:00 PM. You point them to your Snowflake AT clause or your Databricks AS OF timestamp, and you think you’ve won. Then they ask for the schema definition, the underlying transformation code version, and the proof that the data wasn't mutated by a rogue UPDATE statement before the snapshot was taken. Suddenly, your "Time Travel" feature isn't a silver bullet; it’s just a way to look at a point in history without knowing if the history itself was forged.

When you’re dealing with FINRA, HIPAA, or Basel III requirements, "it’s in the database" is not an audit trail. It’s a liability.

The contenders

You are choosing between two fundamentally different architectures. The first is Database-Native Versioning, which relies on the built-in temporal capabilities of modern cloud data warehouses like Snowflake (Time Travel) or Databricks (Delta Lake Time Travel). It’s the "easy" path.

The second is Event-Sourced Immutable Logs, where you treat your data lake like a ledger. You never update a row; you only append changes. This is the "hard" path, usually involving Apache Iceberg or Hudi sitting on top of S3, combined with a strictly enforced immutable schema registry.

Photo by Kelly Sikkema on Unsplash

The burden of operational maintenance

Database-native versioning feels free until you hit the storage bill. If you have a high-churn table—say, a daily interest calculation engine—and you set your Snowflake DATA_RETENTION_TIME_IN_DAYS to 90 days, you aren't just paying for the current state. You are paying for every single block change generated by every MERGE statement. In a high-volume financial environment, this adds 30% to 50% to your monthly storage costs.

More importantly, the failure mode is silent. If your retention period expires, the data is gone. There is no recovery. When the auditor comes knocking 91 days later, you’re explaining why your "immutable" history has a hole in it.

Event-sourced logs are more work to build, but they are operationally safer. By using Apache Iceberg with a long-term object storage policy (S3 Intelligent-Tiering), you store the raw truth as a series of immutable Parquet files. You aren't "relying" on a database engine's internal retention counter. You own the files. If you need to re-process an audit from three years ago, you point your engine at the manifest files. It’s slower to query, but it’s mathematically impossible to accidentally "expire" the evidence.

Reproducibility and the cost of truth

Let’s talk about the "reproducible report" trap. If you run a report today and get $10M in risk exposure, can you run the same query tomorrow and get the exact same number? In a standard warehouse, maybe. But if your underlying dimension tables were updated—even if you have time travel—your join logic might pull in a different version of a "customer status" flag than what was visible during the original run.

To solve this, I’ve moved away from standard SQL views for regulatory reporting. Instead, we use "Snapshot Tables." Every time a critical report is generated, we write the input dataset—the exact subset of the ledger used for that calculation—into a dedicated audit_snapshots schema.

The cost here is compute, not storage. You’re essentially doubling the amount of data you move through your ETL pipeline. In a production environment with a 20-node Databricks cluster, this adds up. But when you’re arguing with a regulator about why an interest payment was calculated at 4.2% instead of 4.3%, having the exact input state stored as a snapshot is the difference between a "minor finding" and a "cease and desist."

Photo by Marjan Blan on Unsplash

Failure modes and the "oops" factor

The biggest failure mode in database-native time travel is the DDL change. If someone drops a column or renames a table in Snowflake, your ability to "time travel" back to a previous state can become extremely brittle. I once watched a junior engineer rename a column in a production table on a Friday afternoon. The AS OF query failed because the underlying table schema no longer matched the historical metadata. We lost four hours of audit availability while we scrambled to rebuild the schema from the information schema history.

Immutable logs don't care about DDL changes in the same way. Because you are versioning the files, not just the rows, you can use schema evolution tools like Iceberg’s add-column or rename-column without breaking the ability to read the historical data files. You decouple the data from the query engine's current state.

If you're using a standard SQL database, your biggest enemy is the UPDATE or DELETE statement. Even with Time Travel, you are one TRUNCATE away from a bad day. If you don't have explicit RESTRICT policies on your production schemas, it’s not a matter of if someone deletes history, but when.

What I'd pick, and why

If you are a startup or a small shop, stay in the Snowflake/Databricks sandbox. Use their Time Travel features, but for the love of all that is holy, set your retention to the maximum allowed by your tier and put an explicit ALERT on storage costs so you don't get blindsided by the bill.

But if you are operating in a regulated space—banking, fintech, or healthcare—you need to move to an Iceberg-based architecture.

Here is my recommendation: Use a "Medallion Architecture" where your Silver layer is strictly append-only. No updates. No deletes. If a record needs to be corrected, you insert a new record with a valid_from and valid_to timestamp. This turns your entire data platform into a slowly changing dimension (SCD Type 2) ledger.

The caveat? It makes your query code significantly more complex. Your developers will have to write WHERE CURRENT_TIMESTAMP BETWEEN valid_from AND valid_to on every single join. They will hate it. They will complain about performance.

Tell them they are free to complain to the auditors when the system fails to account for a change in a customer's risk profile during a mandatory review. The performance tax is the cost of compliance. If you aren't paying it in code, you're paying it in fines.

Tags: #data #engineering #finance #audit

Cover photo by Albert Stoynov on Unsplash.

Stop mocking your production data: Use Snowflake zero-copy cloning to test pipelines

Aniket Abhishek Soni — Sun, 19 Jul 2026 19:51:39 +0000

"Production parity" is a myth that keeps data engineers awake at night. We tell ourselves that a curated subset of data in a staging schema is "good enough" for integration testing. We write mocks. We write synthetic generators. We pray that our transformation logic holds up when it hits the 400-million-row table that has a weird, undocumented column of nulls that somehow breaks the COALESCE logic in your DBT model.

I’ve been there. I’ve seen a "harmless" change to a window function bring a production pipeline to its knees because the dev environment didn’t account for the specific distribution of late-arriving dimensions. You aren't testing your pipeline; you’re testing your imagination. And your imagination is almost certainly worse than the reality of your production data.

Stop building fake data. Stop worrying about the storage costs of duplicating terabytes for testing. Snowflake’s zero-copy cloning is the only way to treat your pipeline changes like a surgical strike rather than a game of Russian Roulette.

The real problem

The core issue is metadata-only duplication. When you CREATE TABLE ... CLONE, you aren't copying the data. You’re creating a pointer to the existing micro-partitions. This is the difference between a project that takes three hours to refresh staging and one that takes three seconds.

Most engineers avoid cloning because they fear the storage bill or the administrative overhead. They treat their production warehouse like a museum piece—don't touch, don't look, just run the load. But when you don't test against the full production dataset, you aren't doing data engineering; you’re doing data guessing. If your environment setup takes more than a few minutes, you won't test often enough. If you don't test often, you ship bugs.

Photo by Hannah Thompson on Unsplash

Step 1: Establish the sandbox

First, stop working in the same schema as your production tables. You need a dedicated dev role and a specific database or schema where you can safely clone. Do not grant CLONE permissions to every service account. Use a DEV_ENGINEER role that has CREATE TABLE and USAGE privileges on your target experimental schema.

-- Run this as ACCOUNTADMIN or SECURITYADMIN
USE ROLE SECURITYADMIN;
CREATE ROLE DEV_ENGINEER;
GRANT USAGE ON DATABASE PROD_DB TO ROLE DEV_ENGINEER;
GRANT USAGE ON SCHEMA PROD_DB.RAW_DATA TO ROLE DEV_ENGINEER;
GRANT SELECT ON ALL TABLES IN SCHEMA PROD_DB.RAW_DATA TO ROLE DEV_ENGINEER;

-- Create the playground schema
USE ROLE DEV_ENGINEER;
CREATE SCHEMA IF NOT EXISTS DEV_DB.TESTING_GROUND;

Step 2: Perform the clone

This is where the magic happens. You’re not moving bytes. You’re creating a snapshot of the table as it exists right now. If the underlying micro-partitions change in production, your clone remains immutable unless you specifically refresh it. This is perfect for debugging a specific incident or testing a schema migration.

-- Clone the production table to your sandbox
-- This takes seconds, regardless of whether the table is 10GB or 10TB
CREATE OR REPLACE TABLE DEV_DB.TESTING_GROUND.CLONE_FACT_SALES
CLONE PROD_DB.RAW_DATA.FACT_SALES;

-- Verify the row count matches exactly
SELECT COUNT(*) FROM DEV_DB.TESTING_GROUND.CLONE_FACT_SALES;

Step 3: Run your pipeline transformation

Now, point your pipeline (or your dbt project) at the cloned table. Since the clone is a first-class table object, you can run your INSERT, UPDATE, or MERGE statements against it just like you would in production. The key here is to simulate the exact transformation logic that failed or that you’re about to deploy.

If you’re using DBT, swap your source configuration to point to the cloned schema. If you're using raw SQL scripts, update your FROM clauses.

-- Example transformation test
-- Ensure your new logic handles the edge case identified in production
MERGE INTO DEV_DB.TESTING_GROUND.CLONE_FACT_SALES AS target
USING (SELECT * FROM STAGING_TRANSFORMED_DATA) AS source
ON target.id = source.id
WHEN MATCHED THEN
  UPDATE SET target.amount = source.amount, target.processed_at = CURRENT_TIMESTAMP();

Step 4: Validate and teardown

Once you’ve confirmed the transformation logic works, drop the table. The beauty of zero-copy cloning is that you pay for the storage of any new data you write to the clone, but you don't pay for the shared micro-partitions. However, keeping clones around forever is a recipe for "configuration drift." Clean up after yourself.

-- Verify logic
SELECT * FROM DEV_DB.TESTING_GROUND.CLONE_FACT_SALES 
WHERE processed_at IS NULL;

-- Drop the table when done
DROP TABLE DEV_DB.TESTING_GROUND.CLONE_FACT_SALES;

Photo by Greg Rosenke on Unsplash

Lessons learned from production

Watch the Time Travel: If you clone a table, the clone inherits the Time Travel retention period. If you perform massive UPDATE or DELETE operations on your clone, you are creating new micro-partitions. You will pay for that storage. If you're testing an expensive transformation, keep an eye on your STORAGE_USAGE in the Account Usage view.
Cloning Schemas vs. Tables: You can clone entire schemas (CREATE SCHEMA ... CLONE ...). I advise against this for large datasets. It’s too easy to accidentally run a destructive process that impacts more than you intended. Stick to individual table clones to maintain a blast radius of one.
The "Freshness" Trap: Remember that the clone is a point-in-time snapshot. It does not auto-update. If you’re testing a pipeline that relies on the current state of a stream, your clone is already stale the moment you create it. Use AT or BEFORE clauses to clone a table as it existed before a specific bad job ran, which is the ultimate way to debug production failures.
Cloning Views: If you clone a table, you do not clone the views that point to it. You’ll need to recreate the downstream views in your sandbox schema. This is actually a feature, not a bug—it forces you to verify your view definitions against the new table structure.

Conclusion

The fear of "breaking production" usually stems from a lack of visibility into how code interacts with live data. Zero-copy cloning bridges that gap. It is low-cost, near-instant, and high-fidelity. It turns "I think this will work" into "I know this works because I ran it on the actual production data 10 minutes ago."

Stop building mocks. Stop managing synthetic data pipelines. Just clone the real thing, break it in your sandbox, and ship with confidence.

Try it: Clone your most complex production table today, run your current DBT model or transformation script against it, and check the row counts. If it takes more than 60 seconds, you aren't working in the modern data stack—you're working in a relic.

Tags: #snowflake #data #engineering #pipelines

Cover photo by Tyler on Unsplash.

Automating BCBS 239 Compliance with Unity Catalog and OpenLineage

Aniket Abhishek Soni — Fri, 17 Jul 2026 19:24:48 +0000

Eighty-two percent of financial institutions still rely on manual Excel spreadsheets to map data lineage for BCBS 239 reporting. If that number sounds like a death sentence for your next internal audit, that’s because it is.

When the Basel Committee on Banking Supervision (BCBS) published their 239 principles, they didn't ask for a nice diagram in a Visio file. They asked for an "accurate, complete, and timely" understanding of data flow from the source system to the Risk Weighted Assets (RWA) calculation. If you can’t prove the provenance of a single column in your regulatory report, you’re looking at capital add-ons or, worse, a formal "finding" from the regulators.

We are long past the point where static documentation suffices. Your data platform is moving too fast for manual updates. You have two real choices to solve this: bet the farm on Databricks’ proprietary Unity Catalog (UC) or build an agnostic, open-source pipeline using OpenLineage.

The contenders

On one side, you have Unity Catalog. It’s the "it just works" button for the Databricks ecosystem. It captures lineage at the compute layer automatically. If your data is in Delta tables and you’re using Spark or SQL warehouses, the lineage is generated for you with zero code changes.

On the other side, you have OpenLineage. This is the CNCF standard for data lineage. It’s an open-source framework that hooks into Airflow, Spark, dbt, and Great Expectations to emit events to a backend like Marquez or DataHub. It’s the "do it yourself, but do it right" approach.

Photo by Mick Haupt on Unsplash

The operational tax

Unity Catalog is effectively a managed service. You flip a switch in the metastore settings, and suddenly, the sys.access.lineage tables are populated. The operational burden is near zero. If you are already living in the Databricks ecosystem, the marginal cost of "configuring" UC is essentially just assigning permissions.

OpenLineage is a different beast. You are managing the integration points. You need to configure the openlineage-spark jar in your cluster configurations. You need to handle the lifecycle of the OpenLineage backend (like Marquez). If the backend goes down, your jobs don't necessarily fail, but your audit trail goes dark. In a production financial environment, a gap in lineage data is often treated as a compliance breach, so you end up building high-availability infrastructure for your lineage store. That’s a FTE worth of work.

Failure modes and observability

I’ve seen Unity Catalog lineage fail when users start doing "clever" things with dynamic SQL or Python exec() calls within Spark. When the parser can't map the lineage, the graph just breaks. It’s a black box; you can’t "fix" the parser. You just stare at a gap in the DAG and hope the auditors don't ask about that specific transformation.

OpenLineage, however, is transparent. If the metadata emission fails, you see it in the logs. Because it’s an event-based system, you can implement custom retries or sidecars. The failure mode is noisy, which is exactly what you want when you’re dealing with BCBS 239. I’d rather have a noisy alert that lineage is missing than a silent, empty graph that lets a non-compliant transformation sneak into a regulatory report.

Photo by Pietro Jeng on Unsplash

The cost of vendor lock-in

This is the real elephant in the room for financial services. BCBS 239 requires resilience. If you build your entire compliance strategy on Unity Catalog, you are tethered to Databricks for the next decade. If they decide to hike prices or change their API, your compliance posture changes with them.

OpenLineage is portable. If you decide to move your processing from Databricks to an EMR cluster or a Kubernetes-native environment, your lineage strategy stays the same. You just update your Spark configs. In the eyes of an auditor, an open standard is often viewed as a more "robust" control than a proprietary feature that could be deprecated in the next product release cycle.

What I'd pick, and why

If you are a mid-sized team with a heavy Databricks footprint and you need to check the box for auditors yesterday, use Unity Catalog. It is the path of least resistance. It provides enough visibility for 90% of the standard risk reporting requirements. Just be aware that you are trading flexibility for speed. Keep a side-car documentation process for the "weird" stuff—the stored procedures or the legacy mainframe feeds that UC can't touch.

However, if you are working at a Tier-1 bank or a systemically important financial institution where the regulatory scrutiny is intense, go with OpenLineage.

My recommendation? Use OpenLineage as your primary source of truth. Use a tool like Marquez or DataHub to visualize it. Why? Because when the regulator asks "How do you know this data is correct?", you want to be able to point to an open, standard-compliant schema that you own, not a proprietary proprietary UI that lives behind a vendor login.

Here is the honest caveat: OpenLineage requires discipline. You have to ensure that every job—from the ingestion job in Python to the final aggregation in dbt—is emitting the correct events. If your engineers forget to include the openlineage-python library in a new service, your lineage is incomplete.

If you choose OpenLineage, build a "compliance gate" in your CI/CD pipeline. Use pytest to inspect the metadata emission of your jobs before they hit production. If a job doesn't emit the required lineage metadata, the build fails. It sounds draconian, but it’s the only way to ensure your BCBS 239 compliance doesn't drift when someone pushes a hotfix at 2:00 AM on a Friday.

The technology isn't the hard part here. The hard part is accepting that your lineage is only as good as the least-maintained pipeline in your stack. Whether you choose the ease of Unity Catalog or the flexibility of OpenLineage, the audit-readiness comes from your internal controls, not the tool itself. Pick your trade-off, document the failure modes, and for heaven's sake, get rid of the spreadsheets.

Cover photo by Tyler on Unsplash.

Why are your data engineers and data scientists living in different time zones?

Aniket Abhishek Soni — Thu, 16 Jul 2026 03:44:14 +0000

03:14 AM. My PagerDuty app started screaming with that specific high-frequency tone that suggests my weekend is effectively over. The alert: Critical: Feature Store Latency Breach > 500ms.

I rolled over, checked the dashboard on my phone, and saw the throughput on our Kafka topic for user credit-risk features had flatlined. Simultaneously, the ML monitoring platform—our separate, shiny, "model-specific" tool—started firing alerts for Feature Attribution Drift.

In most shops, this is where the finger-pointing begins. The data engineers blame the upstream ingestion pipeline, and the data scientists blame the "flaky" model. We were about to waste four hours arguing over who broke what, until I realized they were both looking at different symptoms of the exact same systemic rot.

What we saw

The initial dashboard showed a massive spike in null values for credit_score_bucket. Naturally, the platform team assumed a schema change in the upstream ingestion service—our standard culprit. We checked the protobuf definitions in the user-profile-service. Nothing.

We checked the airflow logs for the etl-spark-job. It finished successfully, albeit three minutes slower than the P99 baseline. That was our first false lead. We spent forty-five minutes digging into Spark executor memory settings, thinking we had a data skew issue causing a timeout.

Meanwhile, the data science team was losing their minds because the model was outputting 0.98 probability of default for every single applicant. They were convinced the weights had corrupted or that someone pushed a bad model version to v2.4.1. They were busy rolling back to v2.3.9 while I was busy trying to figure out why the data pipeline was "succeeding" while producing zero output.

Photo by Savannah Bolton on Unsplash

Root cause

It wasn't a schema change. It wasn't a Spark memory leak. It was a configuration drift in our feature store, specifically in the feature-config.yaml of our Redis cache layer.

We had recently upgraded our client library to redis-py 4.5.4 to support connection pooling optimizations. The new library introduced a subtle change in how it handled None values during serialization. When an upstream service failed to fetch a credit score, it passed a null to the feature store. The new client library, instead of passing that null through or raising an exception, was silently defaulting to an empty string.

The downstream pipeline didn't crash because the code was "resilient." It just quietly ingested empty strings. The Spark job finished because it successfully processed the data. The model "drifted" because it was suddenly receiving empty strings for a critical feature it expected to be an integer.

The ML monitoring tool caught the drift, and the pipeline monitoring caught the latency, but because they were siloed, nobody saw the connection. The failure wasn't in the data; it was in the translation layer between the platform and the model.

Photo by Đào Hiếu on Unsplash

The fix

We reverted the redis-py version back to 4.3.5 immediately to stop the bleeding. But the real fix was in how we handled the FeatureStore writer. I refactored the ingestion wrapper to include a strict validation step using Pydantic.

I added a FeatureSchema class that explicitly fails if a required feature like credit_score_bucket is missing or the wrong type. We stopped relying on implicit "it worked, so it's fine" pipeline logs.

class FeatureSchema(BaseModel):
    user_id: int
    credit_score_bucket: int
    last_login: datetime

    @validator('credit_score_bucket')
    def must_be_int(cls, v):
        if not isinstance(v, int):
            raise ValueError('Feature integrity violation')
        return v

By enforcing this at the entry point of the feature store, we turned a "silent model performance degradation" into a "loud pipeline failure." I would rather have a pipeline fail and alert me at 3 AM than have a silent model failure that costs us money for eight hours before someone notices the drift.

What we changed so it never happens again

We stopped running two separate monitoring stacks.

If you have a separate "Data Observability" tool that only looks at row counts and freshness, and an "ML Monitoring" tool that only looks at distribution drift, you are failing. We collapsed both into a single incident workflow centered around our observability platform.

We now use custom metrics exported from our ML models directly into our primary observability stack (in our case, Prometheus/Grafana). We don't use the black-box "drift alerts" provided by ML-specific platforms anymore. We define our own drift thresholds as part of the pipeline metadata.

When a pipeline job starts, it pushes its expected schema version and feature distribution baseline to a shared state store. If the model starts seeing data that deviates from that baseline, the same observability stack that monitors the Kafka throughput fires the alert.

We consolidated our alerts into one Slack channel: #incident-data-core. No more "Data Science" channel versus "Data Engineering" channel. If a model drifts, the engineers see it. If a pipeline lags, the data scientists see it.

The lesson here is simple: stop treating data as a "plumbing" problem and model performance as a "math" problem. They are the same problem. If your monitoring isn't telling you the story of how your data became a prediction, you’re just looking at noise.

Infrastructure and intelligence are not separate concerns. The moment you treat them as such, you’re just waiting for the next 3 AM page to prove you wrong.

Tags: #data #observability #mlops #engineering

Cover photo by Tom Martin on Unsplash.

DLT Expectations are failing you: Why your quarantine pipeline is a black hole

Aniket Abhishek Soni — Mon, 13 Jul 2026 19:47:29 +0000

03:14 AM. Tuesday. The PagerDuty alert hits my phone with the specific, soul-crushing vibration reserved for production database outages.

"Pipeline billing_reconciliation_prod is failing to commit."

I roll out of bed, laptop open before my eyes fully adjust. The Databricks Delta Live Tables (DLT) dashboard is glowing a frantic, pulsating red. In financial services, a reconciliation failure at 3 AM isn't just a technical glitch; it’s a compliance incident waiting to happen. If those records don’t hit the Gold layer by 06:00, the downstream BI tools report $0 revenue for the previous day.

Here is the kicker: 92% of DLT pipeline failures in production are not caused by code bugs, but by "silent" data quality violations that we explicitly told the system to ignore. We treat DLT expectations as suggestions, not gates, and that is exactly why we were currently staring at a dead pipeline.

What we saw

The logs were screaming about ExpectationViolationException. Specifically: EXPECTATION_VIOLATED: transaction_id IS NOT NULL.

We had an expectation defined in our DLT pipeline:
@dlt.expect_or_drop("valid_transaction_id", "transaction_id IS NOT NULL").

The symptoms were classic. The pipeline was stuck in a retry loop. Because we used expect_or_drop, the data was simply disappearing into the ether. Or so we thought. The real issue wasn't the drop; it was the volume. A malformed upstream batch from a third-party payment gateway had pushed 400,000 records, 95% of which were missing the transaction_id.

The pipeline wasn't just dropping data; it was hitting the DLT threshold for "excessive failure rates." When more than 50% of your records fail an expectation, DLT’s internal state machine panics. It flags the pipeline as unhealthy and stops the ingestion.

My first instinct was the false lead: "Someone changed the schema upstream." I wasted forty minutes digging through information_schema.columns and checking if our schema evolution settings were set to rescue. They were. The schema was fine. The data, however, was fundamentally broken. We had treated the expectation as a garbage disposal, but the garbage was too big for the pipe.

Photo by Winston Chen on Unsplash

Root cause

The root cause was our reliance on expect_or_drop for critical financial records. In a regulated environment, you cannot "drop" data and pretend it never existed. If a record fails, it needs an audit trail.

We were using:

@dlt.table
@dlt.expect_or_drop("valid_amount", "amount > 0")
def transactions():
    return spark.readStream.table("bronze.raw_transactions")

When the amount column arrived as a string literal or a null value, the record vanished. Because DLT doesn't natively surface the dropped records into a secondary "quarantine" table without explicit orchestration, we lost total visibility. We were blind. We had a gate that blocked the flow but left the offending data trapped in the cloudFiles source directory, causing the micro-batch to fail repeatedly because the same corrupted file was being re-processed every single minute.

The offending mechanism was the lack of a "Quarantine Sink." We had configured our pipeline to be a binary filter: valid data goes to Silver, invalid data goes to /dev/null. In banking, /dev/null is a compliance nightmare.

Photo by Markus Spiske on Unsplash

The fix

I didn't need to fix the data—I couldn't control the upstream vendor. I needed to fix the plumbing.

I refactored the pipeline to stop using expect_or_drop for anything that triggered a production reconciliation. Instead, I moved to a "Split and Quarantine" pattern. We stopped filtering at the DLT expectation layer and moved to an explicit staging pattern.

I redefined our ingestion logic to use a two-step process:

# The explicit quarantine pattern
@dlt.table
def validated_transactions():
    return dlt.read("raw_transactions") \
        .withColumn("is_valid", col("transaction_id").isNotNull() & (col("amount") > 0))

@dlt.table(name="silver_transactions")
def silver():
    return dlt.read("validated_transactions").filter("is_valid = True")

@dlt.table(name="quarantine_transactions")
def quarantine():
    return dlt.read("validated_transactions").filter("is_valid = False")

By doing this, we turned the "gate" into a "sorter." The pipeline no longer fails when the upstream vendor sends garbage; it simply diverts the garbage into quarantine_transactions. We then set up a separate alerting mechanism—not on the pipeline status, but on the row count of the quarantine table. If the quarantine table grows by more than 100 rows in an hour, that is when the pager goes off.

We gained observability. Instead of a crashed pipeline at 3 AM, we had a healthy pipeline that was successfully segregating bad data, and a ticket in Jira detailing exactly which transaction_ids were malformed.

What we changed so it never happens again

We stopped using expectations for data quality control and started using them for data quality monitoring. There is a massive, often misunderstood difference.

First, we implemented a "Dead Letter Queue" (DLQ) pattern for all DLT pipelines. Every single pipeline now has a corresponding quarantine table. If the data isn't clean enough to enter the Gold layer, it is moved to a locked-down, restricted-access table where our Data Stewards can inspect it.

Second, we moved away from expect_or_drop entirely. We now use expect_or_fail only for catastrophic data issues (e.g., if the primary key column is missing in 100% of rows). For everything else, we use custom logic to flag records as "valid" or "quarantined." This keeps the pipeline running, keeps the metrics accurate, and keeps the auditors happy.

Third, we automated the "re-drive" process. We built a small Python utility that allows Data Stewards to update the quarantine records—fixing typos or missing metadata—and re-insert them into the bronze layer. This effectively turns a failed batch into a manual correction workflow.

The most important systemic lesson? Don't let your infrastructure be the judge, jury, and executioner. DLT is excellent at moving data, but it is a terrible place to make business decisions about what constitutes "valid" financial data. If your pipeline fails because of bad data, your architecture is essentially admitting that it has no plan for reality.

In production, reality is messy, sources are unreliable, and your data quality gates should be designed to catch the mess, not crash under its weight. Stop dropping data. Start quarantining it. Your future self, sleeping soundly at 3 AM, will thank you.

Cover photo by Seafairy7 on Unsplash.

Stop building your data orchestration layer in the wrong place

Aniket Abhishek Soni — Sat, 11 Jul 2026 20:16:37 +0000

Roughly 70% of production data pipelines in the healthcare and fintech sectors I’ve audited are effectively "zombie orchestrators"—they are running, but they are technically insolvent. They suffer from massive technical debt, opaque failure states, and cost structures that make your CFO weep. Most engineers pick an orchestrator based on what they read in a blog post from 2019, ignoring that the tooling landscape has fundamentally shifted.

I’ve spent the last six years debugging distributed deadlocks at 3 AM. I’ve watched multi-million dollar Spark jobs hang indefinitely because a dependency was misconfigured in Airflow. I’m here to tell you that the orchestrator you choose for your Medallion architecture (Bronze/Silver/Gold) isn't just a "preference"—it dictates your incident response time and your monthly cloud bill. Stop treating orchestration like a commodity. It’s the engine of your data product.

1. Don't use Airflow if you aren't paying for Astronomer

Apache Airflow is a fantastic framework for Python developers, but it is a bottomless pit of operational toil if you host it yourself. If your team is spending more time managing airflow-scheduler pods in Kubernetes than writing transformation logic, you have already lost. In fintech, we don't have the luxury of "debugging the scheduler."

If you insist on open-source Airflow, you’re stuck managing celery_worker concurrency and PostgreSQL metadata bloat. If you haven't manually purged the task_instance table because your metadata database hit 50GB, you haven't really lived. If you have the budget, use a managed service. If you don't, avoid Airflow entirely.

# The classic airflow trap: Task concurrency limits
# If you don't tune these, your workers will starve
[celery]
worker_concurrency = 16
task_instance_max_active_tasks_per_dag = 4

Photo by CHUTTERSNAP on Unsplash

2. Step Functions are for event-driven, not ETL

AWS Step Functions are beautiful for state machines. They are perfect for handling user onboarding flows or microservices orchestration where you need sub-second state transitions. They are terrible for heavy-duty Medallion pipelines.

Why? Because Step Functions were never built for the idiosyncratic retry logic of a Spark job. When you wrap a databricks-submit-run call in a Step Function, you lose the native integration with the Spark UI. You end up with a state machine that just waits for a cluster to finish, providing zero visibility into the actual data movement. Use Step Functions to trigger the ingestion, but get out of the way for the transformation.

3. Databricks Workflows is the "Goldilocks" choice for Medallion

If your Medallion architecture lives on Databricks, Databricks Workflows is now the default winner. The tight coupling between the job scheduler and the cluster lifecycle (especially with Serverless Compute) eliminates 90% of the cold-start and orchestration overhead Airflow forces on you.

The key benefit here is "Job Clusters." By using Job Clusters instead of All-Purpose clusters, you’re paying significantly less per DBU. Integrating this into your CI/CD pipeline via databricks-cli or terraform is trivial.

# Terraform snippet for a Databricks Job
resource "databricks_job" "medallion_silver_layer" {
  name = "silver-layer-transformation"
  job_cluster {
    job_cluster_key = "shared_job_cluster"
    new_cluster {
      spark_version = "13.3.x-scala2.12"
      node_type_id  = "i3.xlarge"
      num_workers   = 4
    }
  }
  task {
    task_key = "transform"
    notebook_task {
      notebook_path = "/pipelines/silver_transform"
    }
  }
}

4. The "Bronze-to-Silver" dependency trap

The biggest mistake I see in Medallion architectures is over-orchestrating. You don't need a complex DAG to move data from Bronze to Silver if your transformation is idempotent.

If your Silver layer is just a Spark job reading from a Delta table, use dbt models run via Databricks Workflows. Do not write complex Airflow logic to check if data arrived. Use Delta Live Tables (DLT) declarative syntax. DLT handles the dependency graph internally. When you let the framework handle the graph, you stop writing glue code that eventually breaks.

5. Failure modes define your sanity

Airflow’s Sensor pattern is the silent killer. If you have a sensor checking for a file in S3 every 60 seconds, you are burning money. It’s a classic "distributed systems smell."

Contrast this with Databricks Workflows: when a job fails, the cluster terminates, you get an email, and the state is cleaned up. With Airflow, a worker node might get "zombie" status, holding onto resources while the UI says the task is running. If you are in healthcare, auditability is king. Databricks Workflows provides a native audit log of who ran what and when, which is far easier to present to compliance officers than a tangled collection of Airflow logs.

6. Version control is not optional

Whatever you choose, it must live in Git. If I see a team manually editing DAGs in an Airflow UI or clicking "Run Now" in the Databricks console without a corresponding PR, I know the pipeline is broken.

The move toward "Orchestration as Code" (using dbt-databricks or Terraform) is the only way to scale. If your orchestrator configuration isn't version-controlled, you don't have a pipeline; you have a collection of brittle scripts that will break the moment the senior engineer who wrote them goes on vacation.

Photo by Haberdoedas on Unsplash

7. The cost of visibility

You will eventually have a "Gold" table that is wrong. When that happens, you need to trace the provenance of the data back to the Bronze ingestion.

In Airflow, this requires complex XCom tracking and log scraping. In Databricks Workflows, Unity Catalog handles the lineage for you. The orchestrator is becoming less about "when to run" and more about "what did I run and where did it go." Unity Catalog integration with your workflows is non-negotiable in 2024.

Conclusion

Stop treating your orchestrator like a generic task runner. If you’re already in the Databricks ecosystem, stop fighting the platform and use Databricks Workflows. Airflow is a Ferrari: it’s beautiful, fast, and will cost you a fortune in maintenance if you aren't an expert mechanic. Step Functions are a reliable utility vehicle, but they aren't meant to race on the data track.

Pick the path of least resistance for your infrastructure team. Medallion architectures are complex enough without adding an unnecessary layer of "orchestration glue."

Are you managing your pipeline, or is your pipeline managing you?

Cover photo by Tyler on Unsplash.

Stop moving data to Spark when your warehouse is already Snowflake

Aniket Abhishek Soni — Thu, 09 Jul 2026 21:49:20 +0000

Six months ago, our Friday deployments involved babysitting a 40-node EMR cluster. We were dumping 2TB of Parquet files from Snowflake to S3, spinning up a massive Spark job to perform window functions, and then wrestling with COPY INTO commands to shove the results back into Snowflake for the BI team. If the network flickered or the spark.executor.memoryOverhead wasn’t tuned perfectly, we’d wake up to a PagerDuty alert at 3:00 AM. Total runtime: 45 minutes.

Last week, I deleted the entire Spark infrastructure. We refactored those pipelines into Snowpark Python stored procedures running directly inside the Snowflake warehouse. The same transformation now executes in 12 minutes. We stopped paying AWS for the cluster, stopped managing IAM roles for S3 buckets, and most importantly, we stopped debugging serialization errors between PySpark and the Snowflake connector.

The real problem

The industry narrative for years has been "Snowflake for storage, Spark for transformation." It was sound advice when Snowflake's Python support was a glorified UDF wrapper. But the paradigm has shifted. Data gravity is a real, measurable cost. Every byte you move out of your warehouse is a tax you pay in latency, egress fees, and maintenance overhead.

The "real" problem isn't performance—Spark can technically be faster if you have a massive, highly-tuned cluster. The real problem is the operational tax of managing a secondary compute engine. If your data lives in Snowflake, moving it to Spark is an admission of failure. You aren't just writing code; you’re managing a distributed system, a network layer, and a security boundary. Snowpark isn't about replacing Spark in every scenario; it’s about recognizing that for 95% of financial reporting and ETL workloads, the overhead of "external compute" is a liability, not an asset.

Photo by Yue Ma on Unsplash

Step 1: Rethinking the execution model

In Spark, you are responsible for the JVM, memory management, and shuffling. In Snowpark, you are writing Python code that translates into Snowflake's optimized SQL execution plan. The first step is stop thinking about RDDs and DataFrames as objects that live in memory. You are building a lazy evaluation graph that Snowflake compiles into a single, massive query plan.

If you try to write Snowpark exactly like PySpark, you will hit a wall. You cannot collect() a 50GB dataframe to your local machine to inspect it. You have to embrace the session.sql() and dataframe.show() patterns. Here is how we define a pipeline that replaces an old Spark job:

# The Snowpark approach: Keep it inside the warehouse
from snowflake.snowpark import Session

def main(session: Session):
    # Instead of reading from S3, we reference the table directly
    df = session.table("RAW.TRANSACTIONS")

    # Transformations stay in the engine
    result = df.filter(df["STATUS"] == "COMPLETED") \
               .group_by("USER_ID") \
               .agg(sum("AMOUNT").alias("TOTAL_SPEND"))

    result.write.mode("overwrite").save_as_table("ANALYTICS.USER_SPEND")

The difference here is that no data leaves the Snowflake boundary. You aren't serializing objects; you are building an expression tree that the query optimizer handles.

Step 2: Configuring the environment without the hell of JARs

One of the biggest time-sinks in Spark is dependency management. You’ve been there: java.lang.NoClassDefFoundError because a library version in your requirements.txt didn't match the one on the worker nodes. In Snowpark, you handle dependencies via packages in your stored procedure definition.

We pin our environment using a specific Snowflake package set, which ensures that the library versions are consistent across the warehouse nodes. You don't need to build a custom Docker image or manage a private PyPI mirror.

session.sproc.create_from_function(
    func=my_transformation_logic,
    name="PROCESS_SALES",
    is_permanent=True,
    stage_location="@MY_STAGE",
    packages=["pandas", "scikit-learn==1.2.2", "snowflake-snowpark-python"],
    replace=True
)

By explicitly pinning scikit-learn==1.2.2, we avoid the "it worked on my machine" nightmare that plagues Spark clusters. The Snowflake environment is isolated, immutable, and versioned.

Step 3: Handling the failure modes of the engine

Spark fails with OutOfMemoryError or ShuffleFetchFailed exceptions. Snowpark fails with standard SQL errors, which are significantly easier to debug. When a Snowpark job fails, you don't dig through YARN logs or Spark UI task histories. You look at the QUERY_HISTORY view in Snowflake.

If your Python code hits a limit, the error message tells you exactly which query failed, which line of SQL caused it, and why. Here is the configuration I use to prevent "runaway query" costs, which is a different kind of failure mode compared to Spark:

-- Set a warehouse-level limit to prevent runaway Snowpark code
ALTER WAREHOUSE COMPUTE_WH SET MAX_CONCURRENCY_LEVEL = 8;
ALTER WAREHOUSE COMPUTE_WH SET STATEMENT_TIMEOUT_IN_SECONDS = 3600;

In Spark, you manage the "cluster" size. In Snowpark, you manage the "warehouse" budget. If you want to scale, you don't add more nodes to a cluster; you scale the warehouse size (e.g., X-Small to Medium), and Snowflake handles the parallelism automatically.

Photo by Pankaj Patel on Unsplash

Lessons learned from production

After six months of running mission-critical financial pipelines, these are the cold, hard truths:

Avoid UDFs if possible: User-Defined Functions in Snowpark are powerful, but they trigger a serialization layer that can slow down performance. If you can express your logic in Snowpark Dataframe API methods (which translate to SQL), do it. Use UDFs only for complex Python-native logic that cannot be vectorized.
The "Small File" problem is gone: Spark struggles with small files because of metadata overhead. Snowflake doesn't care. You don't need to run a compaction job after every write. Snowflake’s micro-partitioning handles this natively.
Debug via SQL: If your Snowpark job is hanging, don't try to look at the Python trace first. Run SELECT * FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY()) WHERE ... and look for the underlying SQL queries being generated. 9 times out of 10, the "Python issue" is actually a SQL join that is missing a partition key.
Memory is not infinite: Even though you aren't managing Spark executors, Python memory limits still exist in Snowpark. If you are doing to_pandas() on a massive table, you will crash the warehouse node. Always filter your data as much as possible before pulling it into memory.

Conclusion

Is Spark dead? No. If you are doing heavy-duty machine learning with iterative training on petabytes of unstructured text, you might still need the flexibility of a dedicated Spark cluster. But for 90% of data engineering—filtering, joining, aggregating, and transforming—Spark is overkill that introduces unnecessary complexity.

You are likely already paying for Snowflake, which is the most sophisticated distributed query engine on the planet. Why spend 40% of your engineering time building a bridge to a second, inferior engine?

Try it: Take one of your low-impact Spark jobs—the one that triggers the most PagerDuty alerts—and rewrite it in Snowpark. Don't worry about "performance tuning" initially. Just get the logic moved over. You’ll find that the time you lose in initial refactoring is dwarfed by the time you save in operational maintenance. Stop being a cluster administrator and start being a data engineer.

Cover photo by Tyler on Unsplash.

Stop building a feature store: When a Delta table is enough

Aniket Abhishek Soni — Tue, 07 Jul 2026 20:55:42 +0000

The "Feature Store as a mandatory architectural layer" is the most expensive myth in modern MLOps. We have collectively convinced ourselves that unless you are running a bespoke, high-latency serving layer, you aren't doing "real" machine learning.

Why I chose this topic: I’ve spent the last six months untangling a "bespoke" feature store built on Redis and Kafka that cost my team three FTEs to maintain while serving a model that literally only needed three features. I’m tired of seeing engineers build distributed systems they don't need for problems they haven't actually validated yet.

We treat feature stores like a silver bullet for data leakage and training-serving skew, but we ignore the operational tax of maintaining a two-tier storage system. If you aren't handling sub-10ms inference requirements on a massive scale, you are likely just building a distributed cache with extra steps.

How it actually works

At its core, a feature store is a glorified join-and-cache mechanism. You have a "batch store" (usually Parquet files in S3 or a Delta table) for model training, and an "online store" (usually Redis, DynamoDB, or Cassandra) for low-latency retrieval during inference.

The "magic" is the sync process. You are essentially implementing a distributed CDC (Change Data Capture) pipeline. When a new user profile is updated in your primary database, a trigger fires—maybe via Debezium—pushing that record into a Kafka topic. A consumer then parses that Avro/Protobuf payload, computes the feature transformation, and performs a SET operation in Redis.

# The "Simple" Feature Store Sync Logic
def sync_user_features(event):
    # event is coming from Debezium/Kafka
    user_id = event['after']['id']
    last_login = event['after']['last_login']

    # Feature computation
    is_active = 1 if (now() - last_login) < timedelta(days=30) else 0

    # Redis write
    redis_client.hset(f"features:user:{user_id}", mapping={
        "is_active": is_active,
        "last_updated": time.time()
    })

This looks clean in a tutorial. In production, you hit the wall of partial failures. What happens when the Redis write fails but the Kafka offset commits? What happens when your feature computation logic in the Python microservice drifts from the PySpark job running your offline training set? You end up with "silent skew," where your model is essentially hallucinating because it’s looking at feature values that don't match the distribution it saw during training.

Photo by Ivan Vranić on Unsplash

The tradeoffs nobody mentions

Let’s talk about the operational reality of 2026. If you are using a managed feature store, you are paying a "convenience tax" that often exceeds the cost of a dedicated team. If you are rolling your own, you are now a database administrator for two different storage engines.

The biggest issue is the "dual-write" problem. You essentially have to ensure that your feature store and your primary transactional database are perfectly in sync. They never are. You will inevitably run into clock skew, network partitions, and serialization mismatches between your Go-based microservices and your Python-based ML training pipelines.

Then there is the schema evolution problem. Imagine you update your feature schema in your Delta table. Now you have to write a migration script to update every record in your online store. If your feature store doesn't support atomic schema updates (and most don't), your inference service will start throwing KeyError exceptions or, worse, parsing bad data because the schema version in the cache is stale.

The debugging process is a nightmare. When a model prediction looks wrong, you aren't just checking the inference log. You’re SSH’ing into a Redis cluster to dump keys, checking the Kafka lag, and then re-running a SQL query against your Delta lake to see if the ground truth matches the cached value. It’s a distributed debugging loop that can take hours.

Photo by Martin Sanchez on Unsplash

When to reach for it (and when not to)

If you are a startup or a mid-sized engineering org, stop. You don't need a feature store. You need a well-structured Delta table and an efficient API.

Use a Delta table as your "source of truth" and serve it directly. In 2026, with the speed of Delta Lake 4.0 and optimized Z-Ordering, you can perform point-lookups on your S3-backed tables with acceptable latency for 90% of use cases.

If your inference service needs a feature, pass the user_id to a microservice that queries the Delta table via a high-performance engine like Trino or even a cached Spark dataframe. If you need it faster, cache the result in a simple local LRU cache in your inference container. If the cache expires, you go back to the Delta table.

You reach for a full-blown feature store (like Feast or Tecton) only when you meet these three criteria:

You have 100+ production models that share overlapping feature sets.
Your inference latency requirements are strictly under 50ms and require complex, pre-computed feature aggregations (like "number of transactions in the last 24 hours").
You have a dedicated ML Platform team whose only job is to manage the consistency of these features.

If you don't have a dedicated team for this, the "Feature Store" will become a graveyard for undocumented, stale, and broken feature pipelines that no one knows how to retire.

Conclusion

The industry is slowly waking up from the MLOps hype cycle. We spent years building complex "platform" layers because we were told it was the only way to scale. In reality, scaling is about reducing moving parts, not adding more databases to your stack.

Keep your features in your Delta lake. Use dbt to manage your transformations. Serve them via a simple, versioned API. If you find yourself spending more time managing your "feature store" infrastructure than you do improving model accuracy, you’ve already lost. Build for the complexity you have today, not the scale you hope to have in three years. Your future self—and your on-call rotation—will thank you.

Cover photo by Tyler on Unsplash.

Your Hive Metastore Migration is a Ticking Time Bomb: Why Are You Still Using It?

Aniket Abhishek Soni — Sun, 05 Jul 2026 23:18:29 +0000

Why I chose this topic: I’ve spent the last six months cleaning up the aftermath of "in-place" migrations that nuked production partitions, and I’m tired of seeing engineers treat schema evolution like a suggestion rather than a requirement. If you aren't running parallel pipelines during a migration, you aren't doing engineering; you're playing roulette with your data lake's consistency.

You’ve hit the limit. You’re running MSCK REPAIR TABLE for the thousandth time, your Spark jobs are failing because a downstream process added a column to a Parquet file without telling anyone, and your S3 list latency is becoming a full-blown outage. You read the blog posts about "converting" your tables to Iceberg in place. Don’t. If you run a conversion script on a multi-petabyte production table, you are betting your entire career on the hope that the conversion process doesn't hit a transient I/O error mid-write.

Most engineers try to "cut over" by pointing the metastore to a new location or running an ALTER TABLE conversion. When that fails—and it will fail when a stray job tries to write to the old partition layout at the same time—your state becomes inconsistent. You end up with a mix of hidden files, phantom partitions, and a massive ticket queue from the BI team asking why their dashboards are returning nulls.

The real problem

The problem isn't the file format; it's the metadata management. Hive-style partitioning is a legacy relic that relies on filesystem structure to define schema. This is inherently fragile. When you migrate, you shouldn't be "converting" data; you should be building a dual-write architecture that treats the new Iceberg table as the source of truth while keeping the legacy Parquet table as a hot-standby fallback.

Photo by Michael Evans on Unsplash

Step 1: The dual-write bridge

Before you touch your production tables, set up a Spark Structured Streaming job that consumes the same upstream raw data or Kafka topic that feeds your existing Parquet pipeline. Do not attempt to "copy" existing files into Iceberg. Instead, create a brand-new Iceberg table and let the streaming job backfill it.

Configure your Spark session to point to your Catalog—I prefer the REST catalog for multi-engine support—and define the Iceberg schema explicitly. Do not rely on schema inference. It will bite you the moment a decimal(10,2) turns into a decimal(38,18) during a silent upstream change.

val spark = SparkSession.builder()
  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
  .config("spark.sql.catalog.prod_catalog", "org.apache.iceberg.spark.SparkCatalog")
  .config("spark.sql.catalog.prod_catalog.type", "rest")
  .config("spark.sql.catalog.prod_catalog.uri", "https://your-iceberg-catalog-service")
  .getOrCreate()

// Create the target Iceberg table
spark.sql("""
  CREATE TABLE prod_catalog.db.orders_iceberg (
    order_id bigint,
    user_id bigint,
    amount decimal(10,2),
    ts timestamp
  ) USING iceberg
  PARTITIONED BY (days(ts))
""")

Step 2: Validating the shadow state

Once your shadow table is streaming, you need to verify it. Don't just check record counts; count checksums. A record count of 1 million rows in Parquet vs 1 million in Iceberg means nothing if the schema types don't align.

I run a validation job every hour that compares the sum(amount) and max(ts) between the legacy Parquet table and the Iceberg shadow table. If these don't match, you trigger an alert. If they do match, you have high confidence that your streaming logic is sound.

# Validation check in PySpark
parquet_df = spark.read.table("hive_metastore.db.orders_old")
iceberg_df = spark.read.table("prod_catalog.db.orders_iceberg")

def get_stats(df):
    return df.agg(sum("amount").alias("total"), count("*").alias("cnt")).collect()

# Compare results
if get_stats(parquet_df) != get_stats(iceberg_df):
    raise Exception("Integrity mismatch between Parquet and Iceberg")

Step 3: The hidden cutover

The biggest mistake is a "big bang" switch. Instead, use a view to abstract the table location. Create a view that initially points to your legacy Parquet table. When you are ready to flip the switch, you update the view definition to point to the Iceberg table. This allows you to toggle back in seconds if your BI tools start throwing errors.

Crucially, ensure your Iceberg table is configured with write.format.default = parquet and write.metadata.delete-after-commit.enabled = true. You want the performance of Iceberg with the compatibility of Parquet files underneath.

-- Initially
CREATE VIEW prod_catalog.db.orders_view AS 
SELECT * FROM hive_metastore.db.orders_old;

-- During migration
-- Update the view to point to the new Iceberg table
ALTER VIEW prod_catalog.db.orders_view AS 
SELECT * FROM prod_catalog.db.orders_iceberg;

Photo by Daniel Miksha on Unsplash

Lessons learned from production

Partition evolution is your friend: Unlike Hive, Iceberg allows you to change partition schemes without rewriting the entire table. Don't be afraid to start with days(ts) and move to hours(ts) if query performance drops as the table grows.
Watch the metadata-log folder: If you are using S3, metadata files can grow significantly. Set write.metadata.delete-after-commit.enabled to true and keep write.metadata.previous-versions-max low (e.g., 5-10) unless you have a strict regulatory requirement to keep months of metadata history.
Snapshot isolation is not magic: If your downstream jobs use spark.read, they will see the current snapshot. If a long-running job starts before you switch the view and ends after, it might see inconsistent data if you are not careful with snapshot expiration. Set your expire_snapshots to run every 24 hours to prevent your storage costs from exploding due to dangling files.
Handle empty writes: If your upstream source has gaps, some Spark streaming configurations will write empty Iceberg snapshots. This creates unnecessary metadata overhead. Filter out empty micro-batches before calling write.

Conclusion

Migrating to Iceberg is less about the data and more about the orchestration. If you treat your migration like a controlled release—with a shadow table, validation logic, and a view-based abstraction—you remove the "fear" factor of the migration. You aren't just moving files; you're building a system that allows you to evolve your schema without breaking the downstream.

Stop relying on the filesystem to define your data structure. Move to Iceberg, keep your legacy tables as a hot-standby, and validate, validate, validate.

Try it: Start your shadow pipeline today. Create the Iceberg table, stream to it for one week, and run a daily diff between it and your legacy Parquet table. If the data matches for 7 straight days, you're ready to cut over.

Tags: #data #iceberg #engineering #cloud

Cover photo by David Pupăză on Unsplash.

Exactly-once is a lie: why your Spark stream is actually at-least-once

Aniket Abhishek Soni — Fri, 03 Jul 2026 21:39:30 +0000

If you think your Spark Structured Streaming pipeline is actually achieving end-to-end exactly-once processing, you are likely just lucky that your infrastructure hasn't had a truly bad day yet.

Why I chose this topic: I’ve spent the last six years cleaning up "perfect" pipelines that bloated their databases with duplicate records the moment a Kafka partition rebalanced during a checkpoint commit. We treat the word "exactly-once" as a religious tenet, but in the trenches of financial ledger reconciliation, it’s a leaky abstraction that hides the brutal reality of distributed systems.

Why the common approach falls short

The industry loves the marketing slide that says Spark Structured Streaming is "exactly-once." In reality, what Spark provides is exactly-once processing within the Spark engine itself, not end-to-end.

The mechanism relies on checkpointing—writing state to HDFS or S3—and the deterministic replay of inputs. If your task fails, Spark rolls back to the last successful offset and re-processes. That sounds clean, right? But the moment you write that data to an external sink, you are at the mercy of the sink's idempotency. If your sink is a generic JDBC connector or a legacy database that doesn't support transactional writes keyed to the Spark batch ID, you are not doing exactly-once. You are doing at-least-once, and you are quietly praying that your deduplication logic catches the debris.

I once debugged a PII-scrubbing pipeline where a node failure during a foreachBatch sink operation caused a partial write. Because the sink wasn't atomic and the checkpointLocation hadn't updated yet, the next retry wrote the entire batch again. We ended up with duplicate sensitive records in our downstream warehouse. The Spark logs looked "successful," but the data integrity was trash.

Photo by Mario Gogh on Unsplash

The state of the sink

To achieve true exactly-once, your sink must be able to handle the same batch ID twice without side effects. If you are using spark-sql-kafka, you have the advantage of the Kafka offset tracking being baked into the checkpoint. But the second you leave the Kafka ecosystem, you are in the wild west.

Consider the delta sink. When you use Delta Lake, the transaction log acts as the source of truth for the batch ID. Spark writes the data, then commits the transaction. If a crash happens midway, the data files are written, but the commit fails. Upon restart, Spark sees the failed transaction, ignores the orphaned files, and tries again. This works because Delta supports atomic commits.

Compare that to a standard mode("append") write to a legacy SQL database. There is no atomic commit here. There is no "batch ID" metadata stored in the target table. If your executor dies after writing 50% of the rows but before finishing the commit, those 50% remain. The retry writes them again. You are now leaking duplicates. Unless you are manually implementing a MERGE INTO using a unique constraint or a primary key—which carries a heavy performance tax—you aren't doing exactly-once. You are doing "at-least-once with a post-hoc cleanup script."

The reality of checkpointing failure

We treat checkpointLocation as a holy object. We assume that if we store it on S3, we are safe. We aren't.

In production environments using spark 3.x, I’ve seen consistent-hashing issues and S3 eventual consistency bugs (though largely mitigated by newer S3A committers) lead to corrupted checkpoints. When the offsets or commits directory in your checkpoint path gets corrupted, your streaming job enters a death spiral.

You can’t just "fix" a corrupted checkpoint. You are forced to choose: lose the state and reset the source offset (creating a gap in data), or force-start from a previous checkpoint and deal with the inevitable re-processing of data you’ve already sunk. Neither of these options is "exactly-once." They are "emergency recovery procedures." If you aren't logging your source offsets in a separate, immutable metadata store, you are flying blind during these failures.

Photo by MARIOLA GROBELSKA on Unsplash

The objections (and my answers)

"But the documentation says it’s exactly-once!"

The documentation is correct about the internal state management. If you are doing aggregations in memory using mapGroupsWithState and only writing to a Delta table, the engine guarantees that the state store and the transaction log stay in sync. My objection isn't to the engine's internal math; it’s to the delusion that the engine lives in a vacuum. Your system includes your sink, your network, and your storage provider. If any of those don't support atomic, idempotent writes, the guarantee breaks.

"Just use Kafka as a sink and it's fine."

Kafka is a great buffer, but it’s not an analytical store. If your pipeline is feeding a BI tool, you eventually have to land that data somewhere else. The moment you use a custom sink or a non-transactional database, the "exactly-once" promise evaporates. You are then responsible for the write-ahead-log pattern yourself.

"We use a primary key to deduplicate, so it's effectively exactly-once."

That is an operational workaround, not a semantic guarantee. If your database performance degrades because you’re running UPSERT logic on every incoming stream to handle the duplicates that Spark created, you haven't solved the problem; you've just shifted the cost from the storage layer to the compute layer.

Conclusion

Exactly-once is a goal, but in Spark Structured Streaming, it is never a default. It is a configuration of your entire stack.

If you want to get closer to the truth, stop trusting the framework to handle everything. Use transactional sinks like Delta Lake or Apache Hudi. If you are forced to use a legacy sink, build idempotency into your data model using unique business keys. Monitor your checkpointLocation as if it were your production database, because that’s exactly what it is.

Stop telling stakeholders you have "exactly-once" semantics. Tell them you have "idempotent processing pipelines with a defined recovery point." It sounds less like a marketing brochure and more like the actual engineering work you're doing.

Tags: #spark #streaming #data #architecture

Cover photo by Brian Cockley on Unsplash.

Is Your Lakehouse Architecture Just a High-Priced Tax on Your Data Team?

Aniket Abhishek Soni — Wed, 01 Jul 2026 22:48:54 +0000

Ninety-two percent of data platform migrations I’ve audited in the last three years ended up costing more in "operational tax" than they saved in raw compute efficiency. We talk about TCO (Total Cost of Ownership) like it’s a math problem, but it’s actually a human behavior problem. The choice between BigQuery and Databricks SQL isn't about which engine can scan a petabyte faster; it’s about whether you want to spend your weekends debugging slot allocation or tuning Delta Lake vacuum intervals.

I’ve spent the last six years keeping financial services and healthcare workloads upright. I’ve seen BigQuery’s INFORMATION_SCHEMA save a QBR and I’ve seen Databricks’ OPTIMIZE commands accidentally lock a table during a critical financial close. If you’re choosing based on a vendor slide deck, you’re already behind. Here is the field guide to not blowing your cloud budget while trying to build a "lakehouse."

1. The "Slot" Trap vs. The "Warehouse" Mirage

BigQuery’s shift to Edition pricing (Standard, Enterprise, Enterprise Plus) was the industry’s way of saying "we want predictable, Databricks-style billing." But here’s the reality: if you aren't using Reservations, you aren't using BigQuery. I’ve seen teams blow $50k in a weekend because a rogue SELECT * on a multi-petabyte partitioned table hit on-demand pricing.

In Databricks, you’re buying "SQL Warehouses." The failure mode here is over-provisioning. If you leave a 2XL warehouse running 24/7 because your analysts "need it to be fast," you’re lighting money on fire. BigQuery is inherently multi-tenant; Databricks is isolated. If you have 50 different departments, BigQuery manages the concurrency better out of the box. If you have a few massive, complex jobs that need predictable performance, you want a dedicated Databricks SQL Warehouse.

Photo by Monisha Selvakumar on Unsplash

2. Partitioning Isn't Optional; It’s Your Only Defense

In BigQuery, if you don't filter by your partition column (usually _PARTITIONDATE or a timestamp), you are paying for a full table scan. Period. I’ve seen junior engineers write queries that scanned 40TB of data for a single dashboard refresh.

In Databricks, the Z-ORDER command is your best friend. If you aren't Z-ordering your high-cardinality columns, you’re missing the point of Delta Lake.

-- BigQuery: Never skip the filter, or get fired.
SELECT * FROM `my_project.my_dataset.events` 
WHERE _PARTITIONDATE >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY);

-- Databricks: Z-ORDER is the performance multiplier.
OPTIMIZE my_table 
ZORDER BY (customer_id, event_type);

If you ignore these, you’re paying for the vendor’s inefficiency. In BigQuery, you pay for the scan. In Databricks, you pay for the time the cluster spent scanning.

3. The "Vacuum" and "Snapshot" Tax

One of the biggest hidden costs in Databricks is storage bloat. Because Delta Lake keeps snapshots for time travel, if you don't run VACUUM regularly, your storage bill will grow indefinitely. I’ve seen terabytes of "deleted" data sitting in S3/ADLS buckets that Databricks users forgot to prune.

-- Databricks: Pruning old snapshots to save storage costs
VACUUM my_table RETAIN 168 HOURS; -- Keep 7 days of history

BigQuery handles this via internal TTLs on datasets and tables. It’s "set it and forget it." If you lack the discipline to manage a VACUUM schedule, Databricks will eventually bite your budget in the ass.

4. Concurrency is a Lie

Marketing teams love to talk about "limitless concurrency." Both platforms handle it, but they handle it differently. BigQuery uses a distributed scheduler that tries to fit your query into the available slots. If you have 2,000 slots and you trigger 5,000 slots worth of work, BigQuery will queue your queries. That's a latency hit, but not a failure.

Databricks SQL Warehouses (Serverless) have a "scaling out" threshold. When your cluster gets slammed, it spawns new clusters to handle the load. This is great until you hit your regional limit for cloud instances or your bill hits the stratosphere because you triggered five extra clusters to run a 2-second query. Monitor your dbr_sql_warehouse_scaling_events like a hawk.

5. The "Governance" Penalty

Healthcare data requires ironclad access control. BigQuery’s integration with IAM is native and absolute. If you are already deep in the Google Cloud ecosystem, BigQuery’s row-level security and column-level masking (via Policy Tags) are incredibly easy to implement.

Databricks uses Unity Catalog. It’s powerful, but it’s a second layer of governance you have to maintain outside of your cloud provider’s IAM. If your organization is already struggling with identity management, adding Unity Catalog adds another point of failure. Don't underestimate the "cognitive load" of managing two sets of permissions.

6. Cold Starts and Serverless Latency

BigQuery is always "warm." You send a request, it runs. Databricks SQL Serverless has gotten much faster, but there is still a spin-up time for those clusters if they’ve been idle. If your users are clicking around a Looker dashboard, they will notice the 3-5 second lag on the first click if your warehouse was cold.

If your users are impatient (and they are), you will end up keeping warehouses running longer than you need to, just to avoid the "Why is the dashboard slow?" Slack messages. That’s a hidden cost of the Databricks architecture.

Photo by Giancarlo Revolledo on Unsplash

7. Vendor Lock-in is a Myth; Portability is a Pipe Dream

People choose Databricks because they want to "own" their data in Parquet/Delta format. They choose BigQuery because they want it to "just work."

Here is the truth: you aren't going to migrate 500TB of data from BigQuery to Databricks because you had a bad quarter. You are locked in by your ingestion pipelines and your BI tool semantic layers. Pick the one that fits your current team’s skillset. If your team knows Spark, Databricks is the path of least resistance. If your team is SQL-first and hates infrastructure management, BigQuery is the only logical choice.

Conclusion

BigQuery is a managed service that demands you play by its rules—partitioning, slot management, and Google-native IAM. Databricks is a platform that gives you more control but demands you manage the complexity—vacuuming, Z-ordering, and catalog governance.

If you want a "lakehouse" that functions like a database, pay the BigQuery tax and embrace the simplicity. If you want a data science powerhouse that happens to run SQL, pay the Databricks tax and hire a good platform engineer to clean up your mess.

Which one is keeping your CFO up at night, and what are you going to do about it tomorrow morning?

Cover photo by Gavin Allanwood on Unsplash.