DEV Community

Sankalp
Sankalp

Posted on

ACID with delta table

************************ ACID property *******************************

Atomicity:

  • Either All or NULL
  • Transaction should be complete all operation successfully then only commit else rollback all transaction.

Consistency:

  • Sum of total money of A & B should be same before and end of transaction.

Isolation:

  • Parallel schedule transaction can be converted in serial schedule conceptually to make the transaction consistence.

Durability:

  • Changes done by transaction should be permanent.

********************** Parquet Vs Delta file format *******************

Parquet:

Type : Column storage format.
Optimized for : Efficiently read performance, especially for data analytics.

Key features : 1.) Column wise compression --> Reduce file size.
2.) Splitable files --> Parallel processing.
3.) Work well with Hive, Spark, Big Query etc.

Delta:

Built on : Parquet format + transactional layer ( _delta logs).
Optimized for : Reliable, Scalable delta lake with support ACID transactions.

Key features    :   1.) Acid transactions (safe read/write)

            2.) Schema Enforcement        :     Delta Lake ensures that the data written to a table matches the table’s schema. 
                                    This prevents issues like inserting a string into a column that expects an integer.

            3.) Schema Evolution          :     When enabled, Delta can automatically adapt to changes in the schema (e.g., adding new                                      columns) during write operations. This is useful for agile data pipelines where the                                         schema may evolve over time.

            4.) Time Travel (Query Past Versions) : Delta Lake maintains a transaction log (_delta_log) that records every change to the                                        data. You can query a table as it existed at a specific point in time or version using.                                         This is useful for debugging, auditing.                                                             
                                    e.g. 
                                    SELECT * FROM table_name VERSION AS OF 5;
                                    -- or
                                    SELECT * FROM table_name TIMESTAMP AS OF '2025-06-10T12:00:00';

            5.)     Ideal for Streaming + Batch (Unified Workflows):
                                    >   Delta Lake supports both streaming and batch reads/writes on the same table.
                                    >   This unification simplifies architecture: you don’t need separate pipelines or                                            storage for real-time and historical data.

                                    e.g. 
                                    You can ingest real-time data using Spark Structured Streaming and run batch analytics                                      on the same Delta table.
Enter fullscreen mode Exit fullscreen mode

Conclusion:
Use Parquet when you need fast, storage-efficient analytics on append-only data.
Use Delta when you need reliability, schema control, time travel, and transactional operations on top of Parquet.

        ************************ SCD Types *******************************
Enter fullscreen mode Exit fullscreen mode

SCD Type 0 : Fixed , No changes allowed (e.g., Date of Birth).

SCD Type 1 : Overwrite, Old data is overwritten. No history is kept. Simple but loses historical data.

SCD Type 2 : Add Row, New row for every change with versioning or effective dates. Full history preserved.

SCD Type 3 : Add Column, Adds a new column to track previous value. Limited history (usually just 1 change).

SCD Type 4 : History Table, Separate historical table stores changes; main table holds current data. Good for large history storage.

SCD Type 6 : Hybrid, Combination of Types 1, 2, and 3. Tracks history with current data easily accessible.

Top comments (0)