Delta Lake for Dummies: ACID Transactions, Time Travel & Delta Tables
If there's one concept in this entire series that separates a data engineer who knows Databricks from one who truly gets it — it's Delta Lake.
It's the technology that makes your data lake reliable. It's what turns a folder of Parquet files into something that behaves like a proper database. And it's baked into everything you build in Databricks.
Let's break it down from scratch.
What is Delta Lake and Why It Exists
Before Delta Lake, data lakes had a serious reputation problem.
You'd store raw files in S3 or ADLS — CSVs, JSON, Parquet — and it was cheap and flexible. But the moment something went wrong, you were in trouble:
- A pipeline failed halfway through a write — now half your data is new, half is old. Which half do you trust?
- Someone ran
DELETE FROM customers WHERE country = 'DE'and it deleted the wrong rows. Can you undo it? - Two pipelines wrote to the same folder at the same time. The files are now corrupted.
- A schema changed upstream and now your downstream queries break. When did the schema change?
Traditional data lakes had no good answers to these questions. They were fast and cheap, but fragile.
Delta Lake is an open-source storage layer built on top of Parquet files that solves all of these problems by adding:
- ✅ ACID transactions
- ✅ Schema enforcement and evolution
- ✅ Time travel (version history)
- ✅ Upserts (MERGE)
- ✅ Scalable metadata management
And the best part: your data is still stored as regular Parquet files in your cloud storage. Delta doesn't lock you into a proprietary format.
What ACID Transactions Mean in Plain English
ACID is a set of properties that guarantee database operations are reliable. Here's what each letter means — without the textbook definition:
A — Atomicity: All or Nothing
Either the entire operation succeeds, or nothing changes. There's no "half-written" state.
Writing 1,000,000 rows to a Delta table:
✅ All 1,000,000 rows written successfully → committed
❌ Failure at row 600,000 → all changes rolled back, original data intact
Without atomicity (raw Parquet): you'd end up with 600,000 rows written and 400,000 missing. Your data is now broken.
C — Consistency: Data is Always Valid
The table is always in a valid state. Schema rules are enforced. A write that violates the schema gets rejected — it doesn't silently corrupt your data.
I — Isolation: Concurrent Operations Don't Conflict
Multiple pipelines can read from and write to the same Delta table simultaneously without corrupting each other's work.
D — Durability: Committed Data Survives Failures
Once a transaction is committed, it's permanent. A cluster crash after a successful write won't lose your data.
The Delta Transaction Log: The Secret Behind It All
Here's how Delta Lake actually implements all of this. Every Delta table has a hidden folder called _delta_log/.
/mnt/processed/sales/
├── _delta_log/
│ ├── 00000000000000000000.json ← version 0
│ ├── 00000000000000000001.json ← version 1
│ ├── 00000000000000000002.json ← version 2
│ └── ...
├── part-00000-abc123.parquet
├── part-00001-def456.parquet
└── part-00002-ghi789.parquet
Every time you write, update, or delete data, Delta records exactly what changed in a new JSON log file. These log files are the transaction log — an ordered history of every operation ever performed on the table.
When Spark reads a Delta table, it reads the transaction log first to understand the current state, then reads only the relevant Parquet files. This is what enables time travel, concurrent writes, and ACID guarantees — it's all tracked in the log.
Creating Your First Delta Table
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("customer_id", IntegerType(), True),
StructField("product", StringType(), True),
StructField("amount", DoubleType(), True),
StructField("order_date", DateType(), True),
StructField("region", StringType(), True)
])
df = spark.read.csv("/mnt/raw/sales.csv", schema=schema, header=True)
# Write as Delta — this is all it takes
df.write \
.format("delta") \
.mode("overwrite") \
.save("/mnt/processed/sales_silver/")
Or create one directly with SQL:
%sql
CREATE TABLE IF NOT EXISTS sales_silver (
order_id INT,
customer_id INT,
product STRING,
amount DOUBLE,
order_date DATE,
region STRING
)
USING DELTA
LOCATION '/mnt/processed/sales_silver/';
From this point on, every operation on this table is tracked, transactional, and recoverable.
Updating, Deleting, and Merging Data
One of the biggest limitations of plain Parquet files is that you can't easily update or delete individual rows. With Delta Lake, you can.
UPDATE
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/mnt/processed/sales_silver/")
# Update all rows where region is null
delta_table.update(
condition = "region IS NULL",
set = {"region": "'Unknown'"}
)
%sql
UPDATE sales_silver
SET region = 'Unknown'
WHERE region IS NULL;
DELETE
delta_table.delete("order_date < '2020-01-01'")
%sql
DELETE FROM sales_silver
WHERE order_date < '2020-01-01';
MERGE (UPSERT): The Most Powerful Operation
MERGE lets you insert new records and update existing ones in a single atomic operation. This is the foundation of incremental pipeline patterns.
from delta.tables import DeltaTable
# Existing table (target)
target = DeltaTable.forPath(spark, "/mnt/processed/sales_silver/")
# Incoming new/updated data (source)
source = spark.read.csv("/mnt/raw/sales_incremental.csv", header=True, inferSchema=True)
# MERGE: update if exists, insert if new
target.alias("target").merge(
source.alias("source"),
"target.order_id = source.order_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
Or in SQL:
%sql
MERGE INTO sales_silver AS target
USING sales_incremental AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *;
This pattern — MERGE on a unique key — is how most production data pipelines handle incremental loads. Instead of reprocessing everything daily, you only process what changed.
Time Travel: Querying Historical Data
Every write to a Delta table creates a new version. Delta keeps the full history, which means you can query any past version of your table.
See the Table History
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/mnt/processed/sales_silver/")
delta_table.history().show(truncate=False)
%sql
DESCRIBE HISTORY sales_silver;
Output:
version | timestamp | operation | operationParameters
--------|---------------------|------------|--------------------
3 | 2024-03-15 14:23:11 | MERGE | ...
2 | 2024-03-14 09:10:05 | DELETE | ...
1 | 2024-03-13 18:45:32 | UPDATE | ...
0 | 2024-03-12 11:00:00 | WRITE | ...
Query a Previous Version
# By version number
df_v1 = spark.read.format("delta") \
.option("versionAsOf", 1) \
.load("/mnt/processed/sales_silver/")
# By timestamp
df_yesterday = spark.read.format("delta") \
.option("timestampAsOf", "2024-03-14") \
.load("/mnt/processed/sales_silver/")
%sql
-- Query table as it was at version 1
SELECT * FROM sales_silver VERSION AS OF 1;
-- Query table as it was yesterday
SELECT * FROM sales_silver TIMESTAMP AS OF '2024-03-14';
Restore a Previous Version
Made a mistake? Roll back the entire table to a previous state:
delta_table.restoreToVersion(1)
%sql
RESTORE TABLE sales_silver TO VERSION AS OF 1;
💡 This is one of the most powerful features in Delta Lake. Accidental deletes, bad pipeline runs, schema mistakes — all recoverable in seconds.
Schema Enforcement and Evolution
Schema Enforcement (Default)
Delta Lake rejects writes that don't match the table's schema. This protects you from silent data corruption.
# If sales_silver expects DOUBLE for amount,
# and you try to write STRING — Delta rejects it
df_wrong.write \
.format("delta") \
.mode("append") \
.save("/mnt/processed/sales_silver/")
# AnalysisException: A schema mismatch detected when writing to the Delta table
This is a feature, not a bug. It catches upstream schema changes before they silently break your downstream tables.
Schema Evolution (Opt-in)
When you intentionally want to add new columns to an existing table:
df_with_new_column.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save("/mnt/processed/sales_silver/")
Delta adds the new column to the schema and backfills null for all previous rows. Existing data and queries are unaffected.
Optimizing Delta Tables
Over time, Delta tables accumulate many small Parquet files (especially with frequent incremental writes). Two maintenance operations keep them healthy:
OPTIMIZE — Compact Small Files
%sql
OPTIMIZE sales_silver;
-- Optimize and co-locate similar data for faster range queries
OPTIMIZE sales_silver ZORDER BY (order_date, region);
ZORDER BY physically co-locates rows with similar values in the same files — dramatically speeding up queries that filter on those columns.
VACUUM — Remove Old Files
The transaction log keeps old file versions for time travel. VACUUM cleans up files older than a retention threshold:
%sql
-- Default: removes files older than 7 days
VACUUM sales_silver;
-- Custom retention period
VACUUM sales_silver RETAIN 30 HOURS;
⚠️ After VACUUM, you can no longer time travel to versions older than the retention period. Never VACUUM with 0 hours retention in production — you'll lose your history.
Wrapping Up
Here's what you've covered in this article:
- Delta Lake adds ACID transactions, history, and reliability to your data lake — without changing your storage
- ACID means: atomic writes, consistent state, isolated concurrency, durable commits
- The transaction log (
_delta_log/) is the engine behind all Delta's features - You can UPDATE, DELETE, and MERGE rows — impossible with plain Parquet
- Time travel lets you query or restore any past version of your table
- Schema enforcement protects you from upstream changes breaking downstream tables
- OPTIMIZE and VACUUM keep your tables performant and clean
In the next article, we start building the real thing: the Medallion Architecture — Bronze, Silver, and Gold layers — the design pattern we'll use to build our data warehouse.
Top comments (0)