Shrinivas Vishnupurikar

Posted on Jan 5

Schema, COPY, MERGE, and Immutability — A First-Principles Guide for Data Engineers

#dataengineering

In modern data engineering conversations, terms like schema-on-read, schema-on-write, COPY, MERGE, and immutable partitions are used very often.

When I started out my career in Data Engineering, I always heard my senior engineers ( mentors ) mention these when they were designing and architecting pipelines and data deliver platforms in general. Back then, for me these were merely some combinations of words, but seniors this was the matter of design choices that could make or break project deliverables.

These are isolated terms that are deeply connected and which result in data processing patterns and data management techniques.

You could start by understanding them in isolation, but understanding them together builds a strong mental model that applies across systems, formats, and platforms.

This article explains these ideas from first principles, then connects them to open table formats, with a strong focus on why these patterns exist.

1. What Is a “Schema” — First Principles

At its core, a schema is a contract or an agreement; a word-of-mouth like deal between you and the data.

A schema typically defines:

What fields exist
What each field represents
What type of data each field can hold
Whether fields are optional or mandatory

The most important question is not what the schema is, but in our context for today’s article it’s when the system enforces it.

That single question gives birth to two fundamental models:

Schema-on-Write
Schema-on-Read

1.1 Schema-on-Write (Validate First, Store Later)

Definition

Schema-on-write enforces structure before data is stored, in your target systems. If incoming data does not match the expected structure, it is rejected.

Think of schema-on-write like a strict security checkpoint at a college entrance.

The college has a predefined rule: only students with a valid college ID are allowed inside.
The security guard checks the ID before allowing entry.
If you do not have the ID, you are stopped at the gate.

In this analogy:

The college premises represent the target data system
The college ID card represents the schema
The security check represents schema validation
Entry is allowed only if the contract is satisfied

Intuition

“Only clean, trusted data is allowed inside.”

Story Example

Consider a bank account system.

Before saving a record:

Account number must exist
Balance must be numeric
Date fields must be valid

If anything is incorrect, the record is refused. Bad data never enters the system.

Why This Pattern Exists

Data correctness is critical
Downstream systems assume reliability
Errors must be detected early

Trade-off

Lower flexibility
Slower ingestion
Schema changes require coordination

1.2 Schema-on-Read (Store First, Decide Later)

Definition

Schema-on-read stores data first and applies structure only when the data is read.

Think of schema-on-read like entering a public library or a large campus.

There is no strict identity check at the gate
Anyone can enter—students, visitors, researchers
The system does not ask who you are or what you will do upfront

Rules are applied only when you access something specific:

Certain rooms may require permission
Certain books may have usage rules
Some resources may only be available to specific people

In this analogy:

The campus or library represents the storage system
Entering freely represents storing raw data without validation
Rules applied later represent schema being enforced at read time
Meaning is decided when you access the data, not when it arrives

Story Example

A mobile app analytics system collects user events:

Different app versions send different fields
Some events are incomplete
New fields appear frequently

Rejecting such data would cause data loss. Instead, everything is stored, and meaning is applied during analysis.

Why This Pattern Exists

High data variability
Rapid change
Exploration and discovery use cases

Trade-off

Data quality issues surface later
Queries must handle inconsistencies

2. COPY vs MERGE — What Problem Do They Solve?

Schemas define structure. COPY and MERGE define how new data interacts with existing data.

The key question here is:

Does incoming data represent new facts or changes to existing facts?

2.1 COPY (Append-Only Thinking)

Definition
COPY means inserting incoming data as new rows without checking for existing records.

Intuition

“Oh, new data?
Cool. Send it in.

Do I already have something similar?
I don’t care.

Is this a duplicate?
Not my problem.

I’ll just add whatever comes in as a new data point and move on.”

Story Example

A daily sales report:

Yesterday’s sales never change
Today’s sales are new facts

Each day’s data is simply appended.

Why This Pattern Exists

Simplicity
High performance
Historical accuracy

Trade-off

Corrections require reprocessing
Duplicates are possible

2.2 MERGE (State-Aware Thinking)

Definition

MERGE updates existing records when matches are found and inserts new records when they are not.

Intuition

“Alright, new data is coming in.

David, if you find something we already have, update it with the latest details.

Carlos, if you don’t recognize the record at all, add it as a brand-new entry.

Bottom line — keep the data up to date.”

Story Example

A customer profile system:

Email remains the same
Address and phone number change
Status evolves over time

The system must reflect the current state, not a trail of outdated versions.

Why This Pattern Exists

Data represents entities, not events
Corrections are expected
Idempotency matters

Trade-off

More complex logic
Higher compute cost

3. All Valid Combinations — Schema × Write Pattern

3.1 Schema-on-Write + COPY

Strict structure, immutable history

Story:
A financial ledger where every transaction must be valid and never changes.

Why it fits:

Data correctness is enforced
History is preserved forever

3.2 Schema-on-Write + MERGE

Strict structure, evolving state

Story:
A customer master table with well-defined fields, where customer details are updated over time.

Why it fits:

Strong data guarantees
Accurate current view

3.3 Schema-on-Read + COPY

Flexible structure, raw history

Story:
An application log store capturing all events, even malformed ones.

Why it fits:

Zero data loss
Future reprocessing is possible

3.4 Schema-on-Read + MERGE

Flexible structure, improving state

Story:
A data enrichment system where records arrive incomplete and get enriched later.

Why it fits:

Allows partial data
Data quality improves over time

4. Where Partition Immutability Enters the Picture

Modern open table formats rely on immutable data files (partitions).

Immutability means:

Once written, data files are never modified
Updates and deletes create new files
Old files are logically retired via metadata ( often with property like status = DELETED or valid_to = )

This design is essential to:

Safe concurrent reads and writes
Reliable versioning
Efficient rollback and recovery

5. Immutability in Open Table Formats

Apache Iceberg (Primary Focus)

Design philosophy:

Strong immutability
Snapshot-based metadata
Schema evolution as a first-class feature

How patterns map:

COPY → new data files added to a snapshot
MERGE → rewritten files + new snapshot
Schema-on-read and schema-on-write both supported via metadata evolution

Iceberg treats metadata as the control plane and data files as immutable assets.

Delta Lake (Secondary)

Design philosophy:

Transaction log driven
Immutable parquet files
Strong support for MERGE

How patterns map:

COPY → append entries in the transaction log
MERGE → file rewrites tracked in the log
Schema enforcement optional but commonly used

Delta emphasizes transactional reliability with strong ACID guarantees.

Apache Hudi (Secondary)

Design philosophy:

Designed for mutable datasets
Supports record-level updates
Multiple storage strategies

How patterns map:

COPY → insert-only tables
MERGE → copy-on-write or merge-on-read
Schema evolution supported but more operationally involved

Hudi optimizes for near-real-time updates and streaming workloads.

6. Closing Note

Schema enforcement, write patterns, and immutability are not independent ideas.
They are three dimensions of the same design space:

Schema answers when data is trusted
COPY vs MERGE answers how data evolves
Immutability answers how systems stay correct at scale

Once these are understood together, modern data systems stop feeling complex and start feeling inevitable.

This understanding travels with you—across tools, platforms, and technologies.

DEV Community

Schema, COPY, MERGE, and Immutability — A First-Principles Guide for Data Engineers

1. What Is a “Schema” — First Principles

1.1 Schema-on-Write (Validate First, Store Later)

Definition

Intuition

Story Example

Why This Pattern Exists

Trade-off

1.2 Schema-on-Read (Store First, Decide Later)

Definition

Story Example

Why This Pattern Exists

Trade-off

2. COPY vs MERGE — What Problem Do They Solve?

2.1 COPY (Append-Only Thinking)

Intuition

Story Example

Why This Pattern Exists

Trade-off

2.2 MERGE (State-Aware Thinking)

Intuition

Story Example

Why This Pattern Exists

Trade-off

3. All Valid Combinations — Schema × Write Pattern

3.1 Schema-on-Write + COPY

3.2 Schema-on-Write + MERGE

3.3 Schema-on-Read + COPY

3.4 Schema-on-Read + MERGE

4. Where Partition Immutability Enters the Picture

5. Immutability in Open Table Formats

Apache Iceberg (Primary Focus)

Delta Lake (Secondary)

Apache Hudi (Secondary)

6. Closing Note

Top comments (0)