DEV Community

Shrinivas Vishnupurikar
Shrinivas Vishnupurikar

Posted on

Schema, COPY, MERGE, and Immutability — A First-Principles Guide for Data Engineers

In modern data engineering conversations, terms like schema-on-read, schema-on-write, COPY, MERGE, and immutable partitions are used very often.

When I started out my career in Data Engineering, I always heard my senior engineers ( mentors ) mention these when they were designing and architecting pipelines and data deliver platforms in general. Back then, for me these were merely some combinations of words, but seniors this was the matter of design choices that could make or break project deliverables.

These are isolated terms that are deeply connected and which result in data processing patterns and data management techniques.

You could start by understanding them in isolation, but understanding them together builds a strong mental model that applies across systems, formats, and platforms.

This article explains these ideas from first principles, then connects them to open table formats, with a strong focus on why these patterns exist.


1. What Is a “Schema” — First Principles

At its core, a schema is a contract or an agreement; a word-of-mouth like deal between you and the data.

A schema typically defines:

  • What fields exist
  • What each field represents
  • What type of data each field can hold
  • Whether fields are optional or mandatory

The most important question is not what the schema is, but in our context for today’s article it’s when the system enforces it.

That single question gives birth to two fundamental models:

  • Schema-on-Write
  • Schema-on-Read

1.1 Schema-on-Write (Validate First, Store Later)

Definition

Schema-on-write enforces structure before data is stored, in your target systems. If incoming data does not match the expected structure, it is rejected.

Think of schema-on-write like a strict security checkpoint at a college entrance.

  • The college has a predefined rule: only students with a valid college ID are allowed inside.
  • The security guard checks the ID before allowing entry.
  • If you do not have the ID, you are stopped at the gate.

In this analogy:

  • The college premises represent the target data system
  • The college ID card represents the schema
  • The security check represents schema validation
  • Entry is allowed only if the contract is satisfied

Intuition

“Only clean, trusted data is allowed inside.”

Story Example

Consider a bank account system.

Before saving a record:

  • Account number must exist
  • Balance must be numeric
  • Date fields must be valid

If anything is incorrect, the record is refused. Bad data never enters the system.

Why This Pattern Exists

  • Data correctness is critical
  • Downstream systems assume reliability
  • Errors must be detected early

Trade-off

  • Lower flexibility
  • Slower ingestion
  • Schema changes require coordination

1.2 Schema-on-Read (Store First, Decide Later)

Definition

Schema-on-read stores data first and applies structure only when the data is read.

Think of schema-on-read like entering a public library or a large campus.

  • There is no strict identity check at the gate
  • Anyone can enter—students, visitors, researchers
  • The system does not ask who you are or what you will do upfront

Rules are applied only when you access something specific:

  • Certain rooms may require permission
  • Certain books may have usage rules
  • Some resources may only be available to specific people

In this analogy:

  • The campus or library represents the storage system
  • Entering freely represents storing raw data without validation
  • Rules applied later represent schema being enforced at read time
  • Meaning is decided when you access the data, not when it arrives

Story Example

A mobile app analytics system collects user events:

  • Different app versions send different fields
  • Some events are incomplete
  • New fields appear frequently

Rejecting such data would cause data loss. Instead, everything is stored, and meaning is applied during analysis.

Why This Pattern Exists

  • High data variability
  • Rapid change
  • Exploration and discovery use cases

Trade-off

  • Data quality issues surface later
  • Queries must handle inconsistencies

2. COPY vs MERGE — What Problem Do They Solve?

Schemas define structure. COPY and MERGE define how new data interacts with existing data.

The key question here is:

Does incoming data represent new facts or changes to existing facts?

2.1 COPY (Append-Only Thinking)

Definition
COPY means inserting incoming data as new rows without checking for existing records.

Intuition

“Oh, new data?
Cool. Send it in.

Do I already have something similar?
I don’t care.

Is this a duplicate?
Not my problem.

I’ll just add whatever comes in as a new data point and move on.”

Story Example

A daily sales report:

  • Yesterday’s sales never change
  • Today’s sales are new facts

Each day’s data is simply appended.

Why This Pattern Exists

  • Simplicity
  • High performance
  • Historical accuracy

Trade-off

  • Corrections require reprocessing
  • Duplicates are possible

2.2 MERGE (State-Aware Thinking)

Definition

MERGE updates existing records when matches are found and inserts new records when they are not.

Intuition

“Alright, new data is coming in.

David, if you find something we already have, update it with the latest details.

Carlos, if you don’t recognize the record at all, add it as a brand-new entry.

Bottom line — keep the data up to date.”

Story Example

A customer profile system:

  • Email remains the same
  • Address and phone number change
  • Status evolves over time

The system must reflect the current state, not a trail of outdated versions.

Why This Pattern Exists

  • Data represents entities, not events
  • Corrections are expected
  • Idempotency matters

Trade-off

  • More complex logic
  • Higher compute cost

3. All Valid Combinations — Schema × Write Pattern

3.1 Schema-on-Write + COPY

Strict structure, immutable history

Story:
A financial ledger where every transaction must be valid and never changes.

Why it fits:

  • Data correctness is enforced
  • History is preserved forever

3.2 Schema-on-Write + MERGE

Strict structure, evolving state

Story:
A customer master table with well-defined fields, where customer details are updated over time.

Why it fits:

  • Strong data guarantees
  • Accurate current view

3.3 Schema-on-Read + COPY

Flexible structure, raw history

Story:
An application log store capturing all events, even malformed ones.

Why it fits:

  • Zero data loss
  • Future reprocessing is possible

3.4 Schema-on-Read + MERGE

Flexible structure, improving state

Story:
A data enrichment system where records arrive incomplete and get enriched later.

Why it fits:

  • Allows partial data
  • Data quality improves over time

4. Where Partition Immutability Enters the Picture

Modern open table formats rely on immutable data files (partitions).

Immutability means:

  • Once written, data files are never modified
  • Updates and deletes create new files
  • Old files are logically retired via metadata ( often with property like status = DELETED or valid_to = )

This design is essential to:

  • Safe concurrent reads and writes
  • Reliable versioning
  • Efficient rollback and recovery

5. Immutability in Open Table Formats

Apache Iceberg (Primary Focus)

Design philosophy:

  • Strong immutability
  • Snapshot-based metadata
  • Schema evolution as a first-class feature

How patterns map:

  • COPY → new data files added to a snapshot
  • MERGE → rewritten files + new snapshot
  • Schema-on-read and schema-on-write both supported via metadata evolution

Iceberg treats metadata as the control plane and data files as immutable assets.

Delta Lake (Secondary)

Design philosophy:

  • Transaction log driven
  • Immutable parquet files
  • Strong support for MERGE

How patterns map:

  • COPY → append entries in the transaction log
  • MERGE → file rewrites tracked in the log
  • Schema enforcement optional but commonly used

Delta emphasizes transactional reliability with strong ACID guarantees.

Apache Hudi (Secondary)

Design philosophy:

  • Designed for mutable datasets
  • Supports record-level updates
  • Multiple storage strategies

How patterns map:

  • COPY → insert-only tables
  • MERGE → copy-on-write or merge-on-read
  • Schema evolution supported but more operationally involved

Hudi optimizes for near-real-time updates and streaming workloads.


6. Closing Note

Schema enforcement, write patterns, and immutability are not independent ideas.
They are three dimensions of the same design space:

  • Schema answers when data is trusted
  • COPY vs MERGE answers how data evolves
  • Immutability answers how systems stay correct at scale

Once these are understood together, modern data systems stop feeling complex and start feeling inevitable.

This understanding travels with you—across tools, platforms, and technologies.

Top comments (0)