In modern data engineering conversations, terms like schema-on-read, schema-on-write, COPY, MERGE, and immutable partitions are used very often.
When I started out my career in Data Engineering, I always heard my senior engineers ( mentors ) mention these when they were designing and architecting pipelines and data deliver platforms in general. Back then, for me these were merely some combinations of words, but seniors this was the matter of design choices that could make or break project deliverables.
These are isolated terms that are deeply connected and which result in data processing patterns and data management techniques.
You could start by understanding them in isolation, but understanding them together builds a strong mental model that applies across systems, formats, and platforms.
This article explains these ideas from first principles, then connects them to open table formats, with a strong focus on why these patterns exist.
1. What Is a “Schema” — First Principles
At its core, a schema is a contract or an agreement; a word-of-mouth like deal between you and the data.
A schema typically defines:
- What fields exist
- What each field represents
- What type of data each field can hold
- Whether fields are optional or mandatory
The most important question is not what the schema is, but in our context for today’s article it’s when the system enforces it.
That single question gives birth to two fundamental models:
- Schema-on-Write
- Schema-on-Read
1.1 Schema-on-Write (Validate First, Store Later)
Definition
Schema-on-write enforces structure before data is stored, in your target systems. If incoming data does not match the expected structure, it is rejected.
Think of schema-on-write like a strict security checkpoint at a college entrance.
- The college has a predefined rule: only students with a valid college ID are allowed inside.
- The security guard checks the ID before allowing entry.
- If you do not have the ID, you are stopped at the gate.
In this analogy:
- The college premises represent the target data system
- The college ID card represents the schema
- The security check represents schema validation
- Entry is allowed only if the contract is satisfied
Intuition
“Only clean, trusted data is allowed inside.”
Story Example
Consider a bank account system.
Before saving a record:
- Account number must exist
- Balance must be numeric
- Date fields must be valid
If anything is incorrect, the record is refused. Bad data never enters the system.
Why This Pattern Exists
- Data correctness is critical
- Downstream systems assume reliability
- Errors must be detected early
Trade-off
- Lower flexibility
- Slower ingestion
- Schema changes require coordination
1.2 Schema-on-Read (Store First, Decide Later)
Definition
Schema-on-read stores data first and applies structure only when the data is read.
Think of schema-on-read like entering a public library or a large campus.
- There is no strict identity check at the gate
- Anyone can enter—students, visitors, researchers
- The system does not ask who you are or what you will do upfront
Rules are applied only when you access something specific:
- Certain rooms may require permission
- Certain books may have usage rules
- Some resources may only be available to specific people
In this analogy:
- The campus or library represents the storage system
- Entering freely represents storing raw data without validation
- Rules applied later represent schema being enforced at read time
- Meaning is decided when you access the data, not when it arrives
Story Example
A mobile app analytics system collects user events:
- Different app versions send different fields
- Some events are incomplete
- New fields appear frequently
Rejecting such data would cause data loss. Instead, everything is stored, and meaning is applied during analysis.
Why This Pattern Exists
- High data variability
- Rapid change
- Exploration and discovery use cases
Trade-off
- Data quality issues surface later
- Queries must handle inconsistencies
2. COPY vs MERGE — What Problem Do They Solve?
Schemas define structure. COPY and MERGE define how new data interacts with existing data.
The key question here is:
Does incoming data represent new facts or changes to existing facts?
2.1 COPY (Append-Only Thinking)
Definition
COPY means inserting incoming data as new rows without checking for existing records.
Intuition
“Oh, new data?
Cool. Send it in.Do I already have something similar?
I don’t care.Is this a duplicate?
Not my problem.I’ll just add whatever comes in as a new data point and move on.”
Story Example
A daily sales report:
- Yesterday’s sales never change
- Today’s sales are new facts
Each day’s data is simply appended.
Why This Pattern Exists
- Simplicity
- High performance
- Historical accuracy
Trade-off
- Corrections require reprocessing
- Duplicates are possible
2.2 MERGE (State-Aware Thinking)
Definition
MERGE updates existing records when matches are found and inserts new records when they are not.
Intuition
“Alright, new data is coming in.
David, if you find something we already have, update it with the latest details.
Carlos, if you don’t recognize the record at all, add it as a brand-new entry.
Bottom line — keep the data up to date.”
Story Example
A customer profile system:
- Email remains the same
- Address and phone number change
- Status evolves over time
The system must reflect the current state, not a trail of outdated versions.
Why This Pattern Exists
- Data represents entities, not events
- Corrections are expected
- Idempotency matters
Trade-off
- More complex logic
- Higher compute cost
3. All Valid Combinations — Schema × Write Pattern
3.1 Schema-on-Write + COPY
Strict structure, immutable history
Story:
A financial ledger where every transaction must be valid and never changes.
Why it fits:
- Data correctness is enforced
- History is preserved forever
3.2 Schema-on-Write + MERGE
Strict structure, evolving state
Story:
A customer master table with well-defined fields, where customer details are updated over time.
Why it fits:
- Strong data guarantees
- Accurate current view
3.3 Schema-on-Read + COPY
Flexible structure, raw history
Story:
An application log store capturing all events, even malformed ones.
Why it fits:
- Zero data loss
- Future reprocessing is possible
3.4 Schema-on-Read + MERGE
Flexible structure, improving state
Story:
A data enrichment system where records arrive incomplete and get enriched later.
Why it fits:
- Allows partial data
- Data quality improves over time
4. Where Partition Immutability Enters the Picture
Modern open table formats rely on immutable data files (partitions).
Immutability means:
- Once written, data files are never modified
- Updates and deletes create new files
- Old files are logically retired via metadata ( often with property like status = DELETED or valid_to = )
This design is essential to:
- Safe concurrent reads and writes
- Reliable versioning
- Efficient rollback and recovery
5. Immutability in Open Table Formats
Apache Iceberg (Primary Focus)
Design philosophy:
- Strong immutability
- Snapshot-based metadata
- Schema evolution as a first-class feature
How patterns map:
- COPY → new data files added to a snapshot
- MERGE → rewritten files + new snapshot
- Schema-on-read and schema-on-write both supported via metadata evolution
Iceberg treats metadata as the control plane and data files as immutable assets.
Delta Lake (Secondary)
Design philosophy:
- Transaction log driven
- Immutable parquet files
- Strong support for MERGE
How patterns map:
- COPY → append entries in the transaction log
- MERGE → file rewrites tracked in the log
- Schema enforcement optional but commonly used
Delta emphasizes transactional reliability with strong ACID guarantees.
Apache Hudi (Secondary)
Design philosophy:
- Designed for mutable datasets
- Supports record-level updates
- Multiple storage strategies
How patterns map:
- COPY → insert-only tables
- MERGE → copy-on-write or merge-on-read
- Schema evolution supported but more operationally involved
Hudi optimizes for near-real-time updates and streaming workloads.
6. Closing Note
Schema enforcement, write patterns, and immutability are not independent ideas.
They are three dimensions of the same design space:
- Schema answers when data is trusted
- COPY vs MERGE answers how data evolves
- Immutability answers how systems stay correct at scale
Once these are understood together, modern data systems stop feeling complex and start feeling inevitable.
This understanding travels with you—across tools, platforms, and technologies.
Top comments (0)