DEV Community

Piyush Gupta
Piyush Gupta

Posted on

Mastering Schema Evolution: Why Apache Avro is the King of Big Data (Part 2)

The Evolution Nightmare

In Part 1, we saw how Thrift and Protocol Buffers use numeric tags to shrink data. But in a real-world distributed system, you can’t upgrade every microservice at the same time. You will always have Old Code talking to New Code.

This creates a massive problem: Schema Evolution. If Service A adds a new field to its database, will Service B (running the old code) crash when it tries to read that data?

Forward vs. Backward Compatibility

To build a resilient system, you must understand two concepts:

  1. Backward Compatibility: New code can read data written by old code. (Essential when you update your "Readers" first).
  2. Forward Compatibility: Old code can read data written by new code. (Essential when you update your "Writers" first).

In Thrift and Protobuf, this is managed by Tags. If a reader sees a tag it doesn't recognize, it simply ignores it. But what if you want to avoid tags entirely?


3. Apache Avro: The "No Tag" Evolution

Apache Avro was created within the Hadoop ecosystem because Thrift wasn't a perfect fit for massive data files. Unlike Protobuf, Avro does not store tag numbers or field types in the binary data. It only stores the raw values.

How it works: The Pairwise Resolution

How does the reader know what the data is if there are no tags?

Avro uses two schemas:

  • Writer’s Schema: The schema the application used when it sent the data.
  • Reader’s Schema: The schema the receiving application expects.

When data is read, the Avro library looks at both schemas side-by-side. If the field order changed or a field was renamed, Avro "resolves" the difference by looking at the field names.

![Diagram: Avro Reader and Writer Schema Resolution Logic]

Implementation Example (Avro JSON Schema)

Avro schemas are written in simple JSON, making them much easier to generate dynamically than the IDLs we saw in Part 1.

user_schema.avsc

{
  "type": "record",
  "name": "User",
  "namespace": "com.piyush.devto",
  "fields": [
    {"name": "username", "type": "string"},
    {"name": "age", "type": ["int", "null"], "default": 0},
    {"name": "email", "type": ["string", "null"]}
  ]
}
Enter fullscreen mode Exit fullscreen mode

Python Implementation:

import avro.schema
from avro.datafile import DataFileWriter
from avro.io import DatumWriter

# 1. Load the schema from a file
schema = avro.schema.parse(open("user_schema.avsc", "rb").read())

# 2. Write binary data to an Avro Object Container File
with DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema) as writer:
    writer.append({
        "username": "Piyush", 
        "age": 25, 
        "email": "piyush@example.com"
    })

print("Data serialized to users.avro")
Enter fullscreen mode Exit fullscreen mode

The Killer Feature: Database Integration

One reason Avro is the "gold standard" for Kafka and Big Data is its relationship with relational databases.

Because Avro schemas are JSON, you can write a script to automatically convert a SQL table into an Avro schema.

  • SQL Column → Avro Field Name
  • SQL Data Type → Avro Type
  • Nullable Column → Avro Union ["type", "null"]

This makes Avro perfect for Change Data Capture (CDC), where you stream every single update from your Postgres or MySQL database into a Data Lake like S3 or Snowflake.


Summary: Choosing Your Weapon

Which encoding should you use for your next project?

Feature JSON Protobuf / Thrift Apache Avro
Best For Public APIs Internal Microservices Big Data / Pipelines
Speed Slowest Fast Fastest (no tags)
Schema Optional Required (IDL) Required (JSON)
Logic Human-readable Tag-based Resolution-based

Final Thoughts

Encoding isn't just about saving bytes; it's about defining the contract between your services.

  • Use JSON when you need ease of use.
  • Use Protobuf for gRPC and internal speed.
  • Use Avro when your data is massive and your schemas are constantly evolving.

What are you using in production? Let's discuss in the comments!



---

### Why this works for dev.to:
* **Series Link:** The `series` tag in the metadata automatically links Part 1 and Part 2 on the platform.
* **Liquid Tags:** I used blockquotes and bolded text to highlight "Forward/Backward" compatibility—a common interview question for senior devs.
* **Conclusion Table:** Dev.to readers love a quick "Cheat Sheet" or comparison table to wrap up a long-form post.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)