The Evolution Nightmare
In Part 1, we saw how Thrift and Protocol Buffers use numeric tags to shrink data. But in a real-world distributed system, you can’t upgrade every microservice at the same time. You will always have Old Code talking to New Code.
This creates a massive problem: Schema Evolution. If Service A adds a new field to its database, will Service B (running the old code) crash when it tries to read that data?
Forward vs. Backward Compatibility
To build a resilient system, you must understand two concepts:
- Backward Compatibility: New code can read data written by old code. (Essential when you update your "Readers" first).
- Forward Compatibility: Old code can read data written by new code. (Essential when you update your "Writers" first).
In Thrift and Protobuf, this is managed by Tags. If a reader sees a tag it doesn't recognize, it simply ignores it. But what if you want to avoid tags entirely?
3. Apache Avro: The "No Tag" Evolution
Apache Avro was created within the Hadoop ecosystem because Thrift wasn't a perfect fit for massive data files. Unlike Protobuf, Avro does not store tag numbers or field types in the binary data. It only stores the raw values.
How it works: The Pairwise Resolution
How does the reader know what the data is if there are no tags?
Avro uses two schemas:
- Writer’s Schema: The schema the application used when it sent the data.
- Reader’s Schema: The schema the receiving application expects.
When data is read, the Avro library looks at both schemas side-by-side. If the field order changed or a field was renamed, Avro "resolves" the difference by looking at the field names.
![Diagram: Avro Reader and Writer Schema Resolution Logic]
Implementation Example (Avro JSON Schema)
Avro schemas are written in simple JSON, making them much easier to generate dynamically than the IDLs we saw in Part 1.
user_schema.avsc
{
"type": "record",
"name": "User",
"namespace": "com.piyush.devto",
"fields": [
{"name": "username", "type": "string"},
{"name": "age", "type": ["int", "null"], "default": 0},
{"name": "email", "type": ["string", "null"]}
]
}
Python Implementation:
import avro.schema
from avro.datafile import DataFileWriter
from avro.io import DatumWriter
# 1. Load the schema from a file
schema = avro.schema.parse(open("user_schema.avsc", "rb").read())
# 2. Write binary data to an Avro Object Container File
with DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema) as writer:
writer.append({
"username": "Piyush",
"age": 25,
"email": "piyush@example.com"
})
print("Data serialized to users.avro")
The Killer Feature: Database Integration
One reason Avro is the "gold standard" for Kafka and Big Data is its relationship with relational databases.
Because Avro schemas are JSON, you can write a script to automatically convert a SQL table into an Avro schema.
- SQL Column → Avro Field Name
- SQL Data Type → Avro Type
-
Nullable Column → Avro Union
["type", "null"]
This makes Avro perfect for Change Data Capture (CDC), where you stream every single update from your Postgres or MySQL database into a Data Lake like S3 or Snowflake.
Summary: Choosing Your Weapon
Which encoding should you use for your next project?
| Feature | JSON | Protobuf / Thrift | Apache Avro |
|---|---|---|---|
| Best For | Public APIs | Internal Microservices | Big Data / Pipelines |
| Speed | Slowest | Fast | Fastest (no tags) |
| Schema | Optional | Required (IDL) | Required (JSON) |
| Logic | Human-readable | Tag-based | Resolution-based |
Final Thoughts
Encoding isn't just about saving bytes; it's about defining the contract between your services.
- Use JSON when you need ease of use.
- Use Protobuf for gRPC and internal speed.
- Use Avro when your data is massive and your schemas are constantly evolving.
What are you using in production? Let's discuss in the comments!
---
### Why this works for dev.to:
* **Series Link:** The `series` tag in the metadata automatically links Part 1 and Part 2 on the platform.
* **Liquid Tags:** I used blockquotes and bolded text to highlight "Forward/Backward" compatibility—a common interview question for senior devs.
* **Conclusion Table:** Dev.to readers love a quick "Cheat Sheet" or comparison table to wrap up a long-form post.
Top comments (0)