Joe Zack for Coding Blocks

Posted on Sep 12, 2019 • Originally published at codingblocks.net on Sep 12, 2019

Why Avro?

#architecture #distributedsystems #cloud #kafka

Apache Avro is an open source data serialization system that lets you send information. It is frequently associated with “big data” and distributed systems because it has some distinct advantages over the competition.

The primary advantages are listed below, but read on for more information:

Messages are highly efficient
Strong schema support
- Schema versioning
- Dynamic schema support
- Different, but compatible, Reader and Writer versions allowed
- Union types
Object container files can include schema along with encoded records

Why not human readable formats?

There are many different ways of serializing and deserializing data, and they each have their own advantages. JSON is probably the most common format that developers deal with today, but XML is still king in some tech stacks. The prime benefit of these formats is that it easy for humans to read, making development and debugging much easier.

However, the same things that make formats like JSON and XML human readable (like labels for each element or attribute) are the same things that make it grossly inefficient.

Don’t believe me? Here’s a link to the Pokemon pokedex in json format. The labels (“id”, “name”, “Attack”, “Sp. Attack”) make up more data than the data does!

Yes, You can compress json and xml, but it’s expensive and your data is no longer human readable…so why not choose a better format at that point?

Pros :

Human readable / writable
Optional schema
Widely supported
Great for light tasks

Cons :

Computationally inefficient

What about Protocol Buffers/Thrift?

Formats like Protocol Buffers or Thrift are not human readable, but they are dramatically more efficient. The idea is that you can create a schema that defines the shape of the data, and you can share that schema with anybody who needs to decode your data.

This allows us to dramatically reduce the amount of space per message. The schema also id based, which offers a limited kind of schema evolution where you can add or remove (as long as they aren’t required) new fields without breaking anything.

Pros:

Highly efficient
Most major languages have 3p library support for it
Limited support for schema evolution
Required schemas can serve as documentation

Cons:

Schemas must be shared between serializers and deserializers
Can’t add or remove required fields
Schemas are not directly versioned
Complicated for light tasks

Why Avro?

I listed all the reasons above, and I’ll list them again below, but ultimately it boils down to having awesome schema support.

Having a stand-alone, versioned schema allows Avro to keep the minimum amount of information in it’s messages – making them highly compact. The schema support also allows for different readers and writers to negotiate on which schemas they support, which bakes a lot of flexibility in at the lowest level.

There are some additional niceties that go along with this too, like making it easier to support dynamic schemas rather than compiling them in with your code and having to figure out how to share the schema across projects. Sure, you could build this on top of another format…and also build that support into all of the tools you support, but why do that when you could use a popular, agreed upon standard?

Pros :

Messages are highly efficient
Strong schema support
- Schema versioning
- Dynamic schema support
- Different, but compatible, Reader and Writer versions allowed
- Union types
Object container files can include schema along with encoded records