loading...
Coding Blocks

Why Avro?

thejoezack profile image Joe Zack Originally published at codingblocks.net on ・4 min read

Apache Avro is an open source data serialization system that lets you send information. It is frequently associated with “big data” and distributed systems because it has some distinct advantages over the competition.

The primary advantages are listed below, but read on for more information:

Why not human readable formats?

There are many different ways of serializing and deserializing data, and they each have their own advantages. JSON is probably the most common format that developers deal with today, but XML is still king in some tech stacks. The prime benefit of these formats is that it easy for humans to read, making development and debugging much easier.

However, the same things that make formats like JSON and XML human readable (like labels for each element or attribute) are the same things that make it grossly inefficient.

Don’t believe me? Here’s a link to the Pokemon pokedex in json format. The labels (“id”, “name”, “Attack”, “Sp. Attack”) make up more data than the data does!

Yes, You can compress json and xml, but it’s expensive and your data is no longer human readable…so why not choose a better format at that point?

Pros :

  • Human readable / writable
  • Optional schema
  • Widely supported
  • Great for light tasks

Cons :

  • Computationally inefficient

What about Protocol Buffers/Thrift?

Formats like Protocol Buffers or Thrift are not human readable, but they are dramatically more efficient. The idea is that you can create a schema that defines the shape of the data, and you can share that schema with anybody who needs to decode your data.

This allows us to dramatically reduce the amount of space per message. The schema also id based, which offers a limited kind of schema evolution where you can add or remove (as long as they aren’t required) new fields without breaking anything.

Pros:

  • Highly efficient
  • Most major languages have 3p library support for it
  • Limited support for schema evolution
  • Required schemas can serve as documentation

Cons:

  • Schemas must be shared between serializers and deserializers
  • Can’t add or remove required fields
  • Schemas are not directly versioned
  • Complicated for light tasks

Why Avro?

I listed all the reasons above, and I’ll list them again below, but ultimately it boils down to having awesome schema support.

Having a stand-alone, versioned schema allows Avro to keep the minimum amount of information in it’s messages – making them highly compact. The schema support also allows for different readers and writers to negotiate on which schemas they support, which bakes a lot of flexibility in at the lowest level.

There are some additional niceties that go along with this too, like making it easier to support dynamic schemas rather than compiling them in with your code and having to figure out how to share the schema across projects. Sure, you could build this on top of another format…and also build that support into all of the tools you support, but why do that when you could use a popular, agreed upon standard?

Pros :

Cons:

  • Support is not as wide as other formats
  • Complicated for light tasks

Read a much better comparison here: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

The author of that post also wrote this awesome book!

, and if you're interested in this post then I bet you will love it.

Posted on by:

thejoezack profile

Joe Zack

@thejoezack

Programming, Podcasting, Real-time analytics

Coding Blocks

Coding Blocks is a developer focused podcast and community dedicated to becoming better programmers.

Discussion

markdown guide
 

Nice article. Easy to understand. Thanks Joe!