DEV Community

Robertino for Auth0

Posted on

The Power of Parquet

Learn about what Parquet is and how its columnar storage and encoding techniques can help you.

Original post written by Jon Carl, Staff Software Engineer at Auth0.

Imagine you are tasked with choosing how to store a large analytical data set that needs to be easily queryable. Billions of rows with over 75 columns. What would you come up with?

Maybe you reach for a popular cloud database technology such as Amazon RedShift, Snowflake, or Google Cloud SQL. Maybe you think Microsoft Access is all you need. Maybe you decide the tried and true PostgreSQL is the best fit. Or maybe you opt for any of the other hundreds of databases out there.

Someone made the decision for you: each row will be a JSON object on a line in a gzipped file. Some files will have many lines and others will have few lines. To query the data you’ll need to use something that can scan many of these files and read all of the JSON. You still have some part in the decision by choosing what to use for querying. You settle on Amazon Athena which is built on top of Apache Presto and allows you to query the data using SQL.

You like the SQL interface, but using gzipped JSON files as the underlying storage is not exactly fast. If you could change the format, what would you choose?

At Auth0 we recently transitioned our edge logs from gzipped JSON to a format called Parquet. A little background on our data set: the network edge of Auth0 processes over 60 billion requests a month. The logs for these requests arrive from our edge networking partner in gzipped JSON files and is roughly 1 TB of data a day.

What is Parquet?

At its core, Apache Parquet is a file format which stores data in columns instead of rows. A parquet file is not meant to be read completely from beginning to end. Instead there is metadata at the end of the file which contains information that a reader can use to selectively read portions of the file they are interested in. The less data you have to read, the quicker your reading can be.

Example

Let’s look at an example: I want to calculate the 90th percentile of the request duration column in the data set.

JSON

With JSON, the query engine has to read each file in its entirety. For each row it has to read the JSON object at least until it finds the request duration column. Or maybe it has to read the entire JSON object before picking the column out. (This isn’t a JSON parsing blog post, so we won’t investigate further.)

Parquet

With Parquet, the query engine reads the metadata in each file. From there, the engine learns where the column values are stored in the file and reads only that data. This is not only faster than reading the entire file, it’s also cheaper in Athena which bills by bytes scanned.

You could also have Parquet store metadata in one file and reference multiple other files for column locations.

Encoding

Let’s look deeper at how Parquet uses different encoding techniques to make reads efficient.

Dictionary Encoding

When a column has many of the same values, a dictionary can be built and used to refer to those values. We have a status code column in our logs which refers to the status code returned to the client. Status codes are well known and fit well in a dictionary. The distinct values in the column are assigned an index and then that index is used to refer to the values.

Parquet 01

When reading a column there is less to read, and when compressing the data there is less data. Thus, a higher level of compression is achieved. But dictionary encoding does not always make sense; if a column has high cardinality for example, it would not gain much from dictionary encoding.

Run Length Encoding (RLE)

Another option is Run Length Encoding (RLE), which can further reduce the size of the column. While dictionary encoding tackles distinct values, RLE tackles consecutive repeated values. When a column has consecutive repeated values, the number of repetitions is encoded followed by the value being repeated. The same status code example can be used here: across hundreds and thousands of rows, the status codes have low cardinality and thus consecutive repeated values are common.

Parquet 02

As with dictionary encoding, this means less to read and high compression. RLE is also normally combined with dictionary encoding as they complement each other well.

Delta Encoding

The last encoding we will look at is delta encoding, which stores the deltas between consecutive values instead of their full form. We have a duration column which stores the duration of a request in milliseconds. Delta encoding would take the full value of the first item, and then every value after that would be the delta.

Parquet 03

The smaller the deltas, the more effective delta encoding. Another example for delta encoding would be timestamps stored as millisecond precision 64-bit values. We have multiple requests coming per second, and would have a relatively small delta between values.

How Can You Get Started With Parquet?

As you can see, Parquet is much more than a format which stores data in columns with metadata. The way it encodes each column, how it uses types to store data more efficiently, or how it handles repeated values are all examples of how purposefully Parquet has been built. There was a lot of thought put into Parquet, and moving to it from JSON was amazing.

Parquet sounds great, and there are some really cool features to nerd out on, but how can you start using it today? The good news is you don’t need to know all of its inner workings. You don’t need to choose encodings, get into the nitty gritty of how columns are packaged and stored, or have a beefy fleet of DB machines. We started exploring Parquet using DuckDB and grew from there. We had a data set we would run multiple queries on and it was faster for us to first convert it to Parquet and then query it instead of continually querying JSON. For an example dataset it’s as simple as starting DuckDB in your command line and loading an example dataset:

COPY(
  SELECT * FROM
  read_json('https://github.com/grounded042/airports-dataset/raw/226a54bb19535ea8c5e6175a591446e67c4ab44c/airports.json')
) TO 'airports.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);
Enter fullscreen mode Exit fullscreen mode

To see the 20 countries with the most airports, you can use the following query:

SELECT COUNT(*) as count, iso_country FROM 'airports.parquet' GROUP BY iso_country ORDER BY count DESC LIMIT 20;

┌───────┬─────────────┐
│ count │ iso_country │
│ int64 │   varchar   │
├───────┼─────────────┤
│ 31567 │ US          │
│  7206 │ BR          │
│  3563 │ JP          │
│  3126 │ CA          │
│  2677 │ AU          │
│  2594 │ MX          │
│  1683 │ RU          │
│  1656 │ FR          │
│  1452 │ GB          │
│  1402 │ KR          │
│  1335 │ DE          │
│   942 │ AR          │
│   918 │ IT          │
│   731 │ CO          │
│   726 │ PH          │
│   713 │ CN          │
│   652 │ ZA          │
│   632 │ PG          │
│   625 │ IN          │
│   617 │ ID          │
├───────┴─────────────┤
│ 20 rows   2 columns │
└─────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Start small with your own JSON data and grow as needed. For us we grew into a Lambda which converts files as they land in S3, but we wouldn’t have gotten there without first exploring things via DuckDB.

To take things further, DuckDB has a great blog post which gives you first hand experience with Parquet: https://duckdb.org/2021/06/25/querying-parquet.html

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs