Kostas Pardalis

Posted on Feb 6, 2023 • Originally published at cpard.xyz

Adding a new data type to a sql query engine (Trino)

#portfolio #howto #tooling #discuss

Introduction

In this post I want to take you through a journey that starts with Github Issue #1284 requesting support for nanosecond/microsecond precision in TIMESTAMP for Trino and ends with the merge of Github PR #3783 that added support for the parametric TIMESTAMP type to the Trino query engine.

This journey includes a number of surprises too!

For example the identification of issues on the semantics of the TIMESTAMP type on Trino as explained in a lot of detail in Github Issue #34.

Our goal is to go deep into some important dimensions of a SQL query engine, including the type system and data encoding.

But also get a taste of what it takes to engineer a complex system like a distributed SQL query engine that is being used daily by thousands of users.

This is going to be a long post, so buckle up!

The Problem

When working with time in a digital system like a computer, precision is one of the things that we deeply care about. Especially when time is important for the tasks we want to perform.

Some industries like finance are more sensitive to time
measurements than others but in any case, we want to know well the semantics of the time data types we work with.

In 2020 FINRA also known as the Financial Industry Regulatory Authority, submitted a proposal to the SEC for the change of rules relating to the granularity of timestamps in trade reports.

The proposal suggests to start tracking trades using nanosecond time granularity.

A system that supports nanosecond timestamp precision,
can work with timestamps in millisecond precision without losing any information. The opposite is not true though.

Trino can support up to picosecond timestamp precision!

Trino is a technology used heavily by the financial sector.
This is one of the reasons the Trino community started looking into updating the TIMESTAMP type in Trino.

SQL Data Types

A data type is a set of representable values and the physical representation of any of the values, is implementation-dependent.

The value of a Timestamp might be 1663802507 but how it is physically represented is left to the database engineer to decide.

Usually, Datetime types (including Timestamp) are physically implemented using 32/64-bit integers.

The SQL Specification allows for arbitrary precision of Datetime types and does that by stating the following:

...SECOND, however,
can be defined to have a that indicates the number of decimal digits maintained
following the decimal point in the seconds value, a non-negative exact numeric value.

Parametarized Data Types

If you are familiar with SQL you will easily recognize the following statement:

CREATE TABLE orders (
orderkey bigint
)

The above will create a table with a column named orderkey and its type will be bigint. There's no way we can parameterize bigint.

This is true for the majority of types but there are a couple exceptions, with timestamps being one of them.

For reasons that we will investigate later, it does make a difference when you are dealing with a timestamp in milliseconds versus picoseconds.

We'd like to allow the user to define the granularity and take advantage of that in optimizing how the data is manipulated.

In Trino the Timestamp type looks like this Timestamp(p),
where p is is the number of digits of precision for the fraction of seconds.

the following SQL statement:

CREATE TABLE orders (
orderkey bigint,
creationdate timestamp(6)
)

will create a table orders with a column named creationdate of type timestamp, with 6 digits of precision for the fraction of seconds.

Adding a new Data type

Adding a new type, or in this case updating an existing one with new semantics, is a risky task.

we are dealing with a system that is being used by thousands of users daily and any change to a type might end up with literally thousands of SQL queries that are broken.

Let's see some of the considerations that the Trino team had during the design phase and then we will elaborate more on
each one of them.

First we have performance considerations.
How can we make sure that we deliver the best possible performance when dealing with timestamps?
The trick here is to consider different representations and functions based on the precision the user has defined.

There's the major issue of backward compatibility.

How do we ensure that the Timestamp semantics of the current implementation will not break with the introduction of parameterization?

This compatibility is not just for the type itself but also for all the functions that are related to this type.

Then we have the added complexity of Trino being a federated query engine.

We need to make sure that the new Timestamp type can be correctly mapped to the equivalent types of each connector supported and do that in the most performant way possible.

Finally, we need to make sure that we handle correctly data type conversions between different types and precisions.

Each one of the above considerations is a huge fascinating topic on its own, so let's dive in!

From Logical to Physical representation

We physically implement timestamps using integers.
Obviously a 32bit integer can represent more timestamps than an 8bit one.

So how many bits do we need for picosecond precision?

The answer can be found in the design document for the variable precision datetime types.

For p=12 we need 79 bits which is a bit more than 64bits which is what the Long Java type supports. To represent higher than p=7 resolution Timestamps, we will need to come up with our own Java representation.

what Trino does is the following:

We will have one short encoding for any timestamp precision that can be represented by a Java Long Type. This is for an p<=6.

We will also introduce a long encoding for any timestamp precision p > 6 and up to the maximum we want to support, which in this case is 12.

The fractional part will be 16 or 32 depending on what precision we want to support at the end.

How does this look when implemented?

Let's see the current Trino implementation

The LongTimestamp Class implements the type we described in the previous diagram and as you can see there's one long java type that is used to hold the seconds and then a int java type (which is a 32bit type) that is used to represent the fractional part of seconds with precision up to 32 bits.

There's also an implementation for a Short Timestamp.

let's see how this looks like in Trino

We want the Short Timestamp to be represented by a 64bit integer and to do that, we explicitly associate this class with the Java long.class.

Most of the implementation has been omitted to make it easy to understand how Short Timestamp is represented, but the writeLong method is a good place to see how the type is assumed to be a long.

You might be wondering why we make our lifes so hard and we don't just simplify things by having only one Timestamp physical representation.

The answer to this is performance. Trino is one of the most performant SQL query engines in the industry and this wouldn't have happened if we didn't squish every little bit of performance we could.

When we talk about performance, there are two types we care about. One is storage performance, e.g. how we can reduce the storage we require to store information.

The other one is processing time performance, or how we can make things run faster.

Saving Space & Time

If we take a look at the LongTimestampType implementation

Trino is using a BlockBuilder to write and read data for the specific type.
What is interesting in the above code snippets is that:

The size of the type is used as information passed to the BlockBuilder
There's a special implementation of a BlockBuilder for Int96, if you remember, earlier we concluded that we will need 64 + 32 = 96 bits to represent our Long Timestamps.

If we take a look at the ShortTimestampType implementation, we'll notice that the BlockBuilder used is different.

You will notice that the API used is the same but the BlockBuilder used is different, now we use a LongArrayBlockBuilder instead of a Int96ArrayBlockBuilder.

The Block part of the Trino SPI is a very important part of the query engine as it is responsible for the efficient encoding of the data that is processed

Any decision in the Block API can affect greatly the performance of the engine.

Now you might wonder why I went through all these examples and why the Trino engineers went through the implementation of different block writers.

The reason is simple, it has to do with the space efficiency of the data types.

Remember that the ShortTimestampType is represented by a 64 long java type while a LongTimestampType by a 64 long java type plus a 32 int java type. Based on that, the memory required to store a timestam of each type is:

ShortTimestampType: 8 bytes
LongTimestampType: 16 bytes

We need twice the space for a timestamp with precision more than microseconds.
Keep in mind that if we use a LongTimestampType we will use 16 bytes of memory even if we don't have picosecond precision.

This is 2x the need in memory and that's a lot when we are working with petabytes of data!

Let's see now what happens with time complexity and if there's any performance difference for typical type operations like comparison between the two timestamp types.

To do that we will need the implementation of ShortTimestampType and LongTimestampType let's check the difference for the comparisonOperator.

The ShortTimestampType comparisonOperator involves one comparison that uses the Long.compare method.

On the other hand the LongTimestampType might potentially involve a second comparison, so in the worst case we have to compare one long and one int type.

There are substantial performance gains both in time and storage by adopting this more complex, dual type, timestamp implementation.

Decisions like this one in every part of the engine, is what makes Trino such a performant query engine.

Move Fast but Don't Break things

Trino is used by thousands of users on a daily basis so it's super important, whenever something new is introduced to e
nsure that nothing will break.

Backward compatibility is important and it's part of the design of any new feature.

To ensure backward compatibility it was decided to first tackle the language and the data types by maintaining current semantics while adding parametarization.

By current semantics we mean the precision that was supported by Trino at that time which was p=3.

Updating the types alone is not enough though, there are a number of special functions that have to be parameterized.

These functions are current_time, locatime, current_timestamp and localtimestamp where the user will be able to provide a parameter for precision while supporting the current semantics as defaults.

Together with the above mentioned functions all the connectors that accept the timestamp types in their DDL will have to be updated together with all functions and operators that operate on these types.

This is just the first step in building support for parametrized timestamp types and the purpose of this step is to introduce the appropriate changes without breaking anything to the existing users by enforcing the right defaults.

We also need to ensure that there's a way to safely cast a Timestamp with p1 precision to a Timestamp with p2 precision.

To do that, we need to implement logic for truncation and rounding.

The logic for these operations can be found in the implementation of the Timestamps class.

The above two methods are implementing logic for truncating micros to millis (the first one) and rounding micros to millis (the second one).

Time is Relative

Trino is a federated query engine, that means that it allows someone to execute a query on it that is going to be pushed down to different systems.

These systems usually do not share the same semantics. This is especially true for date/time types.

For Trino to consider a new feature for connectors, it first has to be implemented and released for Hive.

Hive is important because it is heavily used together with Trino but because it also acts as the reference implementation for any other connector.

The first step was to add support for variable-precision timestamp type for Hive connector. You might notice on this issue that the Hive metastore supports timestamps with optional nanosecond precision.

What that means is that although Trino can handle up to picosecond time resolution, when working with Hive, we can only use up to nanosecond.

This is a common pattern and different datastores will have different restrictions, for example the Postrgres connector can handle up to microsecond resolution.

That means that the person who implements support for the new Timestamp parameters for the Postgres connector will have to account for this and ensure that the right type casts happen.

After Hive has been supported a number of other connectors and client followed, e.g. updating the CLI, the JDBC and the Python clients. These are some of the most frequently used connectors and clients.

Conclusion

There are always trade-offs when you are working with temporal data, in most cases you will have to change multiple times the way you work with time while modeling and analysing your data.

The responsibility of the query engine is to provide you with all the tools you need to do that while ensuring the soundness and correctness of data processing.

Now that you have a basic understanding of how types work in Trino, I encourage you to take a deeper look into the codebase and dig deeper into how these types are serialized and then processed.

DEV Community