Celeste Horgan

Posted on Nov 6

Writes, 3 ways: Postgres, Apache Kafka® and Apache Iceberg™

#postgres #database #learning #dataengineering

Writes, 3 ways: Postgres, Apache Kafka® and Apache Iceberg™

As a part of my new job at Snowflake, one of the most important pieces of technology I need to advocate for in the coming future is Apache Iceberg™. Snowflake has made rather large investments in implementing Iceberg against its own proprietary data warehouse, but also in supporting the open source community by hiring and paying maintainers to work on Iceberg.

The only problem is: I don’t really know anything about Iceberg or how it works. 😅

I’ve always learned best by comparison: if I can compare something new to something I already know, and identify how it differs, I can understand it more easily. So, this post is an effort to understand Iceberg by way of two things I already understand: Postgres and Apache Kafka™.

Why writes?

The internet is a series of tubes.

For any data service, the thing we care about the most is how data gets in (writes), and how we get data out (reads). Every other operation is a permutation of those two basics, and all other concerns we have when it comes to data (i.e., network latency) are concerns only because we care about those two things.

That is to say: this post could’ve easily been about reads. Indeed, that might make for a good follow-up post.

Why Postgres, Apache Kafka and Apache Iceberg?

Well, as mentioned: I already know a bit about these two systems. That said: Postgres makes for the perfect compare-and-contrast model with basically any other data service. It is so ubiquitous, and such a straightforward implementation of a relational database system that it makes the perfect jumping off point for basically everything else.

Secondly, the timeline that these 3 systems draw is interesting: Postgres exists at near the conception point of managed data services, being open sourced in 1998 from research done prior. Apache Kafka was open sourced in 2011, just as technology companies were beginning to scale from a data perspective and programs were shifting architecture to distributed cluster models. Finally, Apache Iceberg was open sourced in 2017, well into the heyday of cloud computing, and as blob storage for enormous amounts of data was becoming a concern. The popularity of these 3 formats has also remained: Postgres is still the most popular database, Apache Kafka the most popular streaming service, and Iceberg is gaining dominance as an open table format.

There are arguments that other services, such as Redis or Hadoop might be better suited for comparison: but frankly, I know less about those and so it’s a less valuable exercise for me to understand Iceberg against.

The anatomy of a Postgres write

So basically:

Any given write to a Postgres database starts and ends its journey with the Postgres postmaster. The postmaster process handles all in and outgoing connections to the database, and all communication with other subprocesses.

From there, it hits the Write Ahead Log writer. No database transactions are enacted directly on the data store. All database transactions in a Postgres database are first recorded to the WAL.

From there, Postgres retrieves the affected database rows and columns from file storage, and duplicates them in the WAL buffer. Incidentally, this creates a tuple, but we’ll touch on that later.

From here, Postgres tries the transaction against this scratch data: if the transaction fails, a failure is returned to the client via the postmaster. If the transaction succeeds, Postgres then updates the pointers for the affected rows in the database to point to the buffer data, and the buffer becomes fact.

The way Postgres does writes – that is, by creating a copy in the buffer, trying the write, and then updating the pointers to the rows in the buffer – is why another crucial Postgres feature, the autovacuum, exists and is such a point of discussion amongst Postgres database admins. In effect, every transaction that makes its way to the WAL generates the previously-mentioned tuples, or collections of data of various types. In Postgres a tuple usually refers to a given row or set of rows, and is usually used when referring to managing dead tuples via the autovacuum process.

In comparison to other data services, Postgres generates an awful lot of discarded data, because every transaction it tries either generates dead rows or generates a buffer which also needs cleanup. The autovacuum process has the potential to slow down the database in meaningful ways if the autovacuum cadence isn’t well managed and if system resources aren’t well allocated. This is also why many cloud-based rewrites of Postgres feel the need to partially or completely rewrite the autovacuum:

Finally, this also emphasizes why Postgres is so transactional: every write to the database must first pass through a set of discrete steps, and the write ahead log ensures that every transaction is replayable. In these circumstances it becomes key that the system know when a transaction starts and when a transaction ends, and exactly which rows of the database a transaction affects.

Before we move on, let’s highlight one key issue with Postgres and many “traditional” data stores, which will become relevant as soon as we talk about Kafka: Postgres by default stores data directly on the disk, using the same system resources as it uses for compute. This introduces a bit of fragility into it and many other database systems. In comparison to distributed or cloud computing models, which tend to separate storage and compute, this means that a vanilla Postgres deployment is often somewhat fragile: data is stored in one location, and if for some reason the compute components that make Postgres run take up too many resources, it can potentially take down the entire database.

The anatomy of an Apache Kafka write

Shifting gears completely: let’s look at what a “write” looks like in Apache Kafka.

I say “write” loosely because Kafka is an event streaming platform: any data it writes to a topic is more a coincidence of that data being a part of an event than for purposes of persistent storage. In fact, Apache Kafka as-is is not an ideal data store, as by default it is set to discard data in topics after a set period. KIP-1150 aims to change this, but as of writing is still under discussion.

With that said, let’s look at writing to a Kafka topic with an actor diagram, as we did previously:

The biggest difference between this and Postgres is the fact that Kafka is a distributed system: when it comes to writing a topic, this means that we hit two layers of parallel processing. The first is at the partitioner, and the second is at the broker.

Any service can be an Apache Kafka producer. A producer sends data to the serializer first. Unlike Postgres, Kafka is non-transactional: it operates primarily as a service for streaming data, and maintains an open connection with its producers. Messages are serialized and stored serialized by the Kafka topics. The consumer of a topic is responsible for deserialization, and for knowing the origin serialization format of a message.

Next, we hit our first layer of distributed computing: the partitioner uses the message key to determine which partition a message should go to. Partitioning is a complex topic in Kafka, so we’ll leave it there for now. The important part of partitioning for our purposes is that partitions do not receive the same set of messages by default. This means that message order is only preserved per partition. Kafka prioritizes the ability to process messages in a distributed fashion (faster overall) as opposed to in a strict order. This also introduces a common issue with distributed systems: concurrency problems. Any system designed around a Kafka instance also needs to manage its partitioning and consumers correctly, such that the concurrency or lack thereof is managed for the consumers.

From here, we hit our second layer of distribution: the brokers. Kafka typically operates more than one broker for each deployment. For any given topic partition, Kafka decides on a leader. The partitioner analyzes the message key, retrieves the leader for that topic and message, and writes the message to that broker. Leader election, like partitioning, is a complex topic, so we won’t go into it.

For our purposes of understanding a write, though, this means a few things:

Not all brokers in a Kafka cluster have all the data in a topic
Brokers are designed to be added and removed (horizontally scale) based on the number of topics and partitions in that topic, as well as the compute resources available to the overall cluster.
Brokers are therefore not permanent, nor is the data for a particular topic and partition guaranteed to stay on the same broker node in the cluster. Kafka does significant rebalancing on the back end, and brokers fail.
By default, Kafka does not replicate data between broker nodes, having a default replication factor of 1. This almost always upped to 3 for production purposes, however.

In comparison to a write in Postgres, there is no single source of truth with an Apache Kafka write, and the state of the system (in relation to exactly what data is where) is relatively hard to introspect as a result, especially in comparison to Postgres and its authoritative write ahead log. The data is likely in the system somewhere, but it’s impossible to tell the physical location of it in most cases. That said, Kafka does assign an offset value to all messages in a partition of a topic. This is used so that consumers can track which messages within a topic they’ve read (consumed), versus which are new.

Another important thing to note with Apache Kafka is that topics are append-only. Data cannot be updated or deleted, and if this is a requirement it must be resolved on the consumer side.

We did a bit of hand-waving over the complexities of partitioning and broker management in Kafka in this section. This is specifically because interacting with highly distributed systems like Kafka often means that the client applications involved are in fact interacting with an API layer rather than directly with a machine. By dealing with an abstraction layer, we allow Kafka to handle optimization of reading and writing, at the cost of visibility into the system. To a certain extent, Postgres does this as well, with all Postgres connections coming through the Postmaster.

Abstraction as a component of distributed systems is an important concept to keep in mind as we dive into Apache Iceberg.

The anatomy of an Apache Iceberg write

Once again into the breach!

Apache Iceberg is an abstraction layer for large data sets (of the Big Data, petabyte variety). It provides an open table format for large data sets. That is, it allows users to interact with large data sets, distributed over many partitions and physical/virtual machines, in a SQL-table-esque way.

As an Iceberg table is inherently an abstraction of an underlying blob storage layer, its write structure is reflective of that:

The client application – and the end user, requesting changes to an Iceberg table – interact with Iceberg as an application through normal SQL commands. The write command, thus, is a normal INSERT from the client side. But that’s about where normalcy ends.

Once that request hits Iceberg, it is processed by the Processing engineer, which does a few things. First, the Processing engine gets the current table’s metadata pointer from the Iceberg catalog location and returns it to the processing engine.

Apache Iceberg, on a fundamental level, is a metadata storage and tracking mechanism. Iceberg catalogs all files in blob storage and generates manifest lists of them. It then generates metadata of manifest lists (so: metadata of metadata) which are mapped to the particular snapshot of an Iceberg table. Every time an Iceberg table receives a valid update request, the manifest files for that table are re-generated accordingly, a new snapshot metadata file is generated, and the metadata pointer for the table is updated: thus, the next time the table is accessed by the client application, it retrieves the current metadata pointer, which retrieves the snapshot metadata for the table, which retrieves the current manifests, which point to the files currently indexed by the table.

So, the process for a write is as follows:

Initiate the write via the client application (i.e., an INSERT)
Pass the write to the client library’s Processing engine
Retrieve the metadata pointer for the table, and the table’s current snapshot metadata
Have the processing engine plan the write and return a Write Plan to the Processing engine
Write new data files to blob storage
Have the processing engine report new data file locations to the Client library
Create new manifest files and manifest lists for the new data files
Create new metadata file (with new snapshot) for the table
Update the table pointer metadata to the new table snapshot
Report success to the Client library and thus the client application

This tangle of pointers and metadata files enables a few different design features of Iceberg as a result:

Hidden partitioning and partition layout evolution: one of Iceberg’s key tenets is to obfuscate the partitioning of data files in blob storage from the end user. Iceberg does this to introduce simplicity to the end user when interacting with data lakes and other unstructured data storage. By reducing the need to know where a specific set of files are, and instead providing an interface that keeps track of files, Iceberg reduces the fragility of any system an end user might use it with
Time travel: Iceberg treats snapshots as a first-level priority, and as such allows users to access past versions of a table by pointing to old snapshots. This also allows users to do diffs between snapshots.
Flexibility: Iceberg can work with a number of underlying data formats and underlying cloud stores. It’s relatively unopinionated about what the client system does with an Iceberg table and what the underlying system is: it’s a pure layer of abstraction.

What did we learn?

Data is a hard problem to solve, and rigorous academic and real-world research has been put into it
The trend is towards increasing layers of abstraction between the write layers and the storage layers
The value of openness is high: data lives forever(-ish), and open source data services and data formats tend to win, because people want access to that data forever(-ish).