ksqlDB (initally named
KSQL) is not a new product as the first preview versions were released more than four years ago now. And since the beginning, there’s been a common misunderstanding: developers think at first that KSQL is a language to interrogate topic content. And this isn’t the DNA : as a streaming database; its purpose is to offer the means to process data in real time, at scale and in a resilient manner, with various high level features such as joining streams of events, continuously aggregating data, etc.
However, the fact that developers had that kind of expectation from the product is not a surprise; trying to do things in the way you know doing it ever since sounds natural. The challenge is paradigm shift. As a solutions engineer, trying to ease the change of culture and to open eyes on the differences between data at rest and data in motion is my daily job.
That being said, there’s no point in opposing reacting on events to interrogating states, both are valid designs to be used properly depending on the needs. I guess that’s why ksqlDB evolved over time and offered pull queries in v0.6.0. At first it was only available for materialized views, then to tables, and now it’s finally available for streams.
Yes, now you have the capability to write that kind of statement :
SELECT ... from <a topic mapped as a stream> WHERE ...
As a pull query, it’s executed, the data is fetched and returned to the client then the connection is closed.
Before throwing away PostgreSQL and other databases of that kind, let’s take a step back and see what will be the impact of that kind of query. I created a small test with Confluent Cloud with a basic cluster and used the Datagen Source Connector in order to generate some dummy data, based on the inventory model. So I let it run till I had a significant amount of data on this topic, around 1GB. After that, I mapped the topic in ksqlDB:
CREATE STREAM inventory WITH(kafka_topic='inventory', value_format='AVRO');
Pay attention that thanks to the Schema Registry integration, the stream is automatically created with the expected data structure :
Then I executed a useless query that returns no records at all:
select * from INVENTORY where id < -10;
As expected, the execution returned no record. What’s good with Confluent Cloud is that in addition to spinning up a cluster in a couple of seconds, it comes with an out-of-the-box metrics in the user interface, and the impact of that query can be checked almost immediately in the consumption graph:
You can clearly see that the whole content of the topic was scanned, generating a lot of outgoing traffic.
Thank you Captain Obvious, you’ve just demonstrated that Kafka is not a database. So in which case are these pull queries useful? Well, imagine a topic in which you have to do forensics, spotting a set of records against some criteria requires to build an application with a consumer that fully reads the topic content and applies the discriminant (in fact this is what a pull query does). Dramatically painful for a basic need. With pull queries on streams, the longest part is the execution 😎.
All of that is to illustrate what is briefly mentioned in the ksqlDB 0.23.1 blog post with tangible facts regarding the caution of use. If you want to know more about ksqlDB, head over to ksqldb.io to get started, where you can follow the quick start, read the docs, and learn more! Pro tip, I especially recommend checking out Confluent Cloud as it will give you the opportunity to have a complete working environment in a couple of minutes 😉.