I have talked about Kinesis before, and I’m sure you’ve been using Kinesis for longer than me. But according to what I’ve seen, not all teams or companies use all parts of Kinesis. And, there are four parts in Kinesis:
- Ingest and process streaming data with Kinesis streams – Kinesis Data Streams
- Deliver streaming data with Kinesis Firehose delivery streams – Kinesis Firehose Delivery Streams
- Analyze streaming data with Kinesis analytics applications – Kinesis Analytics
- Ingest and process media streams with Kinesis video streams – Kinesis Video Streams
All these four parts offer something different. Well, the last two are definitely different than the first two. But it’s the first two that I see a lot of people getting confused with. So I thought I’ll write this post and do my part in ridding the confusion to the best of my capacity.
So in this post, we’ll see what each of the two – Data Streams and Firehose Delivery Streams – offer, and how they are different from one another. By the end of this post, hopefully, you’ll have a better idea of when to select which Kinesis stream for your app.
Kinesis Data Streams is the part which works like a pipeline for processing data. What I mean by this is, an external source, or a part of your system will be generating messages and putting them into data streams. Another part of your system will be listening to messages on these data streams. Whenever there is data, your app or service will process that data.
This is mostly like a queue. One or more apps will be enqueuing messages to the queue, and one or more apps or services will be dequeuing messages from the queue. As the name suggests, this is used mostly in systems or services where there is unbounded data coming in that needs to be processed in real time.
This is slightly different than data streams, in that, the main purpose of this is to deliver data to a predefined destination. There is no stream processing happening here. The best example I can give to explain Firehose delivery stream is a simple data lake creation. I talk about this so often because I have experience doing this, and it just works.
So what’s happening in this case is, you have an app or a service that is writing or producing data to a delivery stream. This delivery stream will then deliver that data to a storage location, like an S3 bucket. Hence the name, delivery stream.
Now, this doesn’t mean that you can’t do any kind of processing on the incoming data. In fact, Amazon provides a bunch of predefined processors in Firehose. Although, there are limitations to this. For example, you can use in-built features to convert the data type of the incoming data. Let’s suppose your service is producing JSON data into the stream, you can configure the delivery stream to deliver the data in Parquet.
But, if you need any more data transformation than that, you’ll have to provide your own data transformation logic. And yes, that’s possible. But, it has to be a Lambda function. So for each incoming piece of record, the configured Lambda function will be invoked. This Lambda function can do any kind of transformation on the data and return the transformed data. The delivery stream will then deliver this transformed data instead of the original. This comes in very handy when you’re working with backward compatibility issues.
I hope the differences are clear enough here. To reiterate, Kinesis data streams are used in places where an unbounded stream of data needs to worked on in real time. And Kinesis Firehose delivery streams are used when data needs to be delivered to a storage destination, such as S3.
If this wasn’t clear, try implementing simple POCs for each of these, and you’ll quickly understand the difference. Oh, and one more thing, you can only have producers for Firehose delivery streams, you can’t have consumers. Think about that!