Exploring Amazon Kinesis: Real-Time Data Streaming and Processing with Kinesis Streams and Firehose

In December 2013, Amazon Web Services (AWS) launched Kinesis, a service designed for processing real-time streaming big data. Over the years, AWS has expanded the availability of Kinesis to multiple regions, allowing integration with custom applications for real-time data processing from various sources.

Kinesis serves as a highly reliable conduit for streaming messages between data producers and data consumers. Data producers can be any source of data, such as system logs, social network data, financial information, geospatial data, mobile app data, or IoT device telemetry. Data consumers typically include applications for data processing and storage like Apache Hadoop, Apache Storm, Amazon Simple Storage Service (S3), and ElasticSearch.

Key Concepts: Kinesis vs Firehose
To work with Kinesis Streams effectively, it's important to understand some key concepts. The fundamental scaling unit in Kinesis is a shard. Each shard can handle up to 1MB or 1,000 PUTs (data writes) per second, and emit data at a rate of 2MB per second.

Shards scale linearly, meaning that adding shards to a stream increases the ingestion rate by 1MB per second and the emission rate by 2MB per second for each added shard. For example, ten shards would enable a stream to handle 10MB (10,000 PUTs) of data ingestion and 20MB of data emission per second. The number of shards is determined when creating a stream and cannot be changed through the AWS Console afterward.

Resharding, the process of dynamically adding or removing shards from a stream, is possible using the AWS Streams API. However, resharding is considered an advanced strategy and should be approached with a solid understanding of the subject.

When adding or removing shards, the cost of the stream adjusts accordingly. The default limit for shards per region is 10, but this limit can be increased by contacting Amazon Support. There is no limit to the number of shards or streams in an account.

Records are the data units stored in a stream, consisting of a sequence number, a partition key, and a data blob. Data blobs represent the payload of information within a record and have a maximum size of 1MB (before Base64-encoding). For larger data, it needs to be divided into smaller chunks before being put into a Kinesis stream.

Partition keys are used to identify different shards within a stream and enable data distribution across shards. Sequence numbers are unique identifiers for records inserted into a shard and increase monotonically. They are specific to individual shards.

Amazon Kinesis Offerings: Kinesis Streams, Firehose, and Analytics
Amazon Kinesis is divided into three service offerings:

Kinesis Streams: Captures large volumes of data from data producers and streams it into custom applications for processing and analysis. Kinesis replicates streaming data across three availability zones in AWS for reliability and availability. Scaling the ingestion and emission rates requires manually provisioning the appropriate number of shards for the expected data volume. Data can be loaded into streams using HTTPS, Kinesis Producer Library, Kinesis Client Library, or Kinesis Agent. By default, data is available in a stream for 24 hours but can be extended to 168 hours (7 days) for an additional charge. Monitoring is provided through Amazon CloudWatch.

**Kinesis Firehose: **Used for capturing and loading streaming data into other Amazon services like S3 and Redshift. Firehose can handle gigabytes of streaming data per second and supports features like data batching, encryption, and compression. Unlike Kinesis Streams, Firehose automatically scales to meet demand, eliminating the need for manual provisioning. Data can be loaded into Firehose using various methods, and it can stream data to S3 and Redshift simultaneously. Monitoring is available through Amazon CloudWatch.

Kinesis Analytics: A forthcoming product from Amazon that allows running standard SQL queries on data streams and sending the results to analytics tools for monitoring and alerting. As of now, detailed information about this service has not been released by Amazon.

Kinesis vs SQS: Key Differences
Kinesis and Amazon's Simple Queue Service (SQS) differ in their purpose and capabilities. Kinesis is designed for real-time processing of streaming big data, while SQS serves as a message queue for storing messages between distributed application components.

Benefits of Kinesis:

Routing and ordering of records based on a given key.
Multiple clients can read messages concurrently from the same stream.
Ability to replay messages up to seven days in the past.
Records can be consumed at a later time.
Provisioning enough streams ahead of time is required to meet anticipated demand.

Benefits of SQS:

Messaging semantics for tracking successful completion of work items in a queue.
Delay scheduling of messages for up to 15 minutes.
Automatic scaling to handle application demand.
Limited number of messages that can be read or written at a time compared to Kinesis, allowing for larger batches of messages in Kinesis.
By understanding the differences between Kinesis Streams, Firehose, and SQS, you can choose the most suitable service for your specific use case and requirements.

DEV Community

Exploring Amazon Kinesis: Real-Time Data Streaming and Processing with Kinesis Streams and Firehose

Top comments (0)