In this post we’ll explore what is data streaming and take a look at AWS data streaming services.
Data Streaming
Data streaming (also known as event streaming) is a continuous flow of data generated by thousands of sources. Sources of data stream can be application logs, clickstream data from websites and mobile devices, telemetry data from IoT devices, real-time location tracking etc. Streaming data records are usually small, in the range of bytes and kilobytes.
With data streaming, high volumes of high velocity data from different sources can be ingested, processed and analyzed in real-time. The data streaming service captures the data sent from the different sources, stores it for a set duration of time, and the data can be replayed indefinitely for this duration period. Multiple different applications can simultaneously and independently consume the data.
There are different approaches to process streaming data: batch processing and stream processing. With traditional systems the data is first stored and then processed. With stream processing services the data can be consumed, enriched and analyzed in motion.
Batch Processing vs Stream Processing
Batch processing system, takes a large set of data as input, runs a job to process it, and it produces output data. The input is bounded, it has known and finite size, so that the batch process knows when it has finished reading the input. But in reality most of the data is unbounded as it arrives gradually over time, so with the batch process the data is divided in chunks of fixed duration, e.g. per hour or per day. Usually batch jobs are scheduled to run periodically, and often it takes some time for the job to complete, from minutes to several days. Changes in the input are only reflected in the output, hours or days later when the job completes. Batch processing is suitable for historical data analysis, machine learning etc.
With stream processing the data is processed continuously in real-time, even before it is stored. Like batch processing, the stream processing consumes input data and produces output, but it does this on events shortly as they happen. The data can be processed on a record-by-records basis or over sliding time windows. Stream processing is intended for time sensitive applications, where data needs to be processed as soon as it arrives. Requires latency in the order of seconds or milliseconds. It is suitable for real-time stock trading applications, social media feeds, alerting systems etc.
Both approaches can even be combined, by having a real-time layer and a batch layer. Data is first processed to extract real-time insights and then persisted to storage from where it can be used for different batch processing scenarios.
AWS Streaming Services
Data streaming on AWS is enabled by the Amazon Kinesis family of services.
Amazon Kinesis Data Stream is a fully managed streaming data service that collects and stores data streams. It can continuously capture and store terabytes of data per hour from hundreds of thousands of sources. Data is stored in the order it was received and can be replayed indefinitely within this period. You can build stream processing applications that consume the data from Kinesis streams with Kinesis Data Analytics, Lambda or Spark Streaming.
Amazon Kinesis Data Firehose loads the data streams into AWS data stores like S3, Redshift, ElasticSearch etc. It loads the streaming data into storage destinations with zero administration.
Amazon Kinesis Data Analytics enables real-time analytics on the data stream. It enables to process data continuously and get insights in seconds or minutes.
More details on batch and stream processing can be found in Designing Data-Intensive Applications by Martin Kleppmann. Data streaming scenarios and examples can be found in this AWS Whitepaper.
Top comments (0)