Ankit malik

Posted on Jan 25

What is Amazon Kinesis Data Streams

#aws #dataengineering #devops

Amazon Kinesis Data Streams (KDS) is a managed AWS service for collecting, processing and storing real-time streaming data at scale. If your application needs to ingest continuous streams of events (logs, clicks, telemetries, sensor data, audit events) and process them with low latency, Kinesis Data Streams is one of the standard options on AWS.

Why use a streaming data store?

You have data arriving continuously (millions of small messages per minute).
You need near real-time processing (analytics, alerts, enrichment, ETL).
You want to decouple producers and consumers (producers write events; many different consumers can read and process them).
You want to replay or reprocess historical data when needed.

Kinesis Data Streams gives you a durable, ordered buffer that sits between producers and consumers so you can build streaming pipelines reliably.

Basic concepts

Stream

A stream is the top-level object. Producers put records into a stream. Consumers read records from it.

Shard

A stream is made of one or more shards. A shard is a unit of capacity — it controls how much data you can write to and read from the stream. You scale a stream by adding or removing shards (resharding).

Record

A record is a single unit of data that producers write. It contains:

a data blob (binary or text),
a partition key,
a sequence number (assigned by Kinesis when the record is accepted).

Partition key

Producers supply a partition key with each record. Kinesis uses the partition key to determine which shard will store the record. Records with the same partition key go to the same shard and preserve ordering relative to each other.

Sequence number

Each record in a shard receives a unique sequence number. Consumers use sequence numbers for reading and for checkpoints (to resume processing).

Retention

Kinesis stores data for a configurable retention period. That means consumers can re-read and replay data within that window.

How data flows (high level)

Producers write records to the stream using the AWS SDK or PutRecord/PutRecords API.
Kinesis routes each record to a shard using the partition key.
Within a shard, records are ordered and get sequence numbers.
Consumers read shards using GetRecords (polling), Enhanced Fan-Out (push to registered consumers), or via managed integrations such as AWS Lambda.
Consumers process the records and checkpoint progress (so they can resume from where they left off).

Consumers and consumption models

Standard (shared) consumers

Consumers poll the shard endpoints for records. Multiple consumers share the shard read throughput.

Enhanced Fan-Out (EFO)

With EFO, each registered consumer gets a dedicated HTTP/2 connection and receives records with very low latency and without sharing read throughput. This is useful when several independent applications need the same stream data.

Kinesis Client Library (KCL)

KCL is a client-side library (Java, also wrappers for other languages) that helps with shard leasing, checkpointing, rebalancing and fault tolerance. Use it when you need an auto-managed consumer group behavior similar to Kafka consumers.

AWS Lambda integration

You can configure Lambda as a consumer. Lambda will poll the stream and invoke your function with batches of records. This provides a serverless way to process stream data but you need to manage batch sizes and function concurrency.

Scaling and resharding

To increase capacity, add shards. To decrease, merge shards.
Resharding is done by splitting a shard into two or merging two shards into one.
Partition key design is important: if keys are skewed, traffic concentrates on a small set of shards and causes hot-shard problems. Prefer even hashing and avoid a single hot key.
For sudden spikes, consider provisioning more shards proactively or design producers/consumers to handle transient throttling.

Ordering and replay

Ordering is preserved per partition key (i.e., within a shard). There is no global ordering across all shards.
Because Kinesis retains data for the configured retention period, consumers can reprocess old data by reading from an older sequence number or timestamp. This is useful for replays, bug fixes, and reprocessing pipelines.

Reliability, security, and durability

Data is replicated across multiple availability zones (managed by AWS) to provide durability.
Access is controlled via IAM policies (who can put, get, or manage streams).
You can enable encryption at rest (KMS) and data in transit is TLS-protected.
Use proper IAM roles for producers and consumers; give least privilege.

Cost (overview)

Kinesis Data Streams cost is primarily driven by:

shard hours (how many shards and for how long),
data ingestion (PUT operations / payload),
optionally enhanced fan-out consumers,
data retrieval or extended retention if you keep data longer.

For cost-sensitive workloads, tune shard count, batch records with PutRecords, and consider Firehose or MSK if they are a better fit.

10. Best practices (short list)

Design partition keys to spread load evenly across shards.
Use PutRecords (batch API) instead of many single PutRecord calls.
Monitor consumer lag and scale shards before lag becomes critical.
Use KCL or a managed consumer pattern to handle shard rebalancing and checkpointing.
Keep messages small and carry only necessary data; offload large payloads to S3 if needed.
Use IAM, encryption, VPC endpoints (if needed), and CloudWatch for observability.

11. When to choose Kinesis Data Streams

Choose Kinesis Data Streams when you need:

ordered, durable streaming with replay capability,
multiple independent consumers reading the same stream,
tight control over shard-level capacity and scaling,
integration with AWS tools (Lambda, Kinesis Data Analytics, Firehose).

Conclusion

Amazon Kinesis Data Streams is a robust, AWS-managed building block for real-time streaming pipelines. It gives you ordered, durable streams, multiple consumption options, and the ability to replay data. For technical teams building event-driven or streaming analytics systems, Kinesis Data Streams is a practical and well-integrated choice on AWS — provided you design partition keys, shards and consumers carefully.

DEV Community