DEV Community

Data Tech Bridge
Data Tech Bridge

Posted on

Amazon Kinesis Cheat Sheet

Below are the important pointers for AWS Kinesis Service (Cheat Sheet) for AWS Certified Data Engineer-Associate Exam.

Core Components of AWS Kinesis

AWS Kinesis is a platform for streaming data on AWS, making it easy to collect, process, and analyze real-time, streaming data.

Component Description
Kinesis Data Streams Real-time data streaming service for ingesting and storing data streams
Kinesis Data Firehose Loads streaming data into AWS data stores with near-zero management
Kinesis Data Analytics Process and analyze streaming data in real-time with SQL or Apache Flink
Kinesis Video Streams Stream video from connected devices to AWS for analytics and ML

Mind Map: AWS Kinesis Ecosystem

AWS Kinesis
├── Kinesis Data Streams
│   ├── Producers
│   │   ├── AWS SDK
│   │   ├── Kinesis Producer Library (KPL)
│   │   └── Kinesis Agent
│   ├── Consumers
│   │   ├── AWS SDK
│   │   ├── Kinesis Client Library (KCL)
│   │   ├── Lambda
│   │   └── Kinesis Data Firehose
│   └── Shards (throughput units)
├── Kinesis Data Firehose
│   ├── Sources
│   │   ├── Direct PUT
│   │   ├── Kinesis Data Streams
│   │   └── CloudWatch Logs/Events
│   └── Destinations
│       ├── S3
│       ├── Redshift
│       ├── Elasticsearch
│       └── Splunk
├── Kinesis Data Analytics
│   ├── SQL Applications
│   └── Apache Flink Applications
└── Kinesis Video Streams
    ├── Producers (cameras, etc.)
    └── Consumers (ML services, etc.)
Enter fullscreen mode Exit fullscreen mode

Detailed Features and Specifications

Feature Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics Kinesis Video Streams
Purpose Real-time streaming data ingestion Load streaming data into data stores Process streaming data in real-time Stream video data for analytics
Data Retention 24 hours (default) up to 365 days No storage (immediate delivery) No storage (immediate processing) Up to 7 days
Scaling Unit Shards Automatic scaling KPUs (Kinesis Processing Units) Automatic scaling
Throughput 1MB/sec or 1000 records/sec per shard (ingestion)
2MB/sec per shard (consumption)
Dynamic, up to service quotas Based on KPUs MB/sec per stream
Latency ~200ms At least 60 seconds Real-time (seconds) Real-time
Reprocessing Yes (by position) No No Yes (by timestamp)
Pricing Model Per shard-hour + PUT payload units Volume of data + optional data transformation KPU-hours GB ingested + stored

Kinesis Data Streams Deep Dive

1. Shards: Basic throughput unit for Kinesis Data Streams. Each shard provides 1MB/sec data input and 2MB/sec data output capacity.

2. Partition Key: Determines which shard a data record goes to. Good partition key design ensures even distribution across shards.

3. Sequence Number: Unique identifier for each record within a shard, assigned when a record is ingested.

4. Data Record: The unit of data stored in Kinesis Data Streams (up to 1MB).

5. Retention Period: Data can be stored from 24 hours (default) up to 365 days.

6. Enhanced Fan-Out: Provides dedicated throughput of 2MB/sec per consumer per shard (vs. shared 2MB/sec without it).

7. Capacity Modes:

  • Provisioned: You specify the number of shards
  • On-demand: Automatically scales based on observed throughput

8. Resharding Operations:

  • Shard split: Increases stream capacity
  • Shard merge: Decreases stream capacity

9. Producer Options:

  • AWS SDK (simple, low throughput)
  • Kinesis Producer Library (KPL) (high performance, batching, retry)
  • Kinesis Agent (log file collection)

10. Consumer Options:
- AWS SDK (simple, manual)
- Kinesis Client Library (KCL) (distributed, coordinated consumption)
- Lambda (serverless processing)
- Firehose (delivery to destinations)

11. Throughput Calculation Example:
- Required ingestion: 10MB/sec
- Required shards = 10MB ÷ 1MB = 10 shards

12. KCL vs KPL: KPL (Producer Library) handles data production with batching and retries; KCL (Consumer Library) manages distributed consumption with checkpointing.

Kinesis Data Firehose Details

13. Delivery Frequency:
- S3: Buffer size (1-128 MB) or buffer interval (60-900 seconds)
- Other destinations: Buffer size (1-100 MB) or buffer interval (60-900 seconds)

14. Data Transformation: Lambda can transform records before delivery.

15. Format Conversion: Convert data to Parquet or ORC before S3 delivery.

16. Compression Options: GZIP, ZIP, Snappy for S3 delivery.

17. Error Handling: Failed records go to an S3 error bucket.

18. Dynamic Partitioning: Partition data in S3 based on record content.

19. No Data Loss: Retries until successful delivery.

20. Serverless: No capacity planning needed.

Kinesis Data Analytics Features

21. SQL Applications: Process streams using SQL queries.

22. Apache Flink Applications: Use Java, Scala, or Python with Apache Flink.

23. Input Sources: Kinesis Data Streams, Firehose, or reference data from S3.

24. Processing Features:
- Windowed aggregations (tumbling, sliding, session)
- Anomaly detection
- Stream-to-stream joins

25. Output Options: Kinesis Data Streams, Firehose, or Lambda.

26. Scaling: Based on Kinesis Processing Units (KPUs).

27. Checkpointing: Automatic state persistence for fault tolerance.

28. Parallelism: Automatically parallelizes processing across multiple instances.

Kinesis Video Streams

29. Use Cases: Security cameras, CCTV, body cameras, audio feeds.

30. Integration: Works with AWS ML services like Rekognition and SageMaker.

31. Producer SDK: Available for C++, Android, and iOS.

32. Fragments: Video is divided into fragments for processing.

33. Metadata: Can attach time-indexed metadata to video streams.

Performance Optimization and Limits

34. Kinesis Data Streams Limits:
- Per shard: 1MB/sec ingestion, 2MB/sec consumption
- Maximum record size: 1MB
- API limits: PutRecord (1000 records/sec), GetRecords (5 transactions/sec)
- Maximum data retention: 365 days

35. Handling Hot Shards: Use a good partition key design to distribute data evenly.

36. Resharding Best Practices: Perform during low traffic periods; complete one operation before starting another.

37. Kinesis Data Firehose Limits:
- Maximum record size: 1MB
- Service quota: 5000 records/sec or 5MB/sec per delivery stream (can be increased)

38. Kinesis Data Analytics Limits:
- SQL applications: Up to 8 KPUs per application
- Flink applications: Based on parallelism configuration

39. Implementing Throttling:
- Use exponential backoff for retries
- Implement client-side rate limiting
- Monitor and alert on throttling metrics

40. Overcoming Rate Limits:
- Request quota increases
- Implement batching with KPL
- Add more shards (Data Streams)

Replayability and Data Recovery

41. Replay Options in Kinesis Data Streams:
- Start from specific sequence number
- Start from timestamp
- Start from TRIM_HORIZON (oldest available record)
- Start from LATEST (most recent record)

42. Checkpointing: KCL stores consumption progress in DynamoDB for fault tolerance.

43. Reprocessing Strategies:
- Create new consumer with different application name
- Reset existing consumer's checkpoints
- Use enhanced fan-out for parallel processing

44. Data Persistence: Consider archiving to S3 via Firehose for long-term storage.

45. Disaster Recovery: Implement cross-region replication using Lambda.

Open Source Integration

46. Kinesis Client Library (KCL) vs. Apache Kafka Consumer:

Feature KCL Kafka Consumer
Language Support Java, Node.js, Python, .NET, Ruby Java, multiple languages via clients
Coordination DynamoDB ZooKeeper/Broker
Scaling Per shard consumption Consumer groups
Checkpointing Built-in Manual offset management
Fault Tolerance Automatic worker rebalancing Rebalance protocol

47. Apache Flink Integration: Kinesis Data Analytics for Apache Flink provides managed Flink environment.

48. Spark Streaming Integration: Can consume from Kinesis using Spark Kinesis connector.

CloudWatch Monitoring for Kinesis

49. Key Metrics for Kinesis Data Streams:
- GetRecords.IteratorAgeMilliseconds: Age of the oldest record (high values indicate consumer lag)
- IncomingBytes/IncomingRecords: Volume of data/records being written
- ReadProvisionedThroughputExceeded/WriteProvisionedThroughputExceeded: Throttling events
- PutRecord.Success/GetRecords.Success: Success rates for operations

50. Key Metrics for Kinesis Data Firehose:
- DeliveryToS3.Success: Success rate of S3 deliveries
- IncomingBytes/IncomingRecords: Volume of data/records being received
- ThrottledRecords: Number of records throttled
- DeliveryToS3.DataFreshness: Age of the oldest record not yet delivered

51. Key Metrics for Kinesis Data Analytics:
- KPUs: Number of KPUs being used
- fullRestarts: Number of application restarts
- downtime: Application downtime
- InputProcessing.OkBytes/InputProcessing.OkRecords: Successfully processed data

52. Recommended Alarms:
- Iterator age > 30 seconds (consumer lag)
- High throttling rates (>10%)
- Delivery freshness > buffer time + 60 seconds (Firehose)
- Error rates > 1%

Security and Compliance

53. Encryption:
- Server-side encryption with KMS
- HTTPS endpoints for in-transit encryption
- Client-side encryption options

54. Authentication and Authorization:
- IAM roles and policies
- Fine-grained access control with IAM

55. VPC Integration: Private access via VPC endpoints.

56. Compliance: Supports HIPAA, PCI DSS, SOC, and ISO compliance.

57. Audit Logging: All API calls logged to CloudTrail.

Cost Optimization

58. Kinesis Data Streams Cost Factors:
- Shard hours (provisioned mode)
- Data ingested (on-demand mode)
- Extended data retention (beyond 24 hours)
- Enhanced fan-out consumers

59. Kinesis Data Firehose Cost Factors:
- Data ingested
- Format conversion
- VPC delivery

60. Kinesis Data Analytics Cost Factors:
- KPU hours
- Durable application backups

61. Cost Optimization Strategies:
- Right-size shard count
- Use on-demand mode for variable workloads
- Batch records when possible
- Monitor and adjust resources based on usage patterns

Common Architectures and Patterns

62. Real-time Analytics Pipeline:
- Data Streams → Data Analytics → Firehose → S3/Redshift

63. Log Processing Pipeline:
- CloudWatch Logs → Firehose → S3 → Athena

64. IoT Data Pipeline:
- IoT Core → Data Streams → Lambda → DynamoDB

65. Click-stream Analysis:
- Web/Mobile → Kinesis Data Streams → Data Analytics → ElastiCache

66. Machine Learning Pipeline:
- Data Streams → Firehose → S3 → SageMaker

Exam Tips and Common Scenarios

67. Scenario: High-throughput Ingestion
- Solution: Use KPL with batching and appropriate shard count

68. Scenario: Consumer Lag
- Solution: Add more consumers, use enhanced fan-out, or increase processing efficiency

69. Scenario: Data Transformation Before Storage
- Solution: Use Firehose with Lambda transformation

70. Scenario: Real-time Anomaly Detection
- Solution: Kinesis Data Analytics with RANDOM_CUT_FOREST function

71. Scenario: Video Analysis
- Solution: Kinesis Video Streams with Rekognition integration

72. Scenario: Exactly-once Processing
- Solution: Use KCL with careful checkpoint management

73. Scenario: Cross-region Replication
- Solution: Consumer application that reads from one region and produces to another

74. Scenario: Handling Spiky Traffic
- Solution: On-demand capacity mode for Data Streams

75. Scenario: Long-term Analytics
- Solution: Firehose to S3 with Athena or Redshift Spectrum

76. Scenario: Stream Enrichment
- Solution: Kinesis Data Analytics with reference data from S3

Troubleshooting Common Issues

77. ProvisionedThroughputExceededException:
- Cause: Exceeding shard limits
- Solution: Add more shards, implement backoff, use KPL batching

78. Iterator Expiration:
- Cause: Not processing data within 5 minutes of retrieval
- Solution: Process faster or request less data per GetRecords call

79. Consumer Lag:
- Cause: Slow processing, insufficient consumers
- Solution: Add consumers, optimize processing, use enhanced fan-out

80. Duplicate Records:
- Cause: Producer retries, consumer restarts
- Solution: Implement idempotent processing, track processed records

81. Data Loss:
- Cause: Exceeding retention period, not handling failures
- Solution: Increase retention, implement proper error handling

82. Uneven Shard Distribution:
- Cause: Poor partition key choice
- Solution: Use high-cardinality partition keys, avoid hot keys

Advanced Features and Integrations

83. Kinesis Data Streams Enhanced Fan-Out:
- Dedicated throughput of 2MB/sec per consumer per shard
- Uses HTTP/2 for push-based delivery
- Lower latency (70ms vs. 200ms+)

84. Kinesis Data Firehose Dynamic Partitioning:
- Partition data in S3 based on record content
- Creates logical folders based on partition keys
- Optimizes for query performance

85. Kinesis Data Analytics Connectors:
- Apache Kafka
- Amazon MSK
- Amazon S3
- Amazon DynamoDB

86. Kinesis Video Streams Edge Agent:
- Run on IoT devices
- Buffer and upload video when connectivity is restored
- Supports intermittent connectivity scenarios

87. AWS Lambda Integration:
- Event source mapping for Data Streams
- Transformation for Firehose
- Processing for Video Streams

88. AWS Glue Integration:
- Streaming ETL jobs can use Kinesis as source
- Process and transform streaming data
- Load into data lake or data warehouse

89. Amazon SageMaker Integration:
- Real-time ML inference on streaming data
- Model training on historical stream data
- Anomaly detection on streams

90. Amazon EventBridge Integration:
- Can route events to Kinesis
- Enables serverless event-driven architectures

Comparison with Other AWS Streaming Services

Feature Kinesis Data Streams MSK (Managed Kafka) EventBridge SQS
Purpose Real-time streaming Streaming & messaging Event routing Messaging
Throughput 1MB/sec per shard MB/sec per broker Thousands/sec Unlimited
Retention Up to 365 days Configurable No retention 14 days
Ordering Per shard Per partition Not guaranteed FIFO available
Consumers Multiple Multiple Multiple Single (unless using fan-out)
Replay Yes Yes No No
Scaling Manual/Auto Manual Automatic Automatic
Latency ~200ms ~ms ~seconds ~ms

Data Ingestion Patterns and Throughput Characteristics

91. Batch vs. Real-time Ingestion:
- Batch: Lower cost, higher latency
- Real-time: Higher cost, lower latency

92. Throughput Patterns:
- Steady-state: Predictable, consistent load
- Spiky: Unpredictable bursts of traffic
- Cyclical: Predictable peaks and valleys

93. Handling Backpressure:
- Buffer in SQS before Kinesis
- Implement client-side throttling
- Use adaptive batching

94. Kinesis Agent Features:
- Pre-processing capabilities
- Automatic retry with backoff
- CloudWatch monitoring integration

95. Multi-Region Considerations:
- Independent streams per region
- Cross-region replication via Lambda
- Global producer routing strategies

Replayability of Data Ingestion Pipelines

96. Replay Strategies:
- Store raw data in S3 for full reprocessing
- Use Kinesis retention period for short-term replay
- Implement event sourcing patterns

97. Replay Scenarios:
- Bug fixes in processing logic
- Recovery from downstream failures
- Historical analysis with new algorithms

98. Implementing Replayability:
- Maintain immutable event store
- Version processing applications
- Use consumer group IDs for isolation

99. Replay Challenges:
- Handling side effects (idempotency)
- Managing processing order
- Balancing storage costs vs. replay needs

100. Replay Best Practices:
- Test replay capabilities regularly
- Document replay procedures
- Monitor replay performance and correctness

Exam-Specific Tips

101. Remember the throughput limits: 1MB/sec in, 2MB/sec out per shard.

102. Know the retention period options: 24 hours (default) to 365 days.

103. Understand the difference between KPL and KCL: Producer vs. Consumer libraries.

104. Know when to use each Kinesis service:
- Data Streams: Raw ingestion and processing
- Firehose: Easy loading to destinations
- Data Analytics: Real-time processing
- Video Streams: Video ingestion and processing

105. Understand resharding operations: Split increases capacity, merge decreases it.

106. Know the consumer options: SDK, KCL, Lambda, Firehose.

107. Remember Firehose buffer settings: Size (1-128 MB) and interval (60-900 seconds).

108. Know the encryption options: Server-side with KMS, client-side, TLS in transit.

109. Understand monitoring metrics: Iterator age, throughput exceeded, success rates.

110. Know the common error scenarios and solutions: Provisioned throughput exceeded, iterator expiration.

111. Understand the cost model: Shard-hours, data ingestion, retention, enhanced fan-out.

112. Know the integration points: Lambda, Glue, SageMaker, S3, Redshift.

113. Understand the differences between Kinesis and other services: MSK, SQS, EventBridge.

114. Know the replay options: TRIM_HORIZON, LATEST, AT_TIMESTAMP, AT_SEQUENCE_NUMBER.

115. Understand the capacity modes: Provisioned vs. On-demand.

116. Know the enhanced fan-out benefits: Dedicated throughput, lower latency.

117. Understand the Firehose delivery frequency: Buffer size or interval, whichever comes first.

118. Know the Data Analytics processing options: SQL vs. Flink.

119. Understand the Video Streams use cases: Security cameras, CCTV, ML integration.

120. Know the common architectures: Real-time analytics, log processing, IoT data pipelines.

Qodo Takeover

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

Rather than just generating snippets, our agents understand your entire project context, can make decisions, use tools, and carry out tasks autonomously.

Read full post

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More