Designing a Scalable, Cost‑Effective Access Pattern for a High‑Throughput Time‑Series Store

#aws #dynamodb

You must store IoT sensor readings that arrive at a rate of 10,000 writes per second.

Each reading includes:

deviceId (string, partition key)
timestamp (ISO‑8601, sort key)
temperature, humidity, pressure (numeric)
metadata (JSON blob, optional)

Requirements:

Fast point‑lookup for the latest reading of a given deviceId.
Efficient range queries to retrieve all readings for a device within a time window (e.g., last 24 h).
Retention policy: keep data for 30 days, then automatically expire.
Cost‑optimized for the high write throughput while keeping read latency < 50 ms.

1. Table Schema & Primary Key

Attribute	Type	Role
`deviceId`	String	Partition key
`timestamp`	String (ISO‑8601, e.g., `2025-12-04T12:34:56Z`)	Sort key
`temperature`, `humidity`, `pressure`	Number	Payload
`metadata`	String (JSON)	Optional payload
`ttl`	Number (epoch seconds)	TTL attribute for expiration

Why this PK?
- Guarantees all readings for a device are stored together, enabling efficient range queries (deviceId = X AND timestamp BETWEEN …).
- Allows a single‑item query for the latest reading by using ScanIndexForward=false and Limit=1.

2. Indexing Strategy

Index	Partition Key	Sort Key	Use‑case
Primary Table	`deviceId`	`timestamp`	Point lookup & range queries per device
Global Secondary Index (GSI) – `DeviceLatestGSI`	`deviceId`	`timestamp` (projected as `DESC`)	Direct query for the latest reading without scanning the whole partition (use `Limit=1`, `ScanIndexForward=false`).
Optional GSI – `MetricGSI`	`metricType` (e.g., `"temperature"` constant)	`timestamp`	If you need cross‑device time‑range queries for a single metric (rare).

Note: The primary table already supports the latest‑reading query; the GSI is optional and only adds cost if you anticipate many concurrent “latest” reads that could cause hot‑partition reads on the same deviceId. In most cases the primary table with Limit=1 suffices.

3. Capacity Mode & Scaling

Mode	When to use	Configuration
On‑Demand	Unpredictable spikes, easy start‑up, no need to manage capacity.	Handles 10 k writes/sec automatically; pay per request.
Provisioned + Auto Scaling	Predictable traffic, want to control cost.	Start with 15,000 RCUs and 5,000 WCUs (each write of ≤ 1 KB consumes 1 WCU). Enable auto‑scaling target 70 % utilization.

Cost comparison (approx., US East 1, Dec 2025):

On‑Demand writes: $1.25 per million write request units → ~ $12.5 k/month for 10 k writes/s (≈ 26 M writes/day).
Provisioned 5,000 WCUs ≈ $0.65 per WCU‑hour → $2.3 k/month plus auto‑scaling buffer. On‑Demand is simpler; provisioned can be cheaper if traffic is stable.

4. Mitigating Hot‑Partition Risk

Uniform deviceId distribution: Ensure device IDs are random (e.g., UUID or hashed).
If a few devices dominate traffic: Use sharding – prepend a random shard suffix to deviceId (e.g., deviceId#shard01). Store the shard count in a small config table; the application queries all shards and merges results. This spreads write capacity across partitions.

5. Data Retention (TTL)

Add a numeric attribute ttl = timestampEpoch + 30 days.
Enable DynamoDB TTL on this attribute; DynamoDB automatically deletes expired items (typically within 48 h of expiration).
No additional Lambda needed, keeping cost low.

6. Read Performance Optimizations

Projection: Keep only needed attributes in the GSI (e.g., temperature, humidity, pressure, timestamp). This reduces read size and cost.
Consistent vs. eventual reads: Use eventual consistency for most queries (cheaper, 0.5 RCU per 4 KB). For the “latest reading” where freshness is critical, use strongly consistent read (1 RCU per 4 KB).
BatchGetItem for fetching multiple latest readings across devices in a single call.

7. Auxiliary Services (optional)

Service	Purpose
AWS Kinesis Data Streams	Buffer inbound sensor data, smooth bursty writes, and feed DynamoDB via a Lambda consumer.
AWS Lambda (TTL cleanup)	If you need deterministic deletion exactly at 30 days, a scheduled Lambda can query items with `ttl` nearing expiration and delete them, but DynamoDB TTL is usually sufficient.
Amazon CloudWatch Alarms	Monitor `ConsumedWriteCapacityUnits`, `ThrottledRequests`, and `SystemErrors` to trigger scaling or alerts.
AWS Glue / Athena	For ad‑hoc analytics on historical data exported to S3 (via DynamoDB Streams → Lambda → S3).

8. Trade‑offs Summary

Trade‑off	Impact
On‑Demand vs. Provisioned	On‑Demand simplifies ops but can be ~30 % more expensive at steady 10 k writes/s. Provisioned requires capacity planning but can be cheaper with auto‑scaling.
Sharding vs. Simplicity	Sharding eliminates hot‑partition risk for skewed device traffic but adds complexity in query logic (multiple shards per device).
TTL vs. Lambda cleanup	TTL is low‑cost, eventual deletion (up to 48 h delay). Lambda gives precise control but adds compute cost.
GSI for latest reading	Guarantees O(1) read latency even under heavy load, but incurs extra write cost (each write updates the GSI). Often unnecessary if `Limit=1` on primary table suffices.
Strong vs. eventual consistency	Strong reads double read cost; use only where immediate freshness is required.

With this design you achieve:

Fast point‑lookup (Query with deviceId + Limit=1, ScanIndexForward=false).
Efficient time‑range queries (Query with deviceId and timestamp BETWEEN …).
Automatic 30‑day expiration via DynamoDB TTL.
Cost‑effective high‑throughput writes using on‑demand or provisioned capacity with auto‑scaling, plus optional sharding to avoid hot partitions.