DEV Community

Code Green
Code Green

Posted on

Designing a Scalable, Cost‑Effective Access Pattern for a High‑Throughput Time‑Series Store

You must store IoT sensor readings that arrive at a rate of 10,000 writes per second.

Each reading includes:

  • deviceId (string, partition key)
  • timestamp (ISO‑8601, sort key)
  • temperature, humidity, pressure (numeric)
  • metadata (JSON blob, optional)

Requirements:

  1. Fast point‑lookup for the latest reading of a given deviceId.
  2. Efficient range queries to retrieve all readings for a device within a time window (e.g., last 24 h).
  3. Retention policy: keep data for 30 days, then automatically expire.
  4. Cost‑optimized for the high write throughput while keeping read latency < 50 ms.

1. Table Schema & Primary Key

Attribute Type Role
deviceId String Partition key
timestamp String (ISO‑8601, e.g., 2025-12-04T12:34:56Z) Sort key
temperature, humidity, pressure Number Payload
metadata String (JSON) Optional payload
ttl Number (epoch seconds) TTL attribute for expiration
  • Why this PK?
    • Guarantees all readings for a device are stored together, enabling efficient range queries (deviceId = X AND timestamp BETWEEN …).
    • Allows a single‑item query for the latest reading by using ScanIndexForward=false and Limit=1.

2. Indexing Strategy

Index Partition Key Sort Key Use‑case
Primary Table deviceId timestamp Point lookup & range queries per device
Global Secondary Index (GSI) – DeviceLatestGSI deviceId timestamp (projected as DESC) Direct query for the latest reading without scanning the whole partition (use Limit=1, ScanIndexForward=false).
Optional GSI – MetricGSI metricType (e.g., "temperature" constant) timestamp If you need cross‑device time‑range queries for a single metric (rare).

Note: The primary table already supports the latest‑reading query; the GSI is optional and only adds cost if you anticipate many concurrent “latest” reads that could cause hot‑partition reads on the same deviceId. In most cases the primary table with Limit=1 suffices.

3. Capacity Mode & Scaling

Mode When to use Configuration
On‑Demand Unpredictable spikes, easy start‑up, no need to manage capacity. Handles 10 k writes/sec automatically; pay per request.
Provisioned + Auto Scaling Predictable traffic, want to control cost. Start with 15,000 RCUs and 5,000 WCUs (each write of ≤ 1 KB consumes 1 WCU). Enable auto‑scaling target 70 % utilization.

Cost comparison (approx., US East 1, Dec 2025):

  • On‑Demand writes: $1.25 per million write request units → ~ $12.5 k/month for 10 k writes/s (≈ 26 M writes/day).
  • Provisioned 5,000 WCUs ≈ $0.65 per WCU‑hour → $2.3 k/month plus auto‑scaling buffer. On‑Demand is simpler; provisioned can be cheaper if traffic is stable.

4. Mitigating Hot‑Partition Risk

  • Uniform deviceId distribution: Ensure device IDs are random (e.g., UUID or hashed).
  • If a few devices dominate traffic: Use sharding – prepend a random shard suffix to deviceId (e.g., deviceId#shard01). Store the shard count in a small config table; the application queries all shards and merges results. This spreads write capacity across partitions.

5. Data Retention (TTL)

  • Add a numeric attribute ttl = timestampEpoch + 30 days.
  • Enable DynamoDB TTL on this attribute; DynamoDB automatically deletes expired items (typically within 48 h of expiration).
  • No additional Lambda needed, keeping cost low.

6. Read Performance Optimizations

  • Projection: Keep only needed attributes in the GSI (e.g., temperature, humidity, pressure, timestamp). This reduces read size and cost.
  • Consistent vs. eventual reads: Use eventual consistency for most queries (cheaper, 0.5 RCU per 4 KB). For the “latest reading” where freshness is critical, use strongly consistent read (1 RCU per 4 KB).
  • BatchGetItem for fetching multiple latest readings across devices in a single call.

7. Auxiliary Services (optional)

Service Purpose
AWS Kinesis Data Streams Buffer inbound sensor data, smooth bursty writes, and feed DynamoDB via a Lambda consumer.
AWS Lambda (TTL cleanup) If you need deterministic deletion exactly at 30 days, a scheduled Lambda can query items with ttl nearing expiration and delete them, but DynamoDB TTL is usually sufficient.
Amazon CloudWatch Alarms Monitor ConsumedWriteCapacityUnits, ThrottledRequests, and SystemErrors to trigger scaling or alerts.
AWS Glue / Athena For ad‑hoc analytics on historical data exported to S3 (via DynamoDB Streams → Lambda → S3).

8. Trade‑offs Summary

Trade‑off Impact
On‑Demand vs. Provisioned On‑Demand simplifies ops but can be ~30 % more expensive at steady 10 k writes/s. Provisioned requires capacity planning but can be cheaper with auto‑scaling.
Sharding vs. Simplicity Sharding eliminates hot‑partition risk for skewed device traffic but adds complexity in query logic (multiple shards per device).
TTL vs. Lambda cleanup TTL is low‑cost, eventual deletion (up to 48 h delay). Lambda gives precise control but adds compute cost.
GSI for latest reading Guarantees O(1) read latency even under heavy load, but incurs extra write cost (each write updates the GSI). Often unnecessary if Limit=1 on primary table suffices.
Strong vs. eventual consistency Strong reads double read cost; use only where immediate freshness is required.

With this design you achieve:

  • Fast point‑lookup (Query with deviceId + Limit=1, ScanIndexForward=false).
  • Efficient time‑range queries (Query with deviceId and timestamp BETWEEN …).
  • Automatic 30‑day expiration via DynamoDB TTL.
  • Cost‑effective high‑throughput writes using on‑demand or provisioned capacity with auto‑scaling, plus optional sharding to avoid hot partitions.

Top comments (0)