Designing a Global Top K Heavy Hitters System with Kafka, Stream Processing, and Redis Sorted Sets
Table of Contents
- System Overview
- Key Components
- High Level Architecture
- Detailed Flow
- Flow Diagram
- Redis Top K Strategies
- Windowing Deep Dive
- Optimizations
- Scaling for Global Use
- Data Modeling
- Capacity Planning and Performance
- Fault Tolerance and Recovery
- Security, Privacy, and Compliance
- Example API Request and Responses
- Code Snippets (Appendix)
- References & Further Reading
System Overview
We need a globally scalable system to identify Top K “Heavy Hitters” (most‑viewed or most‑played posts/songs) from a continuous event stream. The system must:
- Efficiently aggregate and rank by views in near real time
- Support regional and global Top K queries
- Deliver high availability, low latency, and elastic scale
- Maintain predictable performance as data volume and users grow
Key idea: Move ranking work from read time (expensive ORDER BY in databases) to write time using Redis Sorted Sets. Stream processors incrementally update scores; Redis keeps the set sorted so reads are fast and predictable.
Key Components
- Event Streaming
- Apache Kafka: durable, partitioned log for view/play events.
- Stream Processing
- Apache Flink, Kafka Streams, Spark Structured Streaming, or Apache Samza: stateful aggregations and windowed counts.
- Fast Rank Index + Cache
- Redis Sorted Sets (ZSETs): millisecond Top K reads and rank queries.
- Operational Database (recommended for history/analytics)
- Apache Cassandra or MongoDB: time‑bucketed aggregates and metadata.
- API Layer
- REST/GraphQL service exposing Top K endpoints (regional and global).
- Global Aggregator
- Periodically merges regional Top K into a “global Top K” (min‑heap or Redis ZSET).
- Observability
- Metrics (Prometheus/Grafana), logs (ELK/Cloud), tracing (OpenTelemetry).
High Level Architecture
- Clients → Kafka (views/play events)
- Stream processors → Incremental aggregates → Redis (Top K per region/window)
- API → Reads Redis (falls back to DB if needed)
- Global aggregator → Merges regional Top K → Global Top K in Redis
Detailed Flow
1. Data Ingestion
- Producers: Web/mobile clients emit events
{ post_id, region_id, timestamp, (user_id optional) }. - Kafka:
- Topic partitioned by
region_idfor parallelism/locality. - Use Avro/Protobuf with Schema Registry for safe schema evolution.
- Topic partitioned by
2. Stream Processing
- Goal: Incrementally compute per‑region counts and maintain Top K without scanning databases.
- Optional windowing: compute “trending” (e.g., 5‑minute tumbling or sliding windows).
- Update Redis:
- Exact:
ZINCRBYper post, maintain a full leaderboard. - Bounded Top K: keep only the top K in a compact ZSET; ignore long tail.
- Exact:
3. Database Storage
Recommended for durability, analytics, and recovery:
- Persist time‑bucketed aggregates (e.g., end‑of‑window or periodic checkpoints) to Cassandra/MongoDB.
4. Caching and Rank Index (Redis)
- Per‑region ZSET:
region:<region_id>:top_k - Per‑window ZSET:
region:<region_id>:top_k:<window_id> - Global ZSETs:
global:top_k,global:top_k:<window_id>
5. Regional Aggregation
- Preferred: Write‑through from stream processors to Redis, keeping Redis as the live rank index.
- Reconcile occasionally from DB if needed (e.g., post‑incident).
6. Global Aggregation
- Collect regional Top K entries (R regions → ≤ K × R items).
- Compute global Top K with a min‑heap of size K or a Redis ZSET (trim with
ZREMRANGEBYRANK).
7. API Layer
GET /top-posts?region=<region>&timeframe=<window>&k=<N>- Fast path: read from Redis; optional DB fallback for cache misses.
Flow Diagram
1) Client Event
-> Kafka (topic: views; partitioned by region_id)
2) Stream Processor
-> Aggregates (optionally windowed)
-> Redis (ZSET per region/window) for Top K
-> Cassandra/MongoDB for historical aggregates
3) API Service (GET /top-posts)
-> Redis read:
Hit: return Top K
Miss: optional DB query + repopulate Redis
4) Global Aggregator (periodic)
-> Fetch regional Top K
-> Merge to global Top K (min‑heap or Redis ZSET)
-> Write to Redis (global:top_k)
Redis Top K Strategies
A) Exact Leaderboards (Full ZSET)
- Keep every post in a ZSET per region with
total_viewsas score. - Pros: Exact rank for every item; easy “neighbors by rank.”
- Cons: Memory‑heavy at very high cardinality; a single ZSET key lives on one Redis shard.
B) Bounded Top K (Recommended for Massive Scale)
- Maintain only the top K in Redis—memory and latency remain constant regardless of total population.
- Update logic:
- If
ZCARD(key) < K→ZADD - Else fetch current minimum (Kth):
ZRANGE key 0 0 WITHSCORES - If
new_score > min_score:-
MULTI→ZADD key new_score post_id→ZPOPMIN key 1→EXEC
-
- Else skip (not in Top K)
- If
C) Hybrid Approach
- Redis Top K for UI speed
- Cassandra/MongoDB or Redis KV for per‑post totals and analytics
- Global Top K by merging regional Top Ks
Handling Ties
- Deterministic order via:
- score = primary_score + tiny epsilon (e.g., fractional timestamp), or
- secondary tie‑break in app (e.g., latest update wins)
Windowing Deep Dive
Event Time, Watermarks, Allowed Lateness
- Event time: Use event timestamps (not processing time) for correctness.
- Watermarks: Indicate event‑time progress and when windows can be closed.
- Allowed lateness: How long late events can still adjust a closed window.
Window Types
- Tumbling: Fixed, non‑overlapping (e.g., 5m, 1h). Simple and cache‑friendly.
- Sliding: Overlapping windows (size=5m, slide=1m). Smoother, more compute.
- Session: Gap‑based, dynamic. Good for bursty behavior.
Windowed Keys and TTLs
- Key pattern:
region:<r>:top_k:<window_id>global:top_k:<window_id>
- Apply TTLs to auto‑expire old windows.
- For late events within allowed lateness: update window aggregates and adjust Redis ZSETs; beyond lateness: route to DLQ/audit.
Optimizations
Rate Limiting
- Fixed/tumbling window (per minute):
-
INCR rate:<user>:<minute>;EXPIRE rate:<user>:<minute> 60
-
- Sliding window:
ZADD rate:<user> <now_ms> <now_ms>ZREMRANGEBYSCORE rate:<user> 0 <now_ms-60000>-
ZCARD rate:<user>then compare to limit
- Token bucket: Maintain tokens with server time; refill based on elapsed duration.
Caching Keys
- Regional live:
region:<r>:top_k - Regional windowed:
region:<r>:top_k:<window> - Global live:
global:top_k - Global windowed:
global:top_k:<window>
Partitioning
- Kafka: partition by
region_id(and optionally hash ofpost_idfor balance). - Cassandra: partition by
(region_id, window_start); avoid hot partitions.
Approximate Heavy Hitters
- For extreme cardinality: Use Count‑Min Sketch + min‑heap for candidates; confirm via exact counts before Redis writes.
Write Minimization and Concurrency
- Cache the current Kth threshold to avoid unnecessary writes.
- Batch Redis operations; use pipelining.
- Use
MULTI/EXECfor “add + evict” to avoid races.
Scaling for Global Use
- Per‑region leaderboards: one ZSET per region/window.
- Global aggregation: merge regional Top Ks (≤ K × R items) via min‑heap or Redis ZSET.
- Redis Cluster: One key lives on one shard—keep Top K keys small; avoid single global ZSETs with massive cardinality.
- Multi‑region Kafka: Use MirrorMaker or cluster linking for DR and aggregation.
- Database: Region‑specific Cassandra clusters; tune replication for availability vs. latency.
Data Modeling
Cassandra Example
-
views_by_region_and_window- PRIMARY KEY ((
region_id,window_start),post_id) - Columns:
count
- PRIMARY KEY ((
-
posts- PRIMARY KEY (
post_id) - Columns:
content,author,created_at,metadata…
- PRIMARY KEY (
Redis Key Design
-
region:<region_id>:top_k→ ZSET(post_id=>score) -
region:<region_id>:top_k:<window_id>→ ZSET -
global:top_kandglobal:top_k:<window_id>→ ZSET
Capacity Planning and Performance
- Full (exact) ZSET: Memory grows with unique posts; a single ZSET must fit one Redis shard.
- Bounded Top K: Memory bounded by K; independent of total population size—few MBs at K=10k with compact IDs.
- Throughput:
ZINCRBY/ZADD~ O(log N). With bounded Top K, N=K → effectively constant time. - Persistence/HA: Replicas double memory; snapshots/AOF add overhead. Size clusters accordingly and monitor fragmentation/latency.
Fault Tolerance and Recovery
- Kafka: Retention for reprocessing; at‑least‑once or exactly‑once semantics depending on framework.
- Stream processors: Checkpointing and state backends (e.g., Flink RocksDB) for recovery.
- Redis: Replicas and automatic failover; reconcile from Cassandra/MongoDB after incidents if needed.
- Idempotency: Deduplicate via event IDs or rely on EOS processing to prevent over‑counting.
Security, Privacy, and Compliance
- Encrypt in transit (TLS) and at rest where supported.
- Place data stores in private networks (VPC) and use strict ACLs/roles.
- Apply least privilege for service accounts and Redis ACLs.
- Enable audit logs for API and data access.
- Verify any third‑party services/tools align with your organization’s security and compliance guidelines before adoption.
Example API Request and Responses
Request
GET /top-posts?region=NA&timeframe=24h&k=10
Response
{
"region": "NA",
"timeframe": "24h",
"top_posts": [
{
"post_id": "123",
"author": "Artist1",
"view_count": 120000,
"content": "Post Content 1"
},
{
"post_id": "456",
"author": "Artist2",
"view_count": 110000,
"content": "Post Content 2"
}
]
}
Code Snippets (Appendix)
Bounded Top K Update (K = 10,000)
# Pseudocode flow using Redis commands
# Assume key = region:<r>:top_k, K = 10000, new_score and post_id are known
# If size < K → add
ZCARD region:<r>:top_k
# If result < 10000:
ZADD region:<r>:top_k <new_score> <post_id>
# Else get current minimum (Kth)
ZRANGE region:<r>:top_k 0 0 WITHSCORES
# If new_score > min_score:
MULTI
ZADD region:<r>:top_k <new_score> <post_id>
ZPOPMIN region:<r>:top_k 1
EXEC
Read Top K
ZREVRANGE region:<r>:top_k 0 9 WITHSCORES
Sliding‑Window Rate Limiting (per user, last 60s)
# now_ms is the current timestamp in milliseconds
ZADD rate:<user_id> <now_ms> <now_ms>
ZREMRANGEBYSCORE rate:<user_id> 0 <now_ms-60000>
ZCARD rate:<user_id> # Compare with limit
References & Further Reading
- Redis
- Sorted Sets (ZSET): https://redis.io/docs/latest/develop/data-types/sorted-sets/
- Redis Commands: https://redis.io/commands/
- Kafka and Streams
- Apache Kafka: https://kafka.apache.org/
- Kafka Streams: https://kafka.apache.org/documentation/streams/
- MirrorMaker: https://kafka.apache.org/documentation/#mirrormaker
- Stream Processing
- Apache Flink: https://flink.apache.org/
- Flink Event Time & Watermarks: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/time/
- Spark Structured Streaming: https://spark.apache.org/structured-streaming/
- Databases
- Apache Cassandra: https://cassandra.apache.org/_/index.html
- MongoDB: https://www.mongodb.com/
- Observability
- OpenTelemetry: https://opentelemetry.io/
Author links and hashtag:
- GitHub: https://github.com/ZeeshanAli-0704/SystemDesignWithZeeshanAli
- Dev.to tag: https://dev.to/t/systemdesignwithzeeshanali
- Hashtag: #SystemDesignWithZeeshanAli
Top comments (0)