shubham pandey (Connoisseur)

Posted on Mar 11

Social Media Feed at Scale A System Design Deep Dive — Question by Question

#architecture #distributedsystems #interview #systemdesign

Introduction

A social media feed seems simple on the surface — show the latest tweets from people you follow. But at 300 million users, it becomes one of the most challenging distributed systems problems in software engineering. This post walks through the real complexity challenge by challenge, including the wrong turns and how to navigate out of them.

Challenge 1: The Naive Feed Approach

Interview Question: When millions of users open their feed simultaneously — how would you naively build it and where does that break down?

Wrong Approach: For each user opening the app, fetch latest tweets from all 500 followed accounts iteratively using a for loop. Do this for all 300 million users opening the app.

Why It Fails: Let us do the math. 300 million users multiplied by 500 followed accounts equals 150 billion database lookups just to render feeds simultaneously. That is before anyone even posts a tweet. A single database melts instantly under this pressure.

Key Insight: The naive pull approach is unusable at scale. Reading feed at request time means the database pays the price for every single app open.

Challenge 2: Flipping the Approach — Fan-out on Write

Interview Question: Instead of each user pulling tweets when they open the app, what if you did the work upfront at write time?

Navigation: After understanding that 150 billion lookups is unworkable, the natural flip is — what if the work happens when someone tweets instead of when someone reads? Push the tweet to all followers at write time so that reading the feed becomes a simple instant lookup.

Solution: When someone posts a tweet, the Feed Service pushes it to all followers' personal Redis queues immediately. When a user opens the app they read directly from their own Redis queue. Feed load becomes one single cache lookup instead of 150 billion queries.

This is called Fan-out on Write. Twitter calls the per-user queue the Home Timeline Cache stored in Redis.

Key Insight: Pre-computing the feed at write time trades write complexity for extremely fast reads. Feed load becomes O(1) instead of O(followers).

Challenge 3: The Celebrity Problem

Interview Question: Cristiano Ronaldo has 150 million followers. He posts a tweet. Your Feed Service must now push that one tweet to 150 million Redis queues simultaneously. What happens?

Wrong Approach: Iterate through all followers and push to each queue. This is just the naive approach in reverse — 150 million write operations for one tweet. At 50 tweets per day that is 7.5 billion Redis writes for Ronaldo alone.

Navigation: The key realization is that normal users and celebrity users have fundamentally different fan-out costs. We need to treat them differently.

Solution: Hybrid approach — users above a follower threshold (e.g. 100K followers) are treated as celebrities. Normal accounts use fan-out on write as before. Celebrity accounts skip the personal queue entirely and their tweets are fetched differently at read time.

Key Insight: One size does not fit all. High follower accounts need a completely different write strategy to prevent write amplification explosion.

Challenge 4: The Pagination Cursor Problem

Interview Question: With the hybrid approach, your feed merges two sources — personal Redis queue for normal friends and a separate fetch for celebrity tweets. How do you paginate across two independent sources?

Wrong Approach: Sort on the client side app. Send all tweets from both sources to the phone and let it sort.

Why It Fails: Sending 1500 tweets to 300 million mobile devices simultaneously destroys network bandwidth. Users on slow connections have to download everything before seeing anything.

Second Wrong Approach: Push celebrity tweets to personal Redis queue only when the user opens the app to keep everything in one place.

Why It Fails: If 50 million followers open the app simultaneously after Ronaldo tweets, you now have 50 million write operations triggered at app open time instead of tweet time. The explosion just moved, it did not disappear.

Navigation: The pagination cursor problem with two sources is real but it is the smaller of the two evils compared to write amplification. The right move is to solve the cursor problem rather than abandon the hybrid approach.

Solution: Store the per-user cursor for each source independently. After each scroll request return two cursors to the app — one for personal queue position and one for celebrity cache position. Next scroll request resumes from exactly where each source left off.

Key Insight: Multi-source pagination requires independent cursors per source. The complexity is worth it compared to the alternative of catastrophic write amplification.

Challenge 5: Shared Celebrity Cache

Interview Question: Does every one of the 50 million followers of Ronaldo actually need their own personal copy of his tweet?

Navigation: The hint here was powerful — if all 50 million users are reading the exact same tweet, storing 50 million identical copies is wasteful. What if there was one shared copy everyone reads from?

Solution: Celebrity tweets are stored once in a shared Redis cache. All followers read from the same single cache entry. 50 million users opening the app simultaneously all hit one shared cache — zero write amplification, one read source.

Final hybrid architecture:

Normal friends tweets pushed to personal Redis queue per user
Celebrity tweets stored in shared Redis cache once
At feed load server merges personal queue and relevant celebrity caches
Follower to celebrity mapping stored in cache with DynamoDB as fallback

Key Insight: Shared cache for celebrity tweets eliminates write amplification entirely. One write serves 150 million readers.

Challenge 6: Follower Mapping Storage

Interview Question: How does the server know which celebrity caches to fetch when a user opens their feed? Where do you store the mapping of which celebrities each user follows?

Wrong Approach: Query DynamoDB on every app open.

Why It Fails: DynamoDB adds unnecessary latency for data that rarely changes. You do not unfollow someone every minute.

Navigation: Follower mapping is read on every single app open, changes very infrequently, and is the same data read repeatedly. This is a perfect cache use case.

Solution: Store follower mapping in Redis cache. On cache miss fall back to DynamoDB and reload into cache. This is the cache-aside pattern — cache as the fast layer, DynamoDB as the source of truth underneath.

Key Insight: Cache-aside pattern is ideal for data that is read frequently but updated rarely. Always have a persistent fallback for cache misses.

Challenge 7: Trending Topics — The Counting Problem

Interview Question: Twitter shows trending hashtags from the last hour. At 6000 tweets per second with 3 hashtags each that is 18000 hashtag events per second. Running a COUNT query on your main database every few minutes — what is wrong with this?

Wrong Approach: Query the main tweet database every few minutes counting hashtag occurrences in the last hour.

Why It Fails: An expensive aggregation query competes directly with 6000 writes per second on the same database. The database gets crushed under simultaneous heavy reads and writes.

Second Approach: Store each hashtag in a separate database and increment its counter on every mention.

Problem: 18000 counter increments per second on the same rows causes race conditions. Two requests read the same counter value simultaneously and both try to increment — one update gets lost. Adding locks solves correctness but serializes 18000 operations per second, destroying throughput.

Navigation: The hint was — do you even need a database for counting? Counting does not need to be persistent. Trending from 6 months ago is useless. What if counting lived entirely in memory with a data structure built for atomic increments and automatic sorting?

Solution: Redis Sorted Set. Each hashtag is a member, its mention count is the score. ZINCRBY atomically increments the score with no locking needed. ZREVRANGE returns top N hashtags instantly by score.

Key Insight: Redis Sorted Set replaces a database entirely for counting and ranking. Atomic score increments eliminate race conditions without any locking.

Challenge 8: The Sliding Time Window Problem

Interview Question: Trending should reflect only the last 60 minutes. If you just keep incrementing scores forever, hashtags from yesterday pollute your trending list. How do you make scores reflect only recent mentions?

Wrong Approach: Give each hashtag a 24 hour TTL in Redis.

Why It Fails: TTL deletes the entire key after 24 hours. It does not expire individual mentions within the window. A hashtag with 5 million mentions accumulated over 24 hours still dominates trending even if nobody mentioned it in the last 60 minutes.

Navigation: Instead of expiring the whole hashtag, expire individual mentions. Each mention has a timestamp. Remove mentions older than 60 minutes from the count. Redis Sorted Set supports exactly this with score as timestamp.

Solution: Two Redis Sorted Sets working together.

Per hashtag sorted set tracks the time window:

Member = unique tweet ID
Score = Unix timestamp of the mention
ZREMRANGEBYSCORE removes mentions older than 60 minutes
ZCOUNT returns exact mention count in the sliding window

Global trending sorted set tracks the ranking:

Member = hashtag name
Score = current mention count from the time window
ZREVRANGE returns top 10 trending hashtags instantly

Every new mention updates both structures keeping the sliding window accurate in near real time.

Key Insight: Sliding window is achieved by using timestamp as score and ZREMRANGEBYSCORE to expire old mentions. Two sorted sets separate the concerns of time windowing and global ranking.

Challenge 9: Async Processing with Kafka

Interview Question: Updating Redis sorted sets for every hashtag at 18000 operations per second — should this happen synchronously while the user waits for their tweet to post?

Navigation: The user posts a tweet and immediately gets a response. Hashtag counting is a background concern — the user should never wait for it. This calls for asynchronous processing with a message queue in between.

Solution: Kafka sits between the tweet service and the hashtag counting service.

Flow:

User posts tweet
Tweet service saves tweet and publishes hashtag event to Kafka instantly
User gets immediate response — they never wait for hashtag counting
Hashtag consumer service reads from Kafka and updates Redis sorted sets
If consumer goes down Kafka holds all events — nothing is lost

Why Kafka over a database or Redis queue:

Stores events in order
Multiple consumers can read independently
Events survive consumer downtime
Handles millions of events per second effortlessly

Key Insight: Kafka decouples the tweet service from hashtag processing entirely. The tweet service never blocks and hashtag counting scales independently.

Challenge 10: Real Time Push Notifications

Interview Question: Twitter notifies you within seconds when someone likes your tweet. 300 million phones are open right now. The naive polling approach — each phone asks the server every 5 seconds for new notifications — generates 60 million requests per second, most returning empty. How do you push notifications instantly without polling?

Wrong Approach: Each phone polls the server every few seconds asking for new notifications.

Why It Fails: 300 million phones polling every 5 seconds is 60 million requests per second of pure wasted load. The vast majority return empty responses.

Navigation: Instead of phones asking the server, the server should tell the phones. This requires a persistent connection — the phone connects once and keeps that connection open so the server can push anytime.

Solution: AWS SNS or Google FCM handles persistent connections to all mobile devices at scale. No need to reinvent this — cloud providers have already solved it.

Notification flow:

User likes a tweet
Kafka event published
Notification service consumes from Kafka
Notification service calls AWS SNS or Google FCM
SNS or FCM pushes notification to phone instantly via persistent connection

Key Insight: Never reinvent infrastructure that cloud providers have already solved at scale. AWS SNS and Google FCM handle billions of push notifications daily — use them.

Full Architecture Summary

Feed generation — Fan-out on write to personal Redis queue per user
Celebrity tweets — Shared Redis cache read by all followers
Follower mapping — Redis cache with DynamoDB fallback
Feed merge — Server side merge of personal queue and celebrity cache
Pagination — Independent cursors per source returned to client
Trending computation — Kafka streaming to Redis Sorted Set
Time window — Per hashtag sorted set with timestamp as score
Global ranking — Single global trending sorted set updated in real time
Async processing — Kafka decouples tweet service from hashtag service
Push notifications — AWS SNS and Google FCM for instant mobile delivery

Final Thoughts

Twitter's feed looks like a simple list of posts. Underneath it is a carefully orchestrated system of pre-computed caches, hybrid architectures, sliding time windows, async pipelines, and cloud push infrastructure — all working together to make everything feel instant.

The most valuable lesson from this design is that wrong answers are not failures — they are navigation tools. Every wrong approach revealed exactly why the correct approach exists. That is how real system design thinking works.

Happy building. 🚀

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.