DEV Community: Rachit Misra

Designing Instagram at Scale: A Complete System Design Deep Dive

Rachit Misra — Fri, 03 Apr 2026 03:59:26 +0000

From a ₹800/month server to 500M daily users — every component, every trade-off, every edge case.

Why Instagram is a perfect system design problem
The numbers that define the problem
The scaling journey — Stage by Stage
Component Deep Dive: Feed Generation
Component Deep Dive: Stories & Expiry
Component Deep Dive: Media Upload & CDN
Component Deep Dive: Notifications
Component Deep Dive: Search & Discovery
Component Deep Dive: Likes & Comments
Database Design — Every Decision Justified
API Design — Full Contracts
Edge Cases Nobody Draws on Their Diagram
Key Trade-offs Summary

1. Why Instagram is a Perfect System Design Problem {#why-instagram}

Instagram sits at the intersection of every hard distributed systems problem:

Read-heavy (people scroll more than they post)
Write-heavy at peaks (52,000 likes per second)
Media-intensive (photos, videos, reels, stories)
Real-time (stories expire, feeds update, notifications land)
Socially connected (the graph makes everything harder)
Globally distributed (500M users across every timezone)

It’s not one hard problem. It’s eight hard problems running simultaneously, sharing infrastructure, with users who notice every hiccup.

This is why it appears in almost every senior system design interview. And why most candidates fail it — not because they don’t know the components, but because they don’t know why each component exists and what breaks without it.

This article covers everything. By the end you’ll be able to design Instagram from first principles, justify every decision, and handle every curveball an interviewer can throw.

2. The Numbers That Define the Problem {#the-numbers}

Before writing a single box on your architecture diagram, establish the scale. This isn’t optional ceremony — it determines every design decision you make.

User scale:

2B registered users
500M Daily Active Users (DAU)
Peak concurrent users: ~50M

Content scale:

100M photos/videos uploaded per day → ~1,150 uploads/second
500M stories created per day
4.5B likes per day → ~52,000 likes/second
100M comments per day → ~1,150 comments/second

Read scale:

Each user opens the app ~7x/day
3.5B feed loads/day → ~40,000 feed requests/second
Feed load is your most expensive operation

Storage scale:

Average photo: 3MB (after compression)
100M photos/day × 3MB × 3 sizes = ~900TB new storage per day
Video and reels multiply this significantly

Derived constraints:

Read:Write ratio ≈ 80:20 (mostly read)
Feed generation is the critical path
Like storage needs write-optimised infrastructure
Media storage needs a CDN — serving from origin is impossible at this scale

Now you can design. Everything flows from these numbers.

3. The Scaling Journey — Stage by Stage {#scaling-journey}

The biggest mistake in system design interviews is jumping straight to the 500M DAU architecture. Real systems don’t start there. Understanding the journey is what separates a junior answer from a senior one.

Stage 1 — 1K DAU: Ship Fast

Infrastructure: Single server, single PostgreSQL instance, S3 for photos.

What works: Everything. At 1K users, you have no scaling problems. Your only job is shipping features.

What breaks first: PostgreSQL connection limits. Default is 100 max connections. At ~80 concurrent users hitting the DB, you start seeing too many connections errors. Fix: PgBouncer for connection pooling. Trade-off: one more component to operate.

Architecture:

Client → Single Server (App + Postgres + PgBouncer) → S3

Stage 2 — 100K DAU: The First Real Pain

What breaks: Feed queries. SELECT posts WHERE user_id IN (list_of_500_followings) ORDER BY created_at DESC LIMIT 10 becomes a slow full table scan as posts accumulate.

Fixes:

Redis for pre-computed feeds (cache-aside pattern, TTL 10 min)
Read replicas so reads don’t compete with writes
CDN (CloudFront) in front of S3 — stop serving media from origin
Story expiry cron — a job every 15 minutes marking expired stories deleted

New problems introduced:

Cache invalidation: whose feed do you invalidate when someone posts?
Read replica lag: users might briefly see stale data (eventual consistency)

Architecture:

Client → Load Balancer → App Servers → {Postgres Primary, Redis, S3+CDN}
                                     ↓
                              Postgres Read Replicas

Stage 3 — 10M DAU: Real Distributed Systems

This is the interesting stage. Three things break simultaneously.

What breaks:

Monolith deployment slows down feature development — team coordination hell
Like/comment write throughput saturates PostgreSQL
Text search with LIKE queries is unusably slow

Fixes:

Split into microservices (User, Post, Feed, Comment, Notification, Search)
Introduce Kafka as the event backbone — services stop calling each other synchronously
Cassandra for likes and comments (write-optimised, no transactions needed)
Elasticsearch for search, hashtags, and explore

Architecture:

Client → API Gateway → Microservices → Kafka → Consumers
                              ↓
              {Postgres, Redis, Cassandra, Elasticsearch, S3+CDN}

Stage 4 — 500M DAU: Planetary Scale

What changes:

Geo-distribution: data centres in US, EU, Asia-Pacific
ML-powered feed ranking replaces chronological ordering
Sharding Postgres by user_id across multiple instances
Cassandra runs as a multi-region cluster
Kafka handles millions of events per second with consumer groups

4. Component Deep Dive: Feed Generation {#feed-generation}

Feed generation is the hardest problem in the Instagram system design. Get this wrong and every other component is irrelevant.

The Core Question: Push vs Pull

Fan-out on Write (Push):
When a user posts, immediately write that post to every follower’s feed.

✅ Feed reads are O(1) — just read the pre-computed list
❌ Write amplification: 1 post × 10,000 followers = 10,000 writes
❌ Catastrophic for celebrities (Ronaldo posting = 600M writes)

Fan-out on Read (Pull):
When a user opens their feed, fetch posts from everyone they follow in real-time.

✅ No write amplification — posts are written once
❌ Read is expensive: fetch from 500 followings, merge, sort, rank
❌ Slow for power users with many followings

Instagram’s Solution: Hybrid Fan-out

Regular users (< 1M followers): push model — fan-out on write to their followers’ feeds
Celebrity users (> 1M followers): pull model — merge their latest posts at read time

On feed load:

Read pre-computed feed from Redis ZSET (sorted by ML ranking score)
For any celebrity accounts the user follows, fetch their latest posts
Merge, re-rank, serve

Feed Storage in Redis:

Key: feed:{user_id}
Type: ZSET (sorted set)
Score: ML ranking score (not timestamp)
Value: post_id
TTL: 10 minutes

On cache miss → fall back to Cassandra user_timeline table → re-rank → re-cache.

ML Ranking Signals:

Recency (newer posts scored higher)
Relationship strength (how often you interact with this account)
Post engagement velocity (likes/comments in first hour)
Content type preference (video vs photo history)
Session context (what you’ve engaged with this session)

Feed Edge Cases

Offline user returning after 2 weeks:
Don’t backfill 14 days of fan-out events. Their feed cache is cold and stale. Generate fresh on first open from Cassandra. Accept that the first load is slightly slower.

User unfollows someone mid-request:
Eventual consistency means you might briefly surface one post from an unfollowed account. Don’t try to prevent this at the storage layer — the complexity isn’t worth it. Filter at the display layer if it’s a concern.

Deleted post in cached feed:
Store is_deleted flag. Check at serve time. Never serve deleted content from cache regardless of what the feed list says.

New user with zero followings (cold start):
Show explore/trending content until they follow enough accounts for a meaningful feed.

5. Component Deep Dive: Stories & Expiry {#stories}

Stories feel deceptively simple — post a photo, it disappears after 24 hours. The distributed expiry pipeline behind this is non-trivial.

The Storage Architecture

Story metadata  → PostgreSQL (story_id, user_id, expires_at, is_deleted)
Story TTL       → Redis SET (key: story:{user_id}, TTL: 24h)
Story media     → S3 (deleted async after expiry)
Story views     → Redis SET (key: viewed:{user_id}:{story_id}) + async counter

The Expiry Pipeline

Story uploaded → expires_at = NOW() + 24h written to Postgres
Redis key set with matching TTL
On Redis TTL expiry → Kafka story.expired event published
Kafka consumer: soft-delete in Postgres (is_deleted = true)
Kafka consumer: issue S3 delete for the media file
Kafka consumer: invalidate CDN cache for the media URL

The problem: What if the Kafka consumer is down when the TTL fires?

The fix: Reconciliation cron job running every 15 minutes:

SELECT story_id FROM stories
WHERE expires_at < NOW()
AND is_deleted = false;

Anything this finds is cleaned up. Eventual deletion — not real-time. Acceptable for stories.

Story Feed

When a user opens stories:

Fetch user IDs they follow from Postgres (or Redis cache)
For each, check if story:{user_id} key exists in Redis
Return story IDs sorted by recency
Mark viewed: SADD viewed:{viewer_id}:{story_id} in Redis (idempotent)
Increment view count async (avoid hot write on every view)

6. Component Deep Dive: Media Upload & CDN {#media-upload}

At 1,150 uploads per second, your servers cannot be in the media path. Every byte going through your application servers is wasted CPU and network.

Pre-Signed S3 Upload Flow

1. Client → POST /v1/posts  { caption, media_type }
2. Server → generates pre-signed S3 URL (valid 15 min) + post_id
3. Server → returns { post_id, upload_url } to client
4. Client → PUT directly to S3 using upload_url
5. S3 → fires s3:ObjectCreated event
6. Lambda/consumer → publishes media.uploaded to Kafka
7. Kafka consumer → generates thumbnails, updates post status, triggers feed fan-out

Your servers touch zero bytes of media. They handle only metadata.

Media Sizes

Every photo is stored in three sizes:

Thumbnail: 150×150px — profile grids, search results
Feed: 720px wide — home feed display
Full: 1080px wide — post detail view
Original: preserved for potential future use

Stored at: s3://ig-media-{region}/{user_id}/{post_id}/{size}.jpg

CDN Strategy

CloudFront in front of all S3 buckets.

Cache-Control headers: max-age=31536000 (1 year) for immutable media
Edge locations serve 95%+ of media requests — origin never gets hit
On post delete: CDN invalidation API call (small window of stale serving — acceptable)

The edge case: Client successfully uploads to S3 but dies before confirming to your API.

The fix: S3 event notification independently triggers the Kafka event. Your Post Service confirms the upload without waiting for client confirmation. The client can poll GET /v1/posts/:post_id to check status.

7. Component Deep Dive: Notifications {#notifications}

Notifications are a fan-out problem dressed in a UX problem’s clothing.

Notification Types & Channels

Trigger	Channel	Latency Target
Like on your post	Push (FCM/APNs) + In-app	< 5 seconds
Comment on your post	Push + In-app	< 5 seconds
New follower	Push + In-app	< 10 seconds
Story view	In-app only	< 30 seconds
Mention in caption	Push + In-app	< 5 seconds

The Pipeline

Action → Kafka event → Notification Service consumer
       → Enrich (fetch user prefs, device tokens, do-not-disturb)
       → Route (push? in-app? email? all?)
       → Send via FCM (Android) / APNs (iOS)
       → Store in notifications DB for in-app feed

The Hard Edge Cases

Notification storm — viral post:
A post gets 10M likes. Without batching, your Notification Service receives 10M like.created events and tries to push 10M individual notifications to the post author.

Fix: Debouncing in the Notification Service.

Window: 60 seconds
If like.created events for the same post_id + user_id (recipient) exceed threshold → batch into “X and 9,999 others liked your post”
Store the count in Redis, flush as single notification at window close

Dead device tokens:
User uninstalls app. FCM/APNs return NotRegistered or BadDeviceToken on delivery attempt.

Fix: Notification Service listens for delivery failure callbacks → marks device token as invalid in DB → stops sending to that token.

User preference: notifications off:
Check user notification preferences before publishing to Kafka. Don’t generate events for users who have disabled that notification type. Saves downstream processing entirely.

Do Not Disturb windows:
Store user timezone + DND preferences. Notification Service checks at delivery time — if in DND window, store notification, deliver at window end.

8. Component Deep Dive: Search & Discovery {#search}

Why Not Postgres?

SELECT * FROM posts WHERE caption LIKE '%golden gate%' is a sequential scan on a table with billions of rows. At any meaningful scale, this query will timeout before returning.

You need an inverted index. That’s Elasticsearch.

Elasticsearch Index Design

Posts Index:

{
  "post_id": "keyword",
  "caption": "text (analyzed, english stemming)",
  "hashtags": ["keyword"],
  "location": "geo_point",
  "created_at": "date",
  "like_count": "integer",
  "user_id": "keyword",
  "is_deleted": "boolean"
}

Users Index:

{
  "user_id": "keyword",
  "username": "keyword",
  "bio": "text",
  "follower_count": "integer",
  "is_private": "boolean",
  "is_verified": "boolean"
}

Keeping Elasticsearch in Sync

Elasticsearch is updated asynchronously from Postgres via Kafka:

Postgres write → Kafka (post.created / post.updated / post.deleted)
              → Elasticsearch consumer → index update
              → Lag: ~1-2 seconds

The dual-source pattern:

Post appears in owner’s profile immediately (read from Postgres — source of truth)
Post appears in search results after ~2 seconds (read from Elasticsearch)

Two sources of truth for two different use cases. This is intentional, not a bug.

Trending Hashtags

Trending is a sliding window count problem. Redis handles it elegantly:

On hashtag used: ZINCRBY trending:1h <tag> 1
                 ZINCRBY trending:24h <tag> 1
                 ZINCRBY trending:7d <tag> 1

Expire keys: trending:1h → TTL 1 hour (rolling via scheduled reset)
             trending:24h → TTL 24 hours
             trending:7d → TTL 7 days

Read trending: ZREVRANGE trending:1h 0 9 WITHSCORES

For true sliding windows (not fixed-window resets), use a sorted set with timestamps as members and prune periodically with ZREMRANGEBYSCORE.

Explore / Discover

Explore isn’t search — it’s recommendation. ML-powered, personalised, continuously reranked.

Pipeline:

Candidate generation: posts with high engagement velocity in last 24h
User interest modelling: what content types has this user engaged with?
Collaborative filtering: what are similar users engaging with?
Re-ranking: apply diversity, freshness, safety filters
Serve top 50 candidates per request

Infrastructure: Apache Spark for batch feature computation, TensorFlow Serving for real-time scoring, Redis for caching ranked candidate lists per user.

9. Component Deep Dive: Likes & Comments {#likes-comments}

Why Postgres Can’t Handle Likes

52,000 likes per second. In Postgres, each like is:

An INSERT into the likes table
An UPDATE on the post’s like_count
Potentially a row lock while updating the count

At 52K/second, you’ll hit write contention, lock timeouts, and deadlocks. Postgres wasn’t built for this write pattern.

Cassandra for Likes

-- Cassandra table design
CREATE TABLE likes (
  post_id UUID,
  user_id UUID,
  reaction_type TEXT,
  created_at TIMEUUID,
  PRIMARY KEY (post_id, user_id)
);

Why this schema:

post_id as partition key → all likes for a post on one node
user_id as clustering key → O(1) check “has this user liked this post?”
TIMEUUID for ordering without separate timestamp column
INSERT is idempotent → same (post_id, user_id) twice = one like (handles retries)

Like count:
Don’t store count in Cassandra (COUNTER type has consistency quirks). Instead:

Atomic INCR in Redis: like_ct:{post_id}
Write-back to Postgres posts.like_count every 30 seconds async
Accept: count shown may be ~30s behind actual. Nobody notices.

Comments in Cassandra

CREATE TABLE comments (
  post_id UUID,
  comment_id TIMEUUID,
  user_id UUID,
  text TEXT,
  like_count INT,
  PRIMARY KEY (post_id, comment_id)
) WITH CLUSTERING ORDER BY (comment_id ASC);

Why TIMEUUID as clustering key:
Ordering is built into the key — no ORDER BY at query time. Comments are naturally sorted chronologically. Pagination with WHERE comment_id > <last_seen> is efficient.

Query pattern:

SELECT * FROM comments
WHERE post_id = ?
AND comment_id > ? -- cursor
LIMIT 20;

Efficient. No full scans. Scales to millions of comments per post.

10. Database Design — Every Decision Justified {#database-design}

PostgreSQL — The Relational Core

users table:

CREATE TABLE users (
  user_id      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  username     VARCHAR(30) UNIQUE NOT NULL,
  email        VARCHAR(255) UNIQUE NOT NULL,
  password_hash TEXT NOT NULL,
  bio          TEXT,
  profile_pic_url TEXT,
  follower_count INT DEFAULT 0,
  following_count INT DEFAULT 0,
  is_private   BOOLEAN DEFAULT false,
  is_verified  BOOLEAN DEFAULT false,
  created_at   TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_users_username ON users(username);
CREATE INDEX idx_users_email ON users(email);

posts table:

CREATE TABLE posts (
  post_id      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id      UUID NOT NULL REFERENCES users(user_id),
  caption      TEXT,
  media_urls   TEXT[],
  media_type   VARCHAR(10) CHECK (media_type IN ('photo','video','reel')),
  location_lat DECIMAL(9,6),
  location_lng DECIMAL(9,6),
  like_count   BIGINT DEFAULT 0,
  comment_count INT DEFAULT 0,
  is_deleted   BOOLEAN DEFAULT false,
  created_at   TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_posts_user_created ON posts(user_id, created_at DESC);

follows table:

CREATE TABLE follows (
  follower_id  UUID NOT NULL REFERENCES users(user_id),
  following_id UUID NOT NULL REFERENCES users(user_id),
  status       VARCHAR(10) CHECK (status IN ('active','pending','blocked')),
  created_at   TIMESTAMPTZ DEFAULT NOW(),
  PRIMARY KEY (follower_id, following_id)
);

CREATE INDEX idx_follows_following ON follows(following_id);

Redis Key Design

Key Pattern	Type	Purpose	TTL
`feed:{user_id}`	ZSET	Pre-computed ranked feed	10 min
`story:{user_id}`	SET of story_ids	Active stories	24h
`session:{token}`	STRING	Auth session → user_id	7 days
`rate:{uid}:{action}`	COUNTER	Rate limit window	1 min
`like_ct:{post_id}`	STRING	Atomic like counter	No TTL
`trending:{window}`	ZSET	Hashtag trending scores	Window
`viewed:{uid}:{story_id}`	STRING	Story viewed flag	24h

Storage Selection Rationale

Data	Storage	Why
Users, Posts, Follows, Stories	PostgreSQL	Relational, consistency required
Likes, Comments, Timelines	Cassandra	Write-heavy, no JOINs needed
Feeds, Sessions, Counters	Redis	Speed, TTL support, atomic ops
Search, Hashtags, Explore	Elasticsearch	Inverted index, full-text, geo
Photos, Videos, Stories	S3 + CDN	Cheap, durable, globally distributed

11. API Design — Full Contracts {#api-design}

All APIs are RESTful, JWT-authenticated, cursor-paginated, and rate-limited at the API Gateway via Redis.

Auth APIs

POST /v1/auth/register
Body: { username, email, password, full_name }
Returns: { access_token, refresh_token, user }

POST /v1/auth/login
Body: { email, password }
Returns: { access_token, refresh_token }

POST /v1/auth/refresh
Header: Authorization: Bearer <refresh_token>
Returns: { access_token }

POST /v1/auth/logout
Header: Authorization: Bearer <access_token>
Action: DEL session:{token} from Redis

User APIs

GET  /v1/users/:username
Returns: { user_id, username, bio, follower_count, following_count,
           posts_count, is_followed, is_private, is_verified }

PATCH /v1/users/me
Body: { bio?, profile_pic_url?, is_private? }

GET  /v1/users/:id/followers?cursor=<cursor>&limit=20
Returns: { users[], next_cursor }

POST   /v1/users/:id/follow     → idempotent, Kafka: follow.created
DELETE /v1/users/:id/follow     → Kafka: follow.removed

Post APIs

POST /v1/posts
Body: { caption, media_type, location? }
Returns: { post_id, upload_url }   ← pre-signed S3 URL
Action: client uploads directly to S3, S3 event triggers Kafka

GET  /v1/posts/:post_id
Returns: { post, author, like_count, comment_count, is_liked, is_saved }

DELETE /v1/posts/:post_id
Action: is_deleted=true, Kafka: post.deleted → CDN purge

POST   /v1/posts/:post_id/like
Header: Idempotency-Key: <uuid>
Action: Redis INCR + Cassandra write + Kafka: like.created

DELETE /v1/posts/:post_id/like
Action: Redis DECR + Cassandra delete + Kafka: like.removed

GET  /v1/posts/:post_id/comments?cursor=<cursor>&limit=20
Returns: { comments[], next_cursor }   ← from Cassandra

POST /v1/posts/:post_id/comments
Body: { text }
Action: Cassandra write + Kafka: comment.created → notification

Feed & Stories APIs

GET /v1/feed?cursor=<cursor>&limit=10
Action: Redis ZSET read → cache miss → Cassandra rebuild → re-cache
Returns: { posts[], next_cursor }

GET /v1/explore?page=1&limit=20
Action: Elasticsearch + ML ranking
Returns: { posts[], next_page }

POST /v1/stories
Body: { media_type }
Returns: { story_id, upload_url }
Action: Redis TTL set + Kafka: story.created

GET /v1/stories/feed
Returns: { stories[] }   ← unviewed, sorted by recency

POST /v1/stories/:id/view
Action: Redis SADD viewed:{uid} + async view_count INCR

DELETE /v1/stories/:id
Action: is_deleted=true + Kafka: story.deleted → S3 purge

Search APIs

GET /v1/search/users?q=rachit&limit=10
Action: Elasticsearch users_index, fuzzy match, boost by follower_count

GET /v1/search/posts?q=sunset&hashtag=travel&lat=28.6&lng=77.2&radius=10km
Action: Elasticsearch posts_index, geo-filter + text match

GET /v1/search/trending?window=1h
Action: ZREVRANGE trending:1h 0 9 WITHSCORES
Returns: { tags: [{ tag, post_count, delta }] }

Rate Limits

Endpoint	Limit
POST /posts	10/min
POST /like	60/min
GET /feed	30/min
POST /comments	20/min
GET /search	20/min
POST /stories	5/min
POST /follow	30/min

12. Edge Cases Nobody Draws on Their Diagram {#edge-cases}

This section is what turns a good system design into a great one.

Celebrity Fan-out Storm

Cristiano Ronaldo posts. 600M followers. Fan-out on write to all of them simultaneously would generate 600M Cassandra writes in seconds — your cluster dies.

Fix: Celebrity detection at post time (follower_count > 1M). Skip fan-out. At feed read time, fetch celebrity’s latest posts separately and merge. The merge happens in the Feed Service, in memory, before Redis caching.

The Disappearing Story

Redis TTL fires (story expires). Kafka consumer is restarting at that exact moment. The story.expired event is consumed, but the consumer crashes before committing the offset. The event replays. The delete runs twice on S3.

Fix: S3 delete is idempotent (deleting a non-existent object returns 204). The Cassandra write is idempotent (same story_id soft-delete runs twice = same result). Design all consumers to handle duplicate events safely.

The Double Like

Network is flaky. User taps like. Request times out client-side. Client retries. Server receives two POST /like requests.

Fix: Idempotency-Key: <uuid> header on every like request. Server checks SETNX idempotency:{key} 1 EX 86400 in Redis before processing. If key exists, return cached response. If not, process and cache. Same key = same result, always.

Comment on Deleted Post

Post is soft-deleted. User (who has the post open on their screen) tries to comment. Request hits Comment Service before the deletion propagates.

Fix: Comment Service calls Post Service to validate is_deleted before writing. Or: API Gateway checks post status. Or: accept the race condition and clean up orphaned comments in a background job. Third option is usually right — the complexity of synchronous cross-service validation isn’t worth the edge case frequency.

Notification Flood

10M likes in 10 minutes on a viral reel. Without batching, the post author gets 10M push notifications.

Fix: Debounce in Notification Service. Redis counter per (recipient_id, post_id, notification_type) with 60-second window. At window close, fire one notification: “Priya and 9,999 others liked your reel.” Reset counter.

Cold Start Feed

New user. Zero followings. Feed is empty.

Fix: Onboarding flow → interest selection → seed feed with high-engagement posts matching selected interests from Elasticsearch. After 5+ follows, switch to normal feed generation.

Geo-Replication Lag

User in Mumbai follows someone in New York. The follow write goes to primary (US). Mumbai’s read replica is 800ms behind. User immediately views the newly-followed account’s profile — replica says “not following.”

Fix: For follow-status checks that are user-initiated immediately after a follow action, route the read to primary (or use a read-your-own-writes cache in Redis). This is the one case where eventual consistency is genuinely confusing to users.

13. Key Trade-offs Summary {#trade-offs}

Decision	Trade-off
Cassandra for likes	Write speed vs. no ACID, no JOINs
Push feed fan-out	Fast reads vs. write amplification for popular accounts
Async Elasticsearch sync	Search features vs. 1-2 second indexing lag
Redis like counters	Speed vs. 30-second write-back delay
Eventual consistency on replicas	Read scale vs. briefly stale data
Soft deletes everywhere	Safety / auditability vs. storage overhead
Pre-signed S3 uploads	Scalable media ingestion vs. more complex client logic
Hybrid fan-out	Balanced throughput vs. more complex feed assembly

Final Thoughts

Instagram at 500M DAU isn’t one system. It’s eight systems — feed, stories, media, notifications, search, likes, comments, and the graph — running in parallel, sharing Kafka as the connective tissue, each independently scalable.

The principles that hold across all of them:

Design for the read path first — reads outnumber writes 80:20
Async everything that doesn’t need to be sync — Kafka is your friend
Name your trade-offs explicitly — “we accept 2-second search lag for write simplicity”
Design for idempotency everywhere — networks fail, retries happen, duplicates arrive
The cache is not the source of truth — always have a fallback to the DB

That’s what Instagram-scale system design looks like.

Next in this series: Why SQL beats NoSQL for 90% of startups — the data, the nuance, and why the benchmarks lie.

About the author: Rachit writes about system design, backend engineering, and the real trade-offs nobody talks about. Follow for weekly deep dives.

Tags: system-design backend distributed-systems instagram software-architecture database kafka redis elasticsearch cassandra

Your Shiny New Kafka Cluster is a Ticking Time Bomb

Rachit Misra — Tue, 31 Mar 2026 02:10:16 +0000

What nobody tells you about event-driven architecture until it’s 3 AM and your database is corrupted
A war story about distributed systems, offset commits, and the most expensive lesson of my engineering career.

Everyone said Kafka would fix our feed latency.
They were right.
For exactly 31 days.
Then PagerDuty fired at 3:17 AM, consumer lag hit 2.4 million messages, and we discovered something that no architecture diagram had ever shown us — Kafka doesn’t care about your database state. It just delivers. Faithfully. Mercilessly.
This is that story.

The Setup — How We Got Here
Q3. Product team screaming for faster feeds. Our engineering lead had just come back from a conference. Someone in the architecture meeting said the word Kafka.
The room lit up. Senior engineers nodded. Someone opened a laptop and pulled up the Confluent documentation.
We had a simple Redis-based event queue that was “too slow.” Our P99 latency was 800ms on feed generation. Product wanted 200ms. Kafka promised sub-100ms. The decision felt obvious.
Three weeks later, we shipped it. Latency dropped to 80ms. The product team celebrated. Engineering celebrated. We had graphs going the right direction.
We had no idea what we’d just invited into our system.

What Kafka Actually Is
Before we get to the 3 AM incident, let’s talk about what Kafka actually is — because most teams get this wrong before they write a single line of code.
Kafka is not a message queue.
This is the most dangerous misconception in distributed systems today. When engineers hear “message queue,” they think: send a message, someone receives it, it’s gone. Like an email inbox.
Kafka is a distributed commit log.
Every event is written to a log. The log is partitioned across brokers. Each partition is an ordered, immutable sequence of records. Consumers read from the log at their own pace, tracked by an offset — a pointer to their position in the log.

Partition 0: [event1] [event2] [event3] [event4] [event5]
↑
Consumer offset
(Consumer has read up to here)

The crucial difference from a traditional queue: Kafka doesn’t delete messages after consumption. Messages are retained for a configured retention period (default 7 days). Consumers can seek to any offset and replay any portion of the log.
This is both Kafka’s superpower and its most dangerous property.
Superpower: Replay events for new consumers, rebuild state, debug issues by replaying history.
Danger: If your consumers are not idempotent and your system state has mutated, replay doesn’t recover your system. It corrupts it further.

The Architecture We Built
Our feed generation system looked elegant on the whiteboard:

User Action
↓
[Producer] → [Kafka Topic: user-events]
↓
[Consumer Group A] → Updates user feed cache
[Consumer Group B] → Updates recommendation engine
[Consumer Group C] → Updates activity timeline

Three consumer groups. Each consuming the same events for different purposes. Decoupled. Independent. Horizontally scalable.
The architecture review went smoothly. “What about consumer failures?” someone asked.
“Kafka handles that,” we said. “It retains messages. We can replay.”
We were right. And catastrophically wrong.

Month 2 — The Nightmare Begins
3:17 AM
PagerDuty fires.

ALERT: Consumer lag critical
Topic: order-events
Consumer Group: order-processor
Lag: 2,400,000 messages

2.4 million messages behind. Not a small spike — Consumer Group B had been silently falling behind for 6 hours. A memory leak in one of the consumer instances had caused it to slow down. Kubernetes hadn’t restarted it because it was still running — just slowly.
By the time the alert fired, we had 2.4 million unprocessed order events.
The Decision That Made Everything Worse
“No problem,” the on-call engineer said. “We’ll just restart the consumers and let them catch up.”
This is where the real disaster began.
The consumers restarted. They picked up from their last committed offset. They began processing 2.4 million events.
Thirty minutes later, our database was in a state of chaos:
∙ Orders were being marked as “processing” that had already been delivered
∙ Inventory counters were going negative
∙ Users were being charged twice for orders they’d already received
∙ Recommendation scores were being recalculated with stale data, overwriting fresh data
We stopped the consumers. But the damage was done.
What Actually Happened
Let me reconstruct the failure precisely, because understanding the exact mechanism is everything.
Step 1 — The initial processing:

Event: OrderPlaced {orderId: 12345, amount: 599, userId: user456}
Consumer processes event:
→ Creates order record in DB
→ Charges payment gateway
→ Updates inventory: item_count -= 1
→ Commits offset

This worked correctly for months.
Step 2 — The slow consumer:

Event: OrderPlaced {orderId: 67890, amount: 1299, userId: user789}
Consumer receives event
Consumer starts processing...
→ Creates order record in DB ✅
→ Charges payment gateway ✅
→ Updates inventory: item_count -= 1 ✅
Consumer crashes before committing offset ❌

The offset for this batch was never committed. From Kafka’s perspective, these events were never successfully consumed.
Step 3 — The restart:
When the consumer restarted, it read from the last committed offset — before the crash. It received the same events again.

Event: OrderPlaced {orderId: 67890, amount: 1299, userId: user789}
Consumer processes event AGAIN:
→ Creates order record in DB ← DUPLICATE
→ Charges payment gateway ← DOUBLE CHARGE
→ Updates inventory: item_count -= 1 ← WRONG AGAIN
→ Commits offset ✅

Step 4 — The 2.4 million event replay:
Multiply this across 2.4 million events, many of which had been partially processed or fully processed but whose offsets hadn’t been committed. Some events were processed once, some twice, some partially.
The database was now in an indeterminate state. We couldn’t tell which operations had run once and which had run twice.
Kafka didn’t fail. It worked exactly as designed.
We had built a perfectly reliable pipeline for delivering chaos at scale.

The Root Cause Analysis
Our postmortem identified three fundamental mistakes.
Mistake 1: No Idempotency
Idempotency means that running the same operation multiple times produces the same result as running it once.
Our consumers were not idempotent. Processing the same OrderPlaced event twice created two orders, two charges, two inventory decrements.
The fix:
Every event handler must check: “Have I already processed this event?”

@KafkaListener(topics = "order-events")
public void handleOrderEvent(OrderEvent event) {
// Idempotency check FIRST
if (eventProcessingRepository.exists(event.getEventId())) {
log.info("Event {} already processed, skipping", event.getEventId());
return;
}

try {
    // Process the event
    orderService.processOrder(event);

    // Mark as processed ATOMICALLY with the business operation
    // Use a DB transaction to ensure both happen or neither happens
    eventProcessingRepository.markProcessed(event.getEventId());

} catch (Exception e) {
    // Don't mark as processed — allow retry
    throw e;
}

}

CREATE TABLE processed_events (
event_id VARCHAR(255) PRIMARY KEY,
processed_at TIMESTAMP DEFAULT NOW(),
consumer_group VARCHAR(100)
);

-- With TTL index to avoid unbounded growth
CREATE INDEX idx_processed_events_time ON processed_events(processed_at);

Every unique event ID is stored. Before processing, check if it exists. If yes — skip. If no — process and insert atomically.
The idempotency key design matters:
Your event IDs must be stable and unique across retries. Use a combination of business identifiers:

// Good: stable, business-meaningful
String eventId = "order-placed-" + orderId + "-" + userId;

// Bad: changes on retry
String eventId = UUID.randomUUID().toString();

Mistake 2: Auto-Commit Was a Lie
Kafka consumers have two offset commit strategies:
Auto-commit (the default):

props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "5000");

Every 5 seconds, Kafka automatically commits the current offset regardless of whether your application has finished processing. If your consumer crashes between the auto-commit and finishing processing — the message is considered consumed but your application never finished handling it. Silent data loss.
Alternatively, if your consumer crashes after processing but before the next auto-commit — the message replays on restart. Duplicate processing.
Auto-commit gives you the worst of both worlds: potential data loss AND potential duplicates, with no control over which failure mode you experience.
Manual commit (the correct approach):

props.put("enable.auto.commit", "false");

@KafkaListener(topics = "order-events")
public void handleOrderEvent(
OrderEvent event,
Acknowledgment acknowledgment) {

try {
    // Process the event fully
    orderService.processOrder(event);

    // Only commit offset AFTER successful processing
    acknowledgment.acknowledge();

} catch (Exception e) {
    // Don't acknowledge — message will be redelivered
    // Make sure your handler is idempotent!
    log.error("Failed to process event {}", event.getEventId(), e);
    throw e;
}

}

With manual commit, you control exactly when an offset is committed. The offset only advances when you’ve confirmed successful processing.
At-least-once vs exactly-once:
Manual commit gives you at-least-once delivery — messages are never lost, but may be delivered more than once. Combined with idempotency, this is safe and practical.
Exactly-once delivery is possible with Kafka transactions but comes with significant complexity and performance overhead. For most systems, at-least-once + idempotency is the right trade-off.

// Exactly-once with Kafka transactions (complex, high overhead)
@Transactional
public void processWithExactlyOnce(OrderEvent event) {
// Kafka transaction spans both the consumer offset commit
// and the producer write — atomic
kafkaTemplate.executeInTransaction(operations -> {
orderService.processOrder(event);
operations.send("order-processed", event.getOrderId());
return null;
});
}

Mistake 3: No Replay Strategy
When we decided to replay 2.4 million events, we hadn’t asked a fundamental question:
Is the current state of our database compatible with replaying these events?
The answer was no. The database had moved forward. Events that assumed “inventory = 100” were being replayed against a database where “inventory = 47.”
The correct replay strategy:
Before replaying events, you must answer:
1. What is the current state of the system? Take a snapshot before replay begins.
2. Are the events you’re replaying compatible with the current state? If an event says “deduct 1 from inventory” and inventory is already at the post-event value — replaying it will corrupt state.
3. Can you replay to a shadow system first? Replay events against a read replica or a staging environment to validate the outcome before applying to production.
4. Do you have a compensation mechanism? If replay causes inconsistency, can you detect and correct it?

public class SafeReplayService {

public void replayEvents(String topic, long fromOffset, long toOffset) {
    // Step 1: Take DB snapshot for rollback capability
    String snapshotId = snapshotService.createSnapshot();

    // Step 2: Enable replay mode (idempotency is critical here)
    replayModeFlag.set(true);

    // Step 3: Replay in small batches with validation
    for (long offset = fromOffset; offset < toOffset; offset += BATCH_SIZE) {
        List<ConsumerRecord> batch = fetchBatch(topic, offset, BATCH_SIZE);

        // Validate state compatibility before processing
        if (!stateCompatibilityChecker.isCompatible(batch)) {
            log.error("State incompatibility detected at offset {}", offset);
            rollbackService.rollback(snapshotId);
            throw new ReplayException("Cannot safely replay at offset " + offset);
        }

        processBatch(batch);
        validateBatchOutcome(batch);
    }

    replayModeFlag.set(false);
}

}

The Kafka Failure Modes Nobody Talks About
Consumer Group Rebalancing
When a consumer joins or leaves a consumer group, Kafka triggers a rebalance — reassigning partitions across consumers.
During rebalance, all consumers in the group stop processing. For a group of 10 consumers handling a high-throughput topic, a rebalance can pause processing for 30-60 seconds.
Causes of unexpected rebalances:
∙ Consumer takes longer than max.poll.interval.ms to process a batch
∙ Consumer fails to send heartbeat within session.timeout.ms
∙ Deployment rolling update adds/removes consumer instances
Mitigation:

// Increase poll interval for slow processors
props.put("max.poll.interval.ms", "600000"); // 10 minutes

// Reduce batch size to ensure processing within interval
props.put("max.poll.records", "100"); // Process 100 records at a time

// Use static membership to reduce rebalances during restarts
props.put("group.instance.id", "consumer-instance-1");

Log Compaction Surprises
Kafka supports log compaction on certain topics — retaining only the latest message for each key. This is useful for event sourcing and change data capture.
The surprise: if you’re consuming a compacted topic and your consumer falls behind, some of the events you missed may have been compacted away. You’ll never see intermediate states.

Before compaction:
[key:user1, value:name=Alice] [key:user1, value:name=AliceB] [key:user1, value:name=Carol]

After compaction:
[key:user1, value:name=Carol]

Your consumer that missed the first two events will only see “Carol.” If your system expected to process every name change — you’ve silently lost data.
Partition Hot Spots
Kafka distributes messages across partitions using a partitioning key. If your partitioning key has low cardinality — for example, partitioning order events by country in a primarily US-based app — one partition receives 90% of the traffic.
The consumers reading that partition are overloaded. Others are idle. Horizontal scaling doesn’t help because you can’t have more consumers than partitions for a given topic.

// Bad: low cardinality key
producer.send(new ProducerRecord<>("orders", order.getCountry(), order));

// Good: high cardinality key
producer.send(new ProducerRecord<>("orders", order.getOrderId(), order));

The Poison Pill
A single malformed message that causes your consumer to crash on every processing attempt. Auto-commit means the offset never advances. Your consumer restarts, fetches the same message, crashes again. Infinite loop.

@KafkaListener(topics = "order-events")
public void handleOrderEvent(OrderEvent event) {
try {
orderService.processOrder(event);
acknowledgment.acknowledge();
} catch (PoisonPillException e) {
// Dead letter queue for messages that can't be processed
deadLetterProducer.send("order-events-dlq", event);
acknowledgment.acknowledge(); // Move past the poison pill
log.error("Moved poison pill to DLQ: {}", event.getEventId());
} catch (RetryableException e) {
// Don't acknowledge — allow retry
throw e;
}
}

Always implement a dead letter queue (DLQ) for messages that consistently fail processing. Without it, a single bad message can halt your entire consumer group.

The Three Questions You Must Answer Before Adding Kafka
After our incident, we created a pre-Kafka checklist. Every team considering Kafka must answer these three questions before writing a single line of producer code.
Question 1: Are Your Consumers Idempotent?
Can the same event be processed twice without corrupting state?
If you can’t answer yes with confidence — don’t add Kafka yet. Build idempotency first.
Test this explicitly:

@test
public void processingSameEventTwiceProducesSameResult() {
OrderEvent event = createTestOrderEvent();

orderConsumer.handle(event);
orderConsumer.handle(event); // Process twice

// State should be identical to processing once
Order order = orderRepository.findById(event.getOrderId());
assertEquals(1, orderRepository.countByUserId(event.getUserId()));
assertEquals(OrderStatus.PLACED, order.getStatus());
// Payment should only be captured once
assertEquals(1, paymentRepository.countByOrderId(event.getOrderId()));

}

Question 2: Is Your Offset Commit Strategy Deliberate?
Have you explicitly chosen between at-least-once, at-most-once, and exactly-once?
Have you disabled auto-commit and implemented manual acknowledgment?
Have you tested what happens when your consumer crashes mid-batch?
Question 3: Do You Have a Replay Strategy That Accounts for Current DB State?
If you need to replay 3 days of events tomorrow, can you do it safely?
Do you have the tooling to:
∙ Check state compatibility before replay?
∙ Replay to a shadow environment for validation?
∙ Roll back if replay causes inconsistency?
If you can’t answer yes to all three — you have a time bomb, not a pipeline.

When Kafka IS the Right Answer
After all of this, I want to be clear: Kafka is extraordinary for the right problems.
Use Kafka when:
∙ You need event replay. Building a new analytics service that needs to process 6 months of historical events? Kafka’s retention makes this trivial. A traditional queue can’t do this.
∙ Multiple consumers need the same events. Order placed → update inventory, send email, update recommendations, charge payment. Each consumer group processes independently at their own pace.
∙ You need high throughput with durability. Kafka handles millions of messages per second with persistence guarantees. Traditional queues struggle here.
∙ You’re building event sourcing. Kafka’s log is a natural fit for storing the complete history of state changes.
∙ You need decoupling between services. Producers don’t know about consumers. Services can be added without modifying existing code.
Don’t use Kafka when:
∙ You just need a simple task queue. Redis queues, RabbitMQ, or AWS SQS are simpler and sufficient.
∙ Your team doesn’t understand distributed systems fundamentals. Kafka amplifies your architecture’s weaknesses.
∙ You need simple request-response patterns. Kafka’s async nature adds latency and complexity for synchronous workflows.
∙ You’re a startup with 100 users. Your feed latency problem is probably a missing database index, not a missing message broker.

The Checklist We Wish We Had

Pre-Kafka Production Checklist:

Consumer Design
□ Idempotency implemented and tested
□ Auto-commit disabled
□ Manual offset commit with acknowledgment pattern
□ Dead letter queue configured
□ Poison pill handling implemented
□ Consumer lag alerting configured

Operations
□ Replay strategy documented
□ State compatibility validation tooling built
□ Shadow replay environment available
□ Kafka cluster monitoring configured
□ Consumer lag per group/topic
□ Broker disk usage
□ Under-replicated partitions
□ Controller election rate

Architecture
□ Partition count matches consumer scaling requirements
□ Partitioning key has high cardinality
□ Retention period matches replay requirements
□ Log compaction behavior understood for topic
□ Rebalance frequency monitored

Testing
□ Consumer crash mid-batch tested
□ Double-processing tested (idempotency verification)
□ Consumer lag recovery tested
□ Replay tested against production-like data volume

What Distributed Systems Actually Teach You
The 3 AM incident taught us something that no architecture talk had ever communicated clearly:
Distributed systems don’t punish bad architecture immediately.
They let you deploy. They let you celebrate the latency improvements. They let you present the graphs to product. They let you write the blog post about how you scaled.
Then, weeks or months later, under exactly the right (wrong) conditions — a slow consumer, an unexpected traffic spike, a network partition — they collect.
The bill is always paid at 3 AM.
The engineers who truly understand distributed systems aren’t the ones who avoided incidents. They’re the ones who’ve been humbled by them, understood exactly why they happened, and built systems that fail gracefully instead of catastrophically.
Kafka isn’t a silver bullet. It’s a distributed consistency nightmare dressed in a hoodie that says “low latency.”
Respect it, or it will humble you.

What’s Next
Next post: Why SQL quietly beats NoSQL for 90% of startups — despite everything the hype machine told you. We’ll look at the actual data, the benchmark lies, and the cases where PostgreSQL outperforms MongoDB at scale.
This one is going to make people uncomfortable.
Follow to stay updated. 🔔

If this saved you from a 3 AM incident, share it with your team. The best time to learn this lesson is before it’s your production database.

Tags: Kafka, Distributed Systems, Backend Engineering, System Design, Software Architecture, Java, Event-Driven Architecture, Microservices, Interview Prep

DEV Community: Rachit Misra

Designing Instagram at Scale: A Complete System Design Deep Dive

Table of Contents

1. Why Instagram is a Perfect System Design Problem {#why-instagram}

2. The Numbers That Define the Problem {#the-numbers}

3. The Scaling Journey — Stage by Stage {#scaling-journey}

Stage 1 — 1K DAU: Ship Fast

Stage 2 — 100K DAU: The First Real Pain

Stage 3 — 10M DAU: Real Distributed Systems

Stage 4 — 500M DAU: Planetary Scale

4. Component Deep Dive: Feed Generation {#feed-generation}

The Core Question: Push vs Pull

Feed Edge Cases

5. Component Deep Dive: Stories & Expiry {#stories}

The Storage Architecture

The Expiry Pipeline

Story Feed

6. Component Deep Dive: Media Upload & CDN {#media-upload}

Pre-Signed S3 Upload Flow

Media Sizes

CDN Strategy

7. Component Deep Dive: Notifications {#notifications}

Notification Types & Channels

The Pipeline

The Hard Edge Cases

8. Component Deep Dive: Search & Discovery {#search}

Why Not Postgres?

Elasticsearch Index Design

Keeping Elasticsearch in Sync

Trending Hashtags

Explore / Discover

9. Component Deep Dive: Likes & Comments {#likes-comments}

Why Postgres Can’t Handle Likes

Cassandra for Likes

Comments in Cassandra

10. Database Design — Every Decision Justified {#database-design}

PostgreSQL — The Relational Core

Redis Key Design

Storage Selection Rationale

11. API Design — Full Contracts {#api-design}

Auth APIs

User APIs

Post APIs

Feed & Stories APIs

Search APIs

Rate Limits

12. Edge Cases Nobody Draws on Their Diagram {#edge-cases}

Celebrity Fan-out Storm

The Disappearing Story

The Double Like

Comment on Deleted Post

Notification Flood

Cold Start Feed

Geo-Replication Lag

13. Key Trade-offs Summary {#trade-offs}

Final Thoughts

Your Shiny New Kafka Cluster is a Ticking Time Bomb