AryantKumar

Posted on Apr 25

How Instagram Handles 1,000,000 Concurrent Likes Without Breaking — Explained Simply

#systemdesign #backend #programming #career

You tap ❤️ on Instagram.
A million other people do the same thing. Same post. Same second.
Nothing breaks. No lag. No error. The heart just turns red.
This post is about why that’s actually an incredibly hard engineering problem — and how it’s solved. In plain English.

The Obvious Solution That Doesn’t Work
Every developer’s first instinct:

UPDATE posts SET like_count = like_count + 1 WHERE post_id = 123;

This is clean, readable, and completely correct — for small scale.
The problem: at a million concurrent users, every single like competes for a row-level lock on that one database row.
Requests queue up. Latency spikes. The database CPU maxes out.
Eventually — it dies.
This is called the hot row problem. One row, too many writers, no way to parallelize.
So Instagram doesn’t do this. At all.

What They Actually Do — Three Core Ideas

Idea 1: The Sticky Note Board (Redis)

Instead of writing to the database on every like, Instagram writes to an in-memory store (Redis).
Redis supports an atomic INCR operation:

INCR likes:post:123

This is:
• Lock-free
• O(1) time complexity
• Capable of millions of operations per second
Every few seconds, a background worker counts up all the accumulated increments and writes them to the database in one batch write.

1,000,000 user writes → 1 database write every 5 seconds

That’s a 500x reduction in database pressure. From a single design decision.

Idea 2: The Mailbox (Kafka)

A like isn’t just a number increment. It triggers a chain of events:
• Push notification to the post owner
• Feed re-ranking for followers
• Analytics logging
• ML model signal
If all of this happened synchronously — inside your like request — the API would take seconds to respond.
So instead, the like gets dropped into a message queue (Kafka):

User taps like
  → API validates and writes to Redis
  → Publishes event to Kafka topic "like-events"
  → Returns 200 OK to client  ← this happens in ~50ms

Meanwhile, asynchronously:
  → Notification service reads from Kafka → sends push
  → Feed service reads from Kafka → updates rankings
  → Analytics service reads from Kafka → logs the event

Kafka is, at its core, a distributed FIFO queue. Events go in. Workers consume them at their own pace. Nothing is lost, even during traffic spikes.
The user gets instant feedback. The system catches up behind the scenes.

Idea 3: The Optimistic UI Update

Here’s the part most people don’t realize.
When you tap like — your phone doesn’t wait for the server.
The heart turns red immediately. The count goes up immediately. All of this happens locally, on your device, before any network response arrives.

fun onLikeTapped(postId: String) {
    // Instant UI update — before server responds
    post.isLiked = true
    post.likeCount += 1
    updateUI(post)

    // API call happens in background
    viewModelScope.launch {
        val result = repository.likePost(postId)
        if (result is Result.Error) {
            // Quietly roll back if it failed
            post.isLiked = false
            post.likeCount -= 1
            updateUI(post)
        }
    }
}

This is called an Optimistic UI Update.
The client optimistically assumes the server will succeed — and only corrects itself if it doesn’t.
99% of the time, the user never sees the failure path.
This single pattern is responsible for why Instagram, Twitter, and YouTube feel so instant compared to apps that wait for server confirmation before updating the UI.

The Data Structures Running This System

Here’s what made this really click for me. The DSA concepts you study for interviews are literally running these systems in production.
HashSet → Deduplication
How does Instagram prevent you from liking the same post twice?

Redis: SADD liked_users:post:123 user_456

Returns 1 → new like, proceed
Returns 0 → already liked, reject

Under the hood: a hash table. O(1) average lookup. It doesn’t matter if 10 people or 50 million people liked that post — the check is equally fast.

Max Heap → Feed Ranking
Your Instagram feed isn’t sorted by total likes. It’s sorted by like velocity — likes per minute.

score = (likes_last_10_min * 0.6) + (recency * 0.3) + (relationship * 0.1)

MaxHeap of top K posts for your feed:
  Insert: O(log K)
  Extract max: O(log K)

Every feed refresh, millions of posts get scored and the top K get surfaced to you — using a heap.

LRU Cache → Cache Eviction

Not every post needs to stay in Redis forever.
Hot posts (just went viral) stay in cache. Cold posts (3 years old, no activity) get evicted when the cache fills up.

LRU Cache = HashMap + Doubly Linked List

get(): O(1)
put(): O(1)

Most-recently-used moves to the front. Least-recently-used gets evicted from the tail.

Sliding Window → Rate Limiting

Instagram prevents bot abuse using rate limiting:

Rule: Max 100 likes per minute per user

On each like:
  1. Remove events outside the last 60 seconds
  2. If count >= 100 → reject (429)
  3. Else → add current timestamp, proceed

This is the sliding window log algorithm. O(1) amortized with a circular buffer.

The Three-Layer Architecture

┌─────────────────────────────────┐
│     SPEED LAYER (Redis)          │
│  In-memory, O(1) ops, ~50ms     │
│  What users interact with        │
└────────────────┬────────────────┘
                 │
┌────────────────▼────────────────┐
│     BUFFER LAYER (Kafka)         │
│  Absorbs spikes, decouples       │
│  services, guarantees delivery   │
└────────────────┬────────────────┘
                 │
┌────────────────▼────────────────┐
│     TRUTH LAYER (Database)       │
│  Batch-updated, eventual sync    │
│  Never under direct user load    │
└─────────────────────────────────┘

Users touch the Speed Layer.
The Database sits in the Truth Layer.
They never meet directly.

Real-World Edge Cases

Duplicate Requests

Mobile networks are unreliable. A like request can be sent twice on timeout + retry.
Solution: Idempotency keys

POST /api/posts/123/like
Header: X-Idempotency-Key: <uuid-generated-on-client>

Server: if key seen before → return cached result, skip processing

Same result, no double-increment.

What If Redis Goes Down?
The API tier falls back to writing directly to Kafka with a flag indicating Redis was bypassed. Consumers handle the dedup and count reconciliation. Circuit breakers prevent cascading failures.
Resilience is designed in, not bolted on.

The Bigger Takeaway
The like button is a solved problem. But the pattern behind it isn’t specific to likes.
At scale, the answer is almost never “do the thing immediately.”
It’s always:
1. Do the fast, approximate version now
2. Queue the real work
3. Show the user the expected result
4. Reconcile in the background
Instagram, YouTube, Swiggy, PhonePe, Razorpay — every high-scale system is some variation of this pattern.
Understanding this one system is a legitimate unlock for thinking about distributed systems in general.

Further Reading
• Designing Data-Intensive Applications — Martin Kleppmann
• Redis documentation on atomic counters
• Kafka documentation on consumer groups
• Google SRE Book — Chapter on handling overload

If you’re studying system design for interviews, bookmark this. If something was unclear or you want me to go deeper on any section — drop a comment. Happy to expand.

DEV Community

How Instagram Handles 1,000,000 Concurrent Likes Without Breaking — Explained Simply

Top comments (0)