Ajinkya Singh

Posted on Nov 24

⚡ How Kafka Stores Billions of Messages: The Storage Architecture Nobody Explains

#architecture #backend #kafka #webdev

🎯 Introduction: Beyond the Message Queue

Imagine you're running a global streaming platform processing millions of viewer events every second—play, pause, skip, like, and watch history. How do you store this tsunami of data efficiently while keeping it instantly accessible? This is where Kafka's storage architecture becomes a masterpiece of engineering.

🏗️ The Three Core Responsibilities

Every Kafka broker juggles three critical tasks simultaneously:

1. 📥 Producer Gateway

Accepts incoming streams of events from applications across your network

2. 💾 Storage Engine

Writes messages to disk durably and efficiently—this is where the magic happens

3. 📤 Consumer Server

Rapidly locates and delivers data to consumers while replicating to other brokers

📊 The Storage Hierarchy

Topic: "viewer-activity"
│
├── Partition 0 (folder: /data/viewer-activity-0/)
│   ├── 00000000000000000000.log
│   ├── 00000000000000000000.index
│   ├── 00000000000000000000.timeindex
│   ├── 00000000000000850000.log
│   ├── 00000000000000850000.index
│   ├── 00000000000000850000.timeindex
│   └── 00000000000001700000.log (active segment)
│
├── Partition 1 (folder: /data/viewer-activity-1/)
│   └── [similar segment files...]
│
└── Partition 2 (folder: /data/viewer-activity-2/)
    └── [similar segment files...]

🧩 Why Segments? The Problem with Giant Files

Imagine storing all viewer events in one massive file:

❌ A 15TB file is impossible to manage

❌ Deleting old watch history requires rewriting everything

❌ Finding "what did user watch at 3 PM yesterday" means scanning terabytes

❌ File corruption destroys all your data

The Segment Solution:

Kafka divides each partition's log into segments—smaller, manageable chunks (typically 1GB each).

🔄 When Does Kafka Create a New Segment?

A new segment rolls when either condition is met:

Condition 1: Size Threshold

Current segment reaches 1GB (default)
→ Close it, start fresh segment

Condition 2: Time Threshold

7 days have passed (default)
→ Roll to new segment regardless of size

📅 Real-World Timeline Example:

Monday 9:00 AM  → Users start watching shows
                → Messages written to segment 0

Monday 2:30 PM  → Segment 0 hits 1GB
                → Kafka creates segment 1
                → New events go to segment 1

Tuesday 3:00 PM → Segment 1 hits 1GB
                → Kafka creates segment 2 (now active)

🗂️ Segment File Anatomy

For each segment, Kafka maintains three files:

Partition Folder: viewer-activity-0/
│
├── 00000000000000850000.log        (1GB - actual messages)
├── 00000000000000850000.index      (10MB - offset lookup map)
└── 00000000000000850000.timeindex  (10MB - timestamp lookup map)

File Naming Convention

The number 00000000000000850000 is the base offset—the first message's offset in this segment.

⚡ Index Files: Kafka's Speed Secret

🎯 The Two-Step Lookup Process

When Kafka needs to find a message, it uses TWO files in sequence:

First: Check the segment file name (base offset) to find the RIGHT segment
Second: Use the .index file to find the EXACT location within that segment

Let's see this in action!

🔍 1. Offset Index (.index)

Maps message offset → byte position in the log file

How It Works:

Consumer Request: "Give me messages starting from offset 850,125"

Step 1: FIND THE RIGHT SEGMENT (using base offsets)
        Available segments:
        • 00000000000000000000.log (base: 0)
        • 00000000000000850000.log (base: 850,000) ← THIS ONE!
        • 00000000000001700000.log (base: 1,700,000)

        Offset 850,125 falls between 850,000 and 1,700,000
        → Select segment: 00000000000000850000.log

Step 2: FIND EXACT POSITION (using .index file)
        Broker loads 00000000000000850000.index into memory
        Binary search finds:
        • offset 850,100 → byte 3072
        • offset 850,150 → byte 6144

        Message 850,125 must be between bytes 3072-6144

Step 3: READ THE MESSAGE
        Read from byte 3072 in 00000000000000850000.log
        Scan forward until offset 850,125 is found

Index Structure:

┌──────────┬─────────────┐
│  Offset  │  Byte Pos   │
├──────────┼─────────────┤
│  850000  │      0      │
│  850100  │   3072      │
│  850150  │   6144      │
│  850200  │   9216      │
│   ...    │    ...      │
└──────────┴─────────────┘

Result: ✨ No scanning millions of messages—instant lookup!

📋 Visual: The Two-File Lookup System

🎯 FINDING MESSAGE BY OFFSET (850,125)

File System View:
├── 00000000000000000000.log      (offsets: 0 - 849,999)
├── 00000000000000000000.index
├── 00000000000000850000.log      (offsets: 850,000 - 1,699,999) ⭐
├── 00000000000000850000.index    ⭐ USE THIS!
├── 00000000000001700000.log      (offsets: 1,700,000+)
└── 00000000000001700000.index

Step 1: Match offset to segment name (base offset)
        850,125 is >= 850,000 and < 1,700,000
        → Open segment: 00000000000000850000

Step 2: Use that segment's .index file
        → Open: 00000000000000850000.index
        → Find byte position: 3072
        → Read from .log file at byte 3072

🔑 Key Insight: Why Two Files?

Without segment files (base offset in filename):

❌ Would need to check EVERY index file
❌ Open thousands of files to find the right one
❌ Very slow!

With segment files (base offset in filename):

✅ Filename tells you the offset range instantly
✅ Only open ONE .index file
✅ Lightning fast!

The Two-Step Magic:

Segment filename (base offset) = Coarse filter (which file?)
Index file (.index) = Fine filter (which byte position?)

⏰ 2. Time Index (.timeindex)

Maps timestamp → corresponding offset (also uses two-step lookup!)

How It Works:

Consumer Request: "Show me all viewer activity from the last hour"
                  (timestamp: 2025-11-25 13:00:00)

Step 1: FIND THE RIGHT SEGMENT (using .timeindex files)
        Check all segments' time ranges:
        • Segment 0: timestamps 10:00 - 11:59
        • Segment 850000: timestamps 12:00 - 13:59 ← THIS ONE!
        • Segment 1700000: timestamps 14:00 onwards

        Timestamp 13:00:00 falls in segment 850000

Step 2: FIND THE OFFSET (using .timeindex)
        Broker checks 00000000000000850000.timeindex
        Binary search finds:
        2025-11-25 13:00:00 → offset 850,750

Step 3: USE OFFSET INDEX (now it's a regular offset lookup!)
        Uses 00000000000000850000.index
        Finds: offset 850,750 → byte position 225,280

Step 4: READ THE MESSAGES
        Reads from byte 225,280 in .log file
        Returns all messages from that point forward

Time Index Structure:

┌──────────────────────┬──────────┐
│     Timestamp        │  Offset  │
├──────────────────────┼──────────┤
│ 2025-11-25 10:00:00  │  850000  │
│ 2025-11-25 11:30:00  │  850300  │
│ 2025-11-25 13:00:00  │  850750  │
│ 2025-11-25 14:15:00  │  851000  │
│        ...           │   ...    │
└──────────────────────┴──────────┘

🧠 The Memory Advantage

Why are index lookups so blazing fast?

Disk Storage (Slow):
├── segment.log (1GB - rarely loaded fully)
├── segment.index (10MB - loaded into RAM!)
└── segment.timeindex (10MB - loaded into RAM!)

Memory (Lightning Fast):
└── Index files cached → Microsecond lookups

Because index files are tiny (10-20MB), they fit entirely in memory:

✅ No disk reads for lookups
✅ Binary search through in-memory structures
✅ Microseconds instead of seconds

🗑️ Retention Policies: Managing Disk Space

Without cleanup, your disks would eventually overflow. Kafka automatically deletes old segments based on policies.

1. Time-Based Retention

Configuration:
log.retention.hours = 168  (7 days)

Timeline:
Day 1  → Segment 0 created (Nov 18 viewer data)
Day 2  → Segment 1 created (Nov 19 viewer data)
...
Day 8  → Segment 0 is now 7 days old
       → Kafka deletes:
         - segment-0.log
         - segment-0.index
         - segment-0.timeindex

Visual:

Before Cleanup (Day 8):
[Seg0: Nov 18] [Seg1: Nov 19] [Seg2: Nov 20] ... [Seg7: Nov 25]
      ↓ 7 days old - DELETE

After Cleanup:
[Seg1: Nov 19] [Seg2: Nov 20] ... [Seg7: Nov 25]

2. Size-Based Retention

Configuration:
log.retention.bytes = 107374182400  (100GB per partition)

Current State:
Partition total size = 98GB ✓ OK

New Segment Added:
Partition total size = 103GB ✗ EXCEEDS LIMIT

Action:
→ Kafka deletes oldest segment
→ New total = 102GB ✓

3. Combined Policies

You can use both simultaneously—Kafka deletes when either condition is met:

log.retention.hours = 168     (7 days)
log.retention.bytes = 100GB

Segment deleted if:
✓ Age > 7 days  OR
✓ Total size > 100GB

💨 Why Deletion is Lightning Fast

Segment-Level Operations

Kafka never deletes individual messages. It operates on entire segment files.

❌ Inefficient (What Kafka DOESN'T do):
Read file → Skip old messages → Write remaining → Rebuild indexes
(Hours of processing)

✅ Efficient (What Kafka DOES):
Delete segment-000.log
Delete segment-000.index  
Delete segment-000.timeindex
(Milliseconds)

Benefits:

🚀 One filesystem operation
🚀 No data rewriting
🚀 No index rebuilding
🚀 Extremely low CPU usage

🎬 Real-World Message Journey

Let's trace a viewer event through the entire system:

Step 1: Producer Sends Event

Producer → Kafka Broker

Message: {
  user_id: "viewer-789",
  action: "started_watching",
  show_id: "breaking-code-s1e3",
  timestamp: 2025-11-25 14:45:00
}

Destination:
Topic: viewer-activity
Partition: 2 (based on user_id hash)
Offset: 1,700,082 (next available)

Step 2: Write to Active Segment

Active Segment: 00000000000001700000.log
Position: Append at byte 245,760

Step 3: Update Indexes

Offset Index:
1,700,082 → byte 245,760

Time Index:
2025-11-25 14:45:00 → offset 1,700,082

Step 4: Acknowledge Producer

Broker → Producer:
"✓ Message written at offset 1,700,082"

Step 5: Consumer Reads Message

Consumer Request: "Give me messages from offset 1,700,082"

Kafka Process:
1. Which segment? → 00000000000001700000.log
2. Load offset index into memory
3. Find: offset 1,700,082 → byte 245,760
4. Read from disk starting at byte 245,760
5. Return message to consumer

🧹 Log Compaction: Latest State Only

The Use Case

Scenario: Storing user profile updates

User ID: viewer-789

Messages Over Time:
1. viewer-789 → {email: "old@email.com", plan: "basic"}
2. viewer-789 → {email: "new@email.com", plan: "basic"}
3. viewer-789 → {email: "new@email.com", plan: "premium"}

Question: Do we need messages 1 and 2?
Answer: NO! Only the latest state matters.

How Compaction Works

Configuration:
log.cleanup.policy = compact

Result:
Kafka guarantees: For each key, retain at least the last known value

Background Process:
1. Log cleaner reads old segments
2. Identifies duplicate keys
3. Keeps only latest message per key
4. Creates cleaned segments
5. Replaces old segments

💀 Tombstones: Deleting Keys

To remove a key entirely:

Send message: {key: "viewer-789", value: null}

Effect:
→ Log cleaner removes the key
→ Includes its last valid value
→ After configurable grace period

Compaction vs. Retention Comparison

Aspect	Retention (Delete)	Compaction
What's Kept	Time/size window of all messages	Latest value per key
What's Deleted	Old segments entirely	Old values for same key
Use Case	Time-series events, logs	User profiles, state management
Configuration	`log.cleanup.policy=delete`	`log.cleanup.policy=compact`
History	Limited time/size window	No history, only latest