Understanding Prometheus Storage: From Scrape to Disk

#architecture #database #monitoring

Prometheus is a time series monitoring system built for metrics and alerting. For people curious about how Prometheus works under the hood, this post will guide you through how Prometheus scrapes data, accumulates it in memory, and eventually persists it to disk.

The Data Flow: From Metrics to Disk

┌──────────┐
│ Scrape   │  (Pull metrics from targets)
└────┬─────┘
     │
     ↓
┌──────────┐
│  Parse   │  (Parse response, create series)
└────┬─────┘
     │
     ↓
┌──────────────┐
│   Batch      │  (Temporary, request-scoped buffer)
└────┬─────────┘
     │
     ├──→ [WAL] (Write to durability log)
     │
     ↓
┌─────────────────────┐
│ Head (In-Memory)    │  (Mutable chunks, queryable)
│ 2-hour window       │
└────┬────────────────┘
     │ (after 2 hours)
     ↓
┌──────────────────────┐
│  Block (On Disk)     │  (Immutable, compressed)
│ - metadata           │
│ - index              │
│ - chunks (512MB)     │
└──────────────────────┘

Note: This entire cycle (scrape → parse → batch → WAL → Head) repeats every 15 seconds by default (configurable via the scrape_interval setting). This means the Head continuously accumulates samples, with 240 scrapes per hour appending to series chunks.

Scraping & Parsing

Prometheus queries the /metrics endpoint of target systems, receiving responses in a text-based format. Here's an example:

http_requests_total{job="api", instance="server1"} 100
http_requests_total{job="api", instance="server2"} 200
http_requests_total{job="web", instance="server1"} 50

Prometheus parses this response line by line. Each unique combination of metric name + label key-value pairs becomes a series. In this example, we have one metric (http_requests_total) but three different series because the labels differ.

This series-based organization allows Prometheus to track and query metrics independently by their labels. Now that we have series, they're temporarily collected into a batch before being processed further.

The Batch

The batch is a temporary buffer that holds all the samples from a single scrape cycle. Once the scrape completes, Commit() is called on the batch, triggering two operations: first, the data is written to a Write-Ahead Log (WAL) for durability, then it's appended to the in-memory Head for querying.

Batching is more efficient than writing samples individually, all samples are written together in one operation, reducing I/O overhead. It's also atomic: either all samples from a batch succeed or none do, ensuring consistency.

The Write-Ahead Log (WAL)

Before any data is stored in memory or on disk, Prometheus writes it to a Write-Ahead Log (WAL) file. Think of the WAL as a journal: it records encoded data structures (called records) sequentially on disk. These records include:

Series records (metric names and labels)
Sample records (sample values and timestamps)
Metadata records (metric metadata and exemplars)

Why is the WAL critical?

If Prometheus crashes (power outage, Out of Memory (OOM), restart), unsaved data in memory is lost. The WAL acts as a safety net. When Prometheus restarts, it replays these records from the WAL to recover all uncommitted samples and rebuild the in-memory state. This ensures durability: no data loss.

The WAL typically uses segments of ~128MB each (configurable). These files are stored on disk and can be replayed if needed. After records are written to the WAL successfully, it's safe to update the in-memory Head, knowing that even if the system crashes next second, that data can be recovered.

The Head (In-Memory)

The Head is Prometheus's in-memory mutable storage. It contains multiple series, and each series tracks its own list of chunks: compressed, time-ordered collections of samples. New samples are constantly appended to existing series, growing the chunks until they fill up, at which point new chunks are created.

The Head keeps data in memory for a configurable time window (default: 2 hours). This design has tradeoffs:

Benefit: In-memory storage enables fast writes and queries. Answering questions like "What was the CPU usage 5 minutes ago?" is immediate since data is already in RAM.
Cost: More unique series means more memory usage. This is why Prometheus can suffer from high memory consumption with high-cardinality metrics (thousands or millions of unique series due to many label combinations).

Why does cardinality matter? Each series stores its labels (metric name + key-value pairs), a unique ID, and its list of chunks. When you have many unique series, each one consumes memory for storing label strings and chunk metadata. For example, a single metric with 1,000 unique label combinations = 1,000 separate series, each with its own label data and chunk pointers. This is why high-cardinality metrics can quickly exhaust memory.

After 2 hours of data accumulation, the Head is sealed and flushed to disk as an immutable block. This move from memory to disk is critical for long-term storage and frees up memory for new incoming data.

Blocks (On Disk)

A block is a directory containing all the time-series data from a 2-hour window in a compressed, queryable format. Once sealed, blocks are immutable and stored on disk for long-term retention.

Block Structure:

metadata file: Contains metadata about the block (time range, number of series, etc.)
index file: Maps metric names and labels to their corresponding series, enabling fast lookups during queries.
chunks directory: Contains the actual time-series data. While each chunk belongs to a single series, multiple chunks from different series are packed together into segment files (512MB each by default) for efficient storage and retrieval.

Why are blocks immutable?

Immutability is key to Prometheus's design. Once a block is written to disk, it never changes. This enables compaction: the process of merging multiple blocks together. For example, you might merge 30 blocks from 30 days into fewer, larger blocks, reducing disk usage and improving query performance. Immutability ensures compaction is safe: no concurrent writes can conflict with the merge operation.

This immutable-block architecture is one reason Prometheus is so efficient: old data is read-only, allowing aggressive optimization without worrying about concurrent modifications.

Conclusion

Now you understand the journey of data in Prometheus. Every 15 seconds, metrics are scraped, parsed into series, collected into a batch, written to the WAL for durability, appended to the in-memory Head for fast queries, and then sealed to disk as immutable blocks after 2 hours.

This architecture is elegant: it balances speed, durability, and long-term storage. The in-memory Head enables sub-millisecond query responses for recent data. The WAL ensures no data loss on crashes. Immutable blocks on disk enable efficient compaction and long-term retention without sacrificing query performance.

The tradeoff? Memory usage scales with cardinality. Understanding this design helps you make better decisions when monitoring with Prometheus, whether it's designing your metric labels, configuring retention, or troubleshooting performance.