Aviral Srivastava

Posted on Feb 23

Thanos for Long-term Prometheus Storage

#architecture #devops #monitoring #opensource

Thanos: The Infinity Gauntlet for Your Long-Term Prometheus Storage Needs

Let's face it, running Prometheus is awesome. It's the go-to for monitoring your infrastructure, keeping an eye on your applications, and generally being the digital guardian of your system's health. But as your systems grow, so does the data your Prometheus instances churn out. Suddenly, your shiny 2TB SSD is groaning under the weight of historical metrics, and you're facing a classic DevOps dilemma: how do you keep your precious Prometheus data around for as long as you need it without breaking the bank or your sanity?

Enter Thanos. Think of it as the Infinity Gauntlet for your Prometheus storage strategy. It's not a replacement for Prometheus, but a powerful add-on that lets you consolidate, scale, and access your Prometheus data from anywhere, for extended periods. If you're tired of your Prometheus data disappearing after a few weeks, or struggling to query across multiple Prometheus instances, then buckle up, because Thanos is here to save the day.

So, What Exactly IS Thanos? (The "Why Should I Care?" Section)

At its core, Thanos is a highly available, multi-cluster Prometheus setup that allows for long-term storage and global querying of your metrics. It takes your existing Prometheus instances (or even multiple independent Prometheus setups) and elevates them. Instead of having isolated silos of data, Thanos weaves them together into a unified, queryable tapestry.

Imagine you have several microservices, each with its own Prometheus. Or perhaps you have different teams managing their Prometheus deployments independently. Without Thanos, querying metrics across all these would be a nightmare. You'd have to manually fetch data from each instance, aggregate it, and pray you didn't miss anything. Thanos solves this by introducing a layer that aggregates data from all your Prometheus sources and stores it in object storage (like S3, GCS, or Azure Blob Storage) for long-term retention.

It's like having a super-powered librarian who can instantly pull up any book (metric) from any shelf (Prometheus instance) in your entire library (cluster) and present it to you in one coherent summary.

Before You Grab Your Infinity Stone: Prerequisites

Before you go all-in on Thanos, there are a few things you should have in place:

Existing Prometheus Instances: Thanos works by extending your current Prometheus setup. You'll need one or more Prometheus instances already running and collecting metrics.
Object Storage: This is the backbone of Thanos's long-term storage. You'll need an object storage solution that's accessible from your Thanos components. Popular choices include:
- Amazon S3: A widely used and robust cloud object storage service.
- Google Cloud Storage (GCS): Google's equivalent to S3, offering similar functionality.
- Azure Blob Storage: Microsoft's cloud object storage solution.
- Ceph/MinIO: Self-hosted object storage solutions, great for on-premises deployments or when you want more control.
Kubernetes (Highly Recommended): While Thanos can technically be run outside of Kubernetes, it's designed to thrive within a container orchestration platform. Kubernetes simplifies deployment, scaling, and management of the various Thanos components.
Basic Understanding of Prometheus: You should be comfortable with Prometheus concepts like targets, scrape configurations, PromQL, and how it stores data.

The Superpowers of Thanos: Advantages

Why should you embark on this Thanos journey? The benefits are pretty compelling:

Unlimited (Well, Almost) Long-Term Storage: Prometheus's local storage is limited by disk space. Thanos offloads this to object storage, which is typically much cheaper and scales far beyond what local disks can offer. This means you can keep your metrics for months, years, or even longer, enabling deeper historical analysis and trend identification.
Global Querying: This is a game-changer. Thanos provides a single entry point (the Thanos Query component) to query across all your Prometheus instances. No more stitching together data from disparate sources. A single PromQL query can span your entire observability landscape.
High Availability and Resilience: Thanos is designed for HA. Its components can be deployed redundantly, ensuring that your monitoring infrastructure remains available even if individual Prometheus instances or Thanos components fail.
Reduced Prometheus Resource Footprint: By offloading long-term storage, your Prometheus instances can be configured with shorter retention periods. This significantly reduces their disk I/O and CPU load, making them more efficient and stable.
Data Deduplication: Thanos intelligently handles duplicate data that might be scraped by multiple Prometheus instances, ensuring cleaner and more accurate queries.
Data Compaction and Downsampling: Thanos can automatically compact and downsample older data, making queries over vast historical datasets faster and more manageable.
Separation of Concerns: You can have smaller, more focused Prometheus instances for specific clusters or applications, while Thanos aggregates and manages the long-term storage and global querying. This makes managing your Prometheus fleet much simpler.

The Kryptonite of Thanos: Disadvantages

No solution is perfect, and Thanos is no exception. Here are some potential drawbacks to consider:

Increased Complexity: Introducing Thanos adds more moving parts to your observability stack. You'll have new components to deploy, configure, and manage. This can be a learning curve.
Operational Overhead: While Kubernetes helps, managing a Thanos deployment still requires some operational effort, including upgrades, monitoring the Thanos components themselves, and troubleshooting.
Query Latency for Historical Data: While Thanos excels at querying across vast datasets, queries involving very old, downsampled data might have slightly higher latency compared to querying recent data on a local Prometheus instance. This is a trade-off for the scalability and cost-effectiveness of object storage.
Cost of Object Storage: While generally cheaper than high-performance disks for long-term storage, object storage still incurs costs. You'll need to monitor your usage and optimize your storage classes.
Initial Setup Time: Getting Thanos up and running for the first time can take some effort, especially if you're new to its components and configuration.

The Thanos Toolkit: Key Components and Features

Thanos isn't a single monolithic application. It's a suite of interconnected components that work together to achieve its magic. Here are the key players:

1. Prometheus (Your Data Sources)

As mentioned, your existing Prometheus instances are the source of truth. They continue to scrape, ingest, and store metrics for a short period (e.g., a few days or weeks).

2. Thanos Sidecar

This is where the magic begins. The Thanos Sidecar runs alongside each Prometheus instance. Its primary responsibilities are:

Uploading Data: It continuously uploads Prometheus's TSDB (Time Series Database) blocks to your configured object storage. This happens in the background, so your Prometheus instance isn't bogged down.
Exposing Local Data: It exposes the locally stored Prometheus data to the Thanos Query component.

Example Configuration Snippet (Prometheus prometheus.yml to enable Sidecar):

# Prometheus configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'my-service'
    static_configs:
      - targets: ['localhost:9090']

# Thanos Sidecar configuration (passed as arguments to the sidecar process)
# For example, when running Prometheus with a sidecar sidecar image:
# command:
#  - /bin/thanos
#  - sidecar
#  - --tsdb.path=/prometheus
#  - --objstore.config=file:/etc/thanos/objstore.yml
#  - --prometheus.url=http://localhost:9090
#  - --web.listen-address=0.0.0.0:9101

# In your objstore.yml file (e.g., /etc/thanos/objstore.yml):
# type: s3
# config:
#   bucket: your-thanos-bucket
#   endpoint: s3.amazonaws.com
#   region: us-east-1
#   access_key: YOUR_ACCESS_KEY
#   secret_key: YOUR_SECRET_KEY

3. Thanos Store Gateway

The Store Gateway is responsible for making data stored in object storage available for querying. It acts as an interface between the query layer and your long-term storage. It can serve data directly from object storage or cache it for faster access.

Key Features:

Object Storage Access: Connects to your object storage to retrieve historical metric data.
Data Indexing and Metadata: Indexes the data blocks in object storage, allowing Thanos Query to efficiently locate and retrieve specific metrics.
Caching: Can cache frequently accessed data to improve query performance.

4. Thanos Query

This is the central piece for global querying. Thanos Query acts as the front-end for your users and Prometheus. It aggregates data from multiple Thanos Store Gateways and local Prometheus instances.

Key Features:

Global PromQL Engine: Executes PromQL queries across all connected data sources.
Data Aggregation: Merges results from different sources to provide a unified view.
Query Routing: Intelligent routing of queries to the most appropriate data source (e.g., local Prometheus for recent data, Store Gateway for historical data).

Example Thanos Query Configuration (part of a Kubernetes deployment):

# ... (other Thanos Query configurations)
args:
  - query
  - --grpc-web-port=9095 # Port for gRPC-Web proxy
  - --http-address=0.0.0.0:10902 # HTTP address for the web UI and API
  - --query.replica-label=failure_domain
  - --store=dnssrv+_grpc._tcp.thanos-store.monitoring.svc.cluster.local # Example for discovering Store Gateways
  # Multiple --store flags can be used for different Store Gateway services

5. Thanos Receiver (Optional but Recommended)

The Receiver component can act as a central point for ingesting metrics before they are sent to Prometheus. This is useful for deduplicating metrics from multiple sources and can simplify management in complex environments.

6. Thanos Ruler (Optional)

If you're using Prometheus Alertmanager, Thanos Ruler can provide high-availability alerting by evaluating alerting rules across all your Prometheus instances.

Putting It All Together: A Typical Thanos Architecture

A common setup for Thanos on Kubernetes looks like this:

Multiple Prometheus Instances: Each running in its own cluster or namespace, scraping their local targets.
Thanos Sidecar: Attached to each Prometheus instance, uploading data to object storage.
Thanos Store Gateway: Deployed in a central cluster or as a standalone service, connecting to object storage and providing queryable access to historical data.
Thanos Query: Deployed in a central cluster, acting as the single query endpoint for your users and Grafana. It discovers Store Gateways and can also query local Prometheus instances directly.
Grafana: Configured with Thanos Query as a Prometheus data source.

When you query Grafana, the request goes to Thanos Query. Thanos Query intelligently decides whether to fetch recent data from local Prometheus instances or older data from the Thanos Store Gateway. The results are then returned to Grafana, giving you a seamless experience.

Implementing Thanos: A High-Level Overview

The implementation details will vary based on your environment, but here's a general roadmap:

Set Up Object Storage: Create a bucket in your chosen object storage service.
Deploy Prometheus: Ensure your Prometheus instances are running and configured correctly.
Deploy Thanos Sidecar: Integrate the Thanos Sidecar with your Prometheus deployments. This usually involves running the Sidecar as a sidecar container in your Kubernetes pods or as a separate process alongside your Prometheus binary. Configure it to point to your object storage.
Deploy Thanos Store Gateway: Deploy the Thanos Store Gateway component, configuring it to access your object storage.
Deploy Thanos Query: Deploy the Thanos Query component, configuring it to discover your Thanos Store Gateways.
Configure Grafana: Add your Thanos Query endpoint as a Prometheus data source in Grafana.

Conclusion: Is Thanos the Right Choice for You?

If you're struggling with the limitations of Prometheus's local storage, need to query across multiple Prometheus deployments, or want to build a robust, scalable, and long-term monitoring solution, then Thanos is definitely worth serious consideration. It transforms Prometheus from a capable local monitoring tool into a powerful, distributed observability platform.

While it introduces some complexity, the benefits of unlimited storage, global querying, and improved resilience often outweigh the operational overhead for organizations that have outgrown their basic Prometheus setups. Think of it as an investment in the long-term health and observability of your systems. So, go forth, embrace the Infinity Gauntlet, and make your Prometheus data work for you, forever!

DEV Community