Loki Multi-Cluster Log Aggregation Across AWS Regions

#devops #aws

Originally published on kuryzhev.cloud

You set up Loki in every region, pointed Grafana at each datasource, and called it done — but the moment a cross-region incident hit, you had no way to correlate logs without switching datasources mid-investigation. That context-switching costs minutes during an outage, and minutes matter. Loki multi-cluster log aggregation across AWS regions solves this by centralizing ingestion while keeping agents close to the workloads that generate logs.

This post covers the internals of how Loki actually handles cross-region log federation, the three architectural mistakes I see most often, and the production topology we use today with Loki 3.0.x on EKS.

What Loki Multi-Cluster Aggregation Actually Does

The first thing to understand is that Loki is not Prometheus. There is no federation endpoint, no remote read, no hierarchical scrape chain. Loki operates on a push model — agents (Promtail 2.9.x or Grafana Alloy 1.2+) ship log streams directly to a Loki write path. When engineers first hear "multi-cluster Loki," they assume something analogous to Prometheus federation, where a central instance scrapes regional ones. That assumption gets you burned fast.

What actually happens in a correctly designed setup: agents in eu-west-1 and ap-southeast-1 push log streams to a single Loki deployment in a hub region (us-east-1). Every log line arrives tagged with cluster and region labels. These labels become your primary index dimensions alongside the standard Kubernetes metadata. A Querier in us-east-1 can answer a LogQL query like {cluster="prod-eu-west-1"} |= "ERROR" because the underlying chunks are in a shared S3 bucket — it doesn't matter where the log originated geographically.

Understanding the component roles matters here. The Distributor receives incoming log streams and validates label cardinality before routing to Ingesters. The Ingester buffers recent logs in memory (up to chunk_idle_period: 5m in our config — more on why we reduced this from the 30m default shortly). The Querier reads from both in-memory Ingester data and flushed S3 chunks, merging results transparently. None of these components need to be co-located with the agents. The separation is what makes the hub-and-spoke topology work.

One subtle point: if you use a single S3 bucket in us-east-1, agents in eu-west-1 pay cross-region data transfer costs on every log push. We address this with VPC peering and private NLB endpoints. The logs still land in one bucket, but the network path stays off the public internet.

How People Set This Up Wrong

I've reviewed a lot of Loki deployments. The same three mistakes appear repeatedly, and each one looks fine until it doesn't.

Mistake 1: Multiple Loki stacks, multiple Grafana datasources. This is the path of least resistance and it works — right up until you need to answer "did the error in eu-west-1 correlate with the deployment in us-east-1?" LogQL has no cross-datasource join. You're manually copying timestamps between two browser tabs. I stopped recommending this topology after we spent 40 minutes during an incident reconciling two separate Explore sessions. One central Loki eliminates the problem entirely.

Mistake 2: The filesystem storage backend that quietly becomes production. Someone spins up a quick Loki instance with storage.type: filesystem for "just testing." Three months later it's ingesting 20GB/day and nobody has touched the config. When the Ingester pod reschedules to a different AZ — and it will — the chunks on the old node's local disk are gone. No error in Grafana. No alert fires. Log lines just disappear. The default chunk_idle_period: 30m means you can lose up to 30 minutes of logs silently. We reduce this to 5m and use S3 exclusively.

Mistake 3: No stream selector discipline from day one. Default Promtail configs targeting /var/log/**/*.log without excluding /var/log/containers symlinks cause duplicate ingestion immediately. But the slower-burning problem is label cardinality explosion. Teams start adding pod_name, image_tag, commit_sha as Loki labels because it feels natural coming from Prometheus. Within 48 hours of a busy cluster, the number of unique label combinations saturates the index. Loki starts returning too many outstanding requests — which is not a network error, it maps to querier.max_outstanding_requests_per_tenant (default 100). High-cardinality values belong in log content, parsed with | json or | logfmt, not in the label index.

Watch out: setting auth_enabled: false in any setup you might scale later is a trap. All logs land in the fake tenant. Per-tenant retention overrides never apply. You cannot isolate prod logs from dev logs. Enabling auth later requires re-ingesting or manually migrating tenant assignments — neither is fun.

The Correct Approach — Centralized Loki with Regional Agents

The topology we run: one Loki deployment in microservices mode on EKS in us-east-1, Grafana Alloy agents in each spoke cluster, all writing to a shared S3 bucket. Monolithic mode is fine up to roughly 30-50GB/day ingestion; past that, memory limits on a single pod become the bottleneck and you want separate scaling for Distributors, Ingesters, and Queriers.

The Helm values below configure Loki in microservices mode. Pay attention to the comments — several of these settings have non-obvious failure modes.

# loki-values.yaml — Helm values for Loki microservices mode, hub region (us-east-1)
# Chart: grafana/loki v6.x, Loki image: grafana/loki:3.0.0

loki:
  auth_enabled: true  # REQUIRED for multi-tenant; false puts everything in 'fake' tenant

  commonConfig:
    replication_factor: 3  # Must match ingester replica count below

  storage:
    type: s3
    bucketNames:
      chunks: loki-chunks-prod-us-east-1
      ruler: loki-ruler-prod-us-east-1
      admin: loki-admin-prod-us-east-1
    s3:
      region: us-east-1
      # Use IRSA — no static credentials here
      insecure: false

  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb          # TSDB index, not boltdb-shipper (deprecated)
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h

  limits_config:
    ingestion_rate_mb: 20          # Per-tenant; raise for high-volume clusters
    ingestion_burst_size_mb: 40
    max_label_names_per_series: 20 # Default 15 is too low for k8s workloads
    max_query_parallelism: 32      # Enables query frontend sharding
    retention_period: 720h         # 30d default; override per tenant in overrides

  ingester:
    chunk_idle_period: 5m          # Flush sooner; default 30m risks data loss
    chunk_target_size: 3145728     # 3MB chunks; reduces S3 PUT count vs default 1.5MB
    chunk_encoding: snappy         # Better compression ratio than gzip for logs

  compactor:
    working_directory: /var/loki/compactor
    retention_enabled: true        # Must be explicit; does NOT default to true
    delete_request_store: s3

  query_frontend:
    split_queries_by_interval: 15m # Parallelizes long time-range queries

# Component replica counts
ingester:
  replicas: 3          # Must match replication_factor above

querier:
  replicas: 3
  maxUnavailable: 1

queryFrontend:
  replicas: 2

compactor:
  replicas: 1          # CRITICAL: never run more than 1; no distributed locking

distributor:
  replicas: 2

# IRSA annotation — replace with your actual role ARN
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::123456789012:role/loki-s3-irsa-role"

The Grafana Alloy config below runs in each spoke cluster. The key discipline is injecting cluster and region labels at the agent level, not via Loki's relabeling. If you try to add these labels server-side, the index is already written without them and your queries break in confusing ways.

# alloy-config.river — Grafana Alloy v1.2+ agent config for spoke cluster (eu-west-1)
# Runs in each non-hub EKS cluster; ships logs to central Loki in us-east-1

// Discover all pods in the cluster
discovery.kubernetes "pods" {
  role = "pod"
}

// Relabel: keep only necessary labels, inject cluster identity
discovery.relabel "pod_logs" {
  targets = discovery.kubernetes.pods.targets

  // Inject cluster and region as static labels — set here, not at Loki level
  rule {
    target_label = "cluster"
    replacement  = "prod-eu-west-1"   // Hardcode per cluster; do not use env var
  }
  rule {
    target_label = "region"
    replacement  = "eu-west-1"
  }

  // Keep namespace, pod, container for filtering
  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label  = "namespace"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label  = "pod"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_container_name"]
    target_label  = "container"
  }
}

// Tail pod log files from node filesystem
loki.source.kubernetes "pod_logs" {
  targets    = discovery.relabel.pod_logs.output
  forward_to = [loki.write.central.receiver]
}

// Write to central Loki in hub region via private NLB (VPC peering)
loki.write "central" {
  endpoint {
    // Replace with private NLB DNS of hub Loki for cross-cluster routing
    url = "http://internal-loki-nlb.us-east-1.elb.amazonaws.com/loki/api/v1/push"

    tenant_id = "prod-eu-west-1"  // Maps to per-tenant retention in Loki overrides

    basic_auth {
      username = env("LOKI_USERNAME")   // Injected via k8s secret
      password = env("LOKI_PASSWORD")
    }
  }

  external_labels = {
    cluster = "prod-eu-west-1",
    region  = "eu-west-1",
  }
}

Watch out: the storage_config.aws.s3 field requires an s3://bucket-name URI format. Using an ARN format silently falls back to filesystem storage and logs no error at startup. You won't notice until an Ingester pod restarts and you check S3 to find it empty.

For IAM, use IRSA (IAM Roles for Service Accounts) with s3:GetObject, s3:PutObject, s3:DeleteObject, and s3:ListBucket. Node-level instance profiles grant S3 access to every pod on the node — that's not acceptable in a production security posture. You can read more about the overall architecture approach on kuryzhev.cloud.

Advanced Patterns — Ruler, Retention Tiers, and Query Sharding

Once the basic topology is working, three features separate a functional setup from a scalable one.

Per-tenant retention is the most impactful cost control lever. Define an overrides.yaml that sets retention_period: 720h for the prod-eu-west-1 tenant and 168h (7 days) for dev-eu-west-1. The Compactor enforces these on a 24-hour cycle. Without per-tenant retention, every cluster's logs accumulate at the same rate in S3 indefinitely. One important note: the table_manager component is deprecated as of Loki 2.8. If you have both table_manager and compactor configured, retention silently does nothing. Use Compactor exclusively with retention_enabled: true set explicitly — it does not default to true.

The Ruler component enables cross-cluster alerting via LogQL recording and alerting rules. Configure ruler_storage pointing to a separate S3 prefix (not the same prefix as chunks — this causes read conflicts). A rule that fires when the error rate in eu-west-1 exceeds a threshold routes to Alertmanager in the hub region, giving you unified alert management regardless of which cluster generated the condition.

Query frontend sharding with max_query_parallelism: 32 and split_queries_by_interval: 15m is non-negotiable for multi-cluster deployments. Without it, a 24-hour LogQL query across 5 clusters serializes through a single Querier, hits the timeout, and returns a 500. With it, the Query Frontend splits the time range into 15-minute chunks and fans them out across Querier replicas in parallel. The difference between a 45-second timeout and a 4-second result on a wide query comes down to these two settings.

Also consider enabling the index_gateway component once your index exceeds roughly 10GB. It offloads index lookups from Queriers, reducing their memory footprint by approximately 40% in our measurements. This matters when you're running 3 Querier replicas and each one is holding the full index in memory.

Performance and Cost Notes

The ingestion cost driver in this architecture is not compute — it's S3 PUT requests. Loki's default chunk_target_size is 1.5MB (1,572,864 bytes). For clusters with moderate log volume, chunks fill slowly, flush frequently, and generate a high PUT-to-data ratio. We set chunk_target_size: 3145728 (3MB) and chunk_encoding: snappy across all hub Ingesters. Snappy compresses faster than gzip with slightly lower ratio — for log data the trade-off favors throughput.

Cross-region data transfer from eu-west-1 agents to us-east-1 Loki costs approximately $0.02/GB at standard AWS rates. At 100GB/day per spoke cluster, that's roughly $60/month per region before any optimization. VPC peering or AWS PrivateLink cuts this to $0.01/GB and eliminates public internet exposure entirely. For three spoke clusters at that volume, the saving is ~$90/month — enough to justify the peering setup time.

On the query side: a 1-hour LogQL filter query across 3 clusters with proper label indexing — meaning cluster, namespace, container labels set correctly — resolves in under 2 seconds in our environment. The same query using a regex match against pod_name (a high-cardinality label we index for legacy reasons) takes 45+ seconds and saturates Querier CPU. The lesson: pod_name belongs in log content extracted via | json, not in the label set. A query like {cluster="prod-eu-west-1", namespace="payments"} |= "ERROR" | json | line_format "{{.message}}" is fast. A query with pod_name=~"payments-.*" as a stream selector is slow.

One final security note: enable SSE-KMS on your S3 bucket and set BucketKeyEnabled: true. Without BucketKey, every object written generates a separate KMS API call. At scale — millions of chunks — this adds meaningful cost and introduces KMS throttling as a new failure mode for your log pipeline. The official Loki documentation covers storage configuration in detail, and the Grafana Alloy docs are the right reference for agent configuration going forward as Promtail approaches end-of-life.

Loki multi-cluster log aggregation across AWS regions is not complicated once you have the right mental model. One hub, regional agents, shared S3, tenant-based isolation. The mistakes all come from treating Loki like Prometheus or treating a temporary setup like it won't become permanent. Get the labels right from day one, run the Compactor as a single replica, and size your chunks for your actual volume.