InstaDevOps

Posted on Nov 23 • Originally published at instadevops.com

Logging at Scale: ELK Stack vs Loki vs CloudWatch

#logging #elk #loki #observability

Introduction

At small scale, logging is simple—tail a file, search with grep, done. But as your infrastructure grows to dozens of services across multiple servers, this approach breaks down. Finding a specific error across 100 containers, correlating events across services, or analyzing patterns in millions of log lines becomes impossible without the right tooling.

Modern logging solutions promise to solve these problems, but choosing the wrong one can be costly. The ELK Stack offers powerful search and analytics but requires significant operational overhead. Loki promises simplicity and cost savings but with feature tradeoffs. CloudWatch provides seamless AWS integration but can become expensive at scale.

In this comprehensive guide, we'll explore these three leading logging solutions, helping you choose the right one for your needs.

Why Logging at Scale is Hard

Volume

Small app: 1 server, 10 MB/day logs
  ✓ tail -f application.log
  ✓ grep "ERROR" application.log

At scale: 100 services, 50 GB/day logs
  ✗ Can't SSH to each server
  ✗ Can't grep 50GB of data
  ✗ Logs rotate, old ones compressed/deleted
  ✗ Distributed tracing across services needed

Velocity

1,000 requests/second × 100 services = 100,000 log events/second

Challenges:
- Ingesting 100K events/second
- Indexing for fast search
- Storing efficiently
- Querying without overwhelming system

Cardinality

High-cardinality fields (user IDs, request IDs) explode index sizes:

Low cardinality:
  level: [DEBUG, INFO, WARN, ERROR]  # 4 values
  service: [api, worker, frontend]   # 3 values

High cardinality:
  request_id: unique per request      # millions of values
  user_id: one per user              # millions of values
  session_id: one per session        # millions of values

Indexing high-cardinality fields = massive storage costs

Retention

Retention requirements:
- Debug logs: 7 days
- Application logs: 30 days
- Audit logs: 7 years (compliance)
- Access logs: 90 days

At 50 GB/day:
- 7 days = 350 GB
- 30 days = 1.5 TB
- 90 days = 4.5 TB
- 7 years = 128 TB

The ELK Stack (Elasticsearch, Logstash, Kibana)

Architecture

Application → Filebeat → Logstash → Elasticsearch → Kibana
                            ↓
                          (Parse, filter, enrich)
                            ↓
                        Elasticsearch
                      (Store & Index)
                            ↓
                          Kibana
                      (Visualize & Query)

Components

Elasticsearch: Distributed search and analytics engine

# Elasticsearch cluster
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: production
spec:
  version: 8.10.0
  nodeSets:
  # Master nodes
  - name: master
    count: 3
    config:
      node.roles: ["master"]
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          resources:
            requests:
              memory: 2Gi
              cpu: 1
            limits:
              memory: 2Gi
              cpu: 2
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
  # Data nodes
  - name: data
    count: 3
    config:
      node.roles: ["data", "ingest"]
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          resources:
            requests:
              memory: 8Gi
              cpu: 2
            limits:
              memory: 8Gi
              cpu: 4
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 500Gi

Logstash: Log parsing and enrichment

# logstash.conf
input {
  beats {
    port => 5044
  }
}

filter {
  # Parse JSON logs
  json {
    source => "message"
  }

  # Parse nginx access logs
  if [service] == "nginx" {
    grok {
      match => {
        "message" => "%{COMBINEDAPACHELOG}"
      }
    }

    geoip {
      source => "clientip"
    }
  }

  # Add Kubernetes metadata
  if [kubernetes] {
    mutate {
      add_field => {
        "cluster" => "%{[kubernetes][cluster]}"
        "namespace" => "%{[kubernetes][namespace]}"
        "pod" => "%{[kubernetes][pod]}"
      }
    }
  }

  # Drop debug logs older than 1 day
  if [level] == "DEBUG" and [@timestamp] < "now-1d" {
    drop {}
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Kibana: Visualization and querying

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: production
spec:
  version: 8.10.0
  count: 2
  elasticsearchRef:
    name: production
  podTemplate:
    spec:
      containers:
      - name: kibana
        resources:
          requests:
            memory: 1Gi
            cpu: 500m
          limits:
            memory: 2Gi
            cpu: 1

Strengths

Powerful Search: Full-text search with complex queries

# Kibana Query Language (KQL)
service:"api" AND level:"ERROR" AND status_code:500

# Find errors with specific message pattern
message:/database.*timeout/ AND @timestamp >= "now-1h"

# Aggregate queries
GROUP BY service COUNT WHERE level="ERROR"

Rich Visualizations: Dashboards, graphs, charts

Ecosystem: Massive plugin ecosystem (alerting, ML, APM)

Mature: Battle-tested at enormous scale (Netflix, Uber, GitHub)

Weaknesses

Operational Complexity: Requires dedicated team

Operational tasks:
- Managing cluster health
- Shard allocation
- Index lifecycle management
- JVM heap tuning
- Backup and recovery
- Upgrades (complex)
- Performance tuning

Resource Intensive: High CPU and memory requirements

Typical cluster for 50GB/day:
- 3 master nodes: 2 CPU, 2GB RAM each
- 3 data nodes: 4 CPU, 8GB RAM each
- 500GB storage per data node

Total: 18 CPU, 30GB RAM, 1.5TB storage
Cost: ~$1,500-2,000/month (AWS/GCP)

Storage Costs: Indexes are storage-expensive

Log volume vs storage:
- Raw logs: 50 GB/day
- Indexed in Elasticsearch: 150 GB/day (3x)
- 30 days retention: 4.5 TB

When to Use ELK

✓ Need powerful full-text search
✓ Complex queries and aggregations
✓ Rich dashboards and visualizations
✓ Have dedicated DevOps/SRE team
✓ Budget for infrastructure
✓ Need mature ecosystem (alerts, ML, APM)

Cost Example

50 GB/day logs, 30-day retention:

Self-hosted on AWS:
- EC2 instances: $1,200/month
- EBS storage: $600/month
- Data transfer: $200/month
Total: ~$2,000/month

Managed (Elastic Cloud):
- $0.12/GB/month indexed
- 50 GB/day × 30 days × 3x index = 4.5 TB
- 4,500 GB × $0.12 = $540/month + compute
Total: ~$3,000-5,000/month

Grafana Loki

Architecture

Application → Promtail → Loki → Grafana
                          ↓
                   (Store metadata)
                          ↓
                    Object Storage
                   (Store log content)

Core Concept: Labels, Not Indexes

Loki indexes metadata (labels), not log content:

ELK: Indexes EVERYTHING
  "2024-01-15 10:23:45 ERROR database connection timeout"
  ↓
  Indexes: [2024, 01, 15, 10, 23, 45, ERROR, database, connection, timeout]

Loki: Indexes ONLY labels
  Labels: {service="api", level="error", cluster="prod"}
  Content: "database connection timeout" (stored in S3, not indexed)
  ↓
  Indexes: [service=api, level=error, cluster=prod]

This makes Loki dramatically cheaper:

50 GB/day logs:
- ELK indexed storage: 150 GB/day (3x)
- Loki indexed storage: 500 MB/day (0.01x)
- Loki object storage: 50 GB/day (compressed to ~5 GB)

Components

Promtail: Log collector and shipper

# promtail-config.yaml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # Scrape Kubernetes pod logs
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Add namespace label
      - source_labels:
          - __meta_kubernetes_namespace
        target_label: namespace
      # Add pod label
      - source_labels:
          - __meta_kubernetes_pod_name
        target_label: pod
      # Add container label
      - source_labels:
          - __meta_kubernetes_pod_container_name
        target_label: container
      # Add app label
      - source_labels:
          - __meta_kubernetes_pod_label_app
        target_label: app

Loki: Log aggregation and storage

apiVersion: v1
kind: ConfigMap
metadata:
  name: loki
data:
  loki.yaml: |
    auth_enabled: false

    server:
      http_listen_port: 3100

    ingester:
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1
      chunk_idle_period: 5m
      chunk_retain_period: 30s

    schema_config:
      configs:
        - from: 2023-01-01
          store: boltdb-shipper
          object_store: s3
          schema: v11
          index:
            prefix: index_
            period: 24h

    storage_config:
      boltdb_shipper:
        active_index_directory: /loki/index
        cache_location: /loki/cache
        shared_store: s3
      aws:
        s3: s3://us-east-1/loki-logs
        s3forcepathstyle: true

    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h

    chunk_store_config:
      max_look_back_period: 720h

    table_manager:
      retention_deletes_enabled: true
      retention_period: 720h

LogQL: Loki Query Language

# Basic label filtering
{app="api", level="error"}

# Filter by log content (slower, scans content)
{app="api"} |= "database timeout"

# Regular expression filtering
{app="api"} |~ "error.*database"

# Exclude logs
{app="api"} != "debug"

# Parse JSON and filter
{app="api"} | json | status_code="500"

# Rate query: Count errors per minute
rate({app="api", level="error"}[5m])

# Aggregation: Sum by service
sum by (service) (rate({level="error"}[5m]))

Strengths

Cost-Effective: 10x cheaper storage than Elasticsearch

50 GB/day logs, 30-day retention:

Loki on AWS:
- EC2 (small): $200/month
- S3 storage: 150 GB × $0.023 = $3.45/month
- Data transfer: $50/month
Total: ~$250-300/month

8x cheaper than ELK!

Simple to Operate: Minimal operational overhead

Loki cluster:
- Single binary
- No complex shard management
- No JVM tuning
- Uses object storage (S3, GCS)
- Easy scaling

Integrates with Grafana: Unified observability platform

Label-Based: Forces good logging practices

Weaknesses

Limited Search: No full-text indexing

# Slow (scans all log content):
{app="api"} |= "specific error message"

# Fast (uses index):
{app="api", level="error"}

# Very slow:
{} |= "find this anywhere"  # Scans everything

Fewer Features: Less mature than ELK

Label Cardinality Limits: Can't use high-cardinality labels

# Bad: High cardinality
{user_id="12345"}  # Millions of unique users = millions of streams

# Good: Low cardinality
{service="api", environment="prod"}  # Few unique values

# Instead, filter content:
{service="api"} | json | user_id="12345"

When to Use Loki

✓ Want low operational overhead
✓ Cost-conscious
✓ Already using Grafana
✓ Structured logging with good labels
✓ Don't need complex full-text search
✓ Small team

Cost Example

50 GB/day logs, 30-day retention:

Self-hosted on AWS:
- EC2 instance (t3.large): $60/month
- S3 storage: 1.5 TB × $0.023 = $35/month
- Data transfer: $50/month
Total: ~$150/month

Grafana Cloud (managed):
- $0.50/GB ingested
- 50 GB/day × 30 days = 1.5 TB
- 1,500 GB × $0.50 = $750/month
Total: ~$750/month

AWS CloudWatch Logs

Architecture

Application → CloudWatch Agent → CloudWatch Logs → CloudWatch Insights
                                       ↓
                                  (Managed service)

Components

CloudWatch Agent: Collects logs from EC2/ECS/EKS

// cloudwatch-agent-config.json
{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/application.log",
            "log_group_name": "/aws/ec2/application",
            "log_stream_name": "{instance_id}",
            "timezone": "UTC"
          }
        ]
      }
    }
  }
}

Fluent Bit for Kubernetes:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token

    [OUTPUT]
        Name                cloudwatch_logs
        Match               *
        region              us-east-1
        log_group_name      /aws/eks/cluster
        log_stream_prefix   application-
        auto_create_group   true

CloudWatch Insights Query Language

# Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| filter @timestamp > ago(1h)
| sort @timestamp desc
| limit 100

# Count errors by service
fields service
| filter level = "ERROR"
| stats count() by service

# P95 latency
fields @timestamp, duration
| filter service = "api"
| stats pct(duration, 95) as p95

# Parse JSON logs
fields @timestamp, @message
| parse @message /\{"level":"(?<level>.*?)","service":"(?<service>.*?)"\}/
| filter level = "ERROR"
| stats count() by service

Strengths

Fully Managed: Zero operational overhead

AWS Integration: Native integration with all AWS services

# Automatic logs from:
- Lambda functions
- ECS containers  
- RDS databases
- API Gateway
- CloudFront
- VPC Flow Logs
- And 50+ more services

Simple Setup: Easy to get started

Security: IAM integration, encryption at rest

Weaknesses

Expensive at Scale: Costs grow linearly

50 GB/day ingestion:
- Ingestion: 50 GB × $0.50 = $25/day = $750/month
- Storage: 1.5 TB × $0.03 = $45/month
- Insights queries: ~$200/month
Total: ~$1,000/month

At 500 GB/day: ~$10,000/month

Limited Query Capabilities: Basic compared to ELK

AWS Lock-In: Can't easily migrate to another cloud

No Alerting: Requires CloudWatch Alarms (separate service)

When to Use CloudWatch

✓ AWS-only infrastructure
✓ Want zero operational overhead
✓ Small to medium log volume (<100 GB/day)
✓ Already using AWS services
✓ Need compliance/security (AWS certified)
✓ Don't want to manage infrastructure

Cost Example

50 GB/day logs, 30-day retention:

- Ingestion: $0.50/GB = $25/day = $750/month
- Storage: 1.5 TB × $0.03/GB = $45/month
- Insights queries: ~$200/month
Total: ~$1,000/month

With CloudWatch subscription filters to S3:
- Ingestion: $750/month
- S3 storage: 1.5 TB × $0.023 = $35/month
- Athena queries: ~$50/month
Total: ~$835/month

Detailed Comparison

Feature Matrix

Feature	ELK	Loki	CloudWatch
Full-Text Search	✅ Excellent	⚠️ Limited	⚠️ Basic
Cost (50GB/day)	$2,000/mo	$250/mo	$1,000/mo
Operational Complexity	❌ High	✅ Low	✅ None
Query Performance	✅ Fast	⚠️ Slower	⚠️ Variable
Scalability	✅ Excellent	✅ Good	✅ Excellent
Visualization	✅ Rich (Kibana)	✅ Good (Grafana)	⚠️ Basic
Alerting	✅ Built-in	✅ Via Grafana	✅ CloudWatch Alarms
Multi-Cloud	✅ Yes	✅ Yes	❌ AWS only
Learning Curve	❌ Steep	✅ Gentle	✅ Easy
Retention Flexibility	✅ Excellent	✅ Good	✅ Good
Log Parsing	✅ Powerful	⚠️ Basic	⚠️ Basic

Performance Comparison

Query: Find all ERROR logs in last hour

ELK:
- Query time: 0.5s
- Indexed field lookup
- Aggregations fast

Loki:
- Query time: 2-5s
- Scans log content
- Slower for content search

CloudWatch Insights:
- Query time: 5-30s
- Depends on data volume
- Can timeout on large queries

Storage Efficiency

1 TB raw logs over 30 days:

ELK:
- Indexed: 3 TB
- Cost: $300-600/month (storage only)

Loki:
- Indexed: 10 GB
- Object storage: 200 GB (compressed)
- Cost: $5-10/month (storage only)

CloudWatch:
- Storage: 1 TB
- Cost: $30/month (storage only)
- But ingestion: $15,000/month

Choosing the Right Solution

Decision Tree

Are you AWS-only?
├─ Yes: CloudWatch (unless >100 GB/day)
└─ No
   ├─ Need powerful full-text search?
   │  └─ Yes: ELK
   └─ No
      ├─ Have DevOps team for operations?
      │  ├─ Yes: ELK or Loki
      │  └─ No: Loki or CloudWatch
      └─ Budget constrained?
         ├─ Yes: Loki
         └─ No: ELK

Use Case Recommendations

Startup (<10 services, <10 GB/day)
→ CloudWatch or Loki

Simple setup
Low cost
Good enough functionality

Scale-up (10-50 services, 10-100 GB/day)
→ Loki

Cost-effective
Simple operations
Integrates with Grafana

Enterprise (>50 services, >100 GB/day)
→ ELK

Powerful search needed
Complex queries
Dedicated team available

AWS-Heavy (any size)
→ CloudWatch

If <100 GB/day
Managed service
AWS integration

Hybrid Approaches

CloudWatch + S3 + Athena

Cheaper long-term storage:

# CloudWatch Subscription Filter
resource "aws_cloudwatch_log_subscription_filter" "to_s3" {
  name            = "to-s3"
  log_group_name  = "/aws/lambda/myfunction"
  filter_pattern  = ""
  destination_arn = aws_kinesis_firehose_delivery_stream.logs.arn
}

# Firehose to S3
resource "aws_kinesis_firehose_delivery_stream" "logs" {
  name        = "logs-to-s3"
  destination = "extended_s3"

  extended_s3_configuration {
    role_arn   = aws_iam_role.firehose.arn
    bucket_arn = aws_s3_bucket.logs.arn

    # Compress logs
    compression_format = "GZIP"

    # Partition by date
    prefix              = "logs/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"
    error_output_prefix = "errors/"
  }
}

# Query with Athena
# Cost: $5/TB scanned (vs CloudWatch Insights $0.005/GB)

ELK for Hot Data + S3 for Cold

# Index Lifecycle Management
PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "s3_repository"
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Best Practices

Structured Logging

// Good: Structured JSON
{
  "timestamp": "2024-01-15T10:23:45Z",
  "level": "ERROR",
  "service": "api",
  "trace_id": "abc123",
  "message": "Database connection timeout",
  "error": {
    "type": "ConnectionTimeout",
    "database": "users-db",
    "duration_ms": 5000
  },
  "context": {
    "user_id": 12345,
    "endpoint": "/api/users"
  }
}

// Bad: Unstructured text
"2024-01-15 10:23:45 ERROR api database connection timeout after 5000ms for user 12345 on /api/users"

Use Appropriate Labels (Loki)

# Good: Low cardinality labels
{service="api", environment="prod", level="error"}

# Bad: High cardinality labels
{user_id="12345", request_id="abc123"}
# Creates millions of log streams

# Instead, use content filtering:
{service="api"} | json | user_id="12345"

Sampling High-Volume Logs

import random

def log_debug(message):
    # Sample 10% of debug logs
    if random.random() < 0.1:
        logger.debug(message)

Set Retention Policies

Log types and retention:
- DEBUG: 3-7 days
- INFO: 30 days
- WARN/ERROR: 90 days
- AUDIT: 7 years (compliance)
- ACCESS: 90 days

Conclusion

Choosing a logging solution depends on your specific needs:

Choose ELK if:

You need powerful full-text search
You have >100 GB/day logs
You have a dedicated platform team
Complex queries and analytics are critical

Choose Loki if:

You want low operational overhead
Cost is a primary concern
You use Grafana for monitoring
Structured logging with good labels

Choose CloudWatch if:

You're AWS-only
You want zero operational overhead
Log volume <100 GB/day
AWS integration is valuable

Remember: Start simple. You can always migrate to a more sophisticated solution as your needs grow. The best logging solution is one your team will actually use effectively.

Need help implementing logging infrastructure? InstaDevOps provides expert consulting for observability, logging, and monitoring solutions. Contact us for a free consultation.

Need Help with Your DevOps Infrastructure?

At InstaDevOps, we specialize in helping startups and scale-ups build production-ready infrastructure without the overhead of a full-time DevOps team.

Our Services: