Introduction
At small scale, logging is simple—tail a file, search with grep, done. But as your infrastructure grows to dozens of services across multiple servers, this approach breaks down. Finding a specific error across 100 containers, correlating events across services, or analyzing patterns in millions of log lines becomes impossible without the right tooling.
Modern logging solutions promise to solve these problems, but choosing the wrong one can be costly. The ELK Stack offers powerful search and analytics but requires significant operational overhead. Loki promises simplicity and cost savings but with feature tradeoffs. CloudWatch provides seamless AWS integration but can become expensive at scale.
In this comprehensive guide, we'll explore these three leading logging solutions, helping you choose the right one for your needs.
Why Logging at Scale is Hard
Volume
Small app: 1 server, 10 MB/day logs
✓ tail -f application.log
✓ grep "ERROR" application.log
At scale: 100 services, 50 GB/day logs
✗ Can't SSH to each server
✗ Can't grep 50GB of data
✗ Logs rotate, old ones compressed/deleted
✗ Distributed tracing across services needed
Velocity
1,000 requests/second × 100 services = 100,000 log events/second
Challenges:
- Ingesting 100K events/second
- Indexing for fast search
- Storing efficiently
- Querying without overwhelming system
Cardinality
High-cardinality fields (user IDs, request IDs) explode index sizes:
Low cardinality:
level: [DEBUG, INFO, WARN, ERROR] # 4 values
service: [api, worker, frontend] # 3 values
High cardinality:
request_id: unique per request # millions of values
user_id: one per user # millions of values
session_id: one per session # millions of values
Indexing high-cardinality fields = massive storage costs
Retention
Retention requirements:
- Debug logs: 7 days
- Application logs: 30 days
- Audit logs: 7 years (compliance)
- Access logs: 90 days
At 50 GB/day:
- 7 days = 350 GB
- 30 days = 1.5 TB
- 90 days = 4.5 TB
- 7 years = 128 TB
The ELK Stack (Elasticsearch, Logstash, Kibana)
Architecture
Application → Filebeat → Logstash → Elasticsearch → Kibana
↓
(Parse, filter, enrich)
↓
Elasticsearch
(Store & Index)
↓
Kibana
(Visualize & Query)
Components
Elasticsearch: Distributed search and analytics engine
# Elasticsearch cluster
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: production
spec:
version: 8.10.0
nodeSets:
# Master nodes
- name: master
count: 3
config:
node.roles: ["master"]
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
requests:
memory: 2Gi
cpu: 1
limits:
memory: 2Gi
cpu: 2
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
# Data nodes
- name: data
count: 3
config:
node.roles: ["data", "ingest"]
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
requests:
memory: 8Gi
cpu: 2
limits:
memory: 8Gi
cpu: 4
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
Logstash: Log parsing and enrichment
# logstash.conf
input {
beats {
port => 5044
}
}
filter {
# Parse JSON logs
json {
source => "message"
}
# Parse nginx access logs
if [service] == "nginx" {
grok {
match => {
"message" => "%{COMBINEDAPACHELOG}"
}
}
geoip {
source => "clientip"
}
}
# Add Kubernetes metadata
if [kubernetes] {
mutate {
add_field => {
"cluster" => "%{[kubernetes][cluster]}"
"namespace" => "%{[kubernetes][namespace]}"
"pod" => "%{[kubernetes][pod]}"
}
}
}
# Drop debug logs older than 1 day
if [level] == "DEBUG" and [@timestamp] < "now-1d" {
drop {}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}
Kibana: Visualization and querying
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: production
spec:
version: 8.10.0
count: 2
elasticsearchRef:
name: production
podTemplate:
spec:
containers:
- name: kibana
resources:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 2Gi
cpu: 1
Strengths
Powerful Search: Full-text search with complex queries
# Kibana Query Language (KQL)
service:"api" AND level:"ERROR" AND status_code:500
# Find errors with specific message pattern
message:/database.*timeout/ AND @timestamp >= "now-1h"
# Aggregate queries
GROUP BY service COUNT WHERE level="ERROR"
Rich Visualizations: Dashboards, graphs, charts
Ecosystem: Massive plugin ecosystem (alerting, ML, APM)
Mature: Battle-tested at enormous scale (Netflix, Uber, GitHub)
Weaknesses
Operational Complexity: Requires dedicated team
Operational tasks:
- Managing cluster health
- Shard allocation
- Index lifecycle management
- JVM heap tuning
- Backup and recovery
- Upgrades (complex)
- Performance tuning
Resource Intensive: High CPU and memory requirements
Typical cluster for 50GB/day:
- 3 master nodes: 2 CPU, 2GB RAM each
- 3 data nodes: 4 CPU, 8GB RAM each
- 500GB storage per data node
Total: 18 CPU, 30GB RAM, 1.5TB storage
Cost: ~$1,500-2,000/month (AWS/GCP)
Storage Costs: Indexes are storage-expensive
Log volume vs storage:
- Raw logs: 50 GB/day
- Indexed in Elasticsearch: 150 GB/day (3x)
- 30 days retention: 4.5 TB
When to Use ELK
✓ Need powerful full-text search
✓ Complex queries and aggregations
✓ Rich dashboards and visualizations
✓ Have dedicated DevOps/SRE team
✓ Budget for infrastructure
✓ Need mature ecosystem (alerts, ML, APM)
Cost Example
50 GB/day logs, 30-day retention:
Self-hosted on AWS:
- EC2 instances: $1,200/month
- EBS storage: $600/month
- Data transfer: $200/month
Total: ~$2,000/month
Managed (Elastic Cloud):
- $0.12/GB/month indexed
- 50 GB/day × 30 days × 3x index = 4.5 TB
- 4,500 GB × $0.12 = $540/month + compute
Total: ~$3,000-5,000/month
Grafana Loki
Architecture
Application → Promtail → Loki → Grafana
↓
(Store metadata)
↓
Object Storage
(Store log content)
Core Concept: Labels, Not Indexes
Loki indexes metadata (labels), not log content:
ELK: Indexes EVERYTHING
"2024-01-15 10:23:45 ERROR database connection timeout"
↓
Indexes: [2024, 01, 15, 10, 23, 45, ERROR, database, connection, timeout]
Loki: Indexes ONLY labels
Labels: {service="api", level="error", cluster="prod"}
Content: "database connection timeout" (stored in S3, not indexed)
↓
Indexes: [service=api, level=error, cluster=prod]
This makes Loki dramatically cheaper:
50 GB/day logs:
- ELK indexed storage: 150 GB/day (3x)
- Loki indexed storage: 500 MB/day (0.01x)
- Loki object storage: 50 GB/day (compressed to ~5 GB)
Components
Promtail: Log collector and shipper
# promtail-config.yaml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Scrape Kubernetes pod logs
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Add namespace label
- source_labels:
- __meta_kubernetes_namespace
target_label: namespace
# Add pod label
- source_labels:
- __meta_kubernetes_pod_name
target_label: pod
# Add container label
- source_labels:
- __meta_kubernetes_pod_container_name
target_label: container
# Add app label
- source_labels:
- __meta_kubernetes_pod_label_app
target_label: app
Loki: Log aggregation and storage
apiVersion: v1
kind: ConfigMap
metadata:
name: loki
data:
loki.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2023-01-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
shared_store: s3
aws:
s3: s3://us-east-1/loki-logs
s3forcepathstyle: true
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 720h
table_manager:
retention_deletes_enabled: true
retention_period: 720h
LogQL: Loki Query Language
# Basic label filtering
{app="api", level="error"}
# Filter by log content (slower, scans content)
{app="api"} |= "database timeout"
# Regular expression filtering
{app="api"} |~ "error.*database"
# Exclude logs
{app="api"} != "debug"
# Parse JSON and filter
{app="api"} | json | status_code="500"
# Rate query: Count errors per minute
rate({app="api", level="error"}[5m])
# Aggregation: Sum by service
sum by (service) (rate({level="error"}[5m]))
Strengths
Cost-Effective: 10x cheaper storage than Elasticsearch
50 GB/day logs, 30-day retention:
Loki on AWS:
- EC2 (small): $200/month
- S3 storage: 150 GB × $0.023 = $3.45/month
- Data transfer: $50/month
Total: ~$250-300/month
8x cheaper than ELK!
Simple to Operate: Minimal operational overhead
Loki cluster:
- Single binary
- No complex shard management
- No JVM tuning
- Uses object storage (S3, GCS)
- Easy scaling
Integrates with Grafana: Unified observability platform
Label-Based: Forces good logging practices
Weaknesses
Limited Search: No full-text indexing
# Slow (scans all log content):
{app="api"} |= "specific error message"
# Fast (uses index):
{app="api", level="error"}
# Very slow:
{} |= "find this anywhere" # Scans everything
Fewer Features: Less mature than ELK
Label Cardinality Limits: Can't use high-cardinality labels
# Bad: High cardinality
{user_id="12345"} # Millions of unique users = millions of streams
# Good: Low cardinality
{service="api", environment="prod"} # Few unique values
# Instead, filter content:
{service="api"} | json | user_id="12345"
When to Use Loki
✓ Want low operational overhead
✓ Cost-conscious
✓ Already using Grafana
✓ Structured logging with good labels
✓ Don't need complex full-text search
✓ Small team
Cost Example
50 GB/day logs, 30-day retention:
Self-hosted on AWS:
- EC2 instance (t3.large): $60/month
- S3 storage: 1.5 TB × $0.023 = $35/month
- Data transfer: $50/month
Total: ~$150/month
Grafana Cloud (managed):
- $0.50/GB ingested
- 50 GB/day × 30 days = 1.5 TB
- 1,500 GB × $0.50 = $750/month
Total: ~$750/month
AWS CloudWatch Logs
Architecture
Application → CloudWatch Agent → CloudWatch Logs → CloudWatch Insights
↓
(Managed service)
Components
CloudWatch Agent: Collects logs from EC2/ECS/EKS
// cloudwatch-agent-config.json
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/application.log",
"log_group_name": "/aws/ec2/application",
"log_stream_name": "{instance_id}",
"timezone": "UTC"
}
]
}
}
}
}
Fluent Bit for Kubernetes:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
[OUTPUT]
Name cloudwatch_logs
Match *
region us-east-1
log_group_name /aws/eks/cluster
log_stream_prefix application-
auto_create_group true
CloudWatch Insights Query Language
# Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| filter @timestamp > ago(1h)
| sort @timestamp desc
| limit 100
# Count errors by service
fields service
| filter level = "ERROR"
| stats count() by service
# P95 latency
fields @timestamp, duration
| filter service = "api"
| stats pct(duration, 95) as p95
# Parse JSON logs
fields @timestamp, @message
| parse @message /\{"level":"(?<level>.*?)","service":"(?<service>.*?)"\}/
| filter level = "ERROR"
| stats count() by service
Strengths
Fully Managed: Zero operational overhead
AWS Integration: Native integration with all AWS services
# Automatic logs from:
- Lambda functions
- ECS containers
- RDS databases
- API Gateway
- CloudFront
- VPC Flow Logs
- And 50+ more services
Simple Setup: Easy to get started
Security: IAM integration, encryption at rest
Weaknesses
Expensive at Scale: Costs grow linearly
50 GB/day ingestion:
- Ingestion: 50 GB × $0.50 = $25/day = $750/month
- Storage: 1.5 TB × $0.03 = $45/month
- Insights queries: ~$200/month
Total: ~$1,000/month
At 500 GB/day: ~$10,000/month
Limited Query Capabilities: Basic compared to ELK
AWS Lock-In: Can't easily migrate to another cloud
No Alerting: Requires CloudWatch Alarms (separate service)
When to Use CloudWatch
✓ AWS-only infrastructure
✓ Want zero operational overhead
✓ Small to medium log volume (<100 GB/day)
✓ Already using AWS services
✓ Need compliance/security (AWS certified)
✓ Don't want to manage infrastructure
Cost Example
50 GB/day logs, 30-day retention:
- Ingestion: $0.50/GB = $25/day = $750/month
- Storage: 1.5 TB × $0.03/GB = $45/month
- Insights queries: ~$200/month
Total: ~$1,000/month
With CloudWatch subscription filters to S3:
- Ingestion: $750/month
- S3 storage: 1.5 TB × $0.023 = $35/month
- Athena queries: ~$50/month
Total: ~$835/month
Detailed Comparison
Feature Matrix
| Feature | ELK | Loki | CloudWatch |
|---|---|---|---|
| Full-Text Search | ✅ Excellent | ⚠️ Limited | ⚠️ Basic |
| Cost (50GB/day) | $2,000/mo | $250/mo | $1,000/mo |
| Operational Complexity | ❌ High | ✅ Low | ✅ None |
| Query Performance | ✅ Fast | ⚠️ Slower | ⚠️ Variable |
| Scalability | ✅ Excellent | ✅ Good | ✅ Excellent |
| Visualization | ✅ Rich (Kibana) | ✅ Good (Grafana) | ⚠️ Basic |
| Alerting | ✅ Built-in | ✅ Via Grafana | ✅ CloudWatch Alarms |
| Multi-Cloud | ✅ Yes | ✅ Yes | ❌ AWS only |
| Learning Curve | ❌ Steep | ✅ Gentle | ✅ Easy |
| Retention Flexibility | ✅ Excellent | ✅ Good | ✅ Good |
| Log Parsing | ✅ Powerful | ⚠️ Basic | ⚠️ Basic |
Performance Comparison
Query: Find all ERROR logs in last hour
ELK:
- Query time: 0.5s
- Indexed field lookup
- Aggregations fast
Loki:
- Query time: 2-5s
- Scans log content
- Slower for content search
CloudWatch Insights:
- Query time: 5-30s
- Depends on data volume
- Can timeout on large queries
Storage Efficiency
1 TB raw logs over 30 days:
ELK:
- Indexed: 3 TB
- Cost: $300-600/month (storage only)
Loki:
- Indexed: 10 GB
- Object storage: 200 GB (compressed)
- Cost: $5-10/month (storage only)
CloudWatch:
- Storage: 1 TB
- Cost: $30/month (storage only)
- But ingestion: $15,000/month
Choosing the Right Solution
Decision Tree
Are you AWS-only?
├─ Yes: CloudWatch (unless >100 GB/day)
└─ No
├─ Need powerful full-text search?
│ └─ Yes: ELK
└─ No
├─ Have DevOps team for operations?
│ ├─ Yes: ELK or Loki
│ └─ No: Loki or CloudWatch
└─ Budget constrained?
├─ Yes: Loki
└─ No: ELK
Use Case Recommendations
Startup (<10 services, <10 GB/day)
→ CloudWatch or Loki
- Simple setup
- Low cost
- Good enough functionality
Scale-up (10-50 services, 10-100 GB/day)
→ Loki
- Cost-effective
- Simple operations
- Integrates with Grafana
Enterprise (>50 services, >100 GB/day)
→ ELK
- Powerful search needed
- Complex queries
- Dedicated team available
AWS-Heavy (any size)
→ CloudWatch
- If <100 GB/day
- Managed service
- AWS integration
Hybrid Approaches
CloudWatch + S3 + Athena
Cheaper long-term storage:
# CloudWatch Subscription Filter
resource "aws_cloudwatch_log_subscription_filter" "to_s3" {
name = "to-s3"
log_group_name = "/aws/lambda/myfunction"
filter_pattern = ""
destination_arn = aws_kinesis_firehose_delivery_stream.logs.arn
}
# Firehose to S3
resource "aws_kinesis_firehose_delivery_stream" "logs" {
name = "logs-to-s3"
destination = "extended_s3"
extended_s3_configuration {
role_arn = aws_iam_role.firehose.arn
bucket_arn = aws_s3_bucket.logs.arn
# Compress logs
compression_format = "GZIP"
# Partition by date
prefix = "logs/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"
error_output_prefix = "errors/"
}
}
# Query with Athena
# Cost: $5/TB scanned (vs CloudWatch Insights $0.005/GB)
ELK for Hot Data + S3 for Cold
# Index Lifecycle Management
PUT _ilm/policy/logs_policy
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"searchable_snapshot": {
"snapshot_repository": "s3_repository"
}
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
Best Practices
Structured Logging
// Good: Structured JSON
{
"timestamp": "2024-01-15T10:23:45Z",
"level": "ERROR",
"service": "api",
"trace_id": "abc123",
"message": "Database connection timeout",
"error": {
"type": "ConnectionTimeout",
"database": "users-db",
"duration_ms": 5000
},
"context": {
"user_id": 12345,
"endpoint": "/api/users"
}
}
// Bad: Unstructured text
"2024-01-15 10:23:45 ERROR api database connection timeout after 5000ms for user 12345 on /api/users"
Use Appropriate Labels (Loki)
# Good: Low cardinality labels
{service="api", environment="prod", level="error"}
# Bad: High cardinality labels
{user_id="12345", request_id="abc123"}
# Creates millions of log streams
# Instead, use content filtering:
{service="api"} | json | user_id="12345"
Sampling High-Volume Logs
import random
def log_debug(message):
# Sample 10% of debug logs
if random.random() < 0.1:
logger.debug(message)
Set Retention Policies
Log types and retention:
- DEBUG: 3-7 days
- INFO: 30 days
- WARN/ERROR: 90 days
- AUDIT: 7 years (compliance)
- ACCESS: 90 days
Conclusion
Choosing a logging solution depends on your specific needs:
Choose ELK if:
- You need powerful full-text search
- You have >100 GB/day logs
- You have a dedicated platform team
- Complex queries and analytics are critical
Choose Loki if:
- You want low operational overhead
- Cost is a primary concern
- You use Grafana for monitoring
- Structured logging with good labels
Choose CloudWatch if:
- You're AWS-only
- You want zero operational overhead
- Log volume <100 GB/day
- AWS integration is valuable
Remember: Start simple. You can always migrate to a more sophisticated solution as your needs grow. The best logging solution is one your team will actually use effectively.
Need help implementing logging infrastructure? InstaDevOps provides expert consulting for observability, logging, and monitoring solutions. Contact us for a free consultation.
Need Help with Your DevOps Infrastructure?
At InstaDevOps, we specialize in helping startups and scale-ups build production-ready infrastructure without the overhead of a full-time DevOps team.
Our Services:
- 🏗️ AWS Consulting - Cloud architecture, cost optimization, and migration
- ☸️ Kubernetes Management - Production-ready clusters and orchestration
- 🚀 CI/CD Pipelines - Automated deployment pipelines that just work
- 📊 Monitoring & Observability - See what's happening in your infrastructure
Special Offer: Get a free DevOps audit - 50+ point checklist covering security, performance, and cost optimization.
📅 Book a Free 15-Min Consultation
Originally published at instadevops.com
Top comments (0)