$12,000/Month for Logs Nobody Reads
Our logging bill was $12,000/month. We were ingesting 2TB/day. When I asked the team what percentage of logs they actually looked at during incidents, the answer was embarrassing: about 5%.
We were paying to store 95% noise.
The Log Audit
First, I categorized all log sources by value:
High value (always need during incidents):
Application errors (stack traces)
Authentication events
Business transactions
External API calls with responses
Health check failures
Medium value (sometimes useful):
Request/response logs (sampled)
Performance metrics in logs
Deployment events
Configuration changes
Low value (almost never needed):
Debug/trace level logs
Health check successes
Static asset requests
Heartbeat messages
Verbose framework logs
Strategy 1: Log Levels as a Service
We made log levels dynamic. In production, default is WARN. During incidents, flip to DEBUG for the affected service:
import os
import logging
# Log level from environment variable, changeable at runtime
LOG_LEVEL = os.environ.get('LOG_LEVEL', 'WARNING')
logging.basicConfig(level=getattr(logging, LOG_LEVEL))
# Endpoint to change log level without restart
@app.post('/admin/log-level')
async def set_log_level(level: str):
logging.getLogger().setLevel(getattr(logging, level.upper()))
return {'status': 'ok', 'level': level}
In Kubernetes:
# Normal operation
kubectl set env deployment/api LOG_LEVEL=WARNING
# During incident
kubectl set env deployment/api LOG_LEVEL=DEBUG
# After incident
kubectl set env deployment/api LOG_LEVEL=WARNING
Strategy 2: Tiered Retention
retention_policy:
hot_storage: # Fast search, expensive
duration: 7 days
filter: "level >= WARN OR tag:business_event"
warm_storage: # Slower search, cheaper
duration: 30 days
filter: "level >= INFO"
cold_storage: # Archive only, cheapest
duration: 365 days
filter: "tag:audit OR tag:compliance"
drop: # Don't store at all
filter: "level = DEBUG OR source:health_check"
Strategy 3: Structured Logging
Unstructured logs are expensive to parse. Structured logs are cheap to query:
# Bad: Unstructured
logger.info(f"User {user_id} purchased {product_id} for ${amount}")
# Parsing this requires regex, which costs compute
# Good: Structured
logger.info("purchase_completed", extra={
'user_id': user_id,
'product_id': product_id,
'amount': amount,
'currency': 'USD'
})
# Output: {"message": "purchase_completed", "user_id": "u123", ...}
# Queryable without parsing
Strategy 4: Sample Verbose Logs
import random
def should_log_request(request):
# Always log errors
if request.status_code >= 400:
return True
# Always log slow requests
if request.duration_ms > 1000:
return True
# Sample 10% of successful requests
return random.random() < 0.10
The Results
Before:
Daily ingestion: 2 TB
Monthly cost: $12,000
Useful data: ~5%
After:
Daily ingestion: 400 GB
Monthly cost: $3,600
Useful data: ~70%
We cut costs by 70% AND improved signal quality. Searches are faster because there's less noise. Incidents resolve quicker because relevant logs surface immediately.
The Rule
Before adding a log statement, ask: "Will someone look at this during an incident?" If the answer is no, it's DEBUG level at most.
If you're spending too much on logs and want smarter log management, check out what we're building at Nova AI Ops.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)