Why We Stopped Using Log Aggregation for Everything

#sre #devops #logging #observability

We used to push every log line to our centralized log system. It was a mess. Here's why we stopped and what we do now.

The problem

Our log volume was growing 20% month-over-month. Most of it was debug-level stuff that nobody searched for. We were paying to store logs nobody read.

Worse: when we actually needed to find something, the noise made it harder. You can't grep usefully through a billion lines that are mostly heartbeats.

The rule we adopted

'Logs are for events humans or systems will query. Metrics are for counts. Traces are for request flow.'

Applying this:

DEBUG logs: local only, never shipped
INFO logs: shipped but aggressively sampled (1%)
WARN logs: shipped in full
ERROR logs: shipped in full, tagged with a request ID

Counts and rates moved to metrics, not logs. Request flow moved to traces, not logs.

The results

Log ingest cost down 70%
Search queries 4x faster (less noise)
We actually find things when we need to

The traps

People write INFO logs for debugging, then forget to remove them. A linter that flags high-volume log calls helped us catch this before it got to prod.

Sampled logs can be confusing. 'Why did user X's request not show up?' Answer: it was sampled out. Make sure your sampling rules are transparent so engineers don't assume missing logs mean missing requests.

Logs are one observability tool. Not the only one. Stop making them do everything.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com