At some point, logs stop helping.
Not because logging is bad.
Because the system is doing too much.
When you’re running something continuously, across multiple systems, logs turn into noise fast.
You still log everything.
You just can’t rely on it to understand what’s actually happening.
The expectation
Early on, logging feels like the answer.
Something breaks → check logs → find the issue → fix it
Clean. Linear. Works in small systems.
What actually happens
In production, it looks like this:
thousands of log lines per minute
multiple services writing at the same time
retries creating duplicate entries
partial failures that don’t throw clear errors
You open logs and see everything.
Which means you see nothing.
The real problem
Logs tell you what happened.
They don’t tell you:
what state the system is in
what is currently broken
what needs attention right now
And when things run continuously, that’s what you actually need.
What we started doing instead
We still log. But we stopped treating logs as the source of truth.
1. Track state, not just events
Instead of just writing logs like:
“order created”
“order failed”
We track:
current status of the order
where it is in the flow
what’s pending
So at any moment, we can answer:
what’s stuck right now
2. Surface problems, don’t search for them
Logs require you to go looking.
In real systems, you don’t have time for that.
So we build:
alerts when something is off
dashboards that show broken flows
queues that show backlog
The system tells you where to look.
3. Group by flow, not by line
Logs are isolated lines.
But real issues happen across a sequence.
So we group things by:
request
entity
workflow
Instead of reading 100 lines, you follow one story.
That’s where things start making sense again.
4. Accept that some issues won’t be obvious
Some problems don’t throw errors.
They just… stop moving.
A process gets stuck.
A sync silently fails.
Logs might show nothing critical.
So you need signals like:
time thresholds
missing updates
“this should have finished by now”
What changed for me
I used to think:
if it’s logged, we can debug it
Now I think:
if we need logs to notice something is broken, we’re already late
Logs are for digging deeper.
Not for discovering the problem.
In systems that run all the time, you don’t watch everything manually.
The system needs to show you where it’s struggling.
Otherwise, you’re just scrolling and hoping you notice the right line.
This is something we run into a lot at BrainPack, where multiple systems are always moving and interacting. AI workflows depend on knowing the current state of everything, not just what happened, so observability has to go beyond logs.
Top comments (0)