Dhruvi

Posted on May 4

Why Logging Is Not Enough When You Operate Systems Continuously

#backend #architecture

At some point, logs stop helping.

Not because logging is bad.
Because the system is doing too much.

When you’re running something continuously, across multiple systems, logs turn into noise fast.

You still log everything.
You just can’t rely on it to understand what’s actually happening.

The expectation

Early on, logging feels like the answer.

Something breaks → check logs → find the issue → fix it

Clean. Linear. Works in small systems.

What actually happens

In production, it looks like this:

thousands of log lines per minute
multiple services writing at the same time
retries creating duplicate entries
partial failures that don’t throw clear errors

You open logs and see everything.

Which means you see nothing.

The real problem

Logs tell you what happened.

They don’t tell you:

what state the system is in
what is currently broken
what needs attention right now

And when things run continuously, that’s what you actually need.

What we started doing instead

We still log. But we stopped treating logs as the source of truth.

1. Track state, not just events

Instead of just writing logs like:

“order created”
“order failed”

We track:

current status of the order
where it is in the flow
what’s pending

So at any moment, we can answer:

what’s stuck right now

2. Surface problems, don’t search for them

Logs require you to go looking.

In real systems, you don’t have time for that.

So we build:

alerts when something is off
dashboards that show broken flows
queues that show backlog

The system tells you where to look.

3. Group by flow, not by line

Logs are isolated lines.

But real issues happen across a sequence.

So we group things by:

request
entity
workflow

Instead of reading 100 lines, you follow one story.

That’s where things start making sense again.

4. Accept that some issues won’t be obvious

Some problems don’t throw errors.

They just… stop moving.

A process gets stuck.
A sync silently fails.

Logs might show nothing critical.

So you need signals like:

time thresholds
missing updates
“this should have finished by now”

What changed for me

I used to think:

if it’s logged, we can debug it

Now I think:

if we need logs to notice something is broken, we’re already late

Logs are for digging deeper.

Not for discovering the problem.

In systems that run all the time, you don’t watch everything manually.

The system needs to show you where it’s struggling.

Otherwise, you’re just scrolling and hoping you notice the right line.

This is something we run into a lot at BrainPack, where multiple systems are always moving and interacting. AI workflows depend on knowing the current state of everything, not just what happened, so observability has to go beyond logs.