Quentin Guenther

Posted on Mar 23

When Green Dashboards Lie: Up Is Not Usable

#distributedsystems #observability #softwareengineering #systemdesign

Originally published on qguenther.dev — sharing here for broader discussion.

A system can be technically available and still feel broken.

That gap shows up a lot in distributed systems. Dashboards are green. Error rates are within threshold. Core services are still responding. But a user is staring at a disabled button, a spinner that never resolves, or a confirmation state they no longer trust.

One pattern I have seen repeatedly is that backend metrics describe service health, while users experience workflow health. Those are related, but they are not the same thing.

That is why operating distributed systems with a frontend mindset is useful, especially if you do not write frontend code. I call it a frontend mindset not because it requires frontend skills, but because frontend engineers are forced to confront these questions first. The interface cannot hide behind "dependency timeout" as an explanation. It has to decide what to do next.

The core shift is simple: start with what the user can still do, what state they can trust, and how the interface behaves when dependencies degrade. Then work backward to the services, queues, retries, and caches underneath.

Green Dashboards, Broken Workflows

Most operational models start from the server inward.

Teams track latency, throughput, saturation, and error rate. They alert on dependency failures. They define service-level objectives around availability. All of that matters. None of it tells the full story of whether the system is still usable.

In large systems, the most painful failures are often partial failures:

A page shell renders, but the important action never becomes available.
Reads still succeed from cache, but the write path is degraded and users cannot tell whether a mutation completed.
Retries keep backend throughput acceptable, but the UI enters contradictory states.
A service stays within its availability target while a critical multi-step workflow effectively stops working.

From an infrastructure perspective, these can look like manageable incidents.

From a product perspective, trust is already being lost.

Users do not experience your service topology. They experience waiting, ambiguity, stale data, duplicate actions, and broken momentum.

One simple version of that pattern looks like this:

What a Frontend Mindset Actually Changes

A frontend mindset is not a preference for UI over backend work. It is an operating stance.

It treats distributed systems as a series of promises made to a user:

Will this screen become actionable in a reasonable amount of time?
If an action is delayed, will the user understand whether it is pending, failed, or already applied?
If data is stale, is that safe, visible, and recoverable?
If one dependency fails, which parts of the workflow should keep working?

Instead of asking only, "Is the service up?"

You start asking:

Can the user still complete the task they came for?
Did we preserve confidence or create ambiguity?
Are we measuring availability, or are we measuring usable behavior?

That shift is easier to see side by side:

Backend telemetry tells you whether the platform is still responding. A frontend mindset tells you whether the system is still keeping its promise to the user.

Failure Semantics Eventually Become Product Behavior

Latency becomes waiting.

Eventual consistency becomes confusion when users do not see the result of their action reflected quickly enough.

Retries become duplicate submissions unless mutation semantics and UI states are designed carefully.

Fallbacks become product decisions about what to hide, what to keep read-only, and what to clearly mark as degraded.

Consider a common pattern in distributed applications:

The page loads from cached or partially available data, which makes overall availability look strong. But the action the user actually cares about depends on a slower write path, a downstream authorization check, or a fan-out to several services. The screen appears alive, yet the workflow is fragile.

This is a distributed systems problem; it just becomes visible in the product before it shows up in backend metrics.

The retry path usually looks something like this:

Retry behavior and confirmation states cannot be designed independently. The backend and UI are participating in the same failure semantics.

Operate for Usability, Not Just Availability

If you want a more accurate picture of system health, measure the boundary between architecture and product behavior.

That usually means instrumenting critical workflows, not only services.

Examples of more useful operational questions:

How long until a user can take the primary action on the screen?
How often does a workflow enter a fallback or degraded mode?
How often do users retry a mutation because the result was ambiguous?

These metrics are harder to collect than request-level success rates. They often require cross-layer instrumentation between frontend, backend, and observability tooling.

But they are closer to the truth.

A workflow backed by services at 99.9% availability can still feel unreliable; if the failures cluster on a critical path where ambiguous state erodes user trust. A dashboard may call that healthy. A user will not.

That usually requires a different instrumentation shape:

The important change is not more telemetry, it is correlating service health with workflow outcomes.

Better Incident and Postmortem Questions

This mindset also improves operational reviews.

A lot of incident analysis stays too close to system internals:

Which service degraded?
What dependency timed out?
How long until error rate recovered?

Those are necessary questions, but they are incomplete.

The more useful questions are often:

What did the user see while this was happening?
Which actions became unsafe, misleading, or impossible?
Were retries idempotent, or did they create duplicate effects?

These questions are valuable because many real incidents are not outages. They are trust failures. The system did something technically understandable but behaviorally confusing, exactly the kind of problem a frontend mindset catches earlier.

The Tradeoff Is More Coordination

There is a cost to operating this way.

Backend-centric metrics are easier to standardize. They map cleanly to services, ownership, and alerting. Workflow-level metrics are messier. They cross boundaries. They require teams to agree on what "usable" means for a given product path. They also push engineering, product, and design into closer alignment on degraded behavior.

Graceful degradation has its own cost as well. Read-only modes, stale-state indicators, idempotent mutations, clearer confirmation models, and better recovery paths all take deliberate design and implementation work.

Not every screen deserves that investment.

Internal tools and low-risk flows may not need this. The point is not exhaustive resilience everywhere, it is making that choice deliberately instead of assuming service metrics tell the full story.

The Takeaway

Distributed systems should not be operated on backend metrics alone.

"Up" is not the same as "usable."

If you want a more honest view of reliability, start with the user's ability to complete a task, understand system state, and trust the outcome. That is what a frontend mindset adds. It turns reliability from a narrow infrastructure question into a cross-layer operating model.

The next time your dashboards are green and a user tells you something is broken, resist the instinct to say "everything looks fine on our end." That instinct is the gap this mindset closes.

Top comments (2)

Andre Cytryn • Mar 23

the distinction between service health and workflow health is something teams learn the hard way. the partial failure case is particularly sneaky — reads succeeding from cache while writes silently degrade means your SLO stays green while users are stuck in limbo.

one thing I've found useful: treating optimistic UI updates as a contract. if you promise the user the action succeeded, your system better be able to honor that even across retries. makes the idempotency requirement concrete rather than theoretical.

Quentin Guenther • Mar 24

That’s a great way to frame it, especially the idea of optimistic UI as a contract.

I’ve seen that same failure mode where the system technically “recovers,” but the user is left in an ambiguous state because the contract wasn’t actually upheld.

Framing it this way makes idempotency and reconciliation feel less like backend implementation details and more like product guarantees, which is where they probably belong.