DEV Community

Cover image for When Green Dashboards Lie: Up Is Not Usable
Quentin Guenther
Quentin Guenther

Posted on

When Green Dashboards Lie: Up Is Not Usable

Originally published on qguenther.dev — sharing here for broader discussion.


A system can be technically available and still feel broken.

That gap shows up a lot in distributed systems. Dashboards are green. Error rates are within threshold. Core services are still responding. But a user is staring at a disabled button, a spinner that never resolves, or a confirmation state they no longer trust.

One pattern I have seen repeatedly is that backend metrics describe service health, while users experience workflow health. Those are related, but they are not the same thing.

That is why operating distributed systems with a frontend mindset is useful, especially if you do not write frontend code. I call it a frontend mindset not because it requires frontend skills, but because frontend engineers are forced to confront these questions first. The interface cannot hide behind "dependency timeout" as an explanation. It has to decide what to do next.

The core shift is simple: start with what the user can still do, what state they can trust, and how the interface behaves when dependencies degrade. Then work backward to the services, queues, retries, and caches underneath.

Split illustration: on the left, a green system dashboard shows healthy metrics like uptime, performance, and user activity; on the right, a frustrated person sits at a laptop with a loading spinner, highlighting a broken user experience despite healthy backend signals.

Green Dashboards, Broken Workflows

Most operational models start from the server inward.

Teams track latency, throughput, saturation, and error rate. They alert on dependency failures. They define service-level objectives around availability. All of that matters. None of it tells the full story of whether the system is still usable.

In large systems, the most painful failures are often partial failures:

  • A page shell renders, but the important action never becomes available.

  • Reads still succeed from cache, but the write path is degraded and users cannot tell whether a mutation completed.

  • Retries keep backend throughput acceptable, but the UI enters contradictory states.

  • A service stays within its availability target while a critical multi-step workflow effectively stops working.

From an infrastructure perspective, these can look like manageable incidents.

From a product perspective, trust is already being lost.

Users do not experience your service topology. They experience waiting, ambiguity, stale data, duplicate actions, and broken momentum.

One simple version of that pattern looks like this:

Flow showing how services can return 200 OK while the read model stays stale, the UI shows old state, and the user retries—leading to an ambiguous outcome despite green service dashboards.

What a Frontend Mindset Actually Changes

A frontend mindset is not a preference for UI over backend work. It is an operating stance.

It treats distributed systems as a series of promises made to a user:

  • Will this screen become actionable in a reasonable amount of time?

  • If an action is delayed, will the user understand whether it is pending, failed, or already applied?

  • If data is stale, is that safe, visible, and recoverable?

  • If one dependency fails, which parts of the workflow should keep working?

Instead of asking only, "Is the service up?"

You start asking:

  • Can the user still complete the task they came for?

  • Did we preserve confidence or create ambiguity?

  • Are we measuring availability, or are we measuring usable behavior?

That shift is easier to see side by side:

Same incident viewed two ways: backend asks

Backend telemetry tells you whether the platform is still responding. A frontend mindset tells you whether the system is still keeping its promise to the user.

Failure Semantics Eventually Become Product Behavior

Latency becomes waiting.

Eventual consistency becomes confusion when users do not see the result of their action reflected quickly enough.

Retries become duplicate submissions unless mutation semantics and UI states are designed carefully.

Fallbacks become product decisions about what to hide, what to keep read-only, and what to clearly mark as degraded.

Consider a common pattern in distributed applications:

The page loads from cached or partially available data, which makes overall availability look strong. But the action the user actually cares about depends on a slower write path, a downstream authorization check, or a fan-out to several services. The screen appears alive, yet the workflow is fragile.

This is a distributed systems problem; it just becomes visible in the product before it shows up in backend metrics.

The retry path usually looks something like this:

Sequence where the user places an order, the API returns success, but the read model returns stale data so success and visible state disagree; the user retries, creating duplicate risk.

Retry behavior and confirmation states cannot be designed independently. The backend and UI are participating in the same failure semantics.

UI mock showing a user action, a success response, a stale status after refresh, and a second click that creates ambiguity.

Operate for Usability, Not Just Availability

If you want a more accurate picture of system health, measure the boundary between architecture and product behavior.

That usually means instrumenting critical workflows, not only services.

Examples of more useful operational questions:

  • How long until a user can take the primary action on the screen?

  • How often does a workflow enter a fallback or degraded mode?

  • How often do users retry a mutation because the result was ambiguous?

These metrics are harder to collect than request-level success rates. They often require cross-layer instrumentation between frontend, backend, and observability tooling.

But they are closer to the truth.

A workflow backed by services at 99.9% availability can still feel unreliable; if the failures cluster on a critical path where ambiguous state erodes user trust. A dashboard may call that healthy. A user will not.

That usually requires a different instrumentation shape:

Instrumentation flow: task and frontend signals plus API and service metrics feed into a shared correlation context, then into a workflow dashboard showing actionable time, fallback rate, and retry rate.

The important change is not more telemetry, it is correlating service health with workflow outcomes.

Better Incident and Postmortem Questions

This mindset also improves operational reviews.

A lot of incident analysis stays too close to system internals:

  • Which service degraded?

  • What dependency timed out?

  • How long until error rate recovered?

Those are necessary questions, but they are incomplete.

The more useful questions are often:

  • What did the user see while this was happening?

  • Which actions became unsafe, misleading, or impossible?

  • Were retries idempotent, or did they create duplicate effects?

These questions are valuable because many real incidents are not outages. They are trust failures. The system did something technically understandable but behaviorally confusing, exactly the kind of problem a frontend mindset catches earlier.

Traditional postmortem (service/dependency/error recovery) vs Workflow postmortem (user experience, failed actions, UI state); key insight that

The Tradeoff Is More Coordination

There is a cost to operating this way.

Backend-centric metrics are easier to standardize. They map cleanly to services, ownership, and alerting. Workflow-level metrics are messier. They cross boundaries. They require teams to agree on what "usable" means for a given product path. They also push engineering, product, and design into closer alignment on degraded behavior.

Graceful degradation has its own cost as well. Read-only modes, stale-state indicators, idempotent mutations, clearer confirmation models, and better recovery paths all take deliberate design and implementation work.

Not every screen deserves that investment.

Internal tools and low-risk flows may not need this. The point is not exhaustive resilience everywhere, it is making that choice deliberately instead of assuming service metrics tell the full story.

2x2 matrix: axes are ease of measurement and closeness to product truth. Quadrants show service metrics, page health, task completion, and workflow health. Caption: workflow health is top-right—closest to product truth, hardest to measure, and most valuable.

The Takeaway

Distributed systems should not be operated on backend metrics alone.

"Up" is not the same as "usable."

If you want a more honest view of reliability, start with the user's ability to complete a task, understand system state, and trust the outcome. That is what a frontend mindset adds. It turns reliability from a narrow infrastructure question into a cross-layer operating model.

The next time your dashboards are green and a user tells you something is broken, resist the instinct to say "everything looks fine on our end." That instinct is the gap this mindset closes.

Top comments (0)