DEV Community

Mikuz
Mikuz

Posted on

Measuring What Matters: User-Centric Availability Monitoring

Your systems report healthy status—servers respond, databases execute queries, load balancers register active targets—but customers still can't buy your products. This gap between technical health indicators and actual user success reveals why availability monitoring must shift from infrastructure metrics to user outcomes.

Legacy approaches tracked server availability through ping tests and HTTP status codes, measuring whether systems responded rather than whether users achieved their goals. When payment processing degrades and exhausts connection pools, traditional dashboards may display green while checkout completion rates plummet.

Effective availability monitoring measures what matters to users:

Can they complete purchases? Log in successfully? Accomplish their intended tasks?

The practices outlined here demonstrate how to implement monitoring that detects real problems before customers notice them.


Measure User Outcomes, Not System Responses

Service Level Indicators (SLIs) should quantify whether users accomplish what they set out to do—not whether your infrastructure returns successful status codes.

Structure SLIs as Ratios

Effective SLIs follow a simple structure:

Successful User Actions / Total User Attempts

This shifts focus from system availability to business value delivered.

Instrument for Business Outcomes

Implementation begins with instrumenting applications to capture goal completion data.

For example, in a payment processing system, the relevant metric is not whether the API responds—but whether customers successfully complete transactions.

Using Prometheus (e.g., with the Go client library):

  • Create a counter vector tracking transaction attempts.
  • Include labels for:
    • Status (completed, failed, pending)
    • Confirmation receipt
  • Increment counters based on actual outcomes.

A transaction counts as successful only when:

  • Payment processing completes
  • AND confirmation is received

An HTTP 200 response alone does not qualify.

Distinguish Between States

Track at least three states:

  1. Completed with confirmation
  2. Failed attempts
  3. Pending without confirmation

This ensures your metrics reflect real business outcomes rather than technical artifacts.

Build Meaningful Success Formulas

Examples:

  • Checkout Success Rate

    Confirmed completed payments ÷ Total payment attempts (5-minute window)

  • Authentication Reliability

    Logins completed within 2 seconds ÷ Total login attempts

  • Search Effectiveness

    Queries returning results without errors ÷ Total search requests

Align Thresholds with Business Priority

Set targets based on business impact—not arbitrary technical standards:

  • Payment flows: 99.95% (direct revenue impact)
  • Authentication: 99.9% (gates all features)
  • Search: 99.5% (important but less critical)

Avoid counting HTTP 200 responses as success if error details exist in the body. Only count confirmed business outcomes.


Rank Services by Their Business Consequences

Not all services deserve equal reliability targets.

Your monitoring strategy should reflect which failures hurt your business most.

Identify Revenue-Critical Services

For e-commerce platforms, Tier 1 services include:

  • Payment processing
  • Shopping cart
  • Checkout flow

These require the tightest error budgets (typically 99.9%+ availability).

For a 30-day window:

  • 99.9% → ~43 minutes of allowable degradation
  • 99.5% → ~3.6 hours allowable degradation

Every minute of payment downtime equals direct revenue loss.

Define Service Tiers

Tier 1 – Revenue Blocking

  • Payments
  • Checkout
  • Order processing

Tier 2 – Experience Enhancing

  • Product search
  • Recommendations
  • Reviews

Tier 3 – Internal or Supporting Tools

  • Admin dashboards
  • Reporting systems

Assign progressively relaxed error budgets by tier.

Use Error Budgets for Tradeoffs

Error budgets clarify tradeoffs between:

  • Deployment velocity
  • Feature expansion
  • Reliability investment

If teams want faster releases, they must either:

  • Improve deployment safety
  • Or operate within tighter error budgets

Document these priorities so tradeoffs are explicit and defensible.


Standardize How Teams Define and Review Reliability Targets

Teams should define SLIs based on their service’s user impact. However, organizations need a consistent framework to compare reliability across services.

Create a Standard SLO Documentation Template

Each SLO document should include:

  • SLI definition
  • Measurement window
  • Target percentage
  • Error budget
  • Business justification
  • Historical performance

This ensures comparability without dictating specific metrics.

Use Error Budgets as a Common Language

Different teams may measure:

  • Transaction completion
  • Search success
  • Login latency

But all express reliability as:

Target % → Corresponding Error Budget

This enables leadership to see which services burn reliability fastest—without deep technical interpretation.

Establish a Review Cadence

Hold monthly or quarterly SLO reviews where teams:

  • Present error budget status
  • Discuss recent incidents
  • Propose adjustments
  • Justify target changes

These sessions reveal systemic patterns:

  • Overperforming services may tolerate more velocity
  • Struggling services may need additional investment

Document the Reasoning Behind Targets

Capture why:

  • Checkout requires 99.95%
  • Search accepts 99.5%

This prevents confusion when teams change and ensures SLO evolution follows intentional decisions—not guesswork.

Consistency in process allows flexibility in metrics. Teams should refine SLIs as they learn what correlates with customer satisfaction.


Conclusion

Green dashboards do not guarantee successful customers.

Infrastructure-focused monitoring answers:

“Are servers responding?”

User-centric monitoring answers:

“Can customers accomplish their goals right now?”

Effective availability monitoring requires:

  • SLIs that measure user outcomes
  • Business-aligned reliability targets
  • Service tier prioritization
  • Standardized SLO review processes

A payment API returning HTTP 200 means nothing if transactions fail.

Modern reliability engineering recognizes that availability isn’t about uptime—it’s about user success.

Measure what matters.

Prioritize by business impact.

Standardize how reliability is defined and reviewed.

Your monitoring should answer one question clearly and continuously:

Can customers accomplish what they came here to do—right now?

Top comments (0)