Your systems report healthy status—servers respond, databases execute queries, load balancers register active targets—but customers still can't buy your products. This gap between technical health indicators and actual user success reveals why availability monitoring must shift from infrastructure metrics to user outcomes.
Legacy approaches tracked server availability through ping tests and HTTP status codes, measuring whether systems responded rather than whether users achieved their goals. When payment processing degrades and exhausts connection pools, traditional dashboards may display green while checkout completion rates plummet.
Effective availability monitoring measures what matters to users:
Can they complete purchases? Log in successfully? Accomplish their intended tasks?
The practices outlined here demonstrate how to implement monitoring that detects real problems before customers notice them.
Measure User Outcomes, Not System Responses
Service Level Indicators (SLIs) should quantify whether users accomplish what they set out to do—not whether your infrastructure returns successful status codes.
Structure SLIs as Ratios
Effective SLIs follow a simple structure:
Successful User Actions / Total User Attempts
This shifts focus from system availability to business value delivered.
Instrument for Business Outcomes
Implementation begins with instrumenting applications to capture goal completion data.
For example, in a payment processing system, the relevant metric is not whether the API responds—but whether customers successfully complete transactions.
Using Prometheus (e.g., with the Go client library):
- Create a counter vector tracking transaction attempts.
- Include labels for:
- Status (completed, failed, pending)
- Confirmation receipt
- Increment counters based on actual outcomes.
A transaction counts as successful only when:
- Payment processing completes
- AND confirmation is received
An HTTP 200 response alone does not qualify.
Distinguish Between States
Track at least three states:
- Completed with confirmation
- Failed attempts
- Pending without confirmation
This ensures your metrics reflect real business outcomes rather than technical artifacts.
Build Meaningful Success Formulas
Examples:
Checkout Success Rate
Confirmed completed payments ÷ Total payment attempts (5-minute window)Authentication Reliability
Logins completed within 2 seconds ÷ Total login attemptsSearch Effectiveness
Queries returning results without errors ÷ Total search requests
Align Thresholds with Business Priority
Set targets based on business impact—not arbitrary technical standards:
- Payment flows: 99.95% (direct revenue impact)
- Authentication: 99.9% (gates all features)
- Search: 99.5% (important but less critical)
Avoid counting HTTP 200 responses as success if error details exist in the body. Only count confirmed business outcomes.
Rank Services by Their Business Consequences
Not all services deserve equal reliability targets.
Your monitoring strategy should reflect which failures hurt your business most.
Identify Revenue-Critical Services
For e-commerce platforms, Tier 1 services include:
- Payment processing
- Shopping cart
- Checkout flow
These require the tightest error budgets (typically 99.9%+ availability).
For a 30-day window:
- 99.9% → ~43 minutes of allowable degradation
- 99.5% → ~3.6 hours allowable degradation
Every minute of payment downtime equals direct revenue loss.
Define Service Tiers
Tier 1 – Revenue Blocking
- Payments
- Checkout
- Order processing
Tier 2 – Experience Enhancing
- Product search
- Recommendations
- Reviews
Tier 3 – Internal or Supporting Tools
- Admin dashboards
- Reporting systems
Assign progressively relaxed error budgets by tier.
Use Error Budgets for Tradeoffs
Error budgets clarify tradeoffs between:
- Deployment velocity
- Feature expansion
- Reliability investment
If teams want faster releases, they must either:
- Improve deployment safety
- Or operate within tighter error budgets
Document these priorities so tradeoffs are explicit and defensible.
Standardize How Teams Define and Review Reliability Targets
Teams should define SLIs based on their service’s user impact. However, organizations need a consistent framework to compare reliability across services.
Create a Standard SLO Documentation Template
Each SLO document should include:
- SLI definition
- Measurement window
- Target percentage
- Error budget
- Business justification
- Historical performance
This ensures comparability without dictating specific metrics.
Use Error Budgets as a Common Language
Different teams may measure:
- Transaction completion
- Search success
- Login latency
But all express reliability as:
Target % → Corresponding Error Budget
This enables leadership to see which services burn reliability fastest—without deep technical interpretation.
Establish a Review Cadence
Hold monthly or quarterly SLO reviews where teams:
- Present error budget status
- Discuss recent incidents
- Propose adjustments
- Justify target changes
These sessions reveal systemic patterns:
- Overperforming services may tolerate more velocity
- Struggling services may need additional investment
Document the Reasoning Behind Targets
Capture why:
- Checkout requires 99.95%
- Search accepts 99.5%
This prevents confusion when teams change and ensures SLO evolution follows intentional decisions—not guesswork.
Consistency in process allows flexibility in metrics. Teams should refine SLIs as they learn what correlates with customer satisfaction.
Conclusion
Green dashboards do not guarantee successful customers.
Infrastructure-focused monitoring answers:
“Are servers responding?”
User-centric monitoring answers:
“Can customers accomplish their goals right now?”
Effective availability monitoring requires:
- SLIs that measure user outcomes
- Business-aligned reliability targets
- Service tier prioritization
- Standardized SLO review processes
A payment API returning HTTP 200 means nothing if transactions fail.
Modern reliability engineering recognizes that availability isn’t about uptime—it’s about user success.
Measure what matters.
Prioritize by business impact.
Standardize how reliability is defined and reviewed.
Your monitoring should answer one question clearly and continuously:
Can customers accomplish what they came here to do—right now?
Top comments (0)