Dylan Dumont

Posted on Apr 19

The RED Method: Request Rate, Errors, and Duration as Your Core SLIs

#architecture #observability #backend #devops

"Noise drowns out signal; focus on the three metrics that actually indicate system health."

What We're Building

We are instrumenting a Go-based HTTP handler to expose the three Request Rate, Errors, and Duration metrics required to calculate Service Level Indicators (SLIs). This scope excludes internal tracing spans or database metrics, focusing strictly on the surface API gateway to ensure consistency across a distributed backend. The goal is to replace legacy monitoring scripts with a structured metrics export that feeds directly into a Prometheus stack.

Step 1 — Instrument the Middleware

The first step is intercepting incoming requests before they reach the application logic. You need a middleware function that wraps the handler and captures the timing start point.

type RequestInfo struct {
    Start time.Time
}

func RequestMetricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        reqInfo := RequestInfo{Start: time.Now()}

        // Wrap the original handler logic here
        next.ServeHTTP(w, r)

        // Extract duration
        duration := time.Since(reqInfo.Start)
    })
}

This separation ensures the application logic remains clean while observability concerns are handled at the infrastructure boundary.

Step 2 — Aggregate Request Counts

Counters track the total volume of requests. You should maintain separate counters for 4xx errors and 5xx errors to distinguish client failures from server failures.

var (
    totalRequests = prometheus.NewCounter(prometheus.CounterOpts{
        Name: "api_total_requests_total",
        Help: "Total number of API requests.",
    })

    error5xx = prometheus.NewCounter(prometheus.CounterOpts{
        Name: "api_errors_5xx_total",
        Help: "Server-side errors.",
    })
)

Counters are essential for calculating Request Rate per second, which helps determine capacity planning thresholds.

Step 3 — Classify Error Labels

Do not just count errors; label them. Use status codes (2xx, 4xx, 5xx) as labels to allow you to query specific failure modes later.

func recordError(status int) {
    if status >= 500 {
        error5xx.Inc()
        // Record 4xx in a similar gauge or counter with a label
    }
}

This specificity allows you to distinguish between a rate-limiting issue (429) and a database crash (500) during incident response.

Step 4 — Measure Latency Histograms

Duration needs more than an average. A histogram with percentiles (p50, p95, p99) is required to understand the tail latency that impacts user experience.

duration := time.Since(reqInfo.Start)
apiDurationHistogram.Observe(duration.Seconds())

Histograms normalize for request volume, preventing a flood of requests from skewing the average latency significantly.

Step 5 — Export Metrics via HTTP Endpoint

The final step is exposing these values so a collector like Prometheus can scrape them every 15 seconds. Ensure your server does not block during the write phase.

func startServer() {
    mux := http.NewServeMux()
    mux.Handle("/metrics", prometheus.Handler())
    http.ListenAndServe(":8080", mux)
}

Standard HTTP endpoints provide the necessary protocol compliance for cloud-native observability stacks.

Key Takeaways

Request Rate provides visibility into traffic volume and helps identify capacity saturation points in real-time.
Errors must be labeled by status code to allow engineers to differentiate between client and server failures.
Duration histograms are superior to averages because they reveal the tail latency that causes actual user complaints.
Instrumentation should happen at the edge, ensuring that metrics reflect the contract presented to the client, not internal implementation details.
SLOs derived from these RED metrics drive meaningful alerts rather than noise from every internal dependency failure.

What's Next?

Next, define Service Level Objectives (SLOs) based on the 99.9th percentile of the Duration histogram. You should calculate error budgets to determine how much failure is acceptable before slowing down feature deployment. Finally, implement alerting rules that trigger on sustained spikes in error5xx over 5xx rates exceeding your threshold for one minute.

DEV Community