Mustafa ERBAY

Posted on May 27 • Originally published at mustafaerbay.com.tr

RED Metrics Design: Service-Oriented or Workflow-Oriented?

#monitoring #observability #systemdesign #performance

While developing a production ERP system, we were constantly dealing with delayed shipment reports. It took us three days to find the root cause. Similarly, there were times when, while performing service-based monitoring, we struggled to detect anomalies in critical business workflows promptly. These experiences repeatedly brought the question "Service-oriented or workflow-oriented?" to the forefront when designing RED (Rate, Errors, Duration) metrics. Both approaches have their own advantages and disadvantages, and determining which is more suitable depends on your system's complexity and priorities. Let's dive deep into this topic.

In this post, we'll explore how to design RED metrics from both a service-oriented and a workflow-oriented perspective. I'll explain which approach is more appropriate when, using concrete examples and real-world scenarios. My goal is to help you utilize these metrics more effectively and avoid potential pitfalls.

Service-Oriented RED Metrics: The Building Blocks

Service-oriented RED metrics are typically used to monitor the performance and health of individual services. Each service has its own set of metrics. This approach is quite practical for understanding and managing the fundamental building blocks of distributed systems. Focusing on a single service allows us to clearly see how much demand it's receiving (Rate), how many errors it's generating (Errors), and how long it takes to process these requests (Duration).

For instance, consider an authentication service. Let's assume this service handles millions of requests daily, has a certain error rate, and its average response time should be in the milliseconds. With service-oriented RED metrics, we can easily monitor these values. Tools like prometheus collect and visualize metrics from this service.

# Example Prometheus query: Error rate for Authentication service
sum(rate(http_requests_total{service="auth-service", status=~"5..|4..", environment="production"}[5m]))
/
sum(rate(http_requests_total{service="auth-service", environment="production"}[5m]))

This query calculates the ratio of total 5xx and 4xx errors over the last 5 minutes to the total requests. This gives a quick idea of the service's health. However, these metrics alone might not be sufficient. We might overlook performance issues within the service itself or problems in more complex workflows involving multiple services. For example, even if a service responds quickly internally, if it depends on another slow service, the overall user experience can be negatively impacted.

ℹ️ Advantages of Service-Oriented Metrics

Simplicity: Each service is monitored independently, making management easier.

Focus: Enables quick identification of service-level issues.

Cost-Effectiveness: May require less complex monitoring infrastructure.

The main advantage of this approach is that each service's performance can be evaluated independently. When there's a sudden increase in errors or a slowdown in a service, we can quickly pinpoint the source of the problem. This is critical for teams migrating from monolithic architectures to microservices or breaking down services into smaller, manageable pieces. However, the biggest disadvantage of this approach is its inability to fully reflect the interactions between services and their impact on the overall user experience.

Workflow-Oriented RED Metrics: Centering on User Experience

Workflow-oriented RED metrics track the steps users take to complete a task. This focuses on the question, "What did the user accomplish?" For example, let's consider the "add to cart" workflow on an e-commerce site. This workflow starts from the front-end user interface, connects to the product information service, interacts with the inventory service, sends an update to the cart service, and finally displays a confirmation message to the user.

Each step in this flow can represent a service or encompass multiple services. Workflow-oriented metrics measure how all these steps perform together. This allows us to understand how long it takes for a user to complete a transaction, how many errors occur during this process, and which steps contribute to increased duration or errors. This is the approach that most accurately reflects user experience.

Consider the "order creation" workflow in a production ERP. This workflow involves numerous steps such as retrieving customer information, selecting products, checking inventory, pricing, approval mechanisms, and finally, recording the order. Each of these steps might be managed by different services. Workflow-oriented metrics enable us to monitor the total time for all these steps, the overall error rate encountered during this process, and the success rate of completing an order.

💡 Why Workflow-Oriented Metrics Matter?

Users are concerned with how quickly and smoothly their tasks are completed, rather than the performance of individual services. Therefore, workflow-oriented metrics offer a more direct path to understanding the real user experience.

The biggest challenge with this approach is defining the correct workflows and establishing the necessary infrastructure to monitor these flows end-to-end. It's critical that each step is correctly tagged and metrics are collected consistently. For instance, to measure the time it takes for a user to initiate and successfully complete a transaction, we might need distributed tracing tools. These tools connect requests across different services, allowing us to see the entire workflow as a whole.

Trade-offs: Service or Workflow?

Both approaches have their own set of advantages and disadvantages. Which one we choose largely depends on our system's architecture, business requirements, and the capabilities of our monitoring infrastructure. Often, the best solution is to combine these two approaches.

Service-oriented metrics are excellent for understanding the health of the underlying infrastructure. They are critical for understanding if a service is overloaded, if there are issues with database connections, or if there's a memory leak. For example, metrics like WAL (Write-Ahead Logging) bloat rate for a database service (PostgreSQL) are directly service-oriented and provide crucial information about the database's health.

-- Simplified example query for PostgreSQL WAL bloat monitoring
SELECT
    pg_size_pretty(sum(size)) AS total_size
FROM
    pg_wal_lsn_diff(pg_current_wal_lsn(), '00000000000000000000000000000000');

Metrics like these are vital for understanding the performance of a specific service. However, they don't solely explain a general slowdown in a workflow that uses this database service. Perhaps another service in the workflow is sending excessive requests, straining the database.

On the other hand, workflow-oriented metrics are the best way to understand the end-user experience. A user taking a long time to complete a payment process might be due to a problem in a single service, or it could stem from a bottleneck in the interaction of multiple services.

⚠️ A Common Mistake: Relying Only on Service Metrics

Settling for only service-oriented metrics can cause you to overlook real issues experienced by users. A service appearing "healthy" doesn't mean a workflow using that service isn't slow or error-prone.

Therefore, a hybrid approach is often the most effective. While monitoring the health of core services using service-oriented metrics, it's also essential to include workflow-oriented metrics to understand the end-to-end performance and user experience of critical workflows. This allows us to quickly identify infrastructure problems and resolve general issues faced by users.

Real-World Scenarios and Applications

Based on my experience, I'd like to provide a few examples of how these two approaches can be used in different scenarios.

While working on a production ERP project, we noticed that order approval processes were taking longer than expected. Service-oriented metrics showed that related services like the approval service, inventory service, and notification service were operating normally internally. However, the total time to approve an order (workflow duration) was quite high. Through detailed investigation, we found that communication between the approval service and the inventory service was slow, particularly for complex inventory queries related to certain product groups. These queries weren't "erroneous" on their own but were slowing down the entire workflow. Here, workflow-oriented monitoring (e.g., by measuring the duration of each step with distributed tracing) helped us find the root cause of the problem.

Another example is a situation I encountered with my own mobile application. My app had a feature to send notifications to users. This feature was managed by a backend service, and its service-oriented metrics (request count, error rate, response time) looked perfectly fine. However, users occasionally reported that notifications were delayed or never arrived. Upon investigating, I found the issue wasn't with the backend service itself but with network problems in specific regions experienced by a third-party service provider (like Firebase Cloud Messaging) used for sending notifications. In this case, workflow-oriented metrics (time from sending the notification request to its reception by the device) would have more clearly revealed the problem.

💡 The Power of a Hybrid Approach

Service-oriented metrics are great for identifying "under-the-hood" issues. Workflow-oriented metrics are essential for understanding the "surface-level" user experience issues. Using both together provides the most comprehensive visibility.

Scenarios like these clearly demonstrate the inadequacy of relying on just one approach. In a financial calculator application I developed on my own VPS, I faced similar situations. I needed to optimize the communication between the application's main calculation module and the database. Service-oriented metrics showed the calculation module's CPU usage and memory consumption, but separately monitoring how long database queries took or how much resources they consumed wasn't enough to find the performance bottleneck. Here, performance monitoring tools specific to the database (PostgreSQL) and, from a workflow perspective, monitoring the time taken from the start of a calculation request to the completion of database queries and returning the result, played a critical role in solving the problem.

Technical Depth: How Do We Collect and Analyze Metrics?

Whether service-oriented or workflow-oriented, we need the right tools and techniques to effectively collect and analyze RED metrics.

Tools for Service-Oriented Metrics

One of the most popular tools for service-oriented metrics is Prometheus. Prometheus acts as a time-series database and collects metrics via HTTP scraping. Our applications are configured to expose specific metrics to Prometheus through an HTTP endpoint. For example, a web server (Nginx) can be configured to send metrics like request counts, response times, and error codes to Prometheus.

# Example configuration to enable Prometheus metrics in Nginx (with ngx_http_prometheus_module)
location /metrics {
    prometheus;
    prometheus_bucket_interval 0.001; # For millisecond precision
}

Visualization tools like Grafana are used to fetch data from Prometheus and create understandable dashboards. These dashboards allow us to monitor service-oriented RED metrics (Rate, Errors, Duration) in real-time. For instance, we can create panels showing a service's requests per second (Rate), its 5xx error rate (Errors), and its average request duration (Duration).

Tools for Workflow-Oriented Metrics

Workflow-oriented metrics are typically collected using distributed tracing tools. Projects like Jaeger, Zipkin, or OpenTelemetry provide an end-to-end view by tracing requests across different services and how they connect to each other. With these tools, we can see how much time a request spends moving from one service to another, which services are causing delays, or which steps encounter errors.

For example, consider a series of events occurring as a user browses a website: loading the homepage, clicking on a product, adding it to the cart, and making a payment. With OpenTelemetry, "spans" are created for each of these steps. These spans contain the start and end time of the request, a trace ID, and metadata related to the operation. Once this data is collected, we can calculate the total duration, average duration, and error rate of a specific workflow.

ℹ️ The Role of OpenTelemetry

OpenTelemetry offers a standardized way to collect both service-oriented and workflow-oriented metrics. By integrating your applications with OpenTelemetry, you can obtain both service-level metrics and distributed tracing data. This simplifies your monitoring infrastructure and provides a more consistent data collection process.

Using these tools, we can collect counters like http_requests_total and histograms like http_request_duration_seconds for service-oriented metrics, while also calculating metrics such as total transaction times and error rates derived from distributed tracing data for workflow-oriented metrics. For instance, to calculate the average duration of an order creation workflow, we can sum the durations of all spans representing this workflow and take the average. This allows us to measure the health of the entire workflow, not just individual services.

Which Case for Which? The Decision-Making Process

Choosing the right approach depends on our system's complexity and business goals. For a simple monolithic application, service-oriented metrics might be sufficient. However, in a microservice-based, complex, and high-traffic system, workflow-oriented metrics are absolutely essential.

Situations Favoring Service-Oriented Metrics:

Simple Monolithic Applications: In applications running on a single codebase, service-oriented metrics generally provide adequate visibility.
Basic Infrastructure Monitoring: When monitoring the health of core components like databases, message queues, or cache services.
New Services in Development: To understand a service's own performance before it's fully integrated into workflows.
Cost Constraints: Situations where there are insufficient resources to set up extensive distributed tracing infrastructure.

Situations Favoring Workflow-Oriented Metrics:

Microservice Architectures: In complex systems where multiple services interact.
Critical User Workflows: When the performance of steps users take to complete a task (e.g., payment, ordering, registration) is critical.
End-to-End Performance Issues: To diagnose situations where the overall system is slow despite individual services performing well.
User Experience Focus: When the goal is to best understand and improve the experience perceived by the end-user.

🔥 Risks of the Wrong Choice

Choosing the wrong approach can lead to unnecessary complexity and the oversight of critical issues. Relying solely on service-oriented metrics may prevent you from understanding user-facing problems. Workflow-oriented metrics alone can make it harder to detect underlying infrastructure issues.

Generally, the most balanced approach is to combine both methods. While collecting service-oriented metrics for critical services, it's also essential to calculate RED metrics using distributed tracing data for the most important user workflows. This creates the most comprehensive and effective monitoring strategy, ensuring both infrastructure health and user satisfaction. For example, when a service experiences a sudden spike in errors, service-oriented metrics help you quickly find the source, while you use workflow-oriented metrics to understand which workflows are affected by this error.

Conclusion: A Pragmatic Perspective

The question of whether to design RED metrics from a service-oriented or workflow-oriented perspective doesn't have a single "best" answer. Both approaches have their strengths and weaknesses. As a pragmatic engineer, I see these two approaches not as alternatives but as complementary.

💡 In Summary: What to Do?

1. Lay a Solid Foundation: Collect RED metrics (Rate, Errors, Duration) for each critical service. This allows you to quickly identify service-level issues.

2. Center on the User: Identify your most critical workflows and use distributed tracing tools to monitor their end-to-end performance. Calculate workflow-oriented RED metrics.

3. Connect the Dots: Analyze both service metrics and distributed tracing data together to understand which workflows are affected by anomalies in service-oriented metrics and which services cause problems in workflow-oriented metrics.

4. Choose Tools Wisely: Integrate tools like Prometheus, Grafana, Jaeger, and OpenTelemetry based on your system's needs.

It's important to remember that monitoring isn't just about collecting data; it's about transforming that data into meaningful insights and using those insights to improve our systems. Whichever approach you choose, ensure your metrics are understandable, actionable, and truly reflect your system's health. As I mentioned in my previous [related: Fundamentals of Observability] post, the right metrics allow you to keep a pulse on your system.

In conclusion, while service-oriented metrics help us understand the building blocks, workflow-oriented metrics show how these blocks work together – the real user experience. Using these two in balance is the key to achieving both a robust infrastructure and happy users.

DEV Community