DEV Community

Lakhan Samani
Lakhan Samani

Posted on

Implementing monitoring and alerting for distributed system - Part 6

Introduction

In Part 5, we implemented the tracing in userd & orderd services. In a distributed system, tracing helps follow a request's journey across services, while metrics play a crucial role in observability by providing visibility into system behavior, performance, and health. In distributed systems, they help with:

  • Performance Monitoring – Detect bottlenecks and optimize services.
  • Latency Tracking – Identify slow services in request chains.
  • Error Detection – Spot failing components before they cause outages.
  • Capacity Planning – Predict resource needs to avoid over/under-provisioning.
  • Alerting & Troubleshooting – Enable proactive issue resolution.

In this post, we will set up metrics and alerting in our e-commerce system userd and orderd services using Prometheus

img

Why Prometheus?

Prometheus is one of the best tools for monitoring gRPC-based distributed systems because of its scalability, pull-based model, efficient storage, and powerful query language (PromQL). Here’s why it’s widely used:

  1. Pull-Based Model for Scalability
  2. Efficient Storage & Retrieval
  3. gRPC-Native Support
  4. Powerful Query Language (PromQL)
  5. Easy Integration with Alerting & Dashboards

What kind of metrics are we interested in?

  1. Request Metrics
    • Total Requests (grpc_server_requests_total): Number of gRPC requests received.
    • Request Rate: Requests per second (RPS).
  2. Latency Metrics
    • Request Duration (grpc_server_handling_seconds): Time taken to handle each request.
  3. Error Metrics
    • gRPC Status Codes (grpc_server_responses_total with labels for status_code)
  4. Custom Business Metrics
    • Login API failure / success counter
    • Invalid login attempts counter
    • Register API failure / success counter
    • Me API failure / success counter
    • Create Order API failure / success counter
    • Fetch Order API failure / success counter

Integrating Metrics in userd

We updated userd to start an HTTP metrics server that Prometheus can scrape.

Step 1: Configure Prometheus Go Client

We will be using prometheus client to start metrics server as a go routine that can run in background.

This endpoint is then scraped by prometheus.

We will be using go-grpc-middleware for registering the standard metrics of grpc server.


import (
    "github.com/prometheus/client_golang/prometheus/promhttp"
    grpcprom "github.com/grpc-ecosystem/go-grpc-middleware/providers/prometheus"
)

const (
    metricsPort = ":9092"
)

var (
    grpcMetrics       = grpcprom.NewServerMetrics()
)

func main() {
    ....
    // Create a new gRPC server
    server := grpc.NewServer(
        grpc.StatsHandler(serverHandler),
        grpc.ChainUnaryInterceptor(
            grpcMetrics.UnaryServerInterceptor(),
        ),
        grpc.ChainStreamInterceptor(
            grpcMetrics.StreamServerInterceptor(),
        ),
    )

    // Start Prometheus HTTP server
    go func() {
        http.Handle("/metrics", promhttp.Handler())
        log.Println("Prometheus metrics server running on port", metricsPort)
        log.Fatal(http.ListenAndServe(metricsPort, nil))
    }()
}
Enter fullscreen mode Exit fullscreen mode

Note we will have similar change in orderd service for server setting in main.go file

Step 2: Define custom metrics for API service for userd service

package service

import "github.com/prometheus/client_golang/prometheus"

const (
    serviceName = "userd"
    component   = "service"

    loginHandlerLabel    = "login_handler"
    registerHandlerLabel = "register_handler"
    meHandlerLabel       = "me_handler"

    successResultLabel         = "success"
    userNotFoundResultLabel    = "user_not_found"
    invalidPasswordResultLabel = "invalid_password"
    userExistsResultLabel      = "user_exists"
    invalidTokenResultLabel    = "invalid_token"
    missingTokenResultLabel    = "missing_token"
)

var (
    // Metrics for login API

    loginMetrics = prometheus.NewCounterVec(prometheus.CounterOpts{
        Namespace: serviceName,
        Subsystem: component,
        Name:      loginHandlerLabel,
    }, []string{"result"})

    registerMetrics = prometheus.NewCounterVec(prometheus.CounterOpts{
        Namespace: serviceName,
        Subsystem: component,
        Name:      registerHandlerLabel,
    }, []string{"result"})

    meMetrics = prometheus.NewCounterVec(prometheus.CounterOpts{
        Namespace: serviceName,
        Subsystem: component,
        Name:      meHandlerLabel,
    }, []string{"result"})
)
Enter fullscreen mode Exit fullscreen mode

Step 3: Call metrics in service handlers

Example: Login service/login.go

// Get user by email
resUser, err := s.DBProvider.GetUserByEmail(ctx, email)
if err != nil {
    if errors.Is(err, gorm.ErrRecordNotFound) {
        loginMetrics.WithLabelValues(userNotFoundResultLabel).Inc()
        return nil, errors.New("user not found")
    }
    return nil, err
}

// Match password
if err := bcrypt.CompareHashAndPassword([]byte(resUser.Password), []byte(password)); err != nil {
    loginMetrics.WithLabelValues(invalidPasswordResultLabel).Inc()
    return nil, errors.New("invalid password")
}

// Generate JWT token
token, err := utils.GenerateJWT(s.JWTSecret, resUser.ID)
if err != nil {
    return nil, err
}
loginMetrics.WithLabelValues(successResultLabel).Inc()
Enter fullscreen mode Exit fullscreen mode

Integrating Metrics in orderd

orderd needs both server-side and client-side metrics calls for grpc userd.

Step 1 remains same as userd service.

Step 2: Add gRPC Client Interceptors

gRPC client interceptor helps in collecting grpc metrics while making call to userd service here.

var (
    grpcClientMetrics = grpcprom.NewClientMetrics()
)

// Create UserServiceClient using grpc
grpcConn, err := grpc.NewClient(
    userServiceURL,
    grpc.WithTransportCredentials(insecure.NewCredentials()),
    grpc.WithStatsHandler(openTelemetryClientHandler),
    // Add gRPC Client Interceptors for Prometheus Metrics
    grpc.WithUnaryInterceptor(grpcClientMetrics.UnaryClientInterceptor()),
    grpc.WithStreamInterceptor(grpcClientMetrics.StreamClientInterceptor()),
)
userServiceClient := user.NewUserServiceClient(grpcConn)
Enter fullscreen mode Exit fullscreen mode

Step 3: Define metrics for orderd services

package service

import "github.com/prometheus/client_golang/prometheus"

const (
    serviceName = "orderd"
    component   = "service"

    totalOrdersCreatedMetric = "total_orders"
    totalFetchedOrdersMetric = "total_fetched_orders"

    successResultLabel = "success"
    failedResultLabel  = "failed"
)

var (
    ordersCreatedMetrics = prometheus.NewCounterVec(prometheus.CounterOpts{
        Namespace: serviceName,
        Subsystem: component,
        Name:      totalOrdersCreatedMetric,
    }, []string{"result"})

    ordersFetchedMetrics = prometheus.NewCounterVec(prometheus.CounterOpts{
        Namespace: serviceName,
        Subsystem: component,
        Name:      totalFetchedOrdersMetric,
    }, []string{"result", "order_id"})
)
Enter fullscreen mode Exit fullscreen mode

Step 4: Example integration of metrics in api services

resOrder, err := s.DBProvider.GetOrderById(ctx, orderID)
if err != nil {
    ordersFetchedMetrics.WithLabelValues(failedResultLabel, orderID).Inc()
    return nil, err
}

ordersFetchedMetrics.WithLabelValues(successResultLabel, orderID).Inc()
Enter fullscreen mode Exit fullscreen mode

Setting Up Prometheus

Step 1: Define Prometheus configurations

Create a file called prometheus.yml with following configurations

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: "userd"
    static_configs:
      - targets: ["host.docker.internal:9091"]

  - job_name: "orderd"
    static_configs:
      - targets: ["host.docker.internal:9092"]
Enter fullscreen mode Exit fullscreen mode

Here we are configuring targets that prometheus will scrape for metrics, note we are using host.docker.internal as we are going to setup with docker, if service and prometheus are on same network you can also use localhost there.

Step 2: Run prometheus using docker

docker run -d --name=prometheus -p 9090:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
Enter fullscreen mode Exit fullscreen mode

Note: we are using the path for above defined prometheus.yml here.

Open http://localhost:9090 in your browser to visualize metrics.

Setting up Alerts using prometheus

Lets try to setup alert for invalid password attempts > 3

Step 1: Create alert_rules.yml

groups:
  - name: grpc_alerts
    rules:
      - alert: HighGRPCFailures
        expr: increase(grpc_client_failed_requests_total[1m]) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "High gRPC Client Failures"
          description: "More than 5 gRPC client failures detected in the last minute."
Enter fullscreen mode Exit fullscreen mode

Step 2: Create alertmanager.yml

Just a dummy example to setup slack alerts

global:
  resolve_timeout: 5m

route:
  receiver: "slack-notifications"

receivers:
  - name: "slack-notifications"
    slack_configs:
      - channel: "#alerts"
        send_resolved: true
        username: "prometheus"
        api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
Enter fullscreen mode Exit fullscreen mode

Step 3: Update prometheus.yml

global:
  scrape_interval: 5s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["host.docker.internal:9093"]

rule_files:
  - alert_rules.yml

scrape_configs:
  - job_name: "userd"
    static_configs:
      - targets: ["host.docker.internal:9091"]

  - job_name: "orderd"
    static_configs:
      - targets: ["host.docker.internal:9092"]

Enter fullscreen mode Exit fullscreen mode

Step 4: Start alertmanager & Restart prometheus

docker run -d --name=alertmanager \
  -p 9093:9093 \
  -v $(pwd)/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  prom/alertmanager
Enter fullscreen mode Exit fullscreen mode
docker run -d --name=prometheus -p 9090:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml -v $(pwd)/alert_rules.yml:/etc/prometheus/alert_rules.yml prom/prometheus
Enter fullscreen mode Exit fullscreen mode

Code Links

For complete change log, pls check

Conclusion

We successfully added monitoring and alerting using prometheus metrics and alertmanager to our distributed app

Next Steps:

Explore deploying distributed app on k8s

Stay tuned for the next post! 🚀

Image of Datadog

Master Mobile Monitoring for iOS Apps

Monitor your app’s health with real-time insights into crash-free rates, start times, and more. Optimize performance and prevent user churn by addressing critical issues like app hangs, and ANRs. Learn how to keep your iOS app running smoothly across all devices by downloading this eBook.

Get The eBook

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

If this article connected with you, consider tapping ❤️ or leaving a brief comment to share your thoughts!

Okay