Vincent Davis

Posted on May 12

Building GBIM Observability From Correlation IDs to a Populated k6 Dashboard

#devops #monitoring #performance #sre

Summary

In this iteration, I improved the observability of GBIM on both the backend and frontend, based on the latest origin/staging branch. The focus is not just on installing monitoring tools. It is about ensuring that operational data relevant to the application's workflow actually appears in Prometheus, Grafana, Sentry, GA4, and the k6 dashboard.

Key changes include gbm_* custom Prometheus metrics, structured logs with end-to-end correlation IDs, GA4 event analytics for user activity, Prometheus alert rules, and a k6 job utilizing Prometheus remote write. This ensures the k6-prometheus dashboard is no longer empty after the job runs.

Initial Problems

The monitoring stack was already in place prior to these changes, but several crucial pieces of evidence remained weak.

The k6 dashboard in Grafana was still empty because no k6 job was writing test results to the Prometheus remote write.
The flows for registration, account activation, token reactivation, admin account verification, and admin submission status updates lacked explicit business metrics.
Tracing from frontend requests to backend logs was inconsistent because correlation IDs were not passed from the frontend and returned by the backend.
User activity events on the frontend had not been standardized to provide evidence of user activity monitoring.

Implemented Solutions

1. Backend Metrics

The backend now includes a monitoring/metrics.py file containing Prometheus counters and histograms for essential flows.

gbm_auth_register_total{role,outcome}
gbm_auth_activation_total{outcome}
gbm_auth_reactivation_total{outcome}
gbm_auth_email_send_duration_seconds{event,outcome}
gbm_admin_account_verification_total{action,outcome}
gbm_pengajuan_admin_status_update_total{status,outcome}

These metrics are emitted directly from the views and services handling registration, activation, reactivation, admin account verification, and admin submission status updates. Because of this, the dashboard displays business outcomes like success, validation_error, token_invalid, token_expired, server_error, and service_error instead of just generic HTTP requests.

2. End-to-End Correlation IDs

The frontend now attaches an X-Correlation-ID header to requests via lib/api.ts. The backend accepts valid correlation IDs, generates a new one if the provided header is unsafe, and then returns it in the response.

The benefit is straightforward. When a user experiences an error on the frontend, the correlation ID from the response can be used to search for the relevant backend logs. This speeds up investigations since a single request can be traced across multiple layers.

3. Frontend Analytics

The frontend includes a lib/analytics.ts helper to send GA4 events. This helper is a no-op when window.gtag is unavailable, and more importantly, it only activates if all the following conditions are met.

NEXT_PUBLIC_GA_MEASUREMENT_ID is available.
NEXT_PUBLIC_APP_ENV is set to staging or production.
The runtime host is included in the analytics allowlist, for instance, gbim-staging.ppl.cs.ui.ac.id.

Thanks to this validation, analytics events are not sent from the local environment even if a developer accidentally populates the GA4 variables in .env.local.

Instrumented events include the following.

register_submitted, register_success, register_failed
activation_verified, activation_expired, activation_used, activation_invalid, activation_rate_limited
reactivation_requested, reactivation_success, reactivation_failed
admin_verification_list_viewed, admin_verification_detail_clicked, admin_verification_status_updated
pengajuan_admin_status_updated

[Placeholder SS-04] GA4 Realtime or DebugView displays one of the staging events, such as register_submitted, activation_invalid, or admin_verification_list_viewed.

4. Alerting

Prometheus alert rules were added for actionable conditions.

ActivationFailureRateHigh
RegisterServerError
AdminVerificationErrorBurst
PengajuanStatusUpdateServiceError
K6HighFailureRate
K6HighP95Latency

These alerts transform the monitoring setup from a passive dashboard into a proactive system. The team is immediately notified when the activation failure rate, registration errors, admin verification errors, submission service errors, or k6 metrics exceed their thresholds.

Grafana alerting is also provisioned to a Discord contact point named GBM_MONITORING_DISCORD using the DISCORD_WEBHOOK_URL environment variable. This is critical because alerts that stop at the dashboard are insufficient as operational evidence. Active notifications must be validated directly from the team's communication channels.

[Placeholder SS-05] Grafana Alerting shows the business and k6 rules along with the GBM_MONITORING_DISCORD contact point utilizing a Discord webhook.

5. Populating the Grafana Dashboard with k6

The k6 component is prepared to populate the following dashboard link [https://gbim-staging.ppl.cs.ui.ac.id/grafana/d/ccbb2351-2ae2-462f-ae0e-f2c893ad1028/k6-prometheus](https://gbim-staging.ppl.cs.ui.ac.id/grafana/d/ccbb2351-2ae2-462f-ae0e-f2c893ad1028/k6-prometheus)

The prepared implementation involves several key configurations.

The Prometheus deployment utilizes the --web.enable-remote-write-receiver argument.
The k8s/job/k6-monitoring-smoke.yaml Kubernetes Job runs the grafana/k6:latest image.
The k6 job includes ttlSecondsAfterFinished: 600 so that Completed pods are cleaned up automatically.
k6 leverages the experimental-prometheus-rw output.
The remote write is directed to http://prometheus:9090/api/v1/write.
The test is tagged with testid=monitoring-smoke.
Local scripts are available at k6/monitoring-smoke.js and k6/activation-alert.js.
The pipeline waits for the backend to be ready using kubectl rollout status deployment/gurubesarmengajar and the /api/metrics readiness check before executing k6.

Important note. The k6 dashboard will populate once the latest Prometheus manifest is deployed and the k6 Job is executed. If the job has never run or the Grafana time range is too narrow, the dashboard may still appear empty. For evidence purposes, use a time range that covers the job's execution time, such as now-15m or now-1h.

[Placeholder SS-06] The k6-monitoring-smoke pipeline job completes, the k6 logs indicate the test is running, and the output utilizes the Prometheus remote write.

[Placeholder SS-07] The k6 Prometheus dashboard is populated with testid=monitoring-smoke, covering request rate, p95 latency, failure rate, virtual users, and checks success rate.

How to Reproduce k6 Evidence

Deploy the latest Prometheus manifest that enables the remote write receiver.
Run the k6 Job using the following command.

kubectl apply -f k8s/job/k6-monitoring-smoke.yaml

Check the job status.

kubectl -n ppl-aptikom get job k6-monitoring-smoke
kubectl -n ppl-aptikom logs job/k6-monitoring-smoke

Open the k6-prometheus dashboard, select the Prometheus data source, choose testid=monitoring-smoke, and then set the time range to a period after the job execution.

You can use these queries for a quick PromQL validation.

k6_http_reqs_total{testid="monitoring-smoke"}
k6_http_req_duration_seconds_p95{testid="monitoring-smoke"}
k6_http_req_failed_rate{testid="monitoring-smoke"}

If the cluster still retains older metrics without unit suffixes, the dashboard fallback also accepts k6_http_req_duration_p95.

Mapping to CPL 6 DA

Criterion 1 Built-in Platform Monitoring

Prometheus and Grafana are not merely installed as isolated services. They are fully integrated into the GBIM application lifecycle within Kubernetes. The backend exposes /api/metrics for Prometheus scraping and a health check endpoint for readiness validation, while the Grafana dashboard is provisioned via ConfigMap so it persists even if the Grafana pod restarts.

This implementation also embeds observability directly into the deployment pipeline. The Prometheus manifests, alert rules, dashboards, and Grafana alerting configurations are applied through CI/CD, meaning monitoring changes can be reviewed and pushed just like application code. Through this pattern, monitoring becomes a reproducible part of the platform rather than a manual configuration in the Grafana UI.

Criterion 2 Standard Tool Setup with Live Data

The utilized tools represent industry standards. We use Prometheus for metrics, Grafana for dashboards and alerting, k6 for load testing, GA4 for frontend analytics, Sentry for error visibility, and structured logs for backend investigations. The displayed data is entirely authentic rather than static mocks, as the metrics originate from actual staging requests, user flow events, and a k6 job that genuinely writes test results to the Prometheus remote write.

The most visible improvement is on the k6 dashboard. Previously, the k6-prometheus dashboard was completely blank because no job was writing k6 metrics to Prometheus. Following these changes, the pipeline executes k6-monitoring-smoke, tags it with testid=monitoring-smoke, and the dashboard subsequently reads metrics such as k6_http_reqs_total, k6_http_req_failed_rate, and k6_http_req_duration_seconds_p95.

Criterion 3 Customization Tailored to Workflows

Customizations are tailored specifically to the GBIM workflows that matter most for operations, going far beyond basic CPU, memory, or HTTP status tracking. The backend introduces business metrics for registration, account activation, token reactivation, admin account verification, email delivery duration, and admin submission status updates. Labels such as outcome, role, action, and status enable the dashboard to distinctly categorize successes, validation failures, invalid tokens, expired tokens, not found errors, server errors, and service errors.

On the frontend, GA4 events are specifically designed to track relevant user activities, such as registration submissions, activation outcomes, reactivation requests, viewing the admin verification list, clicking account details, updating account statuses, and updating submission statuses. To prevent polluting the analytics data, the analytics wrapper is exclusively active in staging or production environments and on approved hosts.

Criterion 4 Advanced Usage

Advanced usage is demonstrated in two main areas. These are actionable alerts and end-to-end observability. Prometheus and Grafana alerts are established for conditions requiring follow-up actions, including spikes in activation failures, registration server errors, admin verification errors, submission update service errors, k6 failure rates, and k6 p95 latency. These alerts are routed to Discord via the GBM_MONITORING_DISCORD contact point, eliminating the need for the team to constantly monitor dashboards to detect issues.

Furthermore, correlation IDs seamlessly link frontend requests to backend logs. When a user encounters an error in the browser, the X-Correlation-ID provided in the response can be queried in the backend logs as corr_id, ensuring investigations do not stall at aggregate dashboards. The combination of metrics, alerts, load tests, analytics, and correlation IDs transforms this monitoring setup into a powerful diagnostic tool rather than just passive documentation.

Conclusion

These enhancements make GBIM observability highly concrete for CPL 6 DA requirements, as every layer now possesses its own operational evidence. The backend supplies custom Prometheus metrics for business flows, the frontend dispatches environment-restricted analytics events, client-server requests are traceable via correlation IDs, and k6 generates performance metrics that flow directly into the Grafana dashboard through remote write.

The final outcome is not merely having Grafana installed. It is a robust monitoring system capable of answering vital operational questions. We can now determine if registrations frequently fail, if activations are problematic, if admin verifications generate errors, if submission status updates are stable, and if the backend remains responsive under k6 testing. Once the backend and frontend are deployed and the k6 job completes, the screenshots inserted in each section will serve as definitive proof that the monitoring functions flawlessly from the deployment pipeline all the way to the dashboards and alerting systems.

DEV Community