Summary
In this iteration, I improved the observability of GBIM on both the backend and frontend, based on the latest origin/staging branch. The focus is not just on installing monitoring tools. It is about ensuring that operational data relevant to the application's workflow actually appears in Prometheus, Grafana, Sentry, GA4, and the k6 dashboard.
Key changes include gbm_* custom Prometheus metrics, structured logs with end-to-end correlation IDs, GA4 event analytics for user activity, Prometheus alert rules, and a k6 job utilizing Prometheus remote write. This ensures the k6-prometheus dashboard is no longer empty after the job runs.
Initial Problems
The monitoring stack was already in place prior to these changes, but several crucial pieces of evidence remained weak.
- The k6 dashboard in Grafana was still empty because no k6 job was writing test results to the Prometheus remote write.
- The flows for registration, account activation, token reactivation, admin account verification, and admin submission status updates lacked explicit business metrics.
- Tracing from frontend requests to backend logs was inconsistent because correlation IDs were not passed from the frontend and returned by the backend.
- User activity events on the frontend had not been standardized to provide evidence of user activity monitoring.
Implemented Solutions
1. Backend Metrics
The backend now includes a monitoring/metrics.py file containing Prometheus counters and histograms for essential flows.
gbm_auth_register_total{role,outcome}gbm_auth_activation_total{outcome}gbm_auth_reactivation_total{outcome}gbm_auth_email_send_duration_seconds{event,outcome}gbm_admin_account_verification_total{action,outcome}gbm_pengajuan_admin_status_update_total{status,outcome}
These metrics are emitted directly from the views and services handling registration, activation, reactivation, admin account verification, and admin submission status updates. Because of this, the dashboard displays business outcomes like success, validation_error, token_invalid, token_expired, server_error, and service_error instead of just generic HTTP requests.
[Placeholder SS-01] Prometheus targets
django-backendandprometheusare in theUPstate, indicating that backend metrics are successfully scraped by Prometheus.[Placeholder SS-02] The
GBM Business Monitoringdashboard in Grafana displays panels for registration, activation, admin verification, and admin submission status updates.
2. End-to-End Correlation IDs
The frontend now attaches an X-Correlation-ID header to requests via lib/api.ts. The backend accepts valid correlation IDs, generates a new one if the provided header is unsafe, and then returns it in the response.
The benefit is straightforward. When a user experiences an error on the frontend, the correlation ID from the response can be used to search for the relevant backend logs. This speeds up investigations since a single request can be traced across multiple layers.
[Placeholder SS-03] The browser's Network tab shows requests and responses with the
X-Correlation-ID, and the backend logs display the identicalcorr_idfor that specific request.
3. Frontend Analytics
The frontend includes a lib/analytics.ts helper to send GA4 events. This helper is a no-op when window.gtag is unavailable, and more importantly, it only activates if all the following conditions are met.
-
NEXT_PUBLIC_GA_MEASUREMENT_IDis available. -
NEXT_PUBLIC_APP_ENVis set tostagingorproduction. - The runtime host is included in the analytics allowlist, for instance,
gbim-staging.ppl.cs.ui.ac.id.
Thanks to this validation, analytics events are not sent from the local environment even if a developer accidentally populates the GA4 variables in .env.local.
Instrumented events include the following.
-
register_submitted,register_success,register_failed -
activation_verified,activation_expired,activation_used,activation_invalid,activation_rate_limited -
reactivation_requested,reactivation_success,reactivation_failed -
admin_verification_list_viewed,admin_verification_detail_clicked,admin_verification_status_updated pengajuan_admin_status_updated
[Placeholder SS-04] GA4 Realtime or DebugView displays one of the staging events, such as
register_submitted,activation_invalid, oradmin_verification_list_viewed.
4. Alerting
Prometheus alert rules were added for actionable conditions.
ActivationFailureRateHighRegisterServerErrorAdminVerificationErrorBurstPengajuanStatusUpdateServiceErrorK6HighFailureRateK6HighP95Latency
These alerts transform the monitoring setup from a passive dashboard into a proactive system. The team is immediately notified when the activation failure rate, registration errors, admin verification errors, submission service errors, or k6 metrics exceed their thresholds.
Grafana alerting is also provisioned to a Discord contact point named GBM_MONITORING_DISCORD using the DISCORD_WEBHOOK_URL environment variable. This is critical because alerts that stop at the dashboard are insufficient as operational evidence. Active notifications must be validated directly from the team's communication channels.
[Placeholder SS-05] Grafana Alerting shows the business and k6 rules along with the
GBM_MONITORING_DISCORDcontact point utilizing a Discord webhook.
5. Populating the Grafana Dashboard with k6
The k6 component is prepared to populate the following dashboard link [https://gbim-staging.ppl.cs.ui.ac.id/grafana/d/ccbb2351-2ae2-462f-ae0e-f2c893ad1028/k6-prometheus](https://gbim-staging.ppl.cs.ui.ac.id/grafana/d/ccbb2351-2ae2-462f-ae0e-f2c893ad1028/k6-prometheus)
The prepared implementation involves several key configurations.
- The Prometheus deployment utilizes the
--web.enable-remote-write-receiverargument. - The
k8s/job/k6-monitoring-smoke.yamlKubernetes Job runs thegrafana/k6:latestimage. - The k6 job includes
ttlSecondsAfterFinished: 600so thatCompletedpods are cleaned up automatically. - k6 leverages the
experimental-prometheus-rwoutput. - The remote write is directed to
http://prometheus:9090/api/v1/write. - The test is tagged with
testid=monitoring-smoke. - Local scripts are available at
k6/monitoring-smoke.jsandk6/activation-alert.js. - The pipeline waits for the backend to be ready using
kubectl rollout status deployment/gurubesarmengajarand the/api/metricsreadiness check before executing k6.
Important note. The k6 dashboard will populate once the latest Prometheus manifest is deployed and the k6 Job is executed. If the job has never run or the Grafana time range is too narrow, the dashboard may still appear empty. For evidence purposes, use a time range that covers the job's execution time, such as now-15m or now-1h.
[Placeholder SS-06] The
k6-monitoring-smokepipeline job completes, the k6 logs indicate the test is running, and the output utilizes the Prometheus remote write.[Placeholder SS-07] The
k6 Prometheusdashboard is populated withtestid=monitoring-smoke, covering request rate, p95 latency, failure rate, virtual users, and checks success rate.
How to Reproduce k6 Evidence
- Deploy the latest Prometheus manifest that enables the remote write receiver.
- Run the k6 Job using the following command.
kubectl apply -f k8s/job/k6-monitoring-smoke.yaml
- Check the job status.
kubectl -n ppl-aptikom get job k6-monitoring-smoke
kubectl -n ppl-aptikom logs job/k6-monitoring-smoke
- Open the
k6-prometheusdashboard, select the Prometheus data source, choosetestid=monitoring-smoke, and then set the time range to a period after the job execution.
You can use these queries for a quick PromQL validation.
k6_http_reqs_total{testid="monitoring-smoke"}
k6_http_req_duration_seconds_p95{testid="monitoring-smoke"}
k6_http_req_failed_rate{testid="monitoring-smoke"}
If the cluster still retains older metrics without unit suffixes, the dashboard fallback also accepts k6_http_req_duration_p95.
Mapping to CPL 6 DA
Criterion 1 Built-in Platform Monitoring
Prometheus and Grafana are not merely installed as isolated services. They are fully integrated into the GBIM application lifecycle within Kubernetes. The backend exposes /api/metrics for Prometheus scraping and a health check endpoint for readiness validation, while the Grafana dashboard is provisioned via ConfigMap so it persists even if the Grafana pod restarts.
This implementation also embeds observability directly into the deployment pipeline. The Prometheus manifests, alert rules, dashboards, and Grafana alerting configurations are applied through CI/CD, meaning monitoring changes can be reviewed and pushed just like application code. Through this pattern, monitoring becomes a reproducible part of the platform rather than a manual configuration in the Grafana UI.
Criterion 2 Standard Tool Setup with Live Data
The utilized tools represent industry standards. We use Prometheus for metrics, Grafana for dashboards and alerting, k6 for load testing, GA4 for frontend analytics, Sentry for error visibility, and structured logs for backend investigations. The displayed data is entirely authentic rather than static mocks, as the metrics originate from actual staging requests, user flow events, and a k6 job that genuinely writes test results to the Prometheus remote write.
The most visible improvement is on the k6 dashboard. Previously, the k6-prometheus dashboard was completely blank because no job was writing k6 metrics to Prometheus. Following these changes, the pipeline executes k6-monitoring-smoke, tags it with testid=monitoring-smoke, and the dashboard subsequently reads metrics such as k6_http_reqs_total, k6_http_req_failed_rate, and k6_http_req_duration_seconds_p95.
Criterion 3 Customization Tailored to Workflows
Customizations are tailored specifically to the GBIM workflows that matter most for operations, going far beyond basic CPU, memory, or HTTP status tracking. The backend introduces business metrics for registration, account activation, token reactivation, admin account verification, email delivery duration, and admin submission status updates. Labels such as outcome, role, action, and status enable the dashboard to distinctly categorize successes, validation failures, invalid tokens, expired tokens, not found errors, server errors, and service errors.
On the frontend, GA4 events are specifically designed to track relevant user activities, such as registration submissions, activation outcomes, reactivation requests, viewing the admin verification list, clicking account details, updating account statuses, and updating submission statuses. To prevent polluting the analytics data, the analytics wrapper is exclusively active in staging or production environments and on approved hosts.
Criterion 4 Advanced Usage
Advanced usage is demonstrated in two main areas. These are actionable alerts and end-to-end observability. Prometheus and Grafana alerts are established for conditions requiring follow-up actions, including spikes in activation failures, registration server errors, admin verification errors, submission update service errors, k6 failure rates, and k6 p95 latency. These alerts are routed to Discord via the GBM_MONITORING_DISCORD contact point, eliminating the need for the team to constantly monitor dashboards to detect issues.
Furthermore, correlation IDs seamlessly link frontend requests to backend logs. When a user encounters an error in the browser, the X-Correlation-ID provided in the response can be queried in the backend logs as corr_id, ensuring investigations do not stall at aggregate dashboards. The combination of metrics, alerts, load tests, analytics, and correlation IDs transforms this monitoring setup into a powerful diagnostic tool rather than just passive documentation.
Conclusion
These enhancements make GBIM observability highly concrete for CPL 6 DA requirements, as every layer now possesses its own operational evidence. The backend supplies custom Prometheus metrics for business flows, the frontend dispatches environment-restricted analytics events, client-server requests are traceable via correlation IDs, and k6 generates performance metrics that flow directly into the Grafana dashboard through remote write.
The final outcome is not merely having Grafana installed. It is a robust monitoring system capable of answering vital operational questions. We can now determine if registrations frequently fail, if activations are problematic, if admin verifications generate errors, if submission status updates are stable, and if the backend remains responsive under k6 testing. Once the backend and frontend are deployed and the k6 job completes, the screenshots inserted in each section will serve as definitive proof that the monitoring functions flawlessly from the deployment pipeline all the way to the dashboards and alerting systems.
Top comments (0)