Part 4 of 4: From Load Test to Production Monitor — k6 Studio, Grafana Cloud, and Synthetic Monitoring
The first three parts of this series were about running tests. This one is about making them permanent.
In part 1, k6 was a command-line tool you ran against a URL. In part 2 it became a layered test suite version-controlled alongside the app it tests. In part 3 the stress test revealed something real about the app's architecture. All of that is useful as a development workflow. None of it tells you anything about what's happening in production right now.
That's what this post is about. The same scripts, pointed at a publicly reachable endpoint via ngrok, streaming results into Grafana Cloud in real time, and running on a schedule as synthetic monitors. One codebase. Three modes: local development, cloud-streamed load test, permanent availability check.
All the code is here: https://github.com/mwimpelberg28/k6-playground
Exposing the homelab with ngrok
The Online Boutique runs on a private cluster at 10.4.20.2. Grafana Cloud's synthetic monitoring probes can't reach that they're running from data centers in major cloud providers. To demo synthetic monitoring against a real app rather than a public URL I don't control, I needed to expose the cluster temporarily.
ngrok handles this in one command, pointed at the cluster's frontend service on the reserved free static domain:
ngrok http --url=imitation-laxative-iphone.ngrok-free.dev 10.4.20.2:80
ngrok prints the forwarding URL:
Forwarding https://imitation-laxative-iphone.ngrok-free.dev -> http://10.4.20.2:80
That URL is now publicly reachable. Any HTTP request to it gets tunneled to the cluster. ngrok's free tier now gives you one reserved static domain on ngrok-free.dev. It's stable across restarts, which is what lets me hardcode it into the committed cloud-* npm scripts and the synthetic monitor config rather than re-editing them every time the tunnel comes up. (An ephemeral tunnel gets a random URL that changes on each restart; a paid plan adds multiple custom domains.)
One honest observation: response times change through the tunnel. TTFB in the local load tests was 36ms because the test runner and the cluster are on the same LAN. Through ngrok, requests travel to ngrok's edge, get forwarded to the cluster, and travel back — a single curl through the tunnel measured TTFB around 315ms, roughly 9× the LAN figure. Under the full load run, request p95 landed at ~670ms (vs 273ms on the LAN). That's not a problem — it's actually more realistic. Local load tests measure server performance. Measuring through the tunnel captures something closer to what a remote user experiences.
Running the load test from Grafana Cloud with k6 cloud run
This is the other reason the app has to be publicly reachable. k6 cloud run doesn't execute on your laptop — it uploads the script and runs it on Grafana Cloud's load generators, in whatever regions you configure. Those runners live in Grafana's data centers, so they reach the Online Boutique exactly the way the synthetic probes do: through the ngrok tunnel, not over the LAN. As the test runs, every metric data point streams back into Grafana Cloud in real time rather than printing to the terminal at the end.
Authentication is a one-time login with an API token:
k6 cloud login --token <your-api-token>
export K6_CLOUD_PROJECT_ID=your-project-id
The token comes from your Grafana Cloud account under k6 → Settings → API Token. The project ID is visible on the same page.
Then the cloud run is the same bundle and the same config, pointed at the public URL:
k6 cloud run dist/test.main.js \
-e CONFIG_FILE=../src/config/load.config.json \
-e BASE_URL=https://imitation-laxative-iphone.ngrok-free.dev
# or: npm run cloud-load
(These commands, like the npm scripts, run from the k6-boutique/ directory — that's why dist/ and ../src/config/ resolve the way they do.)
Every http_req_duration, every custom metric, every check result is written to Grafana Cloud as it happens.
The Grafana Cloud k6 interface gives you a run summary page automatically — no dashboard configuration required. It shows the VU ramp timeline, p95 response time over the run, error rate, and check pass rate. For a quick read it's enough. For deeper analysis — and for correlating load-test results with infrastructure metrics — you want the Grafana dashboard.
What the tunnel run actually surfaced
The cloud run finished and tripped a threshold — k6 exits non-zero when any threshold fails. The cloud UI holds the full metric breakdown; to read the complete table here I ran the same load.config.json through the same tunnel (35 max VUs, three concurrent journeys, five minutes). The interesting part is which threshold failed — every latency threshold held comfortably:
✓ http_req_duration p(95)=668ms (<3000)
✓ {journey:browser} p(95)=783ms (<2000)
✓ {journey:shopper} p(95)=604ms (<4000)
✓ {journey:currency} p(95)=609ms (<2000)
✓ group_duration{:::homepage} avg=309ms (<500)
✓ group_duration{:::browse product} avg=273ms (<400)
✓ boutique_checkout_duration p(95)=617ms (<5000)
✓ boutique_checkout_success rate=100% (>0.80)
✗ http_req_failed rate=9.51% (<0.05)
Latency was fine. Checkout succeeded 100% of the time. The single failing threshold was the error rate: http_req_failed at 9.51% — 375 failed requests out of 3,942 — clustered on the homepage and product-page fetches (the status 200 check dropped to 86%), with 107 dropped iterations alongside them.
That pattern is the lesson. The app served clean 200s on every manual request, latency stayed healthy, and yet ~1 request in 10 failed under sustained load. The cause wasn't the Online Boutique, it was the free ngrok tunnel. At ~12.7 requests/second the free tier's connection and rate limits start shedding requests, and those show up in k6 as non-200s. The bottleneck under load was the demo plumbing, not the system under test.
This is worth internalizing before you trust a number: a load test measures the entire path. When you insert a free tunnel between the generator and the app, you've added a component with its own limits, and at high enough throughput that component fails before the app does. For a real load test you'd point k6 at the cluster directly (or pay for a tunnel tier built for it); the tunnel is for reachability exposing the app to Grafana's cloud generators and synthetic probes — not for absorbing load. The thresholds in load.config.json were calibrated against the app on the LAN, so they correctly flagged that something in the path was degrading. They just couldn't tell me it was the tunnel; the error pattern did.
Building the Grafana dashboard
The value of having k6 metrics in Grafana Cloud isn't the k6 interface it's that the same data is in the same Prometheus datasource as your infrastructure metrics. You can build panels that put them side by side.
The custom metrics from the scripts are queryable by name. The four from this suite:
# checkout success rate
boutique_checkout_success
# p95 checkout duration
histogram_quantile(0.95, rate(boutique_checkout_duration_bucket[1m]))
# cart errors over a 5-minute window
increase(boutique_cart_errors_total[5m])
# active sessions (latest gauge value)
boutique_active_sessions
The dashboard I built has eight panels, organized into five groups:
VU ramp — a time series of k6_vus showing the ramp shape. Useful for correlating degradation onset with a specific VU count. When the product page began slowing in the stress test, this panel pinned down exactly when — and at what VU count.
p95 response time by journey — overlaid lines via histogram_quantile(0.95, rate(k6_http_req_duration_bucket{journey="shopper"}[1m])) and the equivalent for journey="browser". At low load the two journeys track each other closely; as VUs climb they fan apart, and the panel shows which journey's latency is degrading rather than burying it in a single global p95.
Checkout success rate — boutique_checkout_success as a stat panel with a threshold at 80%. Green above, red below. During the load test this sits comfortably at 100%. During stress it starts to drop. This is the panel that maps to a business SLO rather than an infrastructure metric.
Cart error count — boutique_cart_errors_total as a time series. Flat during normal load. Any spikes here are worth investigating immediately regardless of what the response time panels show — a cart error is a customer who couldn't add an item, and that has a direct revenue implication.
Web Vitals — LCP, FCP, TTFB, and CLS as stat panels with their respective thresholds colored. CLS shows red at 0.117 from the browser test results. Everything else is green.
The dashboard is exportable as JSON and lives in the repo at grafana/dashboard.json. Import it into any Grafana instance connected to the same Prometheus datasource and it works.
k6 Studio
k6 Studio is a desktop app that sits between browser recording and code. You record a session in its built-in browser, it generates a k6 script, and you can validate and replay the recording before exporting the script.
It's useful in two specific situations: onboarding someone who hasn't written k6 scripts before, and quickly generating the skeleton of a new test flow for an endpoint you haven't covered yet. For the Online Boutique I could've used it to record the checkout flow end-to-end adding a product to cart, navigating to cart, submitting the order and then folded the generated script into the lib/ layer to add error handling and custom metrics.
The generated script is verbose. k6 Studio captures everything the browser sends, including headers and cookies that k6 handles automatically, and includes them explicitly. Before the generated script is usable in a real suite you'll strip the redundant headers, replace hardcoded URLs with variables, and wrap the requests in groups. But having the request sequence correct from the start the right endpoints in the right order with the right request bodies saves meaningful time compared to reconstructing it from documentation or browser DevTools by hand.
One thing it doesn't do: k6 Studio doesn't understand your application's business logic. It records what the browser sent. It doesn't know that the cartId in the cart request needs to match the session, or that the currency selector needs to be set before the price conversion call. That logic lives in the lib/ layer and you add it manually after import.
Setting up synthetic monitoring
Synthetic monitoring turns a k6 script into a scheduled check that runs from Grafana's global probe network. The same script that ran as a local load test becomes a permanent canary executing on a set interval (as often as every minute), from multiple locations, alerting when it fails.
The setup lives in Grafana Cloud under Synthetic Monitoring → Scripted. You paste your script, configure the probe locations, set the execution interval, and save. The script runs against your target URL on that schedule indefinitely.
For the Online Boutique I used the smoke test script with the BASE_URL pointed at the ngrok tunnel:
// k6-boutique/src/config/smoke.config.json — top-level thresholds
{
"thresholds": {
"http_req_failed": ["rate<0.05"],
"http_req_duration": ["p(95)<2000"],
"checks": ["rate>0.90"]
// ...plus per-group group_duration thresholds, omitted here
}
}
The probe locations I selected: North Virginia (US East), London (EU West), and Tokyo (Asia Pacific). One note on interval: the docs and most tutorials assume a one-minute frequency, but Synthetic Monitoring's free tier caps you at 100,000 check executions/month, and a scripted check fanned out to three probes at one-minute eats ~130,000/month on its own. To stay inside the free tier I ran the three probes at a two-minute interval (~65,000/month). Every two minutes, each probe runs the smoke test against the public URL and reports pass/fail, response time, and check results back to Grafana Cloud.
For alerting you'd wire the check's pass rate to a contact point: if it drops below 95% for two consecutive probe intervals, fire to Slack. At a two-minute interval that's a ~four-minute detection window — fast enough to catch a real availability incident, slow enough to ride out a single-probe flake.
Unlike the load test, synthetic monitoring runs at a low request rate — three probes, once every two minutes — so it never approaches the tunnel's rate limits. But "low rate" is not "zero failures," and that turned out to be the interesting part. Over a collection window of a few intervals, the measured per-probe numbers were:
Probe avg http_req_duration check pass rate checkout success
North Virginia (US East) 72 ms 88% 100%
London (EU West) 277 ms 95% 100%
Tokyo (Asia Pacific) 324 ms 90% 100%
Two things stand out. First, checks did not sit at a clean 100% they ran 88–95%, because the free tunnel dropped the occasional request even at this trickle of traffic. The checkout flow itself succeeded 100% of the time on every probe; the misses were on the homepage and product fetches, the same tunnel-shedding signature the load test surfaced, just much rarer. The lesson from earlier holds at every scale: you're measuring the whole path, and the free tunnel is the weakest link in it.
Second, the latency gradient by location is real and expected —but note which probe is fastest. North Virginia comes in lowest at 72 ms because ngrok's edge and the cluster are both US-based, so that probe barely leaves the country. London and Tokyo are 4–5× higher not because the app is slower for them, but because their requests cross an ocean to reach the US edge before they ever touch the cluster. The cluster is physically in the US; the speed of light does the rest. This is something a local load test, with the runner next to the cluster on the same LAN, can never show you.
What the unified view actually gives you
By the end of this series, the k6 setup does three distinct things that look the same from the outside but serve different purposes.
k6 run during development catches regressions before they ship. You run the smoke test against a branch before opening a PR. If response times have jumped or a check is failing, you find out before the reviewer does.
k6 cloud run during staging runs the full load and stress scenarios from Grafana's load generators and puts the results in the same observability stack as your infrastructure metrics. When the p95 product page latency spikes at 100 VUs, you can open the same Grafana instance and look at CPU and memory on the catalog and recommendation service pods at that exact moment. The load test result and the infrastructure telemetry share a timestamp axis.
Synthetic monitoring in production tells you what users are experiencing right now, from where they are, continuously. Not a snapshot from the last test run a live signal.
The same script, version-controlled, reviewed, and maintained like application code, powers all three.
Closing
This series started with a 30-line script and a philosophical argument: load tests should be code, not configuration. By the end it's a layered test suite, a Grafana dashboard, a stress test that revealed something real about a microservices call graph, a CLS finding that HTTP testing would never have surfaced, and a synthetic monitor running checks from three continents.
The tooling is k6 and Grafana Cloud. The underlying idea is that performance isn't a phase before launch it's a property of the system that you measure continuously, with the same rigor you bring to the rest of your engineering.
#k6 #Grafana #SyntheticMonitoring #LoadTesting #Observability #SRE #Kubernetes #WebVitals
Top comments (0)