DEV Community: Matthew Wimpelberg

From Load Test to Production Monitor k6 Studio, Grafana Cloud, and Synthetic Monitoring

Matthew Wimpelberg — Fri, 12 Jun 2026 14:02:02 +0000

Part 4 of 4: From Load Test to Production Monitor — k6 Studio, Grafana Cloud, and Synthetic Monitoring

The first three parts of this series were about running tests. This one is about making them permanent.

In part 1, k6 was a command-line tool you ran against a URL. In part 2 it became a layered test suite version-controlled alongside the app it tests. In part 3 the stress test revealed something real about the app's architecture. All of that is useful as a development workflow. None of it tells you anything about what's happening in production right now.

That's what this post is about. The same scripts, pointed at a publicly reachable endpoint via ngrok, streaming results into Grafana Cloud in real time, and running on a schedule as synthetic monitors. One codebase. Three modes: local development, cloud-streamed load test, permanent availability check.

All the code is here: https://github.com/mwimpelberg28/k6-playground

Exposing the homelab with ngrok

The Online Boutique runs on a private cluster at 10.4.20.2. Grafana Cloud's synthetic monitoring probes can't reach that they're running from data centers in major cloud providers. To demo synthetic monitoring against a real app rather than a public URL I don't control, I needed to expose the cluster temporarily.

ngrok handles this in one command, pointed at the cluster's frontend service on the reserved free static domain:

ngrok http --url=imitation-laxative-iphone.ngrok-free.dev 10.4.20.2:80

ngrok prints the forwarding URL:

Forwarding  https://imitation-laxative-iphone.ngrok-free.dev -> http://10.4.20.2:80

That URL is now publicly reachable. Any HTTP request to it gets tunneled to the cluster. ngrok's free tier now gives you one reserved static domain on ngrok-free.dev. It's stable across restarts, which is what lets me hardcode it into the committed cloud-* npm scripts and the synthetic monitor config rather than re-editing them every time the tunnel comes up. (An ephemeral tunnel gets a random URL that changes on each restart; a paid plan adds multiple custom domains.)

One honest observation: response times change through the tunnel. TTFB in the local load tests was 36ms because the test runner and the cluster are on the same LAN. Through ngrok, requests travel to ngrok's edge, get forwarded to the cluster, and travel back — a single curl through the tunnel measured TTFB around 315ms, roughly 9× the LAN figure. Under the full load run, request p95 landed at ~670ms (vs 273ms on the LAN). That's not a problem — it's actually more realistic. Local load tests measure server performance. Measuring through the tunnel captures something closer to what a remote user experiences.

Running the load test from Grafana Cloud with `k6 cloud run`

This is the other reason the app has to be publicly reachable. k6 cloud run doesn't execute on your laptop — it uploads the script and runs it on Grafana Cloud's load generators, in whatever regions you configure. Those runners live in Grafana's data centers, so they reach the Online Boutique exactly the way the synthetic probes do: through the ngrok tunnel, not over the LAN. As the test runs, every metric data point streams back into Grafana Cloud in real time rather than printing to the terminal at the end.

Authentication is a one-time login with an API token:

k6 cloud login --token <your-api-token>
export K6_CLOUD_PROJECT_ID=your-project-id

The token comes from your Grafana Cloud account under k6 → Settings → API Token. The project ID is visible on the same page.

Then the cloud run is the same bundle and the same config, pointed at the public URL:

k6 cloud run dist/test.main.js \
  -e CONFIG_FILE=../src/config/load.config.json \
  -e BASE_URL=https://imitation-laxative-iphone.ngrok-free.dev
# or: npm run cloud-load

(These commands, like the npm scripts, run from the k6-boutique/ directory — that's why dist/ and ../src/config/ resolve the way they do.)

Every http_req_duration, every custom metric, every check result is written to Grafana Cloud as it happens.

The Grafana Cloud k6 interface gives you a run summary page automatically — no dashboard configuration required. It shows the VU ramp timeline, p95 response time over the run, error rate, and check pass rate. For a quick read it's enough. For deeper analysis — and for correlating load-test results with infrastructure metrics — you want the Grafana dashboard.

What the tunnel run actually surfaced

The cloud run finished and tripped a threshold — k6 exits non-zero when any threshold fails. The cloud UI holds the full metric breakdown; to read the complete table here I ran the same load.config.json through the same tunnel (35 max VUs, three concurrent journeys, five minutes). The interesting part is which threshold failed — every latency threshold held comfortably:

✓ http_req_duration                p(95)=668ms   (<3000)
  ✓ {journey:browser}              p(95)=783ms   (<2000)
  ✓ {journey:shopper}              p(95)=604ms   (<4000)
  ✓ {journey:currency}             p(95)=609ms   (<2000)
✓ group_duration{:::homepage}      avg=309ms     (<500)
✓ group_duration{:::browse product} avg=273ms    (<400)
✓ boutique_checkout_duration       p(95)=617ms   (<5000)
✓ boutique_checkout_success        rate=100%     (>0.80)
✗ http_req_failed                  rate=9.51%    (<0.05)

Latency was fine. Checkout succeeded 100% of the time. The single failing threshold was the error rate: http_req_failed at 9.51% — 375 failed requests out of 3,942 — clustered on the homepage and product-page fetches (the status 200 check dropped to 86%), with 107 dropped iterations alongside them.

That pattern is the lesson. The app served clean 200s on every manual request, latency stayed healthy, and yet ~1 request in 10 failed under sustained load. The cause wasn't the Online Boutique, it was the free ngrok tunnel. At ~12.7 requests/second the free tier's connection and rate limits start shedding requests, and those show up in k6 as non-200s. The bottleneck under load was the demo plumbing, not the system under test.

This is worth internalizing before you trust a number: a load test measures the entire path. When you insert a free tunnel between the generator and the app, you've added a component with its own limits, and at high enough throughput that component fails before the app does. For a real load test you'd point k6 at the cluster directly (or pay for a tunnel tier built for it); the tunnel is for reachability exposing the app to Grafana's cloud generators and synthetic probes — not for absorbing load. The thresholds in load.config.json were calibrated against the app on the LAN, so they correctly flagged that something in the path was degrading. They just couldn't tell me it was the tunnel; the error pattern did.

Building the Grafana dashboard

The value of having k6 metrics in Grafana Cloud isn't the k6 interface it's that the same data is in the same Prometheus datasource as your infrastructure metrics. You can build panels that put them side by side.

The custom metrics from the scripts are queryable by name. The four from this suite:

# checkout success rate
boutique_checkout_success

# p95 checkout duration
histogram_quantile(0.95, rate(boutique_checkout_duration_bucket[1m]))

# cart errors over a 5-minute window
increase(boutique_cart_errors_total[5m])

# active sessions (latest gauge value)
boutique_active_sessions

The dashboard I built has eight panels, organized into five groups:

VU ramp — a time series of k6_vus showing the ramp shape. Useful for correlating degradation onset with a specific VU count. When the product page began slowing in the stress test, this panel pinned down exactly when — and at what VU count.

p95 response time by journey — overlaid lines via histogram_quantile(0.95, rate(k6_http_req_duration_bucket{journey="shopper"}[1m])) and the equivalent for journey="browser". At low load the two journeys track each other closely; as VUs climb they fan apart, and the panel shows which journey's latency is degrading rather than burying it in a single global p95.

Checkout success rate — boutique_checkout_success as a stat panel with a threshold at 80%. Green above, red below. During the load test this sits comfortably at 100%. During stress it starts to drop. This is the panel that maps to a business SLO rather than an infrastructure metric.

Cart error count — boutique_cart_errors_total as a time series. Flat during normal load. Any spikes here are worth investigating immediately regardless of what the response time panels show — a cart error is a customer who couldn't add an item, and that has a direct revenue implication.

Web Vitals — LCP, FCP, TTFB, and CLS as stat panels with their respective thresholds colored. CLS shows red at 0.117 from the browser test results. Everything else is green.

The dashboard is exportable as JSON and lives in the repo at grafana/dashboard.json. Import it into any Grafana instance connected to the same Prometheus datasource and it works.

k6 Studio

k6 Studio is a desktop app that sits between browser recording and code. You record a session in its built-in browser, it generates a k6 script, and you can validate and replay the recording before exporting the script.

It's useful in two specific situations: onboarding someone who hasn't written k6 scripts before, and quickly generating the skeleton of a new test flow for an endpoint you haven't covered yet. For the Online Boutique I could've used it to record the checkout flow end-to-end adding a product to cart, navigating to cart, submitting the order and then folded the generated script into the lib/ layer to add error handling and custom metrics.

The generated script is verbose. k6 Studio captures everything the browser sends, including headers and cookies that k6 handles automatically, and includes them explicitly. Before the generated script is usable in a real suite you'll strip the redundant headers, replace hardcoded URLs with variables, and wrap the requests in groups. But having the request sequence correct from the start the right endpoints in the right order with the right request bodies saves meaningful time compared to reconstructing it from documentation or browser DevTools by hand.

One thing it doesn't do: k6 Studio doesn't understand your application's business logic. It records what the browser sent. It doesn't know that the cartId in the cart request needs to match the session, or that the currency selector needs to be set before the price conversion call. That logic lives in the lib/ layer and you add it manually after import.

Setting up synthetic monitoring

Synthetic monitoring turns a k6 script into a scheduled check that runs from Grafana's global probe network. The same script that ran as a local load test becomes a permanent canary executing on a set interval (as often as every minute), from multiple locations, alerting when it fails.

The setup lives in Grafana Cloud under Synthetic Monitoring → Scripted. You paste your script, configure the probe locations, set the execution interval, and save. The script runs against your target URL on that schedule indefinitely.

For the Online Boutique I used the smoke test script with the BASE_URL pointed at the ngrok tunnel:

// k6-boutique/src/config/smoke.config.json — top-level thresholds
{
  "thresholds": {
    "http_req_failed":   ["rate<0.05"],
    "http_req_duration": ["p(95)<2000"],
    "checks":            ["rate>0.90"]
    // ...plus per-group group_duration thresholds, omitted here
  }
}

The probe locations I selected: North Virginia (US East), London (EU West), and Tokyo (Asia Pacific). One note on interval: the docs and most tutorials assume a one-minute frequency, but Synthetic Monitoring's free tier caps you at 100,000 check executions/month, and a scripted check fanned out to three probes at one-minute eats ~130,000/month on its own. To stay inside the free tier I ran the three probes at a two-minute interval (~65,000/month). Every two minutes, each probe runs the smoke test against the public URL and reports pass/fail, response time, and check results back to Grafana Cloud.

For alerting you'd wire the check's pass rate to a contact point: if it drops below 95% for two consecutive probe intervals, fire to Slack. At a two-minute interval that's a ~four-minute detection window — fast enough to catch a real availability incident, slow enough to ride out a single-probe flake.

Unlike the load test, synthetic monitoring runs at a low request rate — three probes, once every two minutes — so it never approaches the tunnel's rate limits. But "low rate" is not "zero failures," and that turned out to be the interesting part. Over a collection window of a few intervals, the measured per-probe numbers were:

Probe                       avg http_req_duration   check pass rate   checkout success
North Virginia (US East)    72 ms                   88%               100%
London (EU West)            277 ms                  95%               100%
Tokyo (Asia Pacific)        324 ms                  90%               100%

Two things stand out. First, checks did not sit at a clean 100% they ran 88–95%, because the free tunnel dropped the occasional request even at this trickle of traffic. The checkout flow itself succeeded 100% of the time on every probe; the misses were on the homepage and product fetches, the same tunnel-shedding signature the load test surfaced, just much rarer. The lesson from earlier holds at every scale: you're measuring the whole path, and the free tunnel is the weakest link in it.

Second, the latency gradient by location is real and expected —but note which probe is fastest. North Virginia comes in lowest at 72 ms because ngrok's edge and the cluster are both US-based, so that probe barely leaves the country. London and Tokyo are 4–5× higher not because the app is slower for them, but because their requests cross an ocean to reach the US edge before they ever touch the cluster. The cluster is physically in the US; the speed of light does the rest. This is something a local load test, with the runner next to the cluster on the same LAN, can never show you.

What the unified view actually gives you

By the end of this series, the k6 setup does three distinct things that look the same from the outside but serve different purposes.

k6 run during development catches regressions before they ship. You run the smoke test against a branch before opening a PR. If response times have jumped or a check is failing, you find out before the reviewer does.

k6 cloud run during staging runs the full load and stress scenarios from Grafana's load generators and puts the results in the same observability stack as your infrastructure metrics. When the p95 product page latency spikes at 100 VUs, you can open the same Grafana instance and look at CPU and memory on the catalog and recommendation service pods at that exact moment. The load test result and the infrastructure telemetry share a timestamp axis.

Synthetic monitoring in production tells you what users are experiencing right now, from where they are, continuously. Not a snapshot from the last test run a live signal.

The same script, version-controlled, reviewed, and maintained like application code, powers all three.

Closing

This series started with a 30-line script and a philosophical argument: load tests should be code, not configuration. By the end it's a layered test suite, a Grafana dashboard, a stress test that revealed something real about a microservices call graph, a CLS finding that HTTP testing would never have surfaced, and a synthetic monitor running checks from three continents.

The tooling is k6 and Grafana Cloud. The underlying idea is that performance isn't a phase before launch it's a property of the system that you measure continuously, with the same rigor you bring to the rest of your engineering.

#k6 #Grafana #SyntheticMonitoring #LoadTesting #Observability #SRE #Kubernetes #WebVitals

Custom Metrics, Stress Testing, and Web Vitals, Going Beyond Basic Load Testing with k6

Matthew Wimpelberg — Mon, 08 Jun 2026 09:06:37 +0000

Part 3 of 4: Custom Metrics, Stress Testing, and Web Vitals — Going Beyond Basic Load Testing with k6

In part 2 I built a layered test suite against Google's Online Boutique on a homelab Kubernetes cluster. Smoke passed. The load test ran clean after fixing two bugs, a wrong assertion string on checkout and a missing await in the browser scenario. The load test summary showed p95 response times at 273ms, checkout success at 100%, and a CLS score of 0.117 nudging just over the 0.10 threshold.

That left three things unfinished. The stress test hadn't run. The CLS finding had no explanation. And the four custom metric types I'd defined in the scenarios deserved more than a passing mention.

This post runs the stress test, reads the results architecturally, explains what CLS 0.117 actually means and why HTTP testing would never have surfaced it, and walks through all four custom metric types with concrete examples of when each one is the right tool.

All the code is here: https://github.com/mwimpelberg28/k6-playground

Running the stress test

The stress config ramps VUs in stages, holds at peak, then ramps down. The goal isn't "break the app" it's find where degradation starts and understand the shape of it.

// src/config/stress.config.json
{
  "scenarios": {
    "stress": {
      "executor": "ramping-vus",
      "exec": "stressFlow",
      "stages": [
        { "duration": "2m", "target": 50  },
        { "duration": "2m", "target": 100 },
        { "duration": "2m", "target": 150 },
        { "duration": "2m", "target": 100 },
        { "duration": "2m", "target": 0   }
      ],
      "gracefulStop": "30s"
    }
  },
  "thresholds": {
    "http_req_failed":                        ["rate<0.10"],
    "http_req_duration":                      ["p(95)<5000"],
    "group_duration{group:::homepage}":       ["avg<1000"],
    "group_duration{group:::browse product}": ["avg<2000"],
    "checks":                                 ["rate>0.70"]
  }
}

The thresholds are deliberately looser than the load test. Stress isn't about enforcing SLOs it's about observing where and how the system degrades before it hits a hard wall. A stress test that fails immediately at tight thresholds tells you nothing useful about the degradation curve.

npm run stress
# k6 run dist/test.main.js -e CONFIG_FILE=../src/config/stress.config.json

What the stress test showed

The homepage held through the full ramp. Browse product requests started accumulating failures around the 100 VU mark, and by 150 VUs the check pass rate for product pages had dropped noticeably while the homepage check pass rate stayed flat.

That divergence is the finding. The homepage and product page both live in the same frontend service, on the same pod. If the frontend service itself were the bottleneck, both would degrade together. They didn't.

The difference is what each endpoint does downstream. The homepage makes one call: fetch featured products from the catalog service. The product page makes three in parallel fetch product details from the catalog service, fetch recommendations from the recommendation service, convert the price via the currency service. Under low concurrency that fan-out is invisible. Under high concurrency, those downstream services start queuing work, and the product page's response time is gated on whichever of the three takes longest.

This is one of the defining properties of microservices under load. Call graph depth matters more than frontend capacity. A single downstream service that saturates its thread pool or starts garbage collecting will cause latency spikes in every upstream caller that touches it and only those callers. The homepage, which doesn't touch the recommendation or currency service, keeps serving cleanly.

The stress test didn't break the app catastrophically. The homepage never went down. That's actually a well-behaved degradation pattern, the system is shedding load on complex, expensive paths while protecting simple ones. A poorly behaved version of this would see the frontend process itself crash, taking everything with it. What we observed instead was selective degradation by call graph complexity, which points directly at the downstream services as the constraint rather than the frontend.

Why group_duration catches this and http_req_duration doesn't

http_req_duration measures how long a single HTTP request takes. During the stress test, individual requests to the frontend completed in reasonable time even as the app was struggling. The frontend was accepting connections and dispatching work quickly. What was slow was waiting for the downstream calls to come back.

group_duration measures the wall-clock time of a named step end-to-end, including any sequential calls inside it. Every group() in the scripts layer gets a corresponding group_duration{group:::name} series automatically with no extra instrumentation required.

// threshold on a single request
"http_req_duration{journey:shopper}": ["p(95)<4000"],

// threshold on the full browse step including downstream wait time
"group_duration{group:::browse product}": ["avg<2000"]

If the only threshold was http_req_duration, the stress test would have looked healthier than it was. The group threshold on browse product caught the degradation because it was measuring the step the user actually experiences from initiating the product page load to receiving a complete response, including all downstream latency.

This is the shift from infrastructure metrics to user-experience metrics. The group is the unit of SLO, not the request.

The four custom metric types

k6 ships four custom metric types. Each has a specific meaning that makes it right for certain questions and wrong for others. Using the wrong one produces data that's technically correct but practically misleading.

Trend collects a distribution of values and exposes percentiles, min, max, and average. Use it when you want to know what "typical" looks like across all iterations. Checkout duration is a Trend because you want to know p95 the slowest experience a large fraction of users had, not just whether checkout ever succeeded.

const checkoutDuration = new Trend('boutique_checkout_duration', true);

// record it — called once per checkout attempt
checkoutDuration.add(duration);

// threshold against it
"boutique_checkout_duration": ["p(95)<5000"]

The second argument to new Trend() is isTime. Pass true when the values are milliseconds and k6 will format them as time in the terminal output rather than raw numbers.

Rate measures the fraction of recorded values that were successful. Use it when you want a success or failure percentage. Checkout success is a Rate because "82% of checkouts completed" is a statement that maps to a business SLO. "134 checkouts completed" is a count that requires context to interpret, context that changes depending on how many VUs ran and for how long.

const checkoutSuccess = new Rate('boutique_checkout_success');

// record it — true = success, false = failure
checkoutSuccess.add(ok);

// threshold against it
"boutique_checkout_success": ["rate>0.80"]

Counter accumulates a total. Use it when you want an absolute count of something. Cart errors are a Counter rather than a Rate because even a low error rate can represent a large absolute number of failures at high VU counts, and in a real business, each cart error is a customer who couldn't buy something. A Rate tells you the proportion; a Counter tells you the magnitude.

const cartErrors = new Counter('boutique_cart_errors');

// record it
if (!cartOk) {
  cartErrors.add(1);
  return;
}

Gauge records the current value of something at the moment it's called. Unlike Trend, it doesn't accumulate a distribution, it reflects the most recent reading. Use it for point-in-time state: how many sessions are active right now, what's the current queue depth, is a feature flag on or off. In a test context it's less common than the others, but it's the right tool when you care about instantaneous state rather than aggregate behavior.

const activeSessions = new Gauge('boutique_active_sessions');

activeSessions.add(http.cookieJar().cookiesForURL(BASE_URL).length > 0 ? 1 : 0);

The practical summary: percentile distribution → Trend. Success/failure percentage → Rate. Running total → Counter. Point-in-time reading → Gauge.

All four stream to Grafana Cloud as Prometheus time series when you run with k6 cloud run. The names you define in code become the series names. You query them in Grafana exactly like any other metric — rate(boutique_cart_errors[5m]), histogram_quantile(0.95, boutique_checkout_duration). Your load test data lives in the same datasource as your infrastructure metrics, with the same query language and the same alerting system.

The k6 Browser module and Web Vitals

The load test results from part 2 included a CLS score of 0.117 just over the 0.10 "good" threshold. To understand that number you need to know what the browser module is measuring and why it's different from everything else in the test suite.

The browser module runs a real Chromium instance. Not a simulated HTTP client, it's an actual browser, rendering pages, executing JavaScript, loading images, painting layout. Web Vitals are measurements taken from inside that rendering process:

LCP (Largest Contentful Paint) when did the largest visible element finish rendering? Measures perceived load speed.
FCP (First Contentful Paint) when did any content first appear? Measures how quickly the page starts showing something.
TTFB (Time to First Byte) how long before the browser received the first byte of the response? Measures server and network responsiveness.
CLS (Cumulative Layout Shift) how much did the page layout move around after initial render? Measures visual stability.

None of these are measurable from the HTTP layer. An HTTP client can tell you the server responded in 80ms. It can't tell you the user saw a blank white screen for 1.2 seconds while JavaScript parsed, or that the page jumped when an image loaded late and pushed all the text down.

The entry point is separate from the HTTP tests. Browser scenarios require Chromium, which can't share a process with the HTTP engine.

// src/browser.js
import { browser } from 'k6/browser';
import { check }   from 'k6';
import { sleep }   from 'k6';

export const options = {
  scenarios: {
    browser_smoke: {
      executor:        'per-vu-iterations',
      options:         { browser: { type: 'chromium' } },
      vus:             1,
      iterations:      3,
      gracefulStop:    '30s',
    },
  },
  thresholds: {
    browser_web_vital_lcp:  ['p(75)<2500'],
    browser_web_vital_fcp:  ['p(75)<1800'],
    browser_web_vital_ttfb: ['p(75)<800'],
    browser_web_vital_cls:  ['p(75)<0.10'],
  },
};

export default async function () {
  const page = await browser.newPage();

  try {
    await page.goto('http://10.4.20.2/');

    await check(page, {
      'page title present': async () => (await page.title()).length > 0,
      'shows Hot Products':  async () => {
        const body = await page.content();
        return body.includes('Hot Products');
      },
    });

    sleep(1);
  } finally {
    await page.close();
  }
}

Two implementation details worth being explicit about.

async/await throughout. The browser API is Promise-based. page.title() returns a Promise — not a string. Without await, page.title().length is 1, which is the length of the Promise object. The check always passes and measures nothing. This is the bug from part 2. Every browser API call needs to be awaited, including page.goto(), page.content(), and page.title().

try/finally around page operations. If a check throws, the finally block closes the page regardless. Without it, failed iterations leak Chromium instances and the test eventually exhausts memory. This isn't optional defensive programming — it's required for browser tests to be reliable.

No group() calls in browser scenarios. There's a long-standing k6 issue with groups in browser context. Use named checks for step-level visibility instead.

What CLS 0.117 actually means

The part 2 results showed:

browser_web_vital_lcp:  avg=335ms  p(75)=380ms   ✓
browser_web_vital_fcp:  avg=255ms  p(75)=290ms   ✓
browser_web_vital_ttfb: avg=36ms   p(75)=42ms    ✓
browser_web_vital_cls:  avg=0.117  p(75)=0.121   ✗

LCP, FCP, and TTFB are well inside their thresholds. TTFB at 36ms reflects a local network with no physical distance between test runner and server, but also a frontend service responding promptly. Nothing to fix there.

CLS at 0.117 failed the threshold. CLS accumulates a score each time visible content shifts position after the initial render — specifically, it measures the fraction of the viewport affected multiplied by the distance the content moved. A score of 0 means nothing shifted. A score over 0.10 is Google's boundary between "good" and "needs improvement."

On the Online Boutique homepage, the most likely cause is the product image grid. The browser paints the page structure nav, heading, product card containers before the images have loaded. When the images arrive, they push the surrounding layout down. The browser records that shift and adds it to the CLS score.

The fix is to tell the browser how much space each image will occupy before it loads. Explicit width and height attributes on <img> tags, or a CSS aspect-ratio declaration on the container, lets the browser reserve the right amount of space during initial layout. The images load into that space without causing a shift.

<!-- causes layout shift — browser doesn't know the image dimensions -->
<img src="/static/img/products/sunglasses.jpg" />

<!-- no layout shift — browser reserves space before the image loads -->
<img src="/static/img/products/sunglasses.jpg" width="320" height="320" />

The Online Boutique frontend doesn't do this. The product images load asynchronously into unsized containers and shift the layout on arrival. A 0.117 CLS score won't meaningfully affect search rankings on its own, but it is a real user experience problem — content jumping while someone is trying to read it, and it's exactly the class of issue that HTTP testing never surfaces because HTTP testing doesn't render anything.

What's next

Part 4 closes the loop on the unified observability story that started in part 1. k6 run is a developer workflow. k6 cloud run streams results into Grafana Cloud in real time. k6 Studio is a visual test editor that generates scripts without writing code. And synthetic monitoring turns the same scripts you've been running locally into scheduled checks from global probe locations, the same code, running permanently, alerting when production degrades.

The shift from load test to production monitor is what makes k6 different from most testing tools. Part 4 is about making that shift concrete.

#k6 #Grafana #LoadTesting #WebVitals #PerformanceTesting #Observability #SRE #Kubernetes

Building a Real k6 Test Suite Against a Live Kubernetes App

Matthew Wimpelberg — Thu, 28 May 2026 15:29:02 +0000

Part 2 of 4: Building a Real k6 Test Suite Against a Live Kubernetes App

In part 1 I covered k6's philosophy and the anatomy of a first test. This post is where things get real — a production-grade test suite running against a live microservices app on a homelab Kubernetes cluster, including what went wrong on the first run and how I debugged it. All of the code can be found here: https://github.com/mwimpelberg28/k6-playground

The target: Online Boutique

Rather than testing against a mock or a toy API, I wanted something that resembles a real production system. Google's Online Boutique is a microservices demo app with 11 services covering a realistic e-commerce stack: frontend, cart, checkout, product catalog, currency conversion, recommendations, and more.

Deploying it took about two minutes:

kubectl create namespace boutique
kubectl apply -n boutique -f \
  https://raw.githubusercontent.com/GoogleCloudPlatform/microservices-demo/main/release/kubernetes-manifests.yaml

My homelab runs a kubeadm cluster on Ubuntu with MetalLB for load balancing. Within 30 seconds MetalLB had assigned a real external IP and the app was serving traffic at http://10.4.20.2.

kubectl get svc -n boutique frontend-external
# NAME                TYPE           EXTERNAL-IP   PORT(S)
# frontend-external   LoadBalancer   10.4.20.2     80:xxxxx/TCP

The architecture decision that matters most

Before writing a single test I designed a layered project structure. This is the difference between a test suite and a folder of scripts.

k6-boutique/
├── src/
│   ├── config/          ← test options as JSON, selected at runtime
│   │   ├── smoke.config.json
│   │   ├── load.config.json
│   │   ├── stress.config.json
│   │   └── browser.config.json
│   ├── scenarios/       ← user journey flows: chain scripts + sleep
│   │   ├── browseFlow.js
│   │   ├── shopperFlow.js
│   │   ├── currencyFlow.js
│   │   └── stressFlow.js
│   ├── scripts/         ← individual page actions: one group() per file
│   │   ├── home.js
│   │   ├── product.js
│   │   ├── cart.js
│   │   ├── checkout.js
│   │   └── currency.js
│   ├── pages/           ← Page Object Model classes for browser tests
│   │   ├── HomePage.js
│   │   └── ProductPage.js
│   ├── lib/             ← shared HTTP client and check assertions
│   │   ├── client.js
│   │   └── checks.js
│   ├── main.js          ← single entry point for all HTTP tests
│   └── browser.js       ← entry point for browser tests
├── webpack.config.js
└── package.json

Think of it as lego blocks. The lib/ layer knows how to talk to the app. The scripts/ layer wraps each action in a named group(). The scenarios/ layer chains those actions into user journeys. The config/ layer defines the load profile and thresholds for each test type. Nothing reaches down more than one layer.

The shared client

src/lib/client.js knows how to talk to the app — base URL, request helpers, product IDs, checkout payload. Every layer imports from it. Change the target URL once, everything picks it up.

One detail worth calling out: every request carries a name tag.

// src/lib/client.js
function params(name) {
  return { headers: baseHeaders, tags: { service: 'frontend', name } };
}

export function getProduct(productId) {
  return http.get(`${BASE_URL}/product/${productId}`, params('get-product'));
}

export function addToCart(productId, quantity = 1) {
  return http.post(`${BASE_URL}/cart`, { product_id: productId, quantity: quantity.toString() }, params('add-to-cart'));
}

Without the name tag, k6 tracks /product/0PUK6V6EV0 and /product/1YMWWN1N4O as separate metric series. With 10 product IDs and many VUs you hit Grafana Cloud's "too many series" limit fast. The name tag collapses all product page requests into a single get-product series regardless of the ID in the URL.

The shared checks

src/lib/checks.js knows what a good response looks like for each page:

// src/lib/checks.js
export function checkHome(res) {
  return check(res, {
    'status 200':     (r) => r.status === 200,
    'shows products': (r) => r.body.includes('Hot Products'),
    'response < 2s':  (r) => r.timings.duration < 2000,
  });
}

Define it once, use it everywhere. When the app changes, fix it in one place.

The scripts layer

Each file in scripts/ wraps one action in a named group() and runs the appropriate check. This is the unit of reuse — scenarios call these, not raw HTTP calls.

// src/scripts/product.js
export function browseProduct(productId) {
  let ok;
  group('browse product', () => {
    ok = checkProductPage(getProduct(productId));
  });
  return ok;
}

export function viewProduct(productId) {
  let ok;
  group('view product', () => {
    ok = checkProductPage(getProduct(productId));
  });
  return ok;
}

Different group names matter — group_duration{group:::browse product} and group_duration{group:::view product} are separate metrics, so you can set different SLAs for casual browsing vs. intent-to-buy flows.

Config files drive everything

Rather than hardcoding load profiles in test files, each test type has a JSON config file that's passed at runtime. The single entry point reads whichever config you point it at:

// src/main.js
const CONFIG_FILE = __ENV.CONFIG_FILE || '../src/config/smoke.config.json';
const testConfig  = JSON.parse(open(CONFIG_FILE));

export const options = Object.assign({ insecureSkipTlsVerify: false }, testConfig);

export function setup() {
  getHome();  // warm the connection before VUs start
  sleep(2);
}

// Named exports so scenario `exec` fields in the JSON config can reference them
export { browseFlow, shopperFlow, currencyFlow, stressFlow };

The build step (webpack) bundles everything into dist/test.main.js. The JSON config files stay outside the bundle and are opened at runtime, so you can swap them without rebuilding.

npm run build

# local run
k6 run dist/test.main.js -e CONFIG_FILE=../src/config/load.config.json

# cloud run
k6 cloud run dist/test.main.js -e CONFIG_FILE=../src/config/load.config.json

Smoke test first

The smoke config is 1 VU, 5 iterations of shopperFlow — homepage → product → add to cart → checkout. Its only job is to confirm the app is up and critical paths respond correctly. If smoke fails, nothing else runs.

// src/config/smoke.config.json
{
  "scenarios": {
    "smoke": {
      "executor": "per-vu-iterations",
      "vus": 1,
      "iterations": 5,
      "exec": "shopperFlow",
      "gracefulStop": "30s"
    }
  },
  "thresholds": {
    "http_req_failed":   ["rate<0.05"],
    "http_req_duration": ["p(95)<2000"],
    "checks":            ["rate>0.90"],

    "group_duration{group:::homepage}":     ["avg<500"],
    "group_duration{group:::view product}": ["avg<500"],
    "group_duration{group:::add to cart}":  ["avg<1000"],
    "group_duration{group:::checkout}":     ["avg<5000"]
  }
}

The group_duration thresholds are worth explaining. http_req_duration tells you how fast individual requests are. group_duration tells you how long an entire named step takes — a group might contain a single request or several. Setting an SLA on group_duration{group:::checkout} is much closer to a real business SLO than a raw request threshold, because checkout involves multiple sequential calls.

The syntax looks unusual — group:::checkout uses three colons. That's the k6 tag format for the built-in group_duration metric. Every group you define in code gets a corresponding series in this metric for free.

k6 run dist/test.main.js -e CONFIG_FILE=../src/config/smoke.config.json
# or: npm run smoke

What the first run caught

First smoke run: 10% error rate, two thresholds crossed. Response times were excellent — p95 of 87ms — so this wasn't a performance problem. Something was functionally wrong.

Debugging step 1 — verify the text the check was looking for:

curl -s http://10.4.20.2/product/0PUK6V6EV0 | grep -i "add to cart"
# <button type="submit" class="cymbal-button-primary">Add To Cart</button>

Text matched exactly. So the check wasn't wrong — some requests were returning non-200 responses before the check even ran.

Debugging step 2 — check what the cart POST actually returns:

curl -v -X POST http://10.4.20.2/cart \
  -d "product_id=0PUK6V6EV0&quantity=1" \
  -H "Content-Type: application/x-www-form-urlencoded"

< HTTP/1.1 302 Found
< Location: /cart
< Set-Cookie: shop_session-id=51779754-8ac6-4ac9-bbd9-1f062a8dc1b4

The cart POST returns a 302 and sets a session cookie. With only a handful of iterations, cold-start noise before sessions were established was dominating the results. The fix: bump the iteration count, add the setup() warmup in main.js, and slightly relax thresholds — smoke should catch catastrophic failure, not enforce strict SLOs.

This is the value of testing against a real app rather than a mock — you discover actual system behaviour.

Two bugs found during the load test

Running the full suite surfaced two more issues.

Bug 1 — Checkout success was 0%. All 79 checkout attempts completed and returned 200, but none matched the expected text. One curl command revealed it:

curl -s [checkout flow with cookies] | grep -i "order\|confirm\|thank"
# Your order is complete!

The check in src/lib/checks.js assumed Your order is placed. Fixed in one place, picked up everywhere:

export function checkCheckout(res) {
  return check(res, {
    'order placed':  (r) => r.status === 200 && r.body.includes('Your order is complete!'),
    'response < 3s': (r) => r.timings.duration < 3000,
  });
}

Bug 2 — Browser "page title present" failed all 41 iterations. In k6's browser API, page.title() returns a Promise and needs to be awaited. The fix sits in src/scenarios/browserFlow.js:

// broken
'page title present': () => page.title().length > 0,

// fixed
'page title present': async () => (await page.title()).length > 0,

Both fixes are a good reminder that checks are only as good as the assumptions baked into them. The test framework did its job — it surfaced the mismatches immediately.

User journeys: three concurrent scenarios

With smoke passing, it was time for the load test. Rather than hitting one endpoint in a loop, three distinct user types run simultaneously as k6 scenarios. All three are defined in load.config.json; the scenario functions live in src/scenarios/.

// src/config/load.config.json (scenarios section)
{
  "scenarios": {
    "browsers":  { "executor": "ramping-vus",          "exec": "browseFlow",   "stages": [{"duration":"1m","target":20}, {"duration":"3m","target":20}, {"duration":"1m","target":0}], "tags": {"journey":"browser"} },
    "shoppers":  { "executor": "ramping-vus",          "exec": "shopperFlow",  "stages": [{"duration":"1m","target":5},  {"duration":"3m","target":5},  {"duration":"1m","target":0}], "tags": {"journey":"shopper"} },
    "currencyUsers": { "executor": "constant-arrival-rate", "exec": "currencyFlow", "rate": 2, "timeUnit": "1s", "duration": "5m", "preAllocatedVUs": 5, "maxVUs": 10, "tags": {"journey":"currency"} }
  }
}

Browsers — casual visitors, read-only, up to 20 VUs. The scenario chains visitHome() and multiple browseProduct() calls from the scripts layer:

// src/scenarios/browseFlow.js
export function browseFlow() {
  let pagesViewed = 0;

  visitHome();
  pagesViewed++;
  sleep(randSleep(2, 5));

  const numProducts = Math.floor(Math.random() * 3) + 2;
  for (let i = 0; i < numProducts; i++) {
    browseProduct(randomProduct());
    pagesViewed++;
    sleep(randSleep(1, 4));
  }

  browseDepth.add(pagesViewed);
}

Shoppers — full checkout flow, up to 5 VUs. The checkout script returns { ok, duration } so the scenario can record custom metrics without needing access to the raw response:

// src/scenarios/shopperFlow.js
export function shopperFlow() {
  visitHome();
  sleep(randSleep(2, 4));

  viewProduct(productId);
  sleep(randSleep(1, 3));

  const cartOk = addItemToCart(productId, 1);
  if (!cartOk) { cartErrors.add(1); return; }

  viewCart();
  sleep(randSleep(2, 4));

  const { ok, duration } = doCheckout();
  checkoutDuration.add(duration);
  checkoutSuccess.add(ok);
}

Currency switchers — exercises the currency microservice at a constant arrival rate of 2 RPS. constant-arrival-rate controls throughput rather than concurrency — 2 iterations per second regardless of how long each one takes. That's how production traffic actually behaves.

Per-journey request thresholds

Because each scenario tag is set in the JSON config ("tags": {"journey":"browser"}), you can threshold each journey's request duration independently:

"http_req_duration{journey:browser}":  ["p(95)<2000"],
"http_req_duration{journey:shopper}":  ["p(95)<4000"],
"http_req_duration{journey:currency}": ["p(95)<2000"]

Custom metrics as business SLOs

Custom metrics are defined in the scenario files where they're used. shopperFlow.js owns the checkout metrics; browseFlow.js owns browse depth:

// src/scenarios/shopperFlow.js
const checkoutDuration = new Trend('boutique_checkout_duration', true);
const checkoutSuccess  = new Rate('boutique_checkout_success');
const cartErrors       = new Counter('boutique_cart_errors');

The thresholds in load.config.json encode real business requirements:

"boutique_checkout_duration": ["p(95)<5000"],
"boutique_checkout_success":  ["rate>0.80"]

This is the shift from infrastructure SLOs to business SLOs — codified, version-controlled, enforced automatically in CI.

Results across all four test types

After fixing both bugs and re-running the full suite:

The results tell a clear story.

Response times are strong under normal load. Smoke p95 at 89ms and load p95 at 273ms show the app handles realistic traffic comfortably on homelab hardware.

Checkout: 0% → 100% after the fix. All 80 checkout attempts placed orders successfully, with a p95 of 224ms against a 5,000ms threshold. The bug was entirely in the check assertion, not the app.

Browser Web Vitals are healthy. LCP at 335ms and FCP at 255ms are well inside Core Web Vital targets. TTFB at 36ms is excellent. CLS at 0.117 just nudges over the 0.10 target — worth monitoring but not alarming. Note: browser tests deliberately have no group() calls — there's a long-standing k6 issue with groups in the browser context.

Product page buckled first under stress. Product page failures started accumulating around 100 VUs while the homepage held all the way through the 150 VU peak — 9,907 successful checks, zero 500 errors. The product page finished with 2,037 failures total. This makes architectural sense: the product page fans out to the product catalog, recommendation, and currency services simultaneously. Under load, those downstream calls start queuing. The homepage is a simpler call graph and degrades later.

Browse depth averaged 4.0 pages per session — the random product browsing in the browse journey is working as intended, generating realistic read patterns.

What's next

Post 3 covers the stress test in depth — reading degradation signals, understanding the product page failure pattern architecturally, and the k6 Browser module for Web Vitals measurement. Plus all four custom metric types and how to use them as CI-enforceable SLOs in Grafana Cloud.

#k6 #Grafana #LoadTesting #Kubernetes #Observability #SRE #PerformanceTesting

k6: The Tool, The Philosophy, and Your First Test

Matthew Wimpelberg — Tue, 26 May 2026 09:37:09 +0000

I've been going deep on k6, Grafana's open-source load and performance testing tool. This is the first in a four-part series documenting that journey, from first principles to a full test suite running against a live Kubernetes environment.

Why k6?

Most load testing tools treat tests as configuration. k6 treats them as code, JavaScript, version-controlled, modular, and reviewable like any other engineering artifact. That's a meaningful philosophical difference. It means your performance tests live in the same repo as your application, go through the same review process, and can be maintained by the same team.

For those of us already in the Grafana ecosystem, there's another compelling reason: k6 is a Grafana Labs product. Test results stream natively into Grafana Cloud. Custom metrics you define in your scripts become queryable Prometheus time series. Your load test data lives alongside your infrastructure metrics, traces, and logs in one place with one query language and one alerting system.

That unified observability story is what made me want to understand k6 deeply, not just at the surface level.

What k6 covers

Most people think of k6 as a load testing tool. It's actually much broader:

Smoke testing: Is the app up and returning the right things?
Load testing: How does it behave under realistic traffic?
Stress testing: Where does it break?
Soak testing: Does it degrade over hours?
Browser testing: Real Chromium, Web Vitals, frontend performance
Synthetic monitoring: Scheduled availability checks from global probe locations

One tool, one scripting language, the full testing lifecycle from development through production monitoring. I'm going to begin by running a test script locally on my laptop to illustrate the most basic use case.

Your first k6 script

Once you have k6 installed, a minimal test looks like this:

import http from 'k6/http';
import { sleep, check, group } from 'k6';

export const options = {
  vus: 10,
  duration: '30s',
  thresholds: {
    http_req_failed:                    ['rate<0.01'],
    http_req_duration:                  ['p(95)<500'],
    'http_req_duration{group:::Homepage}': ['p(95)<400'], // group-scoped threshold
  },
};

export default function () {
  group('Homepage', () => {
    const res = http.get('https://quickpizza.grafana.com/');
    check(res, {
      'status 200':       (r) => r.status === 200,
      'response < 500ms': (r) => r.timings.duration < 500,
    });
  });

  sleep(1);
}

Three things are happening here:

options tells k6 how to run — 10 virtual users for 30 seconds, and two thresholds that define pass/fail: less than 1% of requests can fail, and the 95th percentile response time must stay under 500ms. If either threshold is violated, k6 exits with a non-zero code. Your CI pipeline fails. That's your SLO enforced automatically.

checks are per-request assertions. A failing check doesn't stop the test, it increments a failure counter. At the end you see pass rates across all iterations, not just a binary pass/fail.

sleep is think time between requests. Without it, k6 hammers the server as fast as possible with unrealistic load that produces misleading results. Real users read pages. sleep(1) models that.

Run it:

k6 run script.js

The terminal output gives you request count, duration percentiles (p50, p90, p95, p99), error rate, data sent/received, and threshold results. A clean first run looks like this:

✓ status 200
✓ response < 500ms

http_req_duration: avg=45ms p(95)=112ms
http_req_failed:   0.00%
✓ thresholds passed

What's next

My next post will cover building a real test suite with a shared library architecture, smoke testing against a live microservices app running on a homelab Kubernetes cluster, and what happens when your first run doesn't go as expected.

The target app is Google's Online Boutique. It's a realistic e-commerce microservices demo with 11 services.

#k6 #Grafana #LoadTesting #PerformanceTesting #Observability #SRE #Kubernetes

DEV Community: Matthew Wimpelberg

From Load Test to Production Monitor k6 Studio, Grafana Cloud, and Synthetic Monitoring

Part 4 of 4: From Load Test to Production Monitor — k6 Studio, Grafana Cloud, and Synthetic Monitoring

Exposing the homelab with ngrok

Running the load test from Grafana Cloud with k6 cloud run

What the tunnel run actually surfaced

Building the Grafana dashboard

k6 Studio

Setting up synthetic monitoring

What the unified view actually gives you

Closing

Custom Metrics, Stress Testing, and Web Vitals, Going Beyond Basic Load Testing with k6

Part 3 of 4: Custom Metrics, Stress Testing, and Web Vitals — Going Beyond Basic Load Testing with k6

Running the stress test

What the stress test showed

Why group_duration catches this and http_req_duration doesn't

The four custom metric types

The k6 Browser module and Web Vitals

What CLS 0.117 actually means

What's next

Building a Real k6 Test Suite Against a Live Kubernetes App

Part 2 of 4: Building a Real k6 Test Suite Against a Live Kubernetes App

The target: Online Boutique

The architecture decision that matters most

The shared client

The shared checks

The scripts layer

Config files drive everything

Smoke test first

What the first run caught

Two bugs found during the load test

User journeys: three concurrent scenarios

Per-journey request thresholds

Custom metrics as business SLOs

Results across all four test types

What's next

k6: The Tool, The Philosophy, and Your First Test

Running the load test from Grafana Cloud with `k6 cloud run`