Matthew Wimpelberg

Posted on Jun 8

Custom Metrics, Stress Testing, and Web Vitals, Going Beyond Basic Load Testing with k6

#tutorial #javascript #performance #testing

Part 3 of 4: Custom Metrics, Stress Testing, and Web Vitals — Going Beyond Basic Load Testing with k6

In part 2 I built a layered test suite against Google's Online Boutique on a homelab Kubernetes cluster. Smoke passed. The load test ran clean after fixing two bugs, a wrong assertion string on checkout and a missing await in the browser scenario. The load test summary showed p95 response times at 273ms, checkout success at 100%, and a CLS score of 0.117 nudging just over the 0.10 threshold.

That left three things unfinished. The stress test hadn't run. The CLS finding had no explanation. And the four custom metric types I'd defined in the scenarios deserved more than a passing mention.

This post runs the stress test, reads the results architecturally, explains what CLS 0.117 actually means and why HTTP testing would never have surfaced it, and walks through all four custom metric types with concrete examples of when each one is the right tool.

All the code is here: https://github.com/mwimpelberg28/k6-playground

Running the stress test

The stress config ramps VUs in stages, holds at peak, then ramps down. The goal isn't "break the app" it's find where degradation starts and understand the shape of it.

// src/config/stress.config.json
{
  "scenarios": {
    "stress": {
      "executor": "ramping-vus",
      "exec": "stressFlow",
      "stages": [
        { "duration": "2m", "target": 50  },
        { "duration": "2m", "target": 100 },
        { "duration": "2m", "target": 150 },
        { "duration": "2m", "target": 100 },
        { "duration": "2m", "target": 0   }
      ],
      "gracefulStop": "30s"
    }
  },
  "thresholds": {
    "http_req_failed":                        ["rate<0.10"],
    "http_req_duration":                      ["p(95)<5000"],
    "group_duration{group:::homepage}":       ["avg<1000"],
    "group_duration{group:::browse product}": ["avg<2000"],
    "checks":                                 ["rate>0.70"]
  }
}

The thresholds are deliberately looser than the load test. Stress isn't about enforcing SLOs it's about observing where and how the system degrades before it hits a hard wall. A stress test that fails immediately at tight thresholds tells you nothing useful about the degradation curve.

npm run stress
# k6 run dist/test.main.js -e CONFIG_FILE=../src/config/stress.config.json

What the stress test showed

The homepage held through the full ramp. Browse product requests started accumulating failures around the 100 VU mark, and by 150 VUs the check pass rate for product pages had dropped noticeably while the homepage check pass rate stayed flat.

That divergence is the finding. The homepage and product page both live in the same frontend service, on the same pod. If the frontend service itself were the bottleneck, both would degrade together. They didn't.

The difference is what each endpoint does downstream. The homepage makes one call: fetch featured products from the catalog service. The product page makes three in parallel fetch product details from the catalog service, fetch recommendations from the recommendation service, convert the price via the currency service. Under low concurrency that fan-out is invisible. Under high concurrency, those downstream services start queuing work, and the product page's response time is gated on whichever of the three takes longest.

This is one of the defining properties of microservices under load. Call graph depth matters more than frontend capacity. A single downstream service that saturates its thread pool or starts garbage collecting will cause latency spikes in every upstream caller that touches it and only those callers. The homepage, which doesn't touch the recommendation or currency service, keeps serving cleanly.

The stress test didn't break the app catastrophically. The homepage never went down. That's actually a well-behaved degradation pattern, the system is shedding load on complex, expensive paths while protecting simple ones. A poorly behaved version of this would see the frontend process itself crash, taking everything with it. What we observed instead was selective degradation by call graph complexity, which points directly at the downstream services as the constraint rather than the frontend.

Why group_duration catches this and http_req_duration doesn't

http_req_duration measures how long a single HTTP request takes. During the stress test, individual requests to the frontend completed in reasonable time even as the app was struggling. The frontend was accepting connections and dispatching work quickly. What was slow was waiting for the downstream calls to come back.

group_duration measures the wall-clock time of a named step end-to-end, including any sequential calls inside it. Every group() in the scripts layer gets a corresponding group_duration{group:::name} series automatically with no extra instrumentation required.

// threshold on a single request
"http_req_duration{journey:shopper}": ["p(95)<4000"],

// threshold on the full browse step including downstream wait time
"group_duration{group:::browse product}": ["avg<2000"]

If the only threshold was http_req_duration, the stress test would have looked healthier than it was. The group threshold on browse product caught the degradation because it was measuring the step the user actually experiences from initiating the product page load to receiving a complete response, including all downstream latency.

This is the shift from infrastructure metrics to user-experience metrics. The group is the unit of SLO, not the request.

The four custom metric types

k6 ships four custom metric types. Each has a specific meaning that makes it right for certain questions and wrong for others. Using the wrong one produces data that's technically correct but practically misleading.

Trend collects a distribution of values and exposes percentiles, min, max, and average. Use it when you want to know what "typical" looks like across all iterations. Checkout duration is a Trend because you want to know p95 the slowest experience a large fraction of users had, not just whether checkout ever succeeded.

const checkoutDuration = new Trend('boutique_checkout_duration', true);

// record it — called once per checkout attempt
checkoutDuration.add(duration);

// threshold against it
"boutique_checkout_duration": ["p(95)<5000"]

The second argument to new Trend() is isTime. Pass true when the values are milliseconds and k6 will format them as time in the terminal output rather than raw numbers.

Rate measures the fraction of recorded values that were successful. Use it when you want a success or failure percentage. Checkout success is a Rate because "82% of checkouts completed" is a statement that maps to a business SLO. "134 checkouts completed" is a count that requires context to interpret, context that changes depending on how many VUs ran and for how long.

const checkoutSuccess = new Rate('boutique_checkout_success');

// record it — true = success, false = failure
checkoutSuccess.add(ok);

// threshold against it
"boutique_checkout_success": ["rate>0.80"]

Counter accumulates a total. Use it when you want an absolute count of something. Cart errors are a Counter rather than a Rate because even a low error rate can represent a large absolute number of failures at high VU counts, and in a real business, each cart error is a customer who couldn't buy something. A Rate tells you the proportion; a Counter tells you the magnitude.

const cartErrors = new Counter('boutique_cart_errors');

// record it
if (!cartOk) {
  cartErrors.add(1);
  return;
}

Gauge records the current value of something at the moment it's called. Unlike Trend, it doesn't accumulate a distribution, it reflects the most recent reading. Use it for point-in-time state: how many sessions are active right now, what's the current queue depth, is a feature flag on or off. In a test context it's less common than the others, but it's the right tool when you care about instantaneous state rather than aggregate behavior.

const activeSessions = new Gauge('boutique_active_sessions');

activeSessions.add(http.cookieJar().cookiesForURL(BASE_URL).length > 0 ? 1 : 0);

The practical summary: percentile distribution → Trend. Success/failure percentage → Rate. Running total → Counter. Point-in-time reading → Gauge.

All four stream to Grafana Cloud as Prometheus time series when you run with k6 cloud run. The names you define in code become the series names. You query them in Grafana exactly like any other metric — rate(boutique_cart_errors[5m]), histogram_quantile(0.95, boutique_checkout_duration). Your load test data lives in the same datasource as your infrastructure metrics, with the same query language and the same alerting system.

The k6 Browser module and Web Vitals

The load test results from part 2 included a CLS score of 0.117 just over the 0.10 "good" threshold. To understand that number you need to know what the browser module is measuring and why it's different from everything else in the test suite.

The browser module runs a real Chromium instance. Not a simulated HTTP client, it's an actual browser, rendering pages, executing JavaScript, loading images, painting layout. Web Vitals are measurements taken from inside that rendering process:

LCP (Largest Contentful Paint) when did the largest visible element finish rendering? Measures perceived load speed.
FCP (First Contentful Paint) when did any content first appear? Measures how quickly the page starts showing something.
TTFB (Time to First Byte) how long before the browser received the first byte of the response? Measures server and network responsiveness.
CLS (Cumulative Layout Shift) how much did the page layout move around after initial render? Measures visual stability.

None of these are measurable from the HTTP layer. An HTTP client can tell you the server responded in 80ms. It can't tell you the user saw a blank white screen for 1.2 seconds while JavaScript parsed, or that the page jumped when an image loaded late and pushed all the text down.

The entry point is separate from the HTTP tests. Browser scenarios require Chromium, which can't share a process with the HTTP engine.

// src/browser.js
import { browser } from 'k6/browser';
import { check }   from 'k6';
import { sleep }   from 'k6';

export const options = {
  scenarios: {
    browser_smoke: {
      executor:        'per-vu-iterations',
      options:         { browser: { type: 'chromium' } },
      vus:             1,
      iterations:      3,
      gracefulStop:    '30s',
    },
  },
  thresholds: {
    browser_web_vital_lcp:  ['p(75)<2500'],
    browser_web_vital_fcp:  ['p(75)<1800'],
    browser_web_vital_ttfb: ['p(75)<800'],
    browser_web_vital_cls:  ['p(75)<0.10'],
  },
};

export default async function () {
  const page = await browser.newPage();

  try {
    await page.goto('http://10.4.20.2/');

    await check(page, {
      'page title present': async () => (await page.title()).length > 0,
      'shows Hot Products':  async () => {
        const body = await page.content();
        return body.includes('Hot Products');
      },
    });

    sleep(1);
  } finally {
    await page.close();
  }
}

Two implementation details worth being explicit about.

async/await throughout. The browser API is Promise-based. page.title() returns a Promise — not a string. Without await, page.title().length is 1, which is the length of the Promise object. The check always passes and measures nothing. This is the bug from part 2. Every browser API call needs to be awaited, including page.goto(), page.content(), and page.title().

try/finally around page operations. If a check throws, the finally block closes the page regardless. Without it, failed iterations leak Chromium instances and the test eventually exhausts memory. This isn't optional defensive programming — it's required for browser tests to be reliable.

No group() calls in browser scenarios. There's a long-standing k6 issue with groups in browser context. Use named checks for step-level visibility instead.

What CLS 0.117 actually means

The part 2 results showed:

browser_web_vital_lcp:  avg=335ms  p(75)=380ms   ✓
browser_web_vital_fcp:  avg=255ms  p(75)=290ms   ✓
browser_web_vital_ttfb: avg=36ms   p(75)=42ms    ✓
browser_web_vital_cls:  avg=0.117  p(75)=0.121   ✗

LCP, FCP, and TTFB are well inside their thresholds. TTFB at 36ms reflects a local network with no physical distance between test runner and server, but also a frontend service responding promptly. Nothing to fix there.

CLS at 0.117 failed the threshold. CLS accumulates a score each time visible content shifts position after the initial render — specifically, it measures the fraction of the viewport affected multiplied by the distance the content moved. A score of 0 means nothing shifted. A score over 0.10 is Google's boundary between "good" and "needs improvement."

On the Online Boutique homepage, the most likely cause is the product image grid. The browser paints the page structure nav, heading, product card containers before the images have loaded. When the images arrive, they push the surrounding layout down. The browser records that shift and adds it to the CLS score.

The fix is to tell the browser how much space each image will occupy before it loads. Explicit width and height attributes on <img> tags, or a CSS aspect-ratio declaration on the container, lets the browser reserve the right amount of space during initial layout. The images load into that space without causing a shift.

<!-- causes layout shift — browser doesn't know the image dimensions -->
<img src="/static/img/products/sunglasses.jpg" />

<!-- no layout shift — browser reserves space before the image loads -->
<img src="/static/img/products/sunglasses.jpg" width="320" height="320" />

The Online Boutique frontend doesn't do this. The product images load asynchronously into unsized containers and shift the layout on arrival. A 0.117 CLS score won't meaningfully affect search rankings on its own, but it is a real user experience problem — content jumping while someone is trying to read it, and it's exactly the class of issue that HTTP testing never surfaces because HTTP testing doesn't render anything.

What's next

Part 4 closes the loop on the unified observability story that started in part 1. k6 run is a developer workflow. k6 cloud run streams results into Grafana Cloud in real time. k6 Studio is a visual test editor that generates scripts without writing code. And synthetic monitoring turns the same scripts you've been running locally into scheduled checks from global probe locations, the same code, running permanently, alerting when production degrades.

The shift from load test to production monitor is what makes k6 different from most testing tools. Part 4 is about making that shift concrete.

#k6 #Grafana #LoadTesting #WebVitals #PerformanceTesting #Observability #SRE #Kubernetes

DEV Community