Mason K

Posted on May 15

Add real video QoE telemetry to your player in an afternoon

#javascript #tutorial #video #webdev

📦 Code: github.com/USER/video-qoe-starter — replace before publishing.

TL;DR

We'll instrument an HLS.js player to emit the three QoE metrics that actually correlate with viewer abandonment — startup time, rebuffering ratio, and playback failure rate — ship them to a tiny Fastify endpoint, store them in SQLite, and render a dashboard. End to end in one file each: a React component, a Node endpoint, and a SQL view.

What we're building

A minimum-viable telemetry rig that answers the three questions every video team gets asked the day after launch:

How long is it taking for a video to start playing?
How often is the player stalling mid-playback?
How often is playback failing outright?

We'll use HLS.js 1.6.x (the current stable line, with LL-HLS support), a Node 22 + Fastify backend, and SQLite for storage. No queue, no warehouse, no Grafana. The point is to ship a working baseline; you can swap parts later.

1. Set up the project 🛠️

mkdir video-qoe-starter && cd video-qoe-starter
npm init -y
npm install hls.js fastify better-sqlite3
npm install -D typescript @types/node @types/better-sqlite3
mkdir client server

Your package.json scripts block:

{
  "scripts": {
    "server": "node --experimental-strip-types server/index.ts",
    "build": "tsc"
  }
}

💡 Tip: Node 22 ships with --experimental-strip-types for TypeScript files, so you can run .ts directly without ts-node for prototypes like this.

2. Instrument the player 🎯

The HLS.js event surface is the same shape across every modern player — MANIFEST_PARSED, MEDIA_ATTACHED, FRAG_BUFFERED, ERROR, plus the standard HTML video events (play, waiting, playing, ended).

// client/qoe.ts
import Hls, { Events, ErrorData } from 'hls.js';

type QoeSession = {
  session_id: string;
  video_id: string;
  player: string;
  user_agent: string;
  startup_time_ms: number | null;
  total_watch_time_ms: number;
  total_stall_time_ms: number;
  rebuffer_count: number;
  error_codes: string[];
};

export function instrument(video: HTMLVideoElement, hls: Hls, videoId: string) {
  const session: QoeSession = {
    session_id: crypto.randomUUID(),
    video_id: videoId,
    player: `hls.js@${Hls.version}`,
    user_agent: navigator.userAgent,
    startup_time_ms: null,
    total_watch_time_ms: 0,
    total_stall_time_ms: 0,
    rebuffer_count: 0,
    error_codes: [],
  };

  let playRequestedAt: number | null = null;
  let lastPlayingAt: number | null = null;
  let stallStartedAt: number | null = null;
  let firstPlayingFired = false;

  video.addEventListener('play', () => {
    if (playRequestedAt === null) playRequestedAt = performance.now();
  });

  video.addEventListener('playing', () => {
    const now = performance.now();
    if (!firstPlayingFired && playRequestedAt !== null) {
      session.startup_time_ms = now - playRequestedAt;
      firstPlayingFired = true;
    }
    if (stallStartedAt !== null) {
      session.total_stall_time_ms += now - stallStartedAt;
      stallStartedAt = null;
    }
    lastPlayingAt = now;
  });

  video.addEventListener('waiting', () => {
    // 'waiting' is the rebuffer signal — buffer underrun, fetching more segments.
    if (firstPlayingFired) {
      session.rebuffer_count += 1;
      stallStartedAt = performance.now();
    }
  });

  video.addEventListener('pause', () => {
    if (lastPlayingAt !== null) {
      session.total_watch_time_ms += performance.now() - lastPlayingAt;
      lastPlayingAt = null;
    }
  });

  hls.on(Events.ERROR, (_, data: ErrorData) => {
    session.error_codes.push(`${data.type}:${data.details}`);
  });

  const flush = () => {
    if (lastPlayingAt !== null) {
      session.total_watch_time_ms += performance.now() - lastPlayingAt;
      lastPlayingAt = null;
    }
    navigator.sendBeacon('/qoe', JSON.stringify(session));
  };

  // Flush on tab close, route change, or page unload.
  window.addEventListener('pagehide', flush);
  document.addEventListener('visibilitychange', () => {
    if (document.visibilityState === 'hidden') flush();
  });

  return { session, flush };
}

A few decisions baked in here worth calling out:

navigator.sendBeacon is the right transport for telemetry on unload. fetch gets aborted when the tab closes; sendBeacon is best-effort but fire-and-forget.
First playing event is the only honest signal for "first frame painted." loadedmetadata and canplay fire too early — they fire when the player could play, not when it did.
waiting is the rebuffer signal, but only after the first playing. Before the first playing, waiting is part of startup.

⚠️ Note: Don't trust client-side derived metrics. Ship raw event data (start, playing, waiting, error timestamps) and compute rebuffer ratio server-side. We're shipping aggregates in this tutorial for simplicity — for production, do raw events.

3. Wire the player to the instrumentation 🔌

// client/Player.tsx
import { useEffect, useRef } from 'react';
import Hls from 'hls.js';
import { instrument } from './qoe';

export function Player({ src, videoId }: { src: string; videoId: string }) {
  const videoRef = useRef<HTMLVideoElement>(null);

  useEffect(() => {
    if (!videoRef.current) return;
    const hls = new Hls({ enableWorker: true });
    hls.loadSource(src);
    hls.attachMedia(videoRef.current);
    const { flush } = instrument(videoRef.current, hls, videoId);

    return () => {
      flush();
      hls.destroy();
    };
  }, [src, videoId]);

  return <video ref={videoRef} controls playsInline width="800" />;
}

That's the whole client side.

4. Build the ingestion endpoint 📥

The simplest thing that could possibly work — Fastify, JSON in, SQLite out:

// server/index.ts
import Fastify from 'fastify';
import Database from 'better-sqlite3';

const db = new Database('qoe.db');
db.exec(`
  CREATE TABLE IF NOT EXISTS sessions (
    session_id TEXT PRIMARY KEY,
    video_id TEXT NOT NULL,
    player TEXT,
    user_agent TEXT,
    startup_time_ms REAL,
    total_watch_time_ms REAL,
    total_stall_time_ms REAL,
    rebuffer_count INTEGER,
    error_codes TEXT,
    created_at INTEGER NOT NULL
  );
`);

const insert = db.prepare(`
  INSERT OR REPLACE INTO sessions
  (session_id, video_id, player, user_agent, startup_time_ms,
   total_watch_time_ms, total_stall_time_ms, rebuffer_count, error_codes, created_at)
  VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
`);

const app = Fastify({ logger: true });

app.post('/qoe', async (req, reply) => {
  const s = req.body as Record<string, unknown>;
  insert.run(
    s.session_id, s.video_id, s.player, s.user_agent,
    s.startup_time_ms ?? null,
    s.total_watch_time_ms ?? 0,
    s.total_stall_time_ms ?? 0,
    s.rebuffer_count ?? 0,
    JSON.stringify(s.error_codes ?? []),
    Date.now(),
  );
  reply.send({ ok: true });
});

app.get('/metrics', async () => {
  return db.prepare(`
    SELECT
      COUNT(*) as sessions,
      AVG(startup_time_ms) as avg_startup_ms,
      SUM(total_stall_time_ms) * 1.0 / NULLIF(SUM(total_watch_time_ms), 0) as rebuffer_ratio,
      SUM(CASE WHEN error_codes != '[]' THEN 1 ELSE 0 END) * 1.0 / COUNT(*) as failure_rate
    FROM sessions
    WHERE created_at > unixepoch() * 1000 - 86400000
  `).get();
});

app.listen({ port: 3000, host: '0.0.0.0' });

Run it:

npm run server

Expect:

{"level":30,"time":...,"msg":"Server listening at http://0.0.0.0:3000"}

5. The three queries that earn their keep 📊

Once you have data flowing, three queries answer most of the questions you'll be asked.

Startup time, p50/p95, by day:

SELECT
  date(created_at / 1000, 'unixepoch') as day,
  COUNT(*) as sessions,
  -- p50 / p95 in SQLite via window functions:
  -- use a CTE in production; the simple AVG below is fine for a starter dashboard
  AVG(startup_time_ms) as avg_startup_ms
FROM sessions
WHERE startup_time_ms IS NOT NULL
GROUP BY day
ORDER BY day DESC
LIMIT 14;

Rebuffer ratio over the last 24h:

SELECT
  SUM(total_stall_time_ms) * 1.0 / NULLIF(SUM(total_watch_time_ms), 0) as rebuffer_ratio
FROM sessions
WHERE created_at > unixepoch() * 1000 - 86400000;

You want this under 1%. Mux's blog calls 0.5% the strong-platform target; npaw's writeup uses the 1% / 3% bands. Past 3%, viewers are noticing. Past 5%, they're leaving.

Failure rate by player version:

SELECT
  player,
  COUNT(*) as sessions,
  SUM(CASE WHEN error_codes != '[]' THEN 1 ELSE 0 END) * 1.0 / COUNT(*) as failure_rate
FROM sessions
GROUP BY player
ORDER BY failure_rate DESC;

This is the query that catches the one specific combination of player + browser + device where playback is silently broken for 4% of your users. There's always one.

6. What we deliberately didn't build ⏭️

A starter rig isn't a production analytics platform. Things this tutorial skips:

Real event-level storage. We ship session aggregates. A production system ships raw events so the metric definitions can change without re-instrumenting.
Queueing. A POST per session is fine until your traffic doubles. Then you want a queue between the endpoint and the database.
Alerting. Querying SQLite is not an alerting strategy.
Cross-session viewer journeys. We track session, not viewer. If your product question is "did the same viewer hit a rebuffer last week and abandon this week?", you need a viewer-stable ID.

The decision tree is the point. If you're going to outgrow this rig in a quarter, that's the trigger to either invest in real telemetry infrastructure or pick up a managed analytics SDK that gives you a richer baseline by default.

Wrapping up

Most teams ship video before they ship video telemetry. That order is backwards. The cost of being blind to your QoE numbers is paid in customer trust — the support ticket you can't diagnose, the regression you don't notice for two weeks — and it compounds.

The rig above is the lowest-effort version that gets you the three core metrics. If it's enough, great. If you outgrow it, the move is either a real internal pipeline or a managed analytics product. Both are reasonable answers; either one is better than continuing to operate without data.

What's next

Swap SQLite for ClickHouse or Postgres + TimescaleDB once you cross ~1M sessions/week.
Add a quality_change event listener and track ABR down-switches per session.
Add a small heatmap of rebuffer events by playback position — surprisingly useful for finding bad keyframe placement at scene cuts.
If you outgrow rolling your own: any of the managed analytics SDKs (Mux Data, the analytics layer in api.video, or other comparable products) drop into this same player with one extra script tag. The metric definitions converge across products — you're paying for the storage and the dashboards, not for the metric.

Sources used to ground the metric definitions:

Mux, "Quality of Experience (QoE) in Video Streaming."
Mux, "The Four Elements of Video Performance."
Mux, "Video Analytics Series Part 1: Rebuffering."
NPAW, "Exploring Video QoE & QoS."
HLS.js release notes — github.com/video-dev/hls.js/releases (1.6.x current stable, 1.7.0-alpha pre-release).

DEV Community