José Catalá

Posted on Apr 20

How I Built a Delay Risk Score for 628,000 Flights

#aviation #webdev #javascript #node

Every flight in the MyAirports API includes a delayRisk object. It looks like this:

{
  "flightNumber": "VY1234",
  "scheduledTime": "2026-04-21T07:15:00Z",
  "status": "scheduled",
  "insights": {
    "delayRisk": {
      "onTimePct": 61,
      "avgDelayMinutes": 28,
      "cancelledPct": 3.2,
      "sample": 89,
      "context": "VY MAD→BCN, Mon morning"
    }
  }
}

Three numbers. Looks simple. It's not.

Getting from raw scraped flight data to a reliable, per-flight delay prediction involved building a statistics pipeline I didn't know I needed — and making a series of judgment calls that turned out to matter a lot.

Here's how it works.

The core insight: route × time-of-day × day-of-week

A delay risk score is only meaningful if it compares like with like.

A Monday 07:00 Vueling flight from Madrid to Barcelona has a very different delay profile than a Friday 19:00 Vueling flight on the same route. Aggregating them together produces a number that's wrong for both.

The bucketing key I landed on is:

route (origin → dest) × airline × departure_hour_bucket × day_of_week

Where departure_hour_bucket is a 3-hour window: 00–03, 03–06, 06–09, 09–12, 12–15, 15–18, 18–21, 21–24.

This gives enough granularity to capture meaningful patterns without fragmenting the sample so much that each bucket has too few flights to be reliable.

The delay_stats table stores one row per bucket:

CREATE TABLE delay_stats (
  id           SERIAL PRIMARY KEY,
  origin_iata  CHAR(3)  NOT NULL,
  dest_iata    CHAR(3)  NOT NULL,
  airline_code VARCHAR(3) NOT NULL,
  hour_bucket  SMALLINT NOT NULL,  -- 0, 3, 6, 9, 12, 15, 18, 21
  day_of_week  SMALLINT NOT NULL,  -- 0=Sun, 6=Sat
  sample_size  INTEGER  NOT NULL,
  on_time_pct  NUMERIC(5,2),
  avg_delay_min NUMERIC(6,1),
  cancelled_pct NUMERIC(5,2),
  updated_at   TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE (origin_iata, dest_iata, airline_code, hour_bucket, day_of_week)
);

What counts as "on time"

Before computing percentages, I had to define what "on time" actually means. The industry standard is within 15 minutes of scheduled departure. That's what I use.

// insights/analyzers/delay-stats.js

const ON_TIME_THRESHOLD_MINUTES = 15;

function classifyFlight(flight) {
  if (flight.status === 'cancelled') return 'cancelled';

  if (!flight.actualDepartureTime || !flight.scheduledTime) return null; // insufficient data

  const delayMinutes = (
    new Date(flight.actualDepartureTime) - new Date(flight.scheduledTime)
  ) / (1000 * 60);

  return {
    isOnTime: delayMinutes <= ON_TIME_THRESHOLD_MINUTES,
    delayMinutes: Math.max(0, delayMinutes) // negative = early, treat as 0
  };
}

A flight that departs 5 minutes early counts as on time (delay = 0). A flight that departs 16 minutes late is delayed. Cancelled flights go into their own bucket.

One thing that tripped me up early: I was computing delay as actual - scheduled and sometimes getting negative numbers because flights occasionally depart ahead of schedule. Treating those as 0-minute delays (rather than negative delays) keeps the average meaningful.

The pipeline: from raw flights to stats

The stats computation runs nightly as a batch job. It scans all completed flights from the previous 90 days and recomputes each bucket from scratch.

// insights/run-stats.js

async function computeDelayStats() {
  const cutoff = new Date(Date.now() - 90 * 24 * 60 * 60 * 1000);

  // Pull all completed flights with actual departure times
  const flights = await prisma.flight.findMany({
    where: {
      status: { in: ['departed', 'arrived', 'cancelled'] },
      scheduledTime: { gte: cutoff },
    },
    select: {
      originIata: true,
      destIata: true,
      airlineCode: true,
      scheduledTime: true,
      actualDepartureTime: true,
      status: true,
    }
  });

  // Group into buckets
  const buckets = new Map();

  for (const flight of flights) {
    const key = getBucketKey(flight);
    if (!key) continue;

    if (!buckets.has(key)) buckets.set(key, []);
    buckets.get(key).push(flight);
  }

  // Compute stats per bucket and upsert
  for (const [key, bucketFlights] of buckets) {
    const stats = computeBucketStats(bucketFlights);
    if (!stats) continue; // insufficient sample

    const { origin, dest, airline, hourBucket, dayOfWeek } = parseKey(key);

    await prisma.delayStat.upsert({
      where: { origin_dest_airline_hour_day: { 
        originIata: origin, destIata: dest, 
        airlineCode: airline, hourBucket, dayOfWeek 
      }},
      update: { ...stats, updatedAt: new Date() },
      create: { originIata: origin, destIata: dest, 
                airlineCode: airline, hourBucket, dayOfWeek, ...stats }
    });
  }
}

function getBucketKey(flight) {
  if (!flight.originIata || !flight.destIata || !flight.airlineCode) return null;

  const d = new Date(flight.scheduledTime);
  const hour = d.getUTCHours();
  const hourBucket = Math.floor(hour / 3) * 3; // round down to 0,3,6,9...
  const dayOfWeek = d.getUTCDay();

  return `${flight.originIata}:${flight.destIata}:${flight.airlineCode}:${hourBucket}:${dayOfWeek}`;
}

Computing the stats for a bucket

Once flights are grouped, computing the three metrics is straightforward:

const MIN_SAMPLE = 30; // minimum flights required for a reliable stat

function computeBucketStats(flights) {
  if (flights.length < MIN_SAMPLE) return null;

  const cancelled = flights.filter(f => f.status === 'cancelled');
  const departed = flights.filter(f => f.status !== 'cancelled' && f.actualDepartureTime);

  // cancelledPct includes all flights in the bucket denominator
  const cancelledPct = (cancelled.length / flights.length) * 100;

  if (departed.length === 0) {
    return { sampleSize: flights.length, onTimePct: null, avgDelayMin: null, cancelledPct };
  }

  const delays = departed.map(f => {
    const delayMs = new Date(f.actualDepartureTime) - new Date(f.scheduledTime);
    return Math.max(0, delayMs / (1000 * 60));
  });

  const onTimeCount = delays.filter(d => d <= ON_TIME_THRESHOLD_MINUTES).length;
  const onTimePct = (onTimeCount / departed.length) * 100;
  const avgDelayMin = delays.reduce((sum, d) => sum + d, 0) / delays.length;

  return {
    sampleSize: flights.length,
    onTimePct: Math.round(onTimePct * 100) / 100,
    avgDelayMin: Math.round(avgDelayMin * 10) / 10,
    cancelledPct: Math.round(cancelledPct * 100) / 100,
  };
}

The MIN_SAMPLE = 30 threshold is important. Below 30 flights, the percentages are too noisy to be useful — a single bad week skews everything. I return null for under-sampled buckets, and the API reflects that with a lower-confidence indicator.

The minimum sample problem

Thirty flights sounds easy, but it's not for thin routes. A regional airline operating 3 flights per week on a niche route will accumulate 30 samples in about 10 weeks — and if we're only keeping 90 days of history, we have a rolling window of roughly 36 observations at best.

For these routes I use a fallback hierarchy:

async function getDelayStats(origin, dest, airline, hourBucket, dayOfWeek) {
  // 1. Exact match: specific airline on this route/time
  let stats = await findStats({ origin, dest, airline, hourBucket, dayOfWeek });
  if (stats && stats.sampleSize >= MIN_SAMPLE) return { ...stats, confidence: 'high' };

  // 2. Any airline on this route/time
  stats = await findStats({ origin, dest, airline: null, hourBucket, dayOfWeek });
  if (stats && stats.sampleSize >= MIN_SAMPLE) return { ...stats, confidence: 'medium' };

  // 3. Any airline on this route, any time in this day bucket (morning/afternoon/evening)
  const dayPart = hourBucket < 12 ? 'morning' : hourBucket < 18 ? 'afternoon' : 'evening';
  stats = await findStats({ origin, dest, airline: null, dayPart, dayOfWeek });
  if (stats && stats.sampleSize >= MIN_SAMPLE) return { ...stats, confidence: 'low' };

  // 4. No data
  return null;
}

The confidence field surfaces in the API response so callers can decide how much weight to give the number. A high confidence onTimePct: 61 means something; a low confidence one is a rough prior at best.

Handling airline codes: the IATA/ICAO mess

Flight data comes from airport boards in different formats. Some airports use IATA airline codes (2-letter: IB, VY, FR), some use ICAO (3-letter: IBE, VLG, RYR), and some mix them depending on the flight.

The stats pipeline normalizes everything to IATA codes before bucketing. If a flight arrives with ICAO code IBE, it gets mapped to IB before the bucket key is computed:

// airline-normalizer.js
const ICAO_TO_IATA = {
  'IBE': 'IB',  // Iberia
  'VLG': 'VY',  // Vueling
  'RYR': 'FR',  // Ryanair
  'BAW': 'BA',  // British Airways
  'DLH': 'LH',  // Lufthansa
  // ... ~400 more
};

function normalizeAirlineCode(code) {
  if (!code) return null;
  if (code.length === 2) return code.toUpperCase(); // already IATA
  return ICAO_TO_IATA[code.toUpperCase()] ?? null;  // map or drop
}

Unknown airline codes get dropped (returned as null), which means those flights still feed into the route-level stats (fallback level 2) but not the airline-specific stats (level 1).

What the numbers actually look like

After running the pipeline across all accumulated data, some patterns stand out.

The morning variance is real. The 06–09 bucket has the highest avgDelayMin of any time window — roughly 1.4x the all-day average — but also the second-highest onTimePct. What's happening: most early morning flights are fine, but when they go wrong, they go badly wrong. The distribution is bimodal: short delays almost never, then a long tail of cascaded delays from overnight issues.

Fridays are not the worst day. Sunday evenings are. Flights completing weekend itineraries face the worst on-time performance across almost every route in the dataset. Monday mornings are second worst.

Low-cost carrier routes have lower onTimePct but lower avgDelayMin. Budget airlines cancel less (they have fewer connectivity options, so they push through more delays) but are more frequently slightly late. Legacy carriers cancel more but have tighter average delay windows when they do depart.

The 90-day rolling window matters more than I expected. Seasonal patterns mean that data from January is largely irrelevant for predicting August performance. I may eventually shrink the window to 60 days and weight recent flights more heavily.

Surfacing it in the API

When a flight is requested, the stats lookup runs against the delay_stats table:

// api/flights.js
async function enrichWithInsights(flight) {
  const d = new Date(flight.scheduledTime);
  const hourBucket = Math.floor(d.getUTCHours() / 3) * 3;
  const dayOfWeek = d.getUTCDay();
  const airline = normalizeAirlineCode(flight.airlineCode);

  const stats = await getDelayStats(
    flight.originIata,
    flight.destIata,
    airline,
    hourBucket,
    dayOfWeek
  );

  if (!stats) return { delayRisk: null };

  return {
    delayRisk: {
      onTimePct: stats.onTimePct !== null ? Math.round(stats.onTimePct) : null,
      avgDelayMinutes: stats.avgDelayMin !== null ? Math.round(stats.avgDelayMin) : null,
      cancelledPct: stats.cancelledPct !== null ? Math.round(stats.cancelledPct * 10) / 10 : null,
      sample: stats.sampleSize,
      confidence: stats.confidence,
      context: buildContextString(flight, dayOfWeek, hourBucket),
    }
  };
}

function buildContextString(flight, dayOfWeek, hourBucket) {
  const days = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat'];
  const timeLabel = hourBucket < 12 ? 'morning' : hourBucket < 18 ? 'afternoon' : 'evening';
  return `${flight.airlineCode} ${flight.originIata}→${flight.destIata}, ${days[dayOfWeek]} ${timeLabel}`;
}

The context string makes the number interpretable at a glance — you see immediately that onTimePct: 61 applies to "VY MAD→BCN, Mon morning" rather than to Vueling in general.

What I'd do differently

Weight by recency. All flights in the 90-day window currently have equal weight. A flight from 85 days ago contributes equally to the stats as one from yesterday. Exponential decay weighting — where more recent flights count more — would make the score respond faster to airline operational changes.

Track the delay distribution, not just the average. avgDelayMinutes: 28 tells you the mean, but it hides a lot. A route with a 28-minute average could be "almost always 25–30 minutes late" or "usually on time but occasionally 4 hours late." The latter is much worse for connecting passengers. Storing the 90th percentile delay alongside the average would be far more useful.

Separate departure and arrival delay. Right now everything is departure-delay. But passengers care about arrival delay. A flight that departs 30 minutes late and lands 10 minutes late (because of a fast cruise and short taxi) is very different from one that lands 35 minutes late. Airport boards rarely surface arrival delay in real time, but when they do, it's worth tracking separately.

Per-aircraft-type stats. Narrow-body short-haul routes flown by a mix of A320s and B737s have meaningfully different delay profiles depending on the day's specific aircraft assignment. I don't have aircraft type data yet, but it's on the roadmap.

The delay risk score started as a single percentage. It's now a layered lookup across four levels of fallback, a batch pipeline that runs every night, and the most-queried field in the API. Getting it right was mostly a series of small decisions — the 15-minute threshold, the 3-hour buckets, the minimum sample of 30 — each of which turned out to matter more than I expected.

The full API, including delayRisk on every flight, is at myairports.online/developers. Free tier: 100 requests/day, no card required.

DEV Community