DEV Community

José Catalá
José Catalá

Posted on

What 628,000 Flights Taught Me About Cancellations (And Why They're Harder to Track Than You Think)

Flight cancellations look simple from the outside: a flight was scheduled, it didn't happen, it shows "Cancelled." Done.

After processing 628,000+ flights across 1,300 airports, I can tell you it's not that simple. Cancellations are one of the messiest problems in airport data — and understanding them properly forced me to build detection infrastructure I didn't know I needed.

Here's what I learned.


The three types of cancellation (that most people treat as one)

Not all cancellations are the same, and the differences matter for both the traveller and the data pipeline.

Early cancellation: The airline cancels the flight hours or days before departure and updates the schedule. The flight board never shows it. From a scraping perspective, this flight simply doesn't appear. You can't detect it by watching the board — you'd need to compare against the original schedule, which you may not have.

Late cancellation: The flight is on the board with status "Scheduled" or "On time" and then, sometimes within 30 minutes of departure, flips to "Cancelled." This is the one passengers hate most, and it's the one your data pipeline needs to actively detect.

Ghost cancellation: The flight disappears from the board entirely without ever showing "Cancelled." The row was there in the previous scrape, it's gone now, and there's no explicit status change. This is the hardest one.

MyAirports encounters all three, and each needs a different detection strategy.


Ghost flights: the cancellation nobody tells you about

This was the most surprising thing I learned. A non-trivial number of airport websites simply remove a cancelled flight from their board rather than marking it as cancelled. It just vanishes.

If your scraper only looks at the current board state, you'll never see these cancellations. They happened, but the evidence is gone.

The fix is snapshot comparison:

// cancellation-detector.js
async function detectGhostCancellations(iata, direction) {
  const now = new Date();
  const windowMinutes = 90;

  // Get what we saw in the last scrape
  const prevScrape = await db.flight.findMany({
    where: {
      airportIata: iata,
      direction,
      lastSeen: { lt: new Date(now - 15 * 60 * 1000) }, // older than 15 min
      scheduledTime: {
        gte: now,
        lt: new Date(now.getTime() + windowMinutes * 60 * 1000)
      },
      status: { not: 'arrived' }
    }
  });

  // Get what's on the board now
  const currentBoard = await scrapeAirport(iata, direction);
  const currentFlightNumbers = new Set(currentBoard.map(f => f.flightNumber));

  // Anything that was there and is now gone
  const disappeared = prevScrape.filter(f => 
    !currentFlightNumbers.has(f.flightNumber) &&
    f.status !== 'cancelled' // not already marked
  );

  return disappeared.map(f => ({
    ...f,
    status: 'cancelled',
    cancellationReason: 'ghost', // disappeared without notice
    detectedAt: now
  }));
}
Enter fullscreen mode Exit fullscreen mode

The window matters. A flight that departed won't show on the board either — you only want to flag flights that disappeared before their scheduled time. The 15-minute "last seen" buffer prevents false positives from temporary scrape failures.


Normalizing "Cancelled" across 30+ languages

Article 3 in this series covered status normalization in general. Cancellations specifically deserve attention because the variation is extreme.

A cancelled flight might show as:

Language Status string
English Cancelled, Canceled, CNCL, CNX, CXL
Spanish Cancelado, Anulado, CANCELADO
German Gestrichen, Annulliert, Abgesagt
French Annulé, Supprimé
Italian Cancellato, Soppresso
Portuguese Cancelado, Suprimido
Russian Отменён, Отменен
Arabic ملغي, ملغاة
Turkish İptal, İptal Edildi
Norwegian Kansellert, Innstilt

And then there are airline-specific codes with no natural language at all: CNX, CXL, CNCL, XLD, CX. Some airports emit only the code without any label.

The normalizer handles this in layers:

// status-normalizer.js

const CANCELLED_EXACT = new Set([
  // English
  'cancelled', 'canceled', 'cancellation', 'cncl', 'cnx', 'cxl', 'xld', 'cx',
  // Spanish
  'cancelado', 'anulado',
  // German
  'gestrichen', 'annulliert', 'abgesagt',
  // French
  'annulé', 'annule', 'supprimé', 'supprime',
  // Italian
  'cancellato', 'soppresso',
  // Portuguese
  'cancelado', 'suprimido',
  // Dutch
  'geannuleerd', 'geannuleerd vlucht',
  // Norwegian/Swedish/Danish
  'kansellert', 'innstilt', 'inställd', 'aflyst',
  // Finnish
  'peruutettu',
  // Russian
  'отменён', 'отменен', 'отменен рейс',
  // Turkish
  'iptal', 'iptal edildi',
  // Arabic (normalized, RTL stripped)
  'ملغي', 'ملغاة',
  // Chinese
  '取消', '已取消',
]);

const CANCELLED_CONTAINS = [
  'cancel', 'annul', 'gestr', 'suprim', 'peruut', 'innstil',
];

function normalizeToCancelled(raw) {
  const s = raw.toLowerCase().trim().replace(/[^\p{L}\p{N}\s]/gu, '');

  if (CANCELLED_EXACT.has(s)) return 'cancelled';
  if (CANCELLED_CONTAINS.some(pat => s.includes(pat))) return 'cancelled';

  return null; // not a cancellation
}
Enter fullscreen mode Exit fullscreen mode

The replace(/[^\p{L}\p{N}\s]/gu, '') strips punctuation and special characters using Unicode property escapes — critical for Arabic and Chinese status strings that sometimes arrive with RTL markers or extra punctuation.


Tracking cancellations over time: the flights table

Every scraped flight goes into the flights table. When a cancellation is detected — either from an explicit status change or a ghost detection — the flight record gets updated:

// persistence.js
async function saveFlights(iata, direction, flights) {
  for (const flight of flights) {
    await prisma.flight.upsert({
      where: { flightNumber_scheduledTime_airportIata: {
        flightNumber: flight.flightNumber,
        scheduledTime: flight.scheduledTime,
        airportIata: iata,
      }},
      update: {
        status: flight.status,
        lastSeen: new Date(),
        ...(flight.status === 'cancelled' && {
          cancelledAt: new Date(),
          cancellationReason: flight.cancellationReason || 'explicit',
        }),
      },
      create: {
        ...flight,
        firstSeen: new Date(),
        lastSeen: new Date(),
      }
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

The cancelledAt timestamp and cancellationReason fields are what make the historical analysis possible later.


Mass cancellation detection: weather, strikes, and cascades

Individual cancellations are noise. Mass cancellations are a signal worth surfacing.

When an airport has 10+ cancellations in a short window, something structural is happening: weather, strike action, airport closure, ground stop, or an airline operational meltdown. The data pattern is distinctive:

// anomaly-detector.js
async function detectMassCancellations(iata, windowHours = 2) {
  const since = new Date(Date.now() - windowHours * 60 * 60 * 1000);

  const recentCancellations = await prisma.flight.count({
    where: {
      airportIata: iata,
      status: 'cancelled',
      cancelledAt: { gte: since }
    }
  });

  const baseline = await getBaselineCancellationRate(iata, windowHours);

  if (recentCancellations > baseline * 3) {
    return {
      type: 'mass_cancellation',
      airport: iata,
      count: recentCancellations,
      baseline,
      deviationFactor: recentCancellations / baseline,
      detectedAt: new Date()
    };
  }

  return null;
}

async function getBaselineCancellationRate(iata, windowHours) {
  const dayOfWeek = new Date().getDay();

  const result = await prisma.$queryRaw`
    SELECT AVG(daily_count) as avg_count
    FROM (
      SELECT DATE(cancelled_at) as day, COUNT(*) as daily_count
      FROM flights
      WHERE airport_iata = ${iata}
        AND status = 'cancelled'
        AND EXTRACT(DOW FROM cancelled_at) = ${dayOfWeek}
        AND cancelled_at > NOW() - INTERVAL '30 days'
      GROUP BY DATE(cancelled_at)
    ) daily
  `;

  return result[0]?.avg_count ?? 1;
}
Enter fullscreen mode Exit fullscreen mode

The 3x deviation threshold is empirical — below that and you get false positives from normal day-to-day variation. Above 3x is consistently meaningful.


What the data actually shows

After accumulating 628,000+ flights, the cancellation patterns are clear:

Cancellation rate by time of day — Early morning flights (05:00–07:00) cancel at roughly 1.2x the rate of afternoon flights. Not because airlines plan worse in the morning, but because cascades: a late-arriving aircraft the night before causes the early morning departure to cancel, and the delay propagates forward through the day.

Cancellation rate by day of week — Mondays and Fridays have the highest cancellation rates, roughly 1.4x midweek. This matches the business travel calendar: high demand creates less buffer for aircraft rotations.

The "cancellation cascade" pattern — When we detect a cancellation on a route (e.g., MAD→LHR at 06:30), the probability of a cancellation on the return (LHR→MAD, any time that day) jumps by roughly 40%. Same aircraft type, same crew. The cascade is real and detectable from the data alone.

Ghost cancellations by region — Ghost cancellations (flights that vanish without notice) are most common in South American and Southeast Asian airports — roughly 3x the rate of European and North American airports. The likely cause is older CMS software that just deletes cancelled entries rather than updating their status.


Surfacing this in the API

The delay_stats table in the insights engine stores cancelled percentage per route per time bucket:

// insights/analyzers/delay-stats.js (relevant section)
const cancelledPct = bucket.totalFlights > 0
  ? (bucket.cancelledCount / bucket.totalFlights) * 100
  : 0;

await prisma.delayStat.upsert({
  where: { /* composite key */ },
  update: { cancelledPct, /* other fields */ },
  create: { cancelledPct, /* other fields */ },
});
Enter fullscreen mode Exit fullscreen mode

This feeds directly into the delayRisk field on every flight in the API response:

{
  "flightNumber": "IB3163",
  "scheduledTime": "2026-04-18T06:30:00Z",
  "status": "scheduled",
  "insights": {
    "delayRisk": {
      "onTimePct": 68,
      "avgDelayMinutes": 22,
      "cancelledPct": 4.7,
      "sample": 143,
      "context": "IB to LHR, Fri morning"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

A cancelledPct of 4.7% on a Friday morning Iberia flight to London isn't alarming in isolation — but if you're booking travel insurance or a tight connection, it's exactly the number you want.


The EU261 connection

Under EU Regulation 261/2004, a cancellation notified fewer than 14 days before departure entitles the passenger to compensation: €250–€600 depending on route distance. The detection pipeline feeds directly into this.

When a flight cancels in our system with cancelledAt set to less than 14 days before scheduledTime, the EU261 eligibility flag gets set:

// eu261.js
function checkEu261Eligibility(flight) {
  if (flight.status !== 'cancelled') return null;

  const noticeHours = (flight.scheduledTime - flight.cancelledAt) / (1000 * 60 * 60);
  const noticeDays = noticeHours / 24;

  if (noticeDays >= 14) return { eligible: false, reason: 'advance_notice' };

  const distanceKm = getRouteDistance(flight.originIata, flight.destIata);

  return {
    eligible: true,
    compensationEur: distanceKm < 1500 ? 250 : distanceKm < 3500 ? 400 : 600,
    noticeDays: Math.round(noticeDays),
    distanceKm
  };
}
Enter fullscreen mode Exit fullscreen mode

The catch: cancelledAt is when we detected the cancellation, not necessarily when the airline officially notified passengers. For ghost cancellations — the ones that disappeared without notice — we mark cancellationReason: 'ghost' and flag the EU261 check as needing manual verification.


What I'd do differently

Track the full status timeline. Right now, if a flight goes Scheduled → Delayed → Cancelled, I only have the final state and the timestamps. I should log every status transition. A flight that was delayed 4 times before finally cancelling tells a different story than one that cancelled cleanly.

Capture the board timestamp. Each scrape runs against live airport data, but the data is often cached on the airport side. I should record the "data as of" timestamp from the airport's API response, not just my scrape time.

Watch for reinstatements. Cancelled flights occasionally get reinstated — a replacement aircraft is found, the weather clears faster than expected. Right now, a reinstatement would just update the status from "cancelled" to "scheduled," but I don't surface this event explicitly.


Flight cancellations are a solved problem from the traveller's perspective — the airline's app tells you. But from the data perspective, they're a distributed, multilingual, inconsistent mess. The ghost cancellations alone took me three months to discover were happening.

The full cancellation API is available at myairports.online/developers. Free tier: 100 requests/day, no card required.

Top comments (0)