José Catalá

Posted on Apr 6

628,000 Flights Later: Building a Flight Analytics Engine on Accumulated Scrape Data

#node #postgres #javascript #api

Every time MyAirports scrapes a flight board, the flights are stored in PostgreSQL. What started as a caching mechanism turned into something more interesting: a dataset of real flight behaviour that can answer questions no static API can answer.

After accumulating 628,000+ flight records across 1,000+ airports, the data supports four categories of analytics:

Airline reliability rankings per airport (on-time rate, average delay, cancellation rate)
Delay pattern breakdowns by airline, destination, day of week, and time of day
Airport busyness heatmaps (7×24 grid: average flight volume by hour and day)
Route reliability stats with per-airline breakdown (LHR→JFK: which airline is most reliable?)

Here's how the analytics layer works.

The raw data

Each Flight record in PostgreSQL has:

flightNumber, airline (IATA), type (arrival/departure)
airport (IATA), origin, destination
scheduled, estimated, actual (timestamps, local timezone)
status (scheduled/boarding/departed/arrived/delayed/cancelled/unknown)
confidence, strategy, scrapedAt

The actual timestamp and status field are what make analytics possible. When a flight's status is arrived and actual is set, we know when it actually landed. The delay is simply actual - scheduled in minutes.

Two compute strategies: incremental and nightly

Running full analytics queries over 628,000 rows on every API request would be too slow. The system uses a two-track approach.

Incremental counters — updated after each scrape:

When a new batch of flights arrives, the system immediately updates lightweight counters: per-airline flight counts at that airport, per-terminal/gate flight counts. These are cheap to maintain and power fast lookups.

Nightly full recompute — runs at 2 AM UTC:

Once per day, the system runs a full pass over all historical data to recompute the richer analytics tables. This is where the delay stats, airline scorecards, and busyness grids are built. It's more expensive, but the data doesn't need to be real-time — yesterday's delay patterns are as useful as today's.

Airline reliability rankings

The query is straightforward in concept, but requires minimum thresholds to be useful:

SELECT
  airline_iata,
  COUNT(*) as total_flights,
  COUNT(*) FILTER (WHERE delay_minutes <= 15) * 100.0 / COUNT(*) as on_time_pct,
  AVG(delay_minutes) FILTER (WHERE delay_minutes > 0) as avg_delay_minutes,
  COUNT(*) FILTER (WHERE status = 'cancelled') * 100.0 / COUNT(*) as cancelled_pct
FROM flights
WHERE airport_iata = 'LHR'
  AND actual IS NOT NULL
  AND status IN ('arrived', 'departed')
GROUP BY airline_iata
HAVING COUNT(*) >= 10
ORDER BY on_time_pct DESC;

The HAVING COUNT(*) >= 10 is the important part. Without a minimum flight threshold, airlines that had one flight and it happened to be on time appear at the top with 100% on-time performance. Ten flights is a low bar, but it filters out pure noise.

The result for LHR looks like this:

{
  "rank": 1,
  "airlineIata": "LX",
  "airlineName": "Swiss International Air Lines",
  "onTimePct": 88.2,
  "avgDelayMinutes": 7.4,
  "cancelledPct": 0.3,
  "totalFlights": 2340
}

Delay patterns: finding when and why delays happen

The delay breakdown goes a level deeper than overall stats. It slices delay data by:

Day of week — Monday departures are often worse than Wednesday ones
Time bucket — morning (06-12), afternoon (12-18), evening (18-24), night (00-06)
Destination — some routes are systematically late
Airline — some carriers at a specific airport are consistently worse than their global average

The time bucket breakdown is particularly useful. At large hub airports, delays compound through the day: a slot disruption in the morning propagates through afternoon flights. The heatmap data makes this visible.

Airport busyness heatmap

The busyness grid answers "when is this airport busy?" — useful for scheduling, understanding delay risk by time, and just general situational awareness:

SELECT
  EXTRACT(DOW FROM scheduled) as day_of_week,  -- 0=Sun, 6=Sat
  EXTRACT(HOUR FROM scheduled) as hour,
  AVG(flight_count) as avg_flights
FROM (
  SELECT
    DATE_TRUNC('hour', scheduled) as bucket,
    EXTRACT(DOW FROM scheduled) as day_of_week,
    EXTRACT(HOUR FROM scheduled) as hour,
    COUNT(*) as flight_count
  FROM flights
  WHERE airport_iata = $1
  GROUP BY 1, 2, 3
) hourly
GROUP BY day_of_week, hour
ORDER BY day_of_week, hour;

The result is a 7×24 matrix of average hourly flight volumes, suitable for heatmap rendering in any charting library.

Gate and terminal predictions

The most interesting use of accumulated data is predicting which gate or terminal a flight will use before the airport announces it. Airlines tend to use the same gates consistently at their hub airports.

After each scrape, the system updates per-airline terminal and gate counters:

// After normalizing a batch of flights
for (const flight of flights) {
  if (flight.terminal) {
    incrementCounter(`${airport}:${airline}:terminal:${flight.terminal}`);
  }
  if (flight.gate) {
    incrementCounter(`${airport}:${airline}:gate:${flight.gate}`);
  }
}

When serving a flight that doesn't have a gate assigned yet, the API enriches it with the most common gate for that airline at that airport:

if (!flight.gate && insights?.terminal?.predicted) {
  flight.insights = {
    terminal: { predicted: insights.terminal.value, confidence: insights.terminal.confidence }
  };
}

The confidence value is the fraction of historical flights that used the predicted terminal. For BA at LHR, it's nearly always Terminal 5 — confidence approaches 1.0. For a code-share flight operated by a third party, confidence might be 0.3.

The route reliability endpoint

One API endpoint that required no special query design: route-level stats. Given an origin and destination, return reliability data for each airline that operates the route:

GET /api/routes/LHR/JFK/insights

{
  "origin": "LHR",
  "dest": "JFK",
  "route": {
    "totalFlights": 1240,
    "onTimePct": 78.3,
    "avgDelayMinutes": 16.5,
    "cancelledPct": 1.8,
    "topAirline": "BA",
    "topAirlineOnTimePct": 84.2
  },
  "airlines": [
    { "airlineIata": "BA", "onTimePct": 84.2, "totalFlights": 890 },
    { "airlineIata": "VS", "onTimePct": 71.4, "totalFlights": 350 }
  ]
}

This is data that's genuinely hard to find anywhere else. The big flight data providers have it, but it's locked behind enterprise pricing. The MyAirports version is built from first-party airport data, accumulated one scrape at a time.

What the data still can't do

A few honest caveats about 628,000 flights:

It's not uniform. Airports with many active users have more scrape history than airports discovered recently. LHR has thousands of records; a newly added regional airport might have dozens.

It's biased toward requested airports. The system scrapes when someone requests data. Popular airports are over-represented relative to their actual traffic volumes.

Delay calculations require actual timestamps. Many airport APIs only return scheduled times, not actual arrival/departure times. Those flights contribute to volume counts but not delay calculations.

The data is directionally useful — the patterns it reveals are real — but it's not a controlled dataset. It's accumulated opportunistically.

All three endpoints are available in the MyAirports API:

GET /api/airports/:iata/insights/airlines
GET /api/airports/:iata/insights/delays
GET /api/airports/:iata/insights/busyness
GET /api/routes/:origin/:dest/insights

Free tier at myairports.online/developers. No credit card required.

DEV Community