José Catalá

Posted on Apr 6

Normalizing Flight Statuses Across 7 Languages: What I Learned Building a Global Airport API

#webdev #javascript #opensource #node

When you scrape flight data from airport websites across 85+ countries, you quickly discover that nobody agrees on anything. Not the data format, not the field names, not the status strings.

"Departed" is SAL, FLY, CER, AIR, DEP, DEPARTED, Departed, Отправлен, Väljunud, DESPEGADO, gestartet, or partito, depending on which airport website you're looking at.

Building MyAirports meant writing a status normalizer that handles all of these — and maps them to a single, predictable enum. Here's how it works and what surprised me along the way.

The target schema

Every flight the API returns has a status field from this set:

scheduled | boarding | departed | arrived | delayed | cancelled | unknown

Seven values. Simple, predictable, usable in any frontend without additional mapping. The hard part is getting there from the real world.

Layer 1: Exact string matches

The first layer is the most straightforward — a lookup table of known status strings to their canonical values:

const EXACT = {
  // English
  'scheduled': 'scheduled',
  'on time': 'scheduled',
  'departed': 'departed',
  'in flight': 'departed',
  'arrived': 'arrived',
  'landed': 'arrived',
  'delayed': 'delayed',
  'cancelled': 'cancelled',
  'boarding': 'boarding',
  'gate open': 'boarding',

  // Spanish (AENA codes)
  'SAL': 'departed',  // salida
  'AIR': 'departed',  // en aire
  'FLY': 'departed',
  'CER': 'departed',  // cerrado
  'LND': 'arrived',   // aterrizó
  'FNL': 'arrived',   // final
  'IBK': 'arrived',   // in-block
  'BOR': 'boarding',  // embarcando
  'EMB': 'boarding',
  'ULL': 'boarding',
  'OPE': 'boarding',
  'EST': 'scheduled', // estimado
  'PRG': 'scheduled', // programado
  'RET': 'delayed',   // retraso
  'CNL': 'cancelled',

  // Russian
  'Отправлен': 'departed',
  'Прилетел': 'arrived',
  'Посадка': 'boarding',
  'Задержан': 'delayed',
  'Отменён': 'cancelled',

  // Estonian
  'Väljunud': 'departed',
  'Saabunud': 'arrived',
  'Pardal': 'boarding',

  // Finnish (Finavia)
  'Lähtenyt': 'departed',
  'Saapunut': 'arrived',
  'Nousussa': 'boarding',
};

This handles probably 60-70% of real-world cases. The remainder need more work.

Layer 2: Substring patterns

Some status strings embed dynamic information — actual times, gate numbers, gate closures. You can't match them exactly, but you can match them by pattern:

const SUBSTRING_PATTERNS = [
  // "Delayed 00:45"
  [/delay/i, 'delayed'],
  // "Cancel" in any language starting with 'cancel'
  [/cancel/i, 'cancelled'],
  // AENA-style: "CERRADO" (closed), "PROGRAMADO" (scheduled)
  [/^cerr/i, 'departed'],
  [/^progr/i, 'scheduled'],
  // "Check-in open" → scheduled (not boarding yet)
  [/check.?in/i, 'scheduled'],
];

Layer 3: Regex patterns for dynamic content

A handful of airports embed actual timestamps or qualifiers directly in the status string. Phoenix Sky Harbor is a good example — their status field for delayed departures looks like "Now 14:22" (the actual departure time). That's not a status, that's a timestamp, and it means the flight has departed.

const REGEX_PATTERNS = [
  // Phoenix: "Now 12:21 PM" → departed
  [/^now\s+\d/i, 'departed'],
  // Finavia: "Expected 14:35" → still scheduled
  [/^expected\s+\d/i, 'scheduled'],
  // Generic: "14:35" alone means actual time → departed/arrived
  [/^\d{1,2}:\d{2}$/, 'arrived'],
];

Layer 4: Russian stem matching

Russian is morphologically rich — words inflect heavily based on grammatical case, number, and gender. "Посадка" (boarding) appears as "посадке", "посадкой", "посадку" in different sentence positions. Exact matching misses these.

The solution isn't a full NLP library (overkill for six words) — it's stem prefixes. The first four or five characters of a word are usually enough to identify it:

const RUSSIAN_STEMS = {
  'посад': 'boarding',   // posad- (посадка, посадке...)
  'отпра': 'departed',   // otpra- (отправлен, отправился...)
  'прил':  'arrived',    // pril- (прилетел, прилетела...)
  'задер': 'delayed',    // zader- (задержан, задержка...)
  'отмен': 'cancelled',  // otmen- (отменён, отменённый...)
};

This catches all the common inflected forms without false positives.

The object guard

One thing I didn't expect: some airport APIs return status as an object rather than a string:

{ "status": { "code": "DEP", "description": "Departed", "color": "#ff0000" } }

The normalizer checks for this and extracts code or description before attempting any matching. Without this guard, typeof status === 'object' crashes the string matching silently.

AENA's 30+ status codes

AENA (the Spanish airport operator, covering all 22 Spanish airports) has the most elaborate status system I encountered. They use two to three character codes from their SIGER ground handling system:

FNL, LND, IBK → arrived
BOR, EMB, ULL, OPE, APE → boarding
SAL, AIR, FLY, CER, APA → departed
EST, PRG, INI, SCH, HOR → scheduled
RET, DEM, REM → delayed
CNL, DES, CAN → cancelled

Each code has a specific operational meaning (IBK = in-block, ULL = final boarding call, DEM = ground delay) that the normalizer collapses to the six canonical values. The operational detail is lost, but that's the right tradeoff for a public API.

What "unknown" means in practice

The unknown bucket catches everything the normalizer can't confidently classify. In production, it appears for:

Empty strings or null values (some airports just don't return status for on-time flights)
Airport-specific codes that haven't been mapped yet
Status strings in languages I haven't configured (Arabic, Japanese, Chinese)

Returning unknown is far better than returning a wrong value. Frontends can show "—" for unknown statuses and users understand it means data isn't available.

Running the normalizer

The normalizer runs at the end of every scrape, after the field mapper has identified which JSON field contains the status. It's a pure function:

normalizeStatus(raw) // → 'scheduled' | 'boarding' | 'departed' | 'arrived' | 'delayed' | 'cancelled' | 'unknown'

Input is always a string (after the object guard). Output is always one of the seven values. No exceptions are thrown — unrecognized inputs return 'unknown'.

The broader lesson

I went into this project expecting the hard part to be scraping. The really hard part turned out to be normalization — taking data from wildly inconsistent sources and making it uniformly useful.

Status normalization is one example. The same problem appears with timestamps (airports use their local timezone, not UTC, and the formats vary wildly), location fields (some put "MILAN, MALPENSA" in the IATA field), and airline codes (some use IATA codes, some use ICAO, some use full names).

The principle that worked: design a strict output schema first, then build the normalization layer to enforce it, and treat every edge case as an addition to the normalizer rather than a special case in the consumer.

The API is live and free to try at myairports.online/developers. Arrivals and departures for 1,000+ airports, statuses normalized to the seven values above.

Next in this series: how 628,000 accumulated flight records are turned into airline delay rankings, gate predictions, and airport busyness heatmaps.

Top comments (1)

Damilare Osibanjo • Apr 6

Nice work bro