José Catalá

Posted on Apr 16

How I Automated Onboarding 1,300 Airport Websites Without Looking at a Single One

#javascript #node #ai #webdev

The original version of MyAirports required a developer (me) to open DevTools, find the network request carrying flight data, copy the URL, figure out the headers, and paste everything into a config file. One airport at a time.

That's fine for ten airports. It doesn't scale to a thousand.

So I built a pipeline that does it automatically: give it an IATA code, and it finds the airport's website, discovers the flight data API, detects the scraping strategy, and writes the config code. No DevTools. No manual inspection. Here's how it works.

The problem with airport websites

Every airport runs its own website. Some are operated by independent airport authorities. Some are run by national operators — Spain's AENA runs 46 airports on the same platform; Finland's Finavia runs 20. Some outsource to the same flight display software vendor.

This creates a few patterns worth detecting:

Operator groups: airports that share the same backend API. Find one, and you've implicitly found all of them.
Standalone airports: unique websites with unique APIs.
Vendor software: airports using products like AirportInfo or similar. Same HTML structure, same endpoints, different domain.
DOM-only: no discoverable API. Flight data is rendered server-side, so you scrape HTML.

The discovery pipeline needs to detect which pattern applies and generate the right config for each.

Step 1: Find the right URL

The first obstacle is simple: given an IATA code like ATH (Athens), what's the official airport website URL?

This sounds trivial but isn't. Search for "Athens airport" and you get aggregators — Skyscanner, TripAdvisor, Wikipedia — before you get the official site. The flight data lives on the official site, not on aggregators.

The solution is a purpose-built search query:

// searcher.js
async function findAirportUrl(iata, airportName, country) {
  const query = `"${airportName}" official airport website flight departures arrivals -skyscanner -tripadvisor -rome2rio`;

  const results = await serperSearch(query);
  return filterAggregators(results);
}

function filterAggregators(results) {
  const blocklist = [
    'skyscanner', 'tripadvisor', 'rome2rio', 'kayak', 'expedia',
    'flightaware', 'flightradar24', 'wikipedia', 'wikitravel',
    'airportia', 'flightsfrom', 'airports-worldwide',
    // 30+ more
  ];

  return results.filter(r => 
    !blocklist.some(blocked => r.url.includes(blocked))
  );
}

The blocklist has grown to 30+ domains over time — whenever a false positive slips through and breaks a discovery, the domain gets added. The negative search terms in the query help, but they're not reliable enough on their own.

For airports where Serper finds nothing useful, the pipeline falls back to in-browser DuckDuckGo search with the same filtering logic.

Step 2: Intercept network traffic

Once the URL is found, the pipeline launches a stealth Playwright browser and loads the page. Rather than looking at the HTML, it watches the network:

// interceptor.js
async function discoverApi(page) {
  const candidates = [];

  page.on('response', async (response) => {
    const url = response.url();
    const contentType = response.headers()['content-type'] || '';

    if (!contentType.includes('application/json')) return;

    try {
      const body = await response.json();
      const score = scoreResponse(body, url);
      if (score > 0) {
        candidates.push({ url, headers: response.request().headers(), score });
      }
    } catch (_) {}
  });

  await page.goto(targetUrl, { waitUntil: 'networkidle' });
  return candidates.sort((a, b) => b.score - a.score);
}

function scoreResponse(body, url) {
  let score = 0;
  const str = JSON.stringify(body).toLowerCase();

  if (str.includes('flight') || str.includes('flightnumber')) score += 3;
  if (str.includes('departure') || str.includes('arrival')) score += 3;
  if (str.includes('iata') || str.includes('icao')) score += 2;
  if (str.includes('status') || str.includes('gate')) score += 2;
  if (str.includes('airline') || str.includes('carrier')) score += 2;
  if (Array.isArray(body) && body.length > 0) score += 1;
  if (body.flights || body.departures || body.arrivals) score += 3;

  // URL indicators  
  if (url.includes('flight') || url.includes('fids')) score += 2;
  if (url.includes('departure') || url.includes('arrival')) score += 2;

  return score;
}

The highest-scoring response is the flight data API. The request headers that triggered it — including any session cookies, CSRF tokens, or API keys — get captured at the same time. One browser session is all it takes; from then on, the system calls the API directly over HTTP without a browser.

This is the core idea behind MyAirports: use a browser once to discover, then never again.

Step 3: WAF detection and FlareSolverr fallback

Before the browser even hits the target page, the pipeline checks for WAF protection:

// waf-detector.js
async function detectWaf(page) {
  const title = await page.title();
  const content = await page.content();

  if (title.includes('Just a moment') || content.includes('cf-browser-verification')) {
    return 'cloudflare';
  }
  if (content.includes('incapsula') || content.includes('visid_incap')) {
    return 'incapsula';
  }

  return null;
}

If Cloudflare is detected, the pipeline hands off to FlareSolverr — a Docker sidecar that runs a real browser, solves the JS challenge, and returns the cookies:

// flaresolverr.js
async function solveCloudflare(url) {
  const response = await fetch('http://localhost:8191/v1', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      cmd: 'request.get',
      url,
      maxTimeout: 60000
    })
  });

  const data = await response.json();
  return {
    cookies: data.solution.cookies,
    userAgent: data.solution.userAgent
  };
}

The solved cookies get stored in a Session table and reused until they expire, so the challenge-solving cost is paid once per airport, not on every scrape.

Step 4: Detect operator groups

Some airports share infrastructure. Spain's AENA airports all use the same API at aena.es/es/flight-search. Once you've discovered one AENA airport, you've got the pattern for all 46.

The group detector looks for these shared patterns:

// group-detector.js
const KNOWN_OPERATORS = [
  { name: 'AENA', urlPattern: /aena\.es/, airports: ['MAD', 'BCN', 'VLC', /* ... */] },
  { name: 'Finavia', urlPattern: /finavia\.fi/, airports: ['HEL', 'OUL', /* ... */] },
  { name: 'ADP', urlPattern: /parisaeroport\.fr/, airports: ['CDG', 'ORY'] },
];

function detectOperatorGroup(discoveredUrl) {
  for (const op of KNOWN_OPERATORS) {
    if (op.urlPattern.test(discoveredUrl)) return op;
  }
  return null;
}

When a group match is found, the code generator writes configs for all airports in the group from a single template.

Step 5: AI pattern learning for DOM-only airports

Some airports don't have a JSON API. Flight data is rendered in HTML tables or card layouts, with no network request to intercept.

The AI learner handles these:

// ai-learn.js
async function learnPattern(url) {
  const page = await launchStealthBrowser();
  await page.goto(url, { waitUntil: 'networkidle', timeout: 18000 });

  const html = await page.content();
  const trimmed = trimHtml(html, 15000); // fits in context window

  const prompt = `
    This is an airport flight board HTML page. 
    Identify CSS selectors for each flight row, flight number, 
    destination/origin, scheduled time, gate, and status.
    Also list any status text values and what they mean.
    Return JSON: { rowSelector, columns: { flightNumber, destination, time, gate, status }, statusMappings }
  `;

  const result = await deepseekChat(prompt, trimmed);
  return JSON.parse(result);
}

Successfully AI-learned airports so far:

Airport	Row selector	Layout
GYD (Baku)	`a.flt_row`	Card with `.fl_col.*` columns
KTW (Katowice)	`div.timetable__row.flight-board__row`	Table-like rows
TLL (Tallinn)	`li.flights-list__item`	List card
DUS (Düsseldorf)	(table class detection)	Standard table

Each of these would have taken 20–30 minutes of manual DevTools work. The AI learner does it in about 30 seconds.

Step 6: Auto-generate config code

Once the pattern is known, the code generator writes the actual config:

// code-generator.js
function generateAirportEntry(iata, name, url, strategy) {
  if (strategy.type === 'api') {
    return `
  ${iata}: {
    name: '${name.replace(/'/g, "\\'")}',
    url: '${url}',
    apiEndpoint: '${strategy.endpoint}',
    headers: ${JSON.stringify(strategy.headers, null, 6)},
  },`;
  }
}

For operator groups it generates a shared template and references:

function generateOperatorEntry(operator, airports) {
  return `
const ${operator.name.toLowerCase()}Config = {
  apiBase: '${operator.apiBase}',
  headers: ${JSON.stringify(operator.headers, null, 2)},
};

${airports.map(a => `  ${a}: { ...${operator.name.toLowerCase()}Config, iata: '${a}' },`).join('\n')}
`;
}

Running at scale: 1,300 airports

npm run auto-discover                    # process all airports
npm run auto-discover -- --country DE   # just one country
npm run auto-discover -- --skip 400     # resume from checkpoint

The tool reads 6,000+ entries from the OpenFlights world airports CSV, processes in batches of 10 with a 60-second timeout per airport, and writes configs directly to disk.

Results after one full run across 1,300 airports:

276 new airports discovered — APIs found, configs generated, immediately usable
~600 DOM-only airports — HTML scraping with multilingual status normalisation
~400 failures — heavy WAF, no official website, or page timeouts
Operator groups detected: AENA (46), Finavia (20), ADP (3), and several smaller groups

Total processing time: ~4 hours on a single VPS. Compared to 20–30 minutes per airport manually, that's roughly 400 hours of DevTools work replaced by a shell command.

What still breaks

The pipeline isn't perfect. Common failure modes:

Heavy bot protection: Akamai Bot Manager and Imperva with device fingerprinting go beyond what FlareSolverr can handle. These remain manual.

Dynamic API tokens: Some airports generate short-lived tokens that expire between scrapes. Discovery captures them, but they're invalid an hour later. These need re-discovery on every request, which is expensive.

iFrame-embedded boards: Some airports embed a third-party flight board in an iframe from a different domain. Network interception doesn't always capture those requests correctly.

No-English flight data: A few airports only publish in their local language (Cyrillic, Arabic, CJK). The status normaliser handles most of it, but field mapping still fails occasionally.

The auto-discovery pipeline turned a manual, one-by-one chore into a batch job. The 276 airports it found in one run would have taken months to add by hand. The failures are the interesting edge cases — and those are where the AI pattern learner earns its keep.

The API is free to try at myairports.online/developers. Free tier: 100 requests/day, no card required.

DEV Community