José Catalá

Posted on Apr 13

How Airport Websites Try to Stop You Scraping Them (And What I Did About It)

#webdev #javascript #devops #node

I scrape 200+ airport departure boards every few minutes. Here's every defence I've had to break through — and the ones that still slow me down.

Why Airport Websites Are a Scraper's Nightmare

Airport websites are an unusual target. They're not e-commerce sites trying to protect pricing data, and they're not news sites worried about plagiarism. Most of them just want to display departure boards to travellers. So you'd think they'd be easy to scrape.

They're not.

The problem is traffic patterns. An airport website gets hammered by bots — flight trackers, travel aggregators, airline apps, insurance tools, and journalists — all hitting the same handful of endpoints constantly. To survive, airports have layered in the same enterprise anti-bot stacks that protect banks and ticketing sites. They often don't even know exactly what's protecting them; it came bundled with their CDN contract.

Building MyAirports — a real-time flight data API covering 200+ airports — meant going through every layer. Here's what I found, roughly in order of how annoying each one is.

Layer 1: IP Rate Limiting

The most basic defence. Too many requests from one IP in a short window → 429 Too Many Requests or a silent block.

What it looks like: Your first few requests succeed. Then you start getting 429s, or worse, 200 responses with no flight data (the site is serving a decoy empty page rather than tipping you off with an error code).

What I did: Distribute requests across time. Each airport is on a polling interval calibrated to how often its data actually changes — a regional airport with 12 daily departures doesn't need to be hit every 30 seconds. Spreading the intervals out naturally reduces per-IP request rates.

For airports where even polite rates get blocked, I route through a residential proxy pool. It's not free, but for the ~15% of airports where it's necessary, there's no other option.

The nuance: Silent 200s with empty data are worse than 429s. I added a validation step that checks response plausibility — if a departure board suddenly returns zero flights for a major hub at 09:00 on a Tuesday, that's not real data. The scraper logs it as a suspected block rather than storing empty results.

Layer 2: User-Agent and Header Fingerprinting

Servers look at your request headers. A bare curl or Node fetch has obvious tells:

User-Agent: node-fetch/1.0
Accept: */*
Accept-Language: (missing)
Accept-Encoding: gzip, deflate

Real browsers send 15–20 headers in a specific order, with specific values that match the stated browser version.

What it looks like: Immediate 403, or redirect to a CAPTCHA page that your headless browser can't easily solve.

What I did: Use a real browser engine (Playwright with Chromium) rather than raw HTTP requests. This gives you the genuine browser fingerprint for free — correct header order, correct values, correct navigator properties, correct TLS handshake cipher suite order.

The cipher suite thing matters more than people realise. Node.js and browsers negotiate TLS with different cipher preference orders. Some DDoS protection vendors fingerprint the TLS handshake before the HTTP request even arrives. If your cipher order looks like OpenSSL rather than Chrome, you're flagged at the network layer before you've sent a single byte of HTTP.

// Using Playwright with a realistic browser context
const browser = await chromium.launch({
  headless: true,
  args: ['--no-sandbox', '--disable-blink-features=AutomationControlled']
});

const context = await browser.newContext({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...',
  locale: 'en-GB',
  timezoneId: 'Europe/London',
  viewport: { width: 1366, height: 768 }
});

The --disable-blink-features=AutomationControlled flag is important — without it, navigator.webdriver returns true in the page context, which is a trivial bot signal that JavaScript on the page can read.

Layer 3: Cloudflare Bot Management

This is where things get genuinely hard. Cloudflare's bot management suite runs a JavaScript challenge in the browser that collects dozens of signals:

Canvas fingerprint (GPU-specific rendering of a test image)
Audio context fingerprint
Font enumeration
WebGL renderer string
Screen dimensions vs. window dimensions
Mouse movement entropy (does the mouse move like a human?)
Timing of JavaScript execution
Whether window.chrome exists and has the right shape
Dozens more

The challenge runs silently in the background. Pass it and you get a cf_clearance cookie that's valid for a few hours. Fail and you get a 403 or an interactive CAPTCHA.

What it looks like: Page loads fine, but all API calls to the flight data endpoint return 403. Or: you get the page, the page loads some JS, and then the JS makes a secondary XHR that gets blocked.

What I did: Playwright with the playwright-extra stealth plugin, which patches the most common automation detection vectors:

import { chromium } from 'playwright-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';

chromium.use(StealthPlugin());

The stealth plugin patches things like:

navigator.webdriver → undefined
navigator.plugins → realistic plugin list
window.chrome → full Chrome object shape
Canvas fingerprint → randomised slightly per session
permissions.query for notifications → returns denied (real browsers do; headless ones used to throw)

This handles maybe 70% of Cloudflare deployments. The other 30% use more aggressive challenges that require the cf_clearance cookie to be renewed more frequently, or require solving an actual interactive challenge. For those, I use a CAPTCHA solving service — 2captcha or AntiCaptcha — which has human solvers available for the rare cases that get through the stealth layer.

Layer 4: Akamai Bot Manager

Akamai is different from Cloudflare. Instead of a one-time challenge, Akamai's bot manager injects a sensor script into every page that runs continuously and phones home with behavioural data. It's looking for things like:

Are mouse movements suspiciously perfect? (Bots often move in straight lines)
Are keystrokes happening at humanly possible intervals?
Is the page being scrolled at all?
Does the _abck cookie (Akamai's session cookie) contain a valid HMAC signature computed from real browser data?

The _abck cookie is the key. It's a JWT-like token containing sensor data, and it's validated server-side on each request. If the cookie is missing or invalid, API requests are rejected — even if you've got a valid session.

What it looks like: You can load the page fine. But when the page makes XHR requests to /api/flights or wherever the data lives, they come back 403 with no body. The failure is invisible to users (the JS handles the retry), but your scraper just gets nothing.

What I did: Two approaches depending on the airport.

For airports where the data is available in the initial HTML (the departure board is server-rendered), I just parse the HTML directly. Akamai only blocks the XHR calls; the first page load often goes through fine and contains all the data I need.

For airports where the data is loaded client-side via XHR, I use Playwright and let the full page JavaScript run. Playwright with stealth generates valid-enough sensor data that the _abck cookie gets accepted. I then extract the cookie and replay it on subsequent requests — though it expires and needs refreshing every 30–60 minutes.

// Wait for the XHR to complete rather than trying to replicate it
await page.waitForResponse(
  response => response.url().includes('/api/flights') && response.status() === 200
);
const data = await page.evaluate(() => window.__FLIGHT_DATA__);

Layer 5: Dynamic Endpoint Obfuscation

Some airport sites don't have a stable API endpoint. The URL for the flight data changes with each deployment — it might be hashed, versioned, or entirely randomised. Hardcoding /api/v1/flights breaks the moment they redeploy.

What it looks like: Your scraper works for weeks, then silently breaks when the airport updates their site. The URL 404s and you don't find out until someone notices the data hasn't updated.

What I did: Intercept network requests rather than targeting a specific URL. Playwright's page.on('response', ...) captures every response the page receives. I filter for responses that look like flight data — JSON with an array of objects containing recognisable fields like flight numbers, times, or status strings.

page.on('response', async (response) => {
  const url = response.url();
  if (!url.includes('flight') && !url.includes('departure') && !url.includes('arrival')) {
    return;
  }

  try {
    const body = await response.json();
    if (isFlightData(body)) {
      await processFlights(body);
    }
  } catch {
    // Not JSON, skip
  }
});

function isFlightData(data) {
  const arr = Array.isArray(data) ? data : data.flights || data.data || data.departures;
  if (!Array.isArray(arr) || arr.length === 0) return false;
  const sample = arr[0];
  return sample.flightNumber || sample.flight_no || sample.ident || sample.designator;
}

This is more resilient to site updates because I'm not coupled to a specific URL structure — I'm coupled to the shape of the data, which changes far less often.

Layer 6: Session Fixation and IP-Bound Cookies

This is the one that took me longest to diagnose. Some airports use session cookies that are bound to the IP address of the first request. The cookie is issued by the server and contains an HMAC of the client IP. If you grab the cookie from one IP (say, your local machine while debugging) and try to use it from another IP (your VPS), the server rejects it.

What it looks like: Everything works locally. On the VPS, you get 403s or empty data even though you're sending the right cookies. The difference is invisible in the request — you're sending a valid-looking cookie, it's just bound to the wrong IP.

What I did: I wrote about this in detail in a separate post, but the short version: I built SkipRestriction — a self-hosted Shadowsocks proxy running on the same VPS as the scraper. When I need to grab cookies for an airport with IP-bound sessions, I browse through the proxy so the cookie is issued to the VPS's IP. The scraper, also running on the VPS, then uses those cookies from the same IP.

Layer 7: JavaScript-Rendered Data with Anti-Automation Timing

Some sites render flight data using JavaScript that deliberately runs on a slight delay, or loads data in chunks, or only renders the full table after a scroll event. These are heuristics designed to detect headless browsers, which typically don't scroll and don't trigger layout-based events.

What it looks like: Playwright loads the page, you query the DOM for flight rows, and you get back 3 rows when there should be 45. The rest haven't rendered yet.

What I did: Inject realistic user behaviour before extracting data:

// Simulate a human reading the page before scraping
await page.waitForLoadState('networkidle');
await page.evaluate(() => window.scrollBy(0, 300));
await page.waitForTimeout(800 + Math.random() * 400);
await page.evaluate(() => window.scrollBy(0, 500));
await page.waitForTimeout(600 + Math.random() * 300);

// Now query for data
const flights = await page.$$eval('.flight-row', rows =>
  rows.map(row => ({
    flight: row.querySelector('.flight-number')?.textContent.trim(),
    status: row.querySelector('.status')?.textContent.trim(),
    time: row.querySelector('.scheduled-time')?.textContent.trim(),
  }))
);

The random delays matter — consistent 800ms waits are a bot signal. Human reading speed has natural variance.

The Layer I Haven't Beaten: Real CAPTCHA with Proof-of-Work

A handful of airports (3–4 in my coverage set) use a challenge mechanism that requires solving a proof-of-work puzzle — essentially, computing a hash with certain properties before the session is established. It's not a visual CAPTCHA; it's a compute challenge that's expensive for bots to complete at scale and trivially fast for a single human browser session.

I currently handle these with a hybrid approach: a real browser session is established manually once (or triggered via a webhook), the resulting cookies are stored, and the scraper uses those cookies until they expire. For airports with long cookie TTLs (24h+), this is workable. For airports that require fresh challenges every hour, it's not scalable and those airports currently have lower update frequency in MyAirports.

If you've cracked this one cleanly, I'd genuinely like to hear how.

What Works, What Doesn't, and What Changes Constantly

The honest summary: Playwright + stealth gets you through 80% of sites. IP management (residential proxies or VPS-matched cookies) handles another 10%. The remaining 10% requires per-site custom solutions, ongoing maintenance, or accepting reduced update frequency.

What changes most often: Cloudflare updates its challenge algorithms roughly every 6–8 weeks. When it does, stealth plugins need updating. I follow the puppeteer-extra-plugin-stealth repo closely and have a monitoring alert that fires when a Cloudflare-protected site's data goes stale — that's usually the signal that a new challenge version is live.

What's surprisingly stable: The actual data extraction logic. Airport websites update their visual design sometimes, but the underlying data structure in the HTML or JSON is remarkably consistent over time. The bot protection changes; the data format rarely does.

The Scraper Stack (For Reference)

Playwright (Chromium) for JS-heavy sites
playwright-extra + stealth plugin for Cloudflare
got (Node HTTP client) for simple HTML-only sites
Residential proxy pool for IP-rate-limited airports
Shadowsocks self-hosted proxy for IP-bound cookie issues
2captcha for the small number of visual CAPTCHAs
PostgreSQL for storage, with a scrape_log table tracking per-airport success/failure rates
Node-cron for scheduling, with dynamic intervals per airport based on data freshness requirements

The full architecture, including the self-learning caching layer, is covered in a previous post.

The flight data is live at myairports.online — if you want to use it rather than scrape airport sites yourself, the API handles all of this so you don't have to.

Have a layer I haven't covered? Drop it in the comments — I'm genuinely curious what else is out there.

Top comments (2)

Ali Afana • Apr 13

This is the kind of post I love reading — real problems from real production, not "here's how to use fetch()."

The silent 200 with empty data is brutal. I hit something similar building an AI sales chatbot — the API would return successfully but with garbage results, and the only way I caught it was by validating response plausibility (similar to your zero-flights-at-9AM check).

The proof-of-work layer is fascinating. Have you considered running a small pool of persistent browser sessions that stay "warm" — keeping the cookies alive with periodic lightweight requests instead of re-solving the challenge every time?

Great write-up. Following for the next one.

José Catalá • Apr 13

Hi, I like your advice about pool of persisten browser sessions, in fact I am working on another scrapping project and I have implemented it there, it saves a lot of memory. It was easier in this other project cuz only one page to scrap, however, in myairports I could reduce the gap between scrapping with the pool and still live cookies. Great suggestion, thank you very much.