How I Built a Web Scraper That Gets Faster and Cheaper With Every Request

#node #api #javascript #webdev

Most web scrapers are dumb: they do the same work every time. Open a browser, load a page, parse the DOM, extract data, close the browser. Every single request.

Building MyAirports — a real-time flight data API covering 1,000+ airports — forced me to think about scraping differently. At scale, running a headless browser for every request is expensive and slow. The system I ended up building instead gets faster and cheaper the more it runs, because it learns.

Here's how it works.

The core insight: most airport SPAs load data from internal JSON APIs

Go to any modern airport website. Open DevTools. Click on the Network tab. Filter for XHR/Fetch requests. What you'll see almost immediately is a call to something like:

GET https://api.heathrow.com/pihub/flights/arrivals
GET https://aena.es/sites/Satellite?pagename=AENA_ConsultarVuelos&airport=MAD
GET https://cph.dk/api/FlightInformation/GetFlightInfoTable?direction=A

The webpage is a React or Angular SPA, and the actual flight data comes from an internal API. That API is often unauthenticated, or authenticated only by session cookies the browser already has. If you can capture that API call, you can make it directly — no browser needed for subsequent requests.

This is the foundation of the whole system.

The 5-layer pipeline

Every flight data request flows through five layers:

Layer 1: Operator registry
For ~500 airports I've already reverse-engineered the API, the system goes directly there. No discovery needed. A curated registry maps IATA codes to endpoints, required headers, and a transform function for normalizing the response. This handles roughly half of all requests, takes 1-2 seconds, and costs nothing beyond the API call.

Layer 2: API cache (discovered endpoints)
The system maintains a database of APIs it has discovered during previous browser sessions. Before firing up a browser, it checks: have we already seen a good API for this airport? If yes, call it directly. This is the "memory" layer — it's how the system gets cheaper over time.

Layer 3: Browser discovery
Only if both previous layers miss does the system launch a real browser (Playwright, stealth mode). But even here it's not scraping the DOM — it's intercepting network traffic. The browser loads the airport's flight page, and the interceptor scores every JSON response it sees:

function scoreResponse(url, body) {
  let score = 0;
  if (url.includes('flight') || url.includes('arrival') || url.includes('departure')) score += 3;
  if (Array.isArray(body) && body.length > 5) score += 2;
  if (body[0]?.flightNumber || body[0]?.flight_no) score += 3;
  if (body[0]?.scheduled || body[0]?.scheduledTime) score += 2;
  return score;
}

Any response with a score above 5 is a candidate. The best candidate is saved to the ApiCache table — the discovered endpoint, the required headers, the response shape. Future requests for that airport hit Layer 2 instead.

Layer 4: Normalizer
Regardless of which layer found the data, every response goes through a normalizer that converts it to a unified flight schema. Airport APIs have wildly different shapes — nested objects, flat arrays, GraphQL responses, tables with inconsistent column names. The normalizer handles all of it.

Layer 5: Persistence and stale-while-revalidate cache
Results are stored in PostgreSQL and kept in an in-memory cache with a stale-while-revalidate strategy. A stale response is returned immediately while a fresh scrape runs in the background.

The result: the system pays for its own speed

When the system first encounters a new airport, it's slow: a full browser session takes 10-15 seconds. But the next time that airport is requested, the discovered API is available in the cache — and the call takes 1-2 seconds.

Over time, the API cache fills up. More and more airports become fast. The "discovery" cost is paid exactly once per airport. After that, it's a cheap HTTP call.

There's also an eviction mechanism: if a cached API endpoint starts returning 0 flights three times in a row, it gets flagged as stale and the system re-discovers it. Airport websites do change their APIs occasionally.

The auto-discovery tool

To seed the API cache without waiting for organic traffic, I built a batch auto-discovery CLI that processes airports from the OpenFlights CSV database:

npm run auto-discover -- --country IT
npm run auto-discover -- --min-size large

It runs the full browser discovery pipeline for each airport, then writes the discovered endpoints to the config files automatically. Processing 1,300+ airports this way found 276 new ones — airports whose APIs I hadn't previously known existed.

AI pattern learning as a last resort

For airports where network interception fails (because the page loads data from a server-side render rather than a client-side API call), there's a DeepSeek-powered fallback: send the page's HTML to the AI and ask it to identify CSS selectors for the flight table.

GYD (Baku) — a.flt_row card layout with .fl_col.* columns
KTW (Katowice) — div.timetable__row.flight-board__row with .flight-board__col--*
TLL (Tallinn) — li.flights-list__item card layout

The AI doesn't always get it right on the first try, but for airports with non-standard layouts it's significantly faster than manually inspecting the DOM.

What I learned

The most important architectural decision was treating discovery as a one-time investment rather than an ongoing cost. Every browser session is expensive — it's a browser, running JavaScript, waiting for a page to load. If you pay that cost once and cache the result, you amortize it across hundreds of future requests.

The second insight was scoring rather than parsing: instead of trying to understand an airport's specific API format upfront, intercept everything and look for responses that smell like flight data. The scoring heuristic is simple but effective, and it works across wildly different API shapes.

The third was building eviction in from the start. APIs change. Caching without a sensible eviction strategy means you end up with a database full of dead endpoints.

The result is a system that today handles most requests in 1-2 seconds, covers 1,000+ airports, and gets incrementally better with every new airport it encounters.

The API is free to try at myairports.online/developers.

If you found this interesting, I write about building MyAirports — a real-time flight data API. The next post covers how the system normalizes flight status strings from 7 different languages.