OnlineProxy

Posted on Apr 1

Monitoring Airline Prices: How to Parse Skyscanner and Aviasales

#python #devops #programming #tutorial

Every traveler knows the frantic ritual: fifteen open tabs, clearing browser cookies in a desperate attempt to outsmart "dynamic pricing," and the soul-crushing moment a fare jumps by $200 while you're entering your credit card details. From the outside, the travel industry's pricing looks like chaos. From the inside, it is a high-frequency battle of algorithms.

For developers and data analysts, the challenge isn't just seeing these prices—it's capturing them at scale. Monitoring Skyscanner and Aviasales (JetRadar) is the "Final Boss" of web scraping. These platforms aren't just websites; they are massive aggregators of aggregators, protected by sophisticated anti-bot shields and complex asynchronous data flows.

If you want to build a reliable price monitor, you have to move beyond simple requests. Here is the senior-level blueprint for architecting a resilient flight data pipeline.

Why is Flight Data the Hardest Nut to Crack?

In most e-commerce scraping, you deal with a static SKU and a price. In flight monitoring, the "product" is a multi-dimensional matrix. A single seat's price is influenced by:

The Global Distribution System (GDS): The legacy backbone (Amadeus, Sabre) where the data originates.
OTA Markup: Online Travel Agencies adding their own margins.
Caching Latency: The price you see on an aggregator is often a "ghost" cached minutes or hours ago.

When you scrape Skyscanner or Aviasales, you aren't just hitting one server. You are tapping into a stream that triggers hundreds of subprocesses. This is why standard BeautifulSoup approaches fail within minutes.

How Do Aggregators Protect Their Data?

Skyscanner and Aviasales employ defensive stacks that make standard scrapers look like toys. Understanding the "why" behind your blocks is the first step to bypassing them.

1. The TLS Fingerprinting Trap

Anti-bot solutions like Akamai (used by Skyscanner) or Cloudflare don't just look at your IP. They look at your TLS Handshake. If you use a standard Python requests library, your cipher suites look like a bot. Real browsers have specific, messy handshake patterns. If they don't match, you are issued a 403 Forbidden before your request even hits the application layer.

2. Behavioral Heuristics

A human doesn't search for "London to Tokyo" 500 times in 2 minutes across different dates. Aggregators track session consistency. If your "user" is switching between currencies and regions with the speed of light, the session is flagged.

3. The Shadow DOM and Dynamic Loading

The "Price" on these sites is rarely in the initial HTML source. It is fetched via XHR/Fetch calls after the page loads. Often, the values are obfuscated or hidden within localized JSON objects that require a JavaScript engine to render.

The Architectural Framework: "The Resilient Scraper"

To build a professional-grade monitor, you need a three-tier architecture. Thinking of it as a single script is the quickest way to technical debt.

Tier 1: The Proxy Mesh

Do not use "cheap" datacenter proxies. They are blacklisted by ASN ranges. For flight scraping, you need Residential Proxies with sticky sessions.

Why? You need to maintain the same IP for the duration of a search "session" (from search input to price results) to mimic human behavior.

Tier 2: The Browser Context Manager

Forget headless Chrome in its default state. Tools like Playwright or Puppeteer must be augmented with "stealth" plugins to mask properties like navigator.webdriver and chrome.app.

Tier 3: The Data Normalization Engine

Aviasales and Skyscanner return data in vastly different formats. Your engine must map these into a unified schema:

Price_final = Price_base + Taxes + Fees_estimated

Step-by-Step Guide: Building Your Monitor

If you are starting today, follow this progression to avoid hitting a wall.

Step 1: Endpoint Discovery (The "API First" Rule)

Before you try to parse HTML, open the Network Tab in your DevTools. Both Aviasales and Skyscanner rely on internal APIs.

Aviasales is generally more developer-friendly, often offering an official API for partners.
Skyscanner is a fortress. You will often see requests to /apis/v1/prices. Attempting to hit these endpoints directly without the correct headers (Referer, Origin, and Cookies) will fail.

Step 2: Handling the "Waiting" State

Flight searches are asynchronous. When you send a request, the server returns a "Session ID" and an incomplete list of flights. You must poll the results.

# Conceptual polling logic
import time
import random

while progress < 100:
    response = session.get(polling_url)
    data = response.json()
    progress = data.get('status')
    update_dashboard(data.get('itineraries'))
    time.sleep(random.uniform(2, 5))

Insight: Failing to simulate the polling behavior is a primary signal to anti-bots that you are a script.

Step 3: Solving the Fingerprint Challenge

Use a library like curl_cffi in Python. It allows you to impersonate the TLS/JA3 fingerprint of a real browser (like Chrome 120) even when making low-level HTTP requests. This is significantly faster than a full browser and more stealthy than requests.

Step 4: Data Deduplication

Aggregators often show the same flight via different OTAs.

Formula for Comparison: Compare flights based on a hash of (Airline + FlightNumber + DepartureTime).
Ignore the price during the ID phase; only use it for the value comparison.

Framework for Scale: The "Scrape-Observe-Adjust" Cycle

Component	Strategy	Senior Insight
Concurrency	Distributed Workers	Don't scale vertically; use Celery or Temporal to distribute tasks across different regions.
Error Handling	Exponential Backoff	If you hit a 429 (Too Many Requests), don't just retry. The wait time should be `T = 2ⁿ × jitter`.
Validation	Schema Enforcement	Flight data is messy. Use Pydantic to ensure the price is always a float and the currency is ISO-compliant.

Beyond the Basics: What Newbies Miss

The "Geo-Pricing" Arbitrage

Prices for the same flight differ based on the IP location. A flight from New York to Paris might be cheaper when "searched" from a Polish IP than a US IP. A senior scraping architect builds a "Geo-Switcher" into their monitor to find the absolute floor of a price.

Detecting "Price Error" Fares

High-level monitors don't just look for cheap tickets; they look for anomalies. If the historical average for a route is $800 and it drops to $150, your system should trigger an immediate alert. This requires a time-series database like InfluxDB or ClickHouse to store historical price points.

Final Thoughts: The Ethics and Evolution of Scraping

Building a monitor for Skyscanner and Aviasales is a game of cat and mouse. You are operating in a space where the targets have billion-dollar incentives to keep you out.

However, the value of this data is immense. Whether you're building a travel startup, a personal alert system, or a market analysis tool, the key is respecting the infrastructure. High-frequency scraping without caching is not just "noisy"—it's inefficient. A senior engineer knows that the best scraper is the one that makes the fewest requests to get the most information.

The question isn't just "How do I parse this?" but "How do I build a system that remains invisible while providing undeniable value?"

DEV Community