How We Index 15,000+ eSIM Plans Across 120+ Providers

#webdev #api #datascience #travel

Running an eSIM comparison database means keeping 15,000+ plans accurate, current, and queryable across 120+ providers. Here's the operational reality of maintaining that index.

The Scale Problem

120+ providers × average 125 plans each = ~15,000 plans, each with:

Pricing (in 2–3 currencies)
Country coverage (1–195 countries)
Data cap or unlimited
Validity period
Feature flags (hotspot, 5G, VoIP, eSIM type)
Provider metadata (reliability score, activation time)

Everything changes. Providers run flash sales (prices drop 30% for 48 hours). Plans are discontinued. New providers launch. Regional coverage maps update. A static snapshot is useless within days.

Data Sources: APIs vs Scraping

About 70 of our 120+ providers have APIs — some purpose-built for resellers, some general-purpose product APIs. These are the gold standard: structured data, predictable format, rate limit policies, and provider-blessed access.

The remaining ~35 providers either have no API or have APIs that are incomplete (missing plan details, outdated pricing). For these, we scrape their public product pages. Scrapers are fragile — UI changes break them — so we monitor scraper health and alert on failure.

A third category: providers we've established direct data partnerships with. They push updates to us rather than us pulling. Small set currently, but growing.

The Normalization Challenge

Every provider structures plan data differently. Here's a sample of real variation:

Provider A: { "dataAmount": "5GB", "validDays": 30, "price": 12.99, "currency": "USD" }
Provider B: { "data_mb": 5120, "validity": "30 days", "cost_usd": "12.99" }
Provider C: { "planDetails": { "gb": 5, "days": 30 }, "pricing": { "USD": 12.99 } }
Provider D: { "data": "5 GB", "duration": 30, "retail_price": "$12.99" }

Same plan, four formats. Our normalization pipeline handles ~200 variations in how providers express data quantities, validity periods, pricing, and feature flags.

Country coverage is the hardest to normalize. Some providers list ISO codes (JP, FR, DE). Others list country names in various languages. Others list regions ("Europe", which means different things per provider). We maintain a mapping table and manually verify edge cases.

Refresh Strategy

Not all plans need the same refresh frequency:

REFRESH_TIERS = {
    "tier_1": {  # Top 20 providers by query volume
        "interval_hours": 6,
        "providers": ["airalo", "holafly", "saily", ...]
    },
    "tier_2": {  # Mid-tier providers  
        "interval_hours": 12,
        "providers": [...]
    },
    "tier_3": {  # Long-tail providers
        "interval_hours": 24,
        "providers": [...]
    }
}

Additionally, our price anomaly detector runs after every refresh. If a plan's price shifts more than 15% between cycles, we flag it for immediate re-verification. This catches both genuine sales (which we want to surface quickly) and data errors (which we don't want to propagate).

What Goes Wrong

Scraper drift: Provider website redesigns break scrapers. We monitor scraper success rates and alert when a scraper starts returning empty or malformed data. Average time to detect: under 2 hours. Average time to fix: 4–48 hours depending on complexity.

Currency conversion lag: We convert all prices to USD and EUR at query time using live FX rates. During periods of rapid FX movement, cached rates can be 0.5–1% off. Acceptable for our use case, but worth monitoring.

Plan discontinuation without notice: Providers sometimes stop offering plans without API signals. Our staleness detector flags plans that haven't been seen in 48+ hours for manual review.

Pricing that requires authentication: Some providers show retail prices for direct customers and wholesale prices for API partners. We show retail (what users actually pay) but ensuring we're fetching the right price tier requires provider-specific handling.

The Result

When a user queries eSIMDB AI, they're searching a database that was refreshed within 6–24 hours, normalized from 120+ disparate sources, with plans that have been verified for accuracy. The search takes under 2 seconds.

The index is not perfect — no real-time data product is. But 6–24 hour freshness is meaningfully better than the 30–90 day staleness common in affiliate comparison sites built on static data exports.

Happy to discuss any aspect of the architecture below.

eSIMDB AI — live at esimdb.ai. Free, no sign-up.