DEV Community

NexusFeed
NexusFeed

Posted on

How I bypassed Akamai on the Illinois liquor license portal (and why it changed how I build scrapers)

Illinois liquor license data lives behind a Salesforce Experience Cloud portal guarded by Akamai. I needed to reach it because the rest of my API already covered California, Texas, New York, and Florida, and four states out of fifty looked like half a product. I thought it would take an afternoon. It took a week.

TL;DR

  • Akamai blocks Webshare residential, datacenter IPs, ScraperAPI proxy mode, and even my own Paris residential IP. Four for four.
  • ScraperAPI's render API (not proxy mode) is the only thing that got through — because it ships the request to a managed browser farm with real anti-bot bypass, not just a rotating IP.
  • Five credits per call. Free tier covers 200 lookups per month. Enough to ship the state.
  • Texas needed a 2Captcha pipeline and a session-cookie trick. FedEx Freight needed ultra_premium=true. Cloudflare killed my Playwright implementation for good.
  • The design decision I'm proudest of: every response carries a _verifiability block with a deterministic confidence score. No silent degradation, ever.

I am building NexusFeed, a B2B data API aimed at AI agents. Two products share the same codebase: LTL fuel surcharge rates for freight carriers, and liquor license compliance lookups for the highest-volume states. The buyers are not humans. They are language-model agents doing compliance checks, freight cost calculations, and business due diligence, and they need data that carries a known source and a known confidence score, not a scraped blob they have to trust blindly.

That framing matters for the rest of this post. Every extraction decision I describe below was made with an agent on the receiving end, not a dashboard user who can eyeball a broken row and shrug.

🧱 The Illinois wall

The Illinois Liquor Control Commission runs its license lookup at ilccportal.illinois.gov/s/license-lookup. The URL looks ordinary. The network layer is not.

Akamai sits in front of the portal and drops almost everything that is not a real browser on a real residential network. I tried four proxy strategies before I stopped counting.

  • Webshare rotating residential. Hit Akamai and got a JavaScript challenge that never resolved.
  • Direct Railway container IP. Same treatment, slightly faster rejection.
  • ScraperAPI standard proxy mode. Rotates residential IPs but does not render JavaScript, so it hit the same wall.
  • My own French residential ISP. Akamai blocked a Paris IP asking for an Illinois liquor license with all the suspicion it deserved.

Four for four. At that point I was staring at the Akamai challenge page long enough to have feelings about its font choices.

🔓 The fix

The fix was to stop thinking of ScraperAPI as a proxy and start thinking of it as a remote browser.

ScraperAPI has two modes. Proxy mode is what most people know: rotate an IP, pass the request through, five hundred requests for a dollar. Render API is different. It ships the request to a managed browser farm that already knows how to get past most CDN bot walls, waits for the page to finish rendering, and returns the full HTML. It charges five credits per call instead of one.

The comment in my extractor is blunter than I would normally write in public:

# Akamai is extremely aggressive on this portal — only ScraperAPI's render API
# (which uses its own browser farm with anti-bot bypass) can load the page.
# Playwright, Webshare proxy, and direct IP all fail.
Enter fullscreen mode Exit fullscreen mode

One HTTP GET to api.scraperapi.com with render=true and country_code=us, a ninety second timeout, and I finally had the rendered portal page in my hands.

params = {
    "api_key": api_key,
    "url": "https://ilccportal.illinois.gov/s/license-lookup",
    "render": "true",
    "country_code": "us",
}
Enter fullscreen mode Exit fullscreen mode

ScraperAPI's free tier is 1,000 credits per month, which at five credits per call is 200 Illinois lookups per month. Not a production tier, but enough to ship the state and find out whether anyone actually wants it before I pay for more.

The second wall: FlexCard

Getting the HTML was only half the problem.

The portal renders its license table inside a Salesforce OmniStudio FlexCard, and the Lightning DataTable lives inside a shadow DOM. More painfully, the FlexCard ignores URL search parameters. There is no ?licenseNumber=... that the server will honour, because filtering happens inside a client-side Aura component that expects user input events, not query strings.

What ScraperAPI returns is the portal's default unfiltered view: the first twenty license records the page chooses to show. I regex-parse the serialized shadow DOM, build LicenseRecord objects, and then filter those twenty records client-side to whatever the caller asked for. If the match is in that window the caller gets a hit. If it is not, they get an empty result, and the response is honest about the window it searched.

Fixing this properly would mean buying ScraperAPI's JS Instructions tier and scripting a real click into the search box, and I will do that when Illinois starts paying for itself. Until then, honesty about the limitation lives in the confidence score, which I will get to in a minute.

🔐 Texas: CAPTCHA and the session cookie trick

Texas has a different flavour of hostile.

TABC's public inquiry runs on apps.tabc.texas.gov/publicinquiry/StatusNewLayout.aspx, an ASP.NET WebForms application from roughly 2008 with a real image CAPTCHA. You cannot just httpx.get() the CAPTCHA image URL, because the ASP.NET session cookie you get back from that request is a different session from the one attached to the form you need to submit. Two separate sessions, one CAPTCHA you can never use, one form that will never accept it.

The fix is to fetch the CAPTCHA from inside the Playwright page itself, where the cookie jar is already correct:

async () => {
    const resp = await fetch('/publicinquiry/Captcha.aspx');
    const buf = await resp.arrayBuffer();
    const bytes = new Uint8Array(buf);
    let binary = '';
    bytes.forEach(b => binary += String.fromCharCode(b));
    return btoa(binary);
}
Enter fullscreen mode Exit fullscreen mode

That runs inside page.evaluate(), so the browser owns both requests and the session cookie matches. I hand the base64 image to 2Captcha at roughly two tenths of a cent per solve, wait about fifteen seconds for a human on the other end to read it, type the answer back into the form, and click submit.

Texas returns results in a popup window opened by an ASP.NET postback, which Playwright catches cleanly with async with page.context.expect_page(). You open the context manager, click the submit button inside it, and the new page falls out the other side as a first-class Page object you can scrape like any other. No polling loop, no guessing at window names, just a popup the browser hands me.

☁️ FedEx: the death of Playwright

FedEx Freight publishes its LTL fuel surcharge on a Cloudflare-fronted marketing page. Cloudflare blocks datacenter IPs by default and is getting steadily better at spotting headless browsers, stealth plugins or no stealth plugins.

I fought Playwright for a day, then deleted the Playwright implementation entirely and routed FedEx through ScraperAPI with five parameters:

params = {
    "api_key": api_key,
    "url": _URL,
    "render": "true",
    "ultra_premium": "true",
    "country_code": "us",
}
Enter fullscreen mode Exit fullscreen mode

ultra_premium=true is the Cloudflare-grade tier. More expensive per call than vanilla render, but the only combination that consistently returned a page with actual fuel surcharge numbers instead of a challenge screen. Every extractor records which method it used in a response field, so I never lose the audit trail when I have to debug a regression six weeks later.

📜 The _verifiability contract

This is the design decision I am proudest of, and the one I wish more scraping APIs made.

Every response from NexusFeed carries a _verifiability block as a first-class field, not a footnote:

class Verifiability(BaseModel):
    source_timestamp: datetime
    extraction_confidence: float
    raw_data_evidence_url: str
    extraction_method: ExtractionMethod
    data_freshness_ttl_seconds: int
Enter fullscreen mode Exit fullscreen mode

extraction_method is an enum with five members: api_mirror, playwright_dom, structured_parse, scraper_api, scraper_api_fallback. An agent reading the response can see immediately whether the number came from a JSON API mirror that returned in two hundred milliseconds, or from a regex run over a rendered Salesforce shadow DOM that cost five credits and took forty seconds, and can make its own call about how much to trust the answer.

extraction_confidence is never hardcoded. It is always the output of one short function, called by every extractor:

def compute_confidence(
    required_fields: list[str],
    found_fields: set[str],
    fallback_triggered: bool = False,
) -> float:
    if fallback_triggered:
        return 0.0
    found = len([f for f in required_fields if f in found_fields])
    return round(found / len(required_fields), 2)
Enter fullscreen mode Exit fullscreen mode

Six required fields, four found, fallback path did not trigger? Confidence is 0.67.

Primary extractor failed and I had to run the HTML fallback? Confidence is 0.0 and the router returns a 503 instead of serving a half-answer.

Anything below 0.5 logs a warning the operator can find in the logs. There is no silent degradation from "perfect JSON API" to "I am guessing and not telling you" without the caller finding out.

I wrote this contract on day one and it has saved me every single time a source site quietly changed its markup. Twice in six months an extractor silently lost a column, the confidence dropped from 1.0 to 0.67, the warning fired, and I had the fix in production before a single caller noticed. The contract forced the extractor to admit what it was missing instead of papering over it.

📦 What shipped

  • 10 LTL carriers: Old Dominion, Saia, Estes, ABF, R+L, TForce, XPO, Southeastern, Averitt, FedEx
  • 5 ABC states: California, Texas, New York, Florida, Illinois
  • 230 passing tests
  • FastAPI on Railway Hobby, Redis cache-aside, 7-day LTL TTL, 24-hour ABC TTL
  • Stripe metered billing per product (ltl_request + abc_request meters)
  • MCP server so agents install NexusFeed as a tool in Claude Desktop or Cline with one command
  • Six months, solo, from an apartment in Paris between day-job hours

Try it

Docs, OpenAPI spec, and the MCP install command are at docs.nexusfeed.dev.

If you want a key, the two products are on RapidAPI:

If you build something on top of either of these and it breaks in an interesting way, I want to hear about it. The whole point of the _verifiability contract is that you should never be guessing whether my data is wrong — but if the contract itself lets you down, that is the most useful bug report I can possibly get.

Follow me here on Dev.to if you want the next writeup, which will be the Averitt hidden JSON API and the Webshare-through-Railway-EU-to-US-carrier proxy story. And if any of this was useful, a ❤️ or a 🦄 on the post genuinely helps a stealth solo founder get discovered.

Top comments (0)