Anatomy of an API scrape: reading 251 requests like a crime scene

#security #webdev #api

Last week someone tried to copy my visa API's database. They didn't succeed — they got 0.6% of it before I cut the key — but the 251 requests they left behind are a near-perfect teaching case for what targeted API extraction actually looks like from the defender's side.

Here's the forensic walkthrough.

The target

One endpoint:

GET /api/v1/visa?from={passport}&to={destination}

It returns the visa rule for a passport→destination pair — visa type, allowed stay, conditions. The full matrix is ~39,585 pairs. That matrix is the product.

The evidence

The attacker's requests weren't spread across the map. They were a sweep, one passport at a time:

Passport	Destinations pulled	Coverage
🇦🇪 UAE (ARE)	195	~100% of that passport's matrix
🇦🇺 Australia (AUS)	53	~1/4, interrupted
🇨🇳 China (CHN)	2	test calls

249 unique pairs, near-zero duplicates. Whoever wrote this was methodical: validate that one full passport comes out cleanly, then move to the next.

Reading the cadence

The timestamps are where a scrape gives itself away. Minute by minute:

11:56   2   ← test phase (incl. the one failure)
11:57   1
11:58   25  ┐
11:59   26  │
12:00   20  │  ~25 req/min, dead regular
  …         │  = one request every ~2.4s
12:07   21  ┘

No human reads visa rules on a 2.4-second metronome for 11 minutes. This is a loop.

The fingerprint

Four signals — and the point isn't nationality, it's that the request parameters themselves leaked the intent:

Handle: visadb_scraper. It signed its own work.
Email: throwaway @temp.com. No intention of receiving anything.
Languages: en + zh, on a product with no Chinese-market surface yet.
Error signature: the very first call (CHN→THA, in Chinese, 11:56:45) failed, then everything ran clean. Classic "calibrating the script" tell.

The math

250 records is 0.6% of the base. At 25 req/min, a full dump would've taken ~26 hours. This wasn't a dump — it was a feasibility test. They proved a whole passport comes out easily, then stopped, nowhere near the 3,000/month free-tier ceiling.

What I couldn't see

I blocked the key (active=false) and the sweep stopped. But my request logs didn't capture the source IP — so I could block the key, not the person. Re-signup costs nothing.

That gap is the real lesson, and it's the subject of the next two posts: rate limits are a cost control, not a security control — and if your logs can't see the network, you can't defend at it.

If you're shipping a data product: assume the first person who finds it valuable will be the first person who tries to copy it. Instrument for that on day one.

DEV Community