Last week someone tried to copy my visa API's database. They didn't succeed — they got 0.6% of it before I cut the key — but the 251 requests they left behind are a near-perfect teaching case for what targeted API extraction actually looks like from the defender's side.
Here's the forensic walkthrough.
The target
One endpoint:
GET /api/v1/visa?from={passport}&to={destination}
It returns the visa rule for a passport→destination pair — visa type, allowed stay, conditions. The full matrix is ~39,585 pairs. That matrix is the product.
The evidence
The attacker's requests weren't spread across the map. They were a sweep, one passport at a time:
| Passport | Destinations pulled | Coverage |
|---|---|---|
| 🇦🇪 UAE (ARE) | 195 | ~100% of that passport's matrix |
| 🇦🇺 Australia (AUS) | 53 | ~1/4, interrupted |
| 🇨🇳 China (CHN) | 2 | test calls |
249 unique pairs, near-zero duplicates. Whoever wrote this was methodical: validate that one full passport comes out cleanly, then move to the next.
Reading the cadence
The timestamps are where a scrape gives itself away. Minute by minute:
11:56 2 ← test phase (incl. the one failure)
11:57 1
11:58 25 ┐
11:59 26 │
12:00 20 │ ~25 req/min, dead regular
… │ = one request every ~2.4s
12:07 21 ┘
No human reads visa rules on a 2.4-second metronome for 11 minutes. This is a loop.
The fingerprint
Four signals — and the point isn't nationality, it's that the request parameters themselves leaked the intent:
-
Handle:
visadb_scraper. It signed its own work. -
Email: throwaway
@temp.com. No intention of receiving anything. -
Languages:
en+zh, on a product with no Chinese-market surface yet. - Error signature: the very first call (CHN→THA, in Chinese, 11:56:45) failed, then everything ran clean. Classic "calibrating the script" tell.
The math
250 records is 0.6% of the base. At 25 req/min, a full dump would've taken ~26 hours. This wasn't a dump — it was a feasibility test. They proved a whole passport comes out easily, then stopped, nowhere near the 3,000/month free-tier ceiling.
What I couldn't see
I blocked the key (active=false) and the sweep stopped. But my request logs didn't capture the source IP — so I could block the key, not the person. Re-signup costs nothing.
That gap is the real lesson, and it's the subject of the next two posts: rate limits are a cost control, not a security control — and if your logs can't see the network, you can't defend at it.
If you're shipping a data product: assume the first person who finds it valuable will be the first person who tries to copy it. Instrument for that on day one.
Top comments (0)