What I learned scraping 141 crypto cardholder agreements

#webscraping #fintech #crypto #dataengineering

On 3 February 2026, three unrelated crypto cards — CEX.IO Card, Trustee Plus, and IN1 — stopped processing payments on the same day. They had no parent in common. They were not hacked. None of the front-end brands had failed. The only thing they shared was a Polish payment-institution whose license had been revoked twelve days earlier by KNF.

That was the prompt to start a dataset. The question was simple: how many other crypto cards share an underlying issuer that almost no user has ever heard of? Answering it required reading roughly 141 cardholder agreements.

This post is about what that data collection actually looked like — the scraper choices, the failure modes, and what surprised me about the structure of "publicly available" legal documents on the web.

The architecture, in two paragraphs

Most crypto companies don't directly issue payment cards. They rent the right to issue cards from a principal member of Visa or Mastercard. That principal — usually a small or mid-size bank or e-money institution — is the BIN sponsor. The six digits at the start of the card number identify them. The brand on the front of the card is a separate company, the program manager, which contracts with both the sponsor and the user.

Three layers, one of them visible. The sponsor is the layer the regulator can actually shut down. When the regulator does, every program manager on that sponsor's BIN goes dark on the same day. From the user's perspective there is no warning, because the user never signed up with the sponsor.

If that sounds familiar — Stripe acquired Bridge in 2025, Coinbase Card runs on Pathward (not Marqeta, which is just the processor), Gnosis Pay runs on Monavate — it is the same pattern at scale.

The scrape

The first plan was naive: a Playwright job that visited each card's /legal or /terms URL, extracted text, and ran a regex for the phrase "issued by [BANK NAME]". This worked for about a third of the dataset.

The other two-thirds failed in interesting ways:

Cardholder agreement is a PDF generated only after KYC. About a dozen cards. The static T&C is a marketing summary; the legally binding agreement is generated at application time with a Lambda. You can't fetch it.
Sponsor name is in an appendix, not the first paragraph. A regex that scans the first 500 words misses it. Some cardholder agreements bury "issued by ___" inside a chargeback procedures section, sometimes thousands of words in.
Sponsor disclosure was deleted. A handful of cards used to name their sponsor and quietly removed it after the Union54 BIN suspension in 2022. The Wayback Machine still has the old version. The current page doesn't.
The page is rendered client-side via a wallet SDK that won't run in headless Chrome. Two cards. Solved by switching to a real Chrome instance with the wallet extension pre-installed.
The card's website doesn't include a cardholder agreement at all. Around 25 cards. The agreement exists somewhere — there must be a paper trail because Visa or Mastercard requires one — but the public-facing site doesn't link to it.

For (5), the only reliable signal is the BIN itself. If you can find a forum post or a press release with someone's card number prefix, you can look up which member that prefix is registered to and infer the sponsor. The signal is noisy, but it's better than nothing.

What ended up in the dataset

After two passes (one scraped, one manual cross-check), each card got one of four confidence labels:

HIGH (~79 cards): sponsor name verbatim from a publicly fetched T&C, on a date recorded with the row.
MEDIUM (~34 cards): sponsor named in an older snapshot, press release, or regulator filing — but the current public page doesn't repeat it.
CIRCUMSTANTIAL (~25 cards): inferred from program-manager naming or industry partnerships. Treated as upper-bound estimate, not fact.
UNKNOWN (~3 cards): best guess, flagged for follow-up.

If you've built data products before, this part will be familiar. The interesting wrinkle is that the legal disclosure regime varies wildly by jurisdiction. US and EU cards almost always name the sponsor verbatim. APAC programs frequently do not. African and LatAm cards have actively removed the disclosure since 2022, because Union54's BIN suspension that year created a contagion risk — if the regulator suspends your sponsor for someone else's fraud, you want to keep your customer association with the sponsor quiet.

That asymmetry — disclosure norms diverging across regions — is itself a structural fact about the market. It is not a dataset cleanliness problem.

What the data shows once you have it

Globally, the Herfindahl-Hirschman Index across all 141 cards is around 400 to 500. Below the US DOJ threshold for "unconcentrated." That number is misleading. Once you split by region and product type — which is the actual choice a user faces when picking a card — the picture inverts.

US self-custody stablecoin cards: HHI around 5,000 to 6,300 depending on how you count circumstantial attribution. Two banks (Third National in Tennessee, Lead Bank in Missouri) anchor roughly two-thirds of issuance. EU/UK self-custody: even worse — a single sponsor (Monavate, owned via Baanx since 1 May 2026 by Exodus) anchors most of the segment.

If you want to look at the per-card data, the methodology, or the per-row source URLs, the dataset is at sweepbase.net/dataset and the full write-up of the concentration findings is at sweepbase.net/research/bin-sponsor-concentration-2026. Both are CC-BY for academic and journalistic use.

What I'd build differently next time

Three concrete things, for anyone trying to do this kind of dataset:

Don't trust a single fetch. Spot-audit on the day of publication. Of 32 cards I re-checked on 16 May 2026, only 14 were verbatim re-verifiable. The rest had been edited since the original scrape. The dataset now schedules quarterly re-fetches.
Track which jurisdiction's regulator can shut each sponsor down. Most public BIN datasets are jurisdiction-blind. For risk analysis, that's a critical missing column. KNF can shut down a Polish sponsor in twelve days. The FCA cannot. Knowing which is which changes the risk-weighting of each card.
Distinguish sponsor from processor from program manager. The single most common error in casual coverage of crypto cards — repeated in trade press for years — is calling Marqeta the "issuer" of Coinbase Card. Marqeta is the processor. The actual sponsor (Pathward) doesn't appear in 95% of articles about the card. Different roles, different regulators, different failure modes.

Closing

Most "crypto card competition" coverage treats the front-of-card brands as substitutable when, behind the scenes, two different brands are often two skins on the same regulated entity. That doesn't matter — until the regulator pulls the sponsor's license, and three programs go dark on a Tuesday.

The dataset is open. Corrections welcome.

Originally published on Sweepbase Research. I run Sweepbase, an independent crypto-card comparison and research project tracking 141 active cards across regions, networks, and BIN sponsors.