DEV Community: SweepBase

What I learned scraping 141 crypto cardholder agreements

SweepBase — Wed, 20 May 2026 19:25:17 +0000

On 3 February 2026, three unrelated crypto cards — CEX.IO Card, Trustee Plus, and IN1 — stopped processing payments on the same day. They had no parent in common. They were not hacked. None of the front-end brands had failed. The only thing they shared was a Polish payment-institution whose license had been revoked twelve days earlier by KNF.

That was the prompt to start a dataset. The question was simple: how many other crypto cards share an underlying issuer that almost no user has ever heard of? Answering it required reading roughly 141 cardholder agreements.

This post is about what that data collection actually looked like — the scraper choices, the failure modes, and what surprised me about the structure of "publicly available" legal documents on the web.

The architecture, in two paragraphs

Most crypto companies don't directly issue payment cards. They rent the right to issue cards from a principal member of Visa or Mastercard. That principal — usually a small or mid-size bank or e-money institution — is the BIN sponsor. The six digits at the start of the card number identify them. The brand on the front of the card is a separate company, the program manager, which contracts with both the sponsor and the user.

Three layers, one of them visible. The sponsor is the layer the regulator can actually shut down. When the regulator does, every program manager on that sponsor's BIN goes dark on the same day. From the user's perspective there is no warning, because the user never signed up with the sponsor.

If that sounds familiar — Stripe acquired Bridge in 2025, Coinbase Card runs on Pathward (not Marqeta, which is just the processor), Gnosis Pay runs on Monavate — it is the same pattern at scale.

The scrape

The first plan was naive: a Playwright job that visited each card's /legal or /terms URL, extracted text, and ran a regex for the phrase "issued by [BANK NAME]". This worked for about a third of the dataset.

The other two-thirds failed in interesting ways:

Cardholder agreement is a PDF generated only after KYC. About a dozen cards. The static T&C is a marketing summary; the legally binding agreement is generated at application time with a Lambda. You can't fetch it.
Sponsor name is in an appendix, not the first paragraph. A regex that scans the first 500 words misses it. Some cardholder agreements bury "issued by ___" inside a chargeback procedures section, sometimes thousands of words in.
Sponsor disclosure was deleted. A handful of cards used to name their sponsor and quietly removed it after the Union54 BIN suspension in 2022. The Wayback Machine still has the old version. The current page doesn't.
The page is rendered client-side via a wallet SDK that won't run in headless Chrome. Two cards. Solved by switching to a real Chrome instance with the wallet extension pre-installed.
The card's website doesn't include a cardholder agreement at all. Around 25 cards. The agreement exists somewhere — there must be a paper trail because Visa or Mastercard requires one — but the public-facing site doesn't link to it.

For (5), the only reliable signal is the BIN itself. If you can find a forum post or a press release with someone's card number prefix, you can look up which member that prefix is registered to and infer the sponsor. The signal is noisy, but it's better than nothing.

What ended up in the dataset

After two passes (one scraped, one manual cross-check), each card got one of four confidence labels:

HIGH (~79 cards): sponsor name verbatim from a publicly fetched T&C, on a date recorded with the row.
MEDIUM (~34 cards): sponsor named in an older snapshot, press release, or regulator filing — but the current public page doesn't repeat it.
CIRCUMSTANTIAL (~25 cards): inferred from program-manager naming or industry partnerships. Treated as upper-bound estimate, not fact.
UNKNOWN (~3 cards): best guess, flagged for follow-up.

If you've built data products before, this part will be familiar. The interesting wrinkle is that the legal disclosure regime varies wildly by jurisdiction. US and EU cards almost always name the sponsor verbatim. APAC programs frequently do not. African and LatAm cards have actively removed the disclosure since 2022, because Union54's BIN suspension that year created a contagion risk — if the regulator suspends your sponsor for someone else's fraud, you want to keep your customer association with the sponsor quiet.

That asymmetry — disclosure norms diverging across regions — is itself a structural fact about the market. It is not a dataset cleanliness problem.

What the data shows once you have it

Globally, the Herfindahl-Hirschman Index across all 141 cards is around 400 to 500. Below the US DOJ threshold for "unconcentrated." That number is misleading. Once you split by region and product type — which is the actual choice a user faces when picking a card — the picture inverts.

US self-custody stablecoin cards: HHI around 5,000 to 6,300 depending on how you count circumstantial attribution. Two banks (Third National in Tennessee, Lead Bank in Missouri) anchor roughly two-thirds of issuance. EU/UK self-custody: even worse — a single sponsor (Monavate, owned via Baanx since 1 May 2026 by Exodus) anchors most of the segment.

If you want to look at the per-card data, the methodology, or the per-row source URLs, the dataset is at sweepbase.net/dataset and the full write-up of the concentration findings is at sweepbase.net/research/bin-sponsor-concentration-2026. Both are CC-BY for academic and journalistic use.

What I'd build differently next time

Three concrete things, for anyone trying to do this kind of dataset:

Don't trust a single fetch. Spot-audit on the day of publication. Of 32 cards I re-checked on 16 May 2026, only 14 were verbatim re-verifiable. The rest had been edited since the original scrape. The dataset now schedules quarterly re-fetches.
Track which jurisdiction's regulator can shut each sponsor down. Most public BIN datasets are jurisdiction-blind. For risk analysis, that's a critical missing column. KNF can shut down a Polish sponsor in twelve days. The FCA cannot. Knowing which is which changes the risk-weighting of each card.
Distinguish sponsor from processor from program manager. The single most common error in casual coverage of crypto cards — repeated in trade press for years — is calling Marqeta the "issuer" of Coinbase Card. Marqeta is the processor. The actual sponsor (Pathward) doesn't appear in 95% of articles about the card. Different roles, different regulators, different failure modes.

Closing

Most "crypto card competition" coverage treats the front-of-card brands as substitutable when, behind the scenes, two different brands are often two skins on the same regulated entity. That doesn't matter — until the regulator pulls the sponsor's license, and three programs go dark on a Tuesday.

The dataset is open. Corrections welcome.

Originally published on Sweepbase Research. I run Sweepbase, an independent crypto-card comparison and research project tracking 141 active cards across regions, networks, and BIN sponsors.

How I Maintained an Awesome-List of 136 Crypto Cards as a CI-Linted Dataset

SweepBase — Thu, 07 May 2026 08:29:34 +0000

Last month I open-sourced awesome-crypto-cards — a curated list of 136 crypto debit and credit cards. This post is about the boring infrastructure: why I run awesome-lint in CI, how I keep the list synced with the dataset behind sweepbase.net, and where I underestimated effort.

Why a flat README, not a database

The list lives as a single README.md. No JSON, no YAML, no static site. People who land on a GitHub awesome-list expect to scan markdown, not click into an interactive viewer.

Trade-offs I accepted: no programmatic queries, no filtering UI, no auto-generated content.

Trade-offs I avoided: an extra build step, broken links from generator bugs, and the friction of "wait, where do I edit this?"

The awesome-lint CI

Every push runs awesome-lint via GitHub Actions. It catches:

Duplicate URLs (you'd be surprised)
Links missing https://
Markdown formatting that breaks GitHub's renderer
Broken anchor references in the contents section
Categories that don't sort alphabetically

# .github/workflows/main.yml
name: Awesome Lint
on: [push, pull_request]
jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '22' }
      - run: npm ci
      - run: npx awesome-lint

The lint config is the strictest version (no-emoji). I keep it that way because the goal is acceptance into other awesome-list registries down the line, and they reject any list that fails their own awesome-lint pass.

Keeping it synced with the source

The dataset behind sweepbase.net is a CSV of 141 rows. Five of those are pre-launch products (waitlist, "in development" custody, "TBA" network) — the README rule is "shipping only," so the README count is 136.

The diff between CSV and README runs as a small Node script:

const csvNames = new Set(cards.map(c => c['Card Service'].trim()));
const readmeNames = new Set();
const re = /- \[([^\]]+)\]\(https:\/\/sweepbase\.net\/cards\//g;
let m;
while ((m = re.exec(readme)) !== null) readmeNames.add(m[1].trim());

const inCsvNotReadme = [...csvNames].filter(n => !readmeNames.has(n));

Each time I add a card to the dataset, this tells me what's missing in the README, and I add it manually. Manual is fine because it's once a week at most.

What I underestimated

Alphabetical filter sections. Each region/custody/use-case section repeats card names. Adding one new card means editing 4-5 lists. I have a script in mind but haven't built it.
The "Related Lists" section. The other awesome-lists in the crypto/defi space are mostly stale (2-3 years since update). Including them feels honest but reduces the list's perceived freshness.
Star farming. Two-week organic plan, 23 days later, 1 star. Reality check: the list needs distribution, not just existence.

If you're building an awesome-list, the lint+CI part is fast. The interesting work is keeping it honest as the underlying space changes.

Repo: https://github.com/mbtrilla/awesome-crypto-cards

Three months of running a Next.js aggregator on a CSV: what broke and what did not,

SweepBase — Wed, 06 May 2026 12:22:11 +0000

I shipped a 141-row crypto card comparison site on a public CSV instead of a database back in February, and I want to write down what I have learned three months in. The earlier posts covered why I picked CSV (why a CSV beats a database for this) and what I would do differently on the architecture side (six lessons-learned from shipping a Next.js 15 + CSV side project). This is the operational version.

What broke

ISR cache went stale faster than I expected. Setting revalidate = 86400 on card detail pages felt safe in dev. In production, when I edited the CSV and pushed, the new content took up to 24 hours to surface on cold pages because Vercel only revalidates on traffic. I added a /api/revalidate webhook that I hit from a small script after every CSV change. That fixed the lag, but it adds a step I forget half the time.

PapaParse parsing in a Server Component blew up once when a column contained a comma inside quoted text and the quoting was wrong. Zod validation caught the malformed row, but I had 20 minutes of "is the entire site broken" panic before I read my own logs. Lesson: always log the failing row before throwing.

Image proxy started rate-limiting. I serve card images via /api/image-proxy with a 7-day cache. About six weeks in, I noticed Google Drive started throttling requests from Vercel egress IPs. Cache hit rate dropped, latency went up. I now host all new card images locally as .webp and only fall back to Drive for legacy entries.

What did not break

The catalog itself. 141 rows in a CSV is below any threshold where you actually need a database. Greps are instant in CI, the file diffs cleanly in PRs, and contributors can read it without a SQL client. I have not regretted this once.

Filter functions as predicates. Every category on the site is a single function (card: Card) => boolean in one file. When I needed to add a new category (Brazil, USDC, self-custody), it was a one-line export. Reading a meta post on the editorial layer of a comparison site made me realize this was the architectural choice that made the most editorial work feel cheap.

Zod schemas as the source of truth. Card type, validation, defaults all in one place. I have refactored the card model three times now and the migration was always trivial because the schema was the contract.

What I would copy on a new project

Start with a CSV. Move to a database only when you have evidence the CSV is the bottleneck. For three months of traffic and 141 rows, mine never was.

If you want the live result, the database is at sweepbase.net and the comparison methodology piece is on Telegraph. There is also a follow-up note on the founder-pitch lens that complements this operational view.

What I learned shipping a Next.js 15 + CSV side project

SweepBase — Thu, 30 Apr 2026 10:48:50 +0000

I shipped a small side project this year: sweepbase.net, a comparison site for crypto debit and credit cards. 139 cards, no DB, the whole dataset is one CSV file in the repo.

Here are the things I'd actually tell another dev about it.

CSV beats a DB more often than people admit

The whole catalog is data.csv, parsed at boot, validated with Zod. Reads outnumber writes by something like 10,000 to 1, and most "writes" are me fixing a number once a month.

For that load profile, a database is theatre. CSV in a public repo gives me:

One source of truth, version controlled
Diff-able commits when I change a number
No admin UI to build
An auditable timeline anybody can inspect

When somebody asks "why did you change Crypto.com APY", I link the commit. That answer is more reassuring than any dashboard.

Zod earns its rent

Zod's schema does double duty: it validates at boot, and it generates the TypeScript type via z.infer. One source for shape, no drift between runtime and compile time.

const CardSchema = z.object({
  service: z.string().min(1),
  fxMargin: z.number().min(0).max(10),
  atmFee: z.number().min(0),
});
export type Card = z.infer<typeof CardSchema>;

If a row in the CSV is malformed, the build fails. I never ship broken data without knowing.

ISR is the right default for content sites

Next.js 15.1 App Router with revalidate: 3600 on every page. The data changes a few times a week. There is no reason to re-render on every request. Lighthouse stays at 100 across the catalog because the rendered HTML is essentially static, and the framework refreshes it every hour.

I had to fight the urge to reach for SSR or client-side fetching. Neither belongs here.

React.cache() is underrated

Multiple components in a single page render call the same getCards() function. Without React.cache(), the CSV gets parsed once per call site. Wrapped in React.cache(), it parses once per request. Easy 10x latency win that I almost missed.

Filters as predicates beats SQL for small data

37 category pages (USA, no-KYC, self-custody, travel, and so on), all rendered from the same Server Component. The category-specific logic lives in lib/filters.ts:

export const isSelfCustody = (card: Card) => card.custody === 'self';
export const isUSACompatible = (card: Card) => card.regions.includes('USA');

Adding a new category page is a 6-line PR: filter, slug, name. No migration, no index to remember.

What I would do differently

Started the public CSV from day one. I used Notion for the first month, lost a week porting it.
Set up Sentry before shipping, not after the first ghost bug report.
Wrote the report-error button in week 1. Real user reports caught more bad data than my own auditing.

Where to look

Live: sweepbase.net
Dataset: /datasets/data.csv
Calculator: /calculator

If you want to see the schema or argue with one of my ratings, both are public. The CSV is the source of truth.