I built 59 scrapers for US government data and most of them were a waste of time

#webdev #api #data #go

I've spent the last few months building web scrapers for US government databases. Secretary of State business filings, SEC EDGAR, FDA clearances, OSHA inspections, court records, contractor licenses -- 59 of them total, all in Go, all deployed on Apify.

Most of them were a waste of my time. Here's what I actually learned.

The government data ecosystem is weirdly split

There are two kinds of government data sources and the experience of scraping them is completely different.

The good ones have real REST APIs with JSON responses, pagination, rate limits, and actual documentation. SEC EDGAR, NIH RePORTER, ClinicalTrials.gov, the NVD CVE database, Regulations.gov -- these are legit. You write a Go HTTP client, handle pagination, and you're done in a day. Some of them are genuinely well-designed APIs that private companies should be embarrassed by.

The bad ones are ASP.NET forms from 2004 with ViewState tokens, session cookies that expire every 3 minutes, and search results that render client-side with JavaScript. Most Secretary of State business filing portals fall into this bucket. California's bizfileOnline has Incapsula WAF now. Texas Comptroller returns HTTP 413 if your query matches more than 300 results. New Jersey's portal has ViewState blobs that are literally 50KB.

If you're planning to build something with government data, figure out which bucket your source falls into before you write any code. I burned two weeks on California SOS before realizing the WAF was going to win.

The APIs nobody knows about

The most surprising thing I found: there are a bunch of free government APIs that do things people pay $50-200/month for from private companies.

SEC EDGAR's full-text search API is free, real-time, and returns structured JSON. People pay for this from third-party providers.

FINRA BrokerCheck has a public JSON API at api.brokercheck.finra.org -- no auth, no rate limiting I've hit. You can look up any financial advisor or broker-dealer and get their entire disciplinary history. Compliance teams pay for this data.

SAM.gov has an API for federal contract opportunities. 200K+ active listings. The API key is free -- you just register.

The IRS 990 data is available through ProPublica's API and through direct bulk downloads. Every nonprofit's revenue, expenses, officers, and compensation. Free.

I didn't know most of these existed until I started building scrapers for them. The SEO for "free government API" is dominated by listicles that list the same 5 sources. The actual useful ones are buried in .gov docs that nobody reads.

What I got wrong about the market

Building the scraper is maybe 20% of the work and I learned this the hard way. The other 80% is figuring out if anyone actually wants the data, packaging it so they can use it without reading your source code, and getting it in front of people who'll pay for it.

I built 59 actors. Some of them solve real problems -- contractor license lookups for construction companies doing compliance checks, business entity verification for KYC teams, SEC filings for financial analysts. Those ones make sense.

But I also built scrapers for things like international trade tariff rates and Census Bureau county business patterns. Technically cool. Nobody's searching for them. The API is free and the data is niche enough that anyone who needs it probably already knows where to find it.

The ones that actually get users are the ones where the underlying data is valuable and the official source is painful to use. Yellow Pages business data, state business filings, FINRA BrokerCheck -- the government (or company) has the data, but their search interface sucks, there's no bulk export, and the API either doesn't exist or requires you to read 40 pages of docs.

Go was the right call (mostly)

I wrote all of these in Go instead of Python or Node. For scrapers specifically:

Goroutines make concurrent HTTP requests trivial. My YellowPages scraper does 10+ concurrent detail page fetches and the code is dead simple compared to asyncio in Python.

The compiled binary is small and starts instantly. On Apify this means lower memory usage and faster cold starts, which directly affects cost per run.

The downside: Go's HTML parsing ecosystem is rough compared to Python's BeautifulSoup or JavaScript's Cheerio. I'm using goquery which is fine but not great for complex selectors. And when you need actual browser automation (Playwright/Puppeteer), you're back to JavaScript anyway -- Go's browser libs are immature.

If I was starting over I'd still pick Go for API-based scrapers and HTTP scrapers. For anything that needs a browser, I'd use TypeScript.

What I'd do differently

Start with 5 scrapers, not 59. I built way too many things before validating that anyone would pay for any of them. Should have picked the 5 most promising markets, built those, priced them, and waited to see what happened before building the other 54.

Talk to potential buyers first. I have zero conversations with actual compliance teams, KYC analysts, or sales ops people. I built what I thought they'd want based on competitor analysis. That's a guess, not validation.

Don't underestimate "packaging." A scraper that returns JSON is not a product. A scraper with clear input fields, example outputs showing real data, good defaults, and documentation explaining what each field means -- that's closer to a product. I spent way too long on the scraping logic and not enough on the user experience of actually running the thing.

If you're thinking about building data tools on government APIs, the opportunity is real. Most of this data is genuinely valuable and underserved. Just don't do what I did and build 59 things before figuring out which 5 matter.

All of these are live on my Apify profile if you want to poke around. The ones I'd actually recommend: Yellow Pages scraper for business leads, California and Texas SOS for business entity lookups, and FINRA BrokerCheck for compliance data. Happy to answer questions about any of them.