DEV Community

Cover image for Scraping SAM.gov and TED EU: How I Built a $13 Trillion Opportunity Monitor
Vhub Systems
Vhub Systems

Posted on

Scraping SAM.gov and TED EU: How I Built a $13 Trillion Opportunity Monitor

Scraping SAM.gov and TED EU: How I Built a $13 Trillion Opportunity Monitor

A consulting firm client came to me with a simple ask: "We need to know when relevant government contracts get posted before our competitors do." Simple ask, nightmare implementation.

Government procurement data is technically public. Every dollar spent, every contract awarded, every solicitation posted — it's all supposed to be accessible. The reality is that these systems were built by the lowest bidders, maintained by committee, and designed by people who've never had to actually use them programmatically. After six weeks of fighting rate limits, XML schemas from 2009, and an EU search API that randomly returns zero results, I had something working. Here's what I learned.


Why This Data Actually Matters

Global government procurement sits around $13 trillion annually. The US federal government alone spends roughly $700 billion per year on contracts. The EU institutions and member states add another few trillion on top of that. For B2B companies — especially in consulting, IT, defense, construction, and professional services — this isn't academic. Missing a solicitation because you saw it three days late means your competitor gets a head start on a 6-month proposal process.

The firms that win consistently aren't always the best. They're usually the best-prepared. And preparation starts with visibility.


SAM.gov: The API That Hates You (But Exists)

SAM.gov has an actual REST API, which already puts it ahead of most government systems. The endpoint you want is /opportunities/v2/search. You'll need to register for an API key at sam.gov/profile, which takes about 10 minutes.

The rate limit is 10 requests per minute. Not per second — per minute. Sounds manageable until you realize you're paginating through thousands of results and each page only returns 100 records.

Here's the core pagination loop I use:

import requests
import time
from datetime import datetime, timedelta

API_KEY = "your_sam_api_key"
BASE_URL = "https://api.sam.gov/opportunities/v2/search"

def fetch_opportunities(naics_code, days_back=7):
    results = []
    offset = 0
    limit = 100
    posted_from = (datetime.now() - timedelta(days=days_back)).strftime("%m/%d/%Y")

    while True:
        params = {
            "api_key": API_KEY,
            "postedFrom": posted_from,
            "postedTo": datetime.now().strftime("%m/%d/%Y"),
            "naicsCode": naics_code,
            "limit": limit,
            "offset": offset,
        }

        for attempt in range(3):
            resp = requests.get(BASE_URL, params=params, timeout=30)
            if resp.status_code == 429:
                print(f"Rate limited, waiting 65s...")
                time.sleep(65)
                continue
            elif resp.status_code == 200:
                break
            time.sleep(10)

        data = resp.json()
        opportunities = data.get("opportunitiesData", [])
        if not opportunities:
            break

        results.extend(opportunities)
        offset += limit
        total = data.get("totalRecords", 0)
        print(f"Fetched {len(results)}/{total}")

        if offset >= total:
            break
        time.sleep(6)  # Stay under 10 req/min

    return results
Enter fullscreen mode Exit fullscreen mode

The time.sleep(6) between requests is not optional. I learned this the hard way — hit the rate limit, got 429s, and my IP got soft-blocked for about 20 minutes. The 65-second wait on a 429 is intentional: it gives the rolling window time to reset.


The Fields That Actually Matter

From each opportunity record, I extract:

  • solicitationNumber — unique identifier, critical for deduplication across runs
  • responseDeadLine — when bids are due; this is the field my clients sort by first
  • naicsCode — North American Industry Classification; this is how you filter to relevant opportunities (e.g., 541512 for computer systems design)
  • awardAmount — not always populated at solicitation stage, but gold when it is
  • placeOfPerformance — city/state/country; matters for companies with geographic constraints
  • type — presolicitation, solicitation, award notice, etc.
  • organizationHierarchy — which agency posted it

For a consulting firm tracking cybersecurity work, I filter on NAICS codes 541512, 541519, and 541690. Running weekly pulls, I was averaging about 340-380 new opportunities per week across those codes. Processing time for a full week's data: about 12 minutes with the rate limiting.


TED EU: Where Good Intentions Go to Die

The European Tenders Electronic Daily (TED) covers EU procurement notices. It's the SAM.gov equivalent for Europe, and in theory it has both a search API and full XML data dumps.

In practice, it is a special kind of chaos.

The offset bug: TED's search API has inconsistent pagination behavior. When you request offset 900 on certain queries, it silently resets and returns results starting from offset 0 again. I caught this by checksumming notice IDs per page — I was seeing duplicates appearing about 8-9 pages in on queries with between 850-1100 total results. I still don't know if it's query-length-dependent or result-count-dependent. My workaround was to detect repeated IDs and break early, then cross-reference against the XML bulk exports for completeness.

The XML fallback: TED publishes daily XML exports at data.europa.eu/api/hub/search/datasets. When the API fails (and it will — I saw zero-result responses on queries I knew had 200+ matching records), these XML dumps are your safety net. The schema is TED-XML 2.0.9, which predates modern sensibilities about data design. You'll find yourself parsing things like <CONTRACTING_BODY><ADDRESS_CONTRACTING_BODY> nested four levels deep to get to a buyer name.

My actual hybrid approach: use the TED API for recent data (last 48 hours), fall back to XML dumps for anything older or when the API returns suspiciously low counts, and reconcile on notice number. It's inelegant. It works.

The mixed format problem: Some TED endpoints return JSON. Some return XML. Some return JSON with base64-encoded XML embedded inside it. I wish I was exaggerating. Always check Content-Type in the response headers before parsing.


What the Final System Looks Like

The production version monitors 23 NAICS codes on the US side and equivalent CPV codes on the EU side (Common Procurement Vocabulary — another taxonomy to learn). It runs daily, deduplicates against a Postgres table keyed on solicitation/notice number, and pushes new opportunities to a Slack channel with deadline, estimated value, and a link to the full notice.

Response time from posting to Slack notification: under 24 hours for SAM.gov (daily runs), under 48 hours for TED (mixing API and XML lag). For a market where solicitation periods are often 30-60 days, that's more than enough runway.

The consulting firm I built this for tracks roughly 1,200 active opportunities at any given time. Their win rate on contracts they identified early (>14 days before submission) versus ones they found late is significantly higher. Whether that's the early visibility or just selection bias from applying to better-fit opportunities, I can't say definitively.


The Maintenance Problem

Here's the honest part: government APIs change without notice. SAM.gov migrated from v1 to v2 of the opportunities endpoint and gave about three weeks' warning. TED is moving toward a new search API (eForms) that will eventually deprecate the current endpoints. The XML schemas get versioned inconsistently.

If you're building this for your own use and have time to maintain it, the approach above is solid. If you need something that stays current without babysitting it, I packaged this as an Apify actor for anyone who needs it without the infrastructure: https://apify.com/lanky_quantifier/government-tenders-scraper. It handles both SAM.gov and TED EU with configurable NAICS/CPV filters and outputs clean JSON.


The Bigger Picture

There's something a little absurd about the fact that $13 trillion in annual public spending requires this much work to monitor systematically. The data is public by law. The APIs exist. But between rate limits, inconsistent schemas, pagination bugs, and silent failures, meaningful access requires real engineering effort — which means smaller players without technical resources are at a structural disadvantage.

That bothers me more than the technical complexity does.


Genuine question for the comments: How do you handle government data sources in your own work? I'm particularly curious whether anyone has found a reliable pattern for TED EU's pagination issues, and whether you've found the official APIs trustworthy enough to rely on exclusively or whether you always maintain a scraping fallback. The "official API vs. scrape-the-UI" tradeoff feels different when the API is government-maintained versus commercial.

Top comments (0)