GovDev

Posted on Apr 18 • Originally published at apify.com

Scraping SAM.gov + USASpending for Federal Contracts (No API Key, in Python)

#python #webscraping #tutorial #automation

SAM.gov is where the US federal government posts every contract opportunity over $25,000. Something like $700 billion a year flows through it. The search UI is terrible, the API takes 10 business days to get a key, and the data lives in two separate systems that nobody bothers to merge: SAM.gov for open opportunities, USASpending.gov for who actually won what.

I spent some good amount of time over the past week building a scraper that pulls both in one call. It runs on Apify, costs $0.02 per contract, and does not need an API key to start. This post walks through how it works.

Why This Exists

The business development team at any company chasing federal contracts has two questions every Monday morning:

What new opportunities match what we do?
Who won the last five contracts that looked like the one we are bidding on?

SAM.gov answers question one. USASpending.gov answers question two. But they are different systems with different APIs and different data models. Nobody pulls both. So BD analysts end up with two tabs open and a spreadsheet.

The actor I built does the join in code. Here is the core flow:

async def run(self):
    # Source 1: USASpending.gov (no key needed, always works)
    usaspending_opps = await self._fetch_usaspending_opportunities()
    self.opportunities.extend(usaspending_opps)

    # Source 2: SAM.gov (optional, richer data if key provided)
    sam_opps = await self._fetch_sam_opportunities()
    self.opportunities.extend(sam_opps)

    # Deduplicate, rank, filter, push
    self._deduplicate()
    self._score_relevance()
    await self._push_to_dataset()

The USA Spending endpoint is https://api.usaspending.gov/api/v2/search/spending_by_award/. POST a JSON body with filters, get back awards. No auth, no rate limit drama. The catch: it only has awarded contracts, not open solicitations.

SAM.gov has the open solicitations but wants an API key. You can request one at api.sam.gov but it takes about 10 business days. So my actor treats the SAM key as optional: works on USASpending alone if you do not have one, uses both if you do.

The Two Things Nobody Else Does

There are maybe ten SAM.gov scrapers on Apify. I read through most of them. Here is what they all miss.

Attachment URLs

When a contracting officer posts an RFP on SAM.gov, they attach documents: the Statement of Work, Section L instructions, Section M evaluation criteria, past performance questionnaires. These are the documents you actually need to write a proposal. They live in a field called resourceLinks in the API response.

Every other scraper I looked at ignores this field. One line of code:

attachment_urls=item.get('resourceLinks') or [],
attachments_count=len(item.get('resourceLinks') or []),

Now the output includes direct download URLs for every attachment. Pipe them into wget or an archive tool and you have the full proposal document set before you even open SAM.gov.

Semantic Ranking Without LLMs

Keyword search sucks at federal contracts. "Cloud migration" matches a contract titled "Cloud Migration Services" but misses one titled "FedRAMP-Authorized Infrastructure Transition" even though they are the same thing.

I considered calling an LLM per contract for relevance scoring. Rejected it immediately. LLMs cost money per call, get rate-limited, and are slow. For 500 contracts per run that is real latency.

Instead I used TF-IDF with domain synonym expansion. The user provides a business description:

business_description = "cloud migration and cybersecurity services for federal agencies"

The scorer tokenizes, expands synonyms (cloud -> FedRAMP, hosting, IaaS, data center), builds a TF-IDF vector from a corpus of NAICS descriptions, and computes cosine similarity against each contract's title and description.

def score(self, title: str, description: str) -> float:
    contract_text = f"{title} {title} {description}"  # title weighted 2x
    contract_tokens = _expand_synonyms(_tokenize(contract_text))
    contract_tf = _compute_tf(contract_tokens)

    dot_product = norm_a = norm_b = 0.0
    for word in set(self._business_tf.keys()) | set(contract_tf.keys()):
        idf = self._idf.get(word, 0.0)
        a = self._business_tf.get(word, 0.0) * idf
        b = contract_tf.get(word, 0.0) * idf
        dot_product += a * b
        norm_a += a * a
        norm_b += b * b

    if norm_a == 0 or norm_b == 0:
        return 0.0
    return round(dot_product / (math.sqrt(norm_a) * math.sqrt(norm_b)), 3)

No external calls. No API keys. Runs in microseconds. Works offline. On my test query with "cloud migration" as the business description, the top result was "USDA ENTERPRISE-SCALE FEDRAMP CERTIFIED CLOUD HOSTING SERVICES" at 0.98 similarity. The keyword matcher would have ranked that fourth because the exact phrase "cloud migration" does not appear.

What Actually Breaks

Two things tripped me up during the build.

SAM.gov rate limits are tighter than documented. I hit 429s after about 20 requests per minute even though the docs say 60. Exponential backoff handles it but I wasted a day tracing the issue before I added retry logic. The fix:

MAX_RETRIES = 3
RETRY_DELAYS = [1, 2, 4]

async def _request_with_retry(self, method, url, **kwargs):
    for attempt in range(MAX_RETRIES):
        response = await self.http_client.get(url, **kwargs) if method == 'GET' \
            else await self.http_client.post(url, **kwargs)

        if response.status_code in (429,) or response.status_code >= 500:
            if attempt < MAX_RETRIES - 1:
                await asyncio.sleep(RETRY_DELAYS[attempt])
                continue
        return response

USASpending returns deeply nested agency data. The same agency can appear as three different strings depending on how you query: "Department of Veterans Affairs", "Veterans Affairs, Department of", or just "VA". I ended up normalizing on a lookup table. Nothing elegant, just exhaustive:

AGENCY_ALIASES = {
    "Department of Veterans Affairs": "VA",
    "Veterans Affairs, Department of": "VA",
    "VA": "VA",
    # ... 47 more
}

Webhook Output

The last piece I added was webhook support. BD teams want this data flowing into Slack or a CRM, not sitting in an Apify dataset. So at the end of each run:

if self.webhook_url and self.opportunities:
    await self._send_webhook()

async def _send_webhook(self) -> None:
    try:
        payload = {
            "event": "contracts_found",
            "count": len(self.opportunities),
            "opportunities": [self._format(o) for o in self.opportunities[:50]],
            "run_at": datetime.now(timezone.utc).isoformat(),
        }
        resp = await self.http_client.post(self.webhook_url, json=payload, timeout=15)
        Actor.log.info(f"Webhook delivered ({resp.status_code})")
    except Exception as e:
        Actor.log.warning(f"Webhook failed: {e}")

Point it at a Zapier webhook, a Make.com scenario, an n8n workflow, or your own endpoint. Non-blocking, so a bad webhook URL does not crash the run.

The Output

Here is what one contract record looks like:

{
  "opportunity_id": "sam_abc123",
  "source": "sam.gov",
  "title": "Cloud Migration Services for Federal Agency",
  "agency": "General Services Administration",
  "naics_code": "541512",
  "set_aside_type": "Total Small Business",
  "estimated_value": 2500000.0,
  "place_state": "VA",
  "response_deadline": "2026-05-15T00:00:00",
  "contact_email": "contracting.officer@gsa.gov",
  "attachments": [
    "https://sam.gov/api/prod/opps/.../rfp.pdf",
    "https://sam.gov/api/prod/opps/.../sow.pdf"
  ],
  "attachments_count": 2,
  "relevance_score": 0.87,
  "semantic_score": 0.954,
  "change_status": "new"
}

Sentiment, price, fundamentals, attachments in one object. Pipe it into whatever you are building.

Running It

The actor is live on Apify at digital_troubadour/government-contract-monitor. You can try it free (Apify gives new accounts $5 in credits, enough for about 250 contracts with this actor at $0.02 each).

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("digital_troubadour/government-contract-monitor").call(
    run_input={
        "business_description": "Cloud migration for federal agencies",
        "naics_codes": ["541512"],
        "max_results": 50,
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{item['semantic_score']:.2f} | {item['title'][:60]}")

If you want to test without running against real APIs, set dry_run: true in the input and you get two sample opportunities back. No charges.

What I Would Build Next

A historical sentiment model would be interesting. Run the scraper daily for a year, store awards in a time series, see which agencies are ramping up spending in which NAICS codes. You could probably predict budget allocation shifts six months before they show up in OMB reports.

Also curious if anyone has a better approach than TF-IDF for the semantic scoring. I tried sentence-transformers but the model load time killed cold-start performance on Apify's container. Open to ideas.

What do you use for monitoring federal contract opportunities? Curious what the BD teams here actually do when SAM.gov search fails them.

DEV Community