Fatih İlhan

Posted on Apr 30

I built two Apify actors that scrape U.S. Congress trading data — directly from government sources, no QuiverQuant

#webscraping #apify #typescript #finance

TL;DR

I built two Apify actors that pull every U.S. Senate and House Periodic Transaction Report (PTR) directly from official government sources, parse them into a clean JSON dataset, and expose them through Apify's API. They replace QuiverQuant's congressional-trading endpoint at roughly 1/10th the cost.

Senate actor: apify.com/seralifatih/congress-trading-pipeline
House actor: apify.com/seralifatih/congress-trading-pipeline-1

Both actors emit the same schema. Run one, the other, or chain them.

Why I built this

I had a personal trading dashboard pulling congressional trade data from QuiverQuant. $30/month, fine. But the API kept returning slightly different shapes between endpoints, occasional 500s during market hours, and the per-transaction granularity I needed sat behind a higher tier.

The data itself is public domain. The STOCK Act (2012) requires every member of Congress to file every trade within 45 days. The Senate publishes them at efdsearch.senate.gov. The House publishes them as a daily-updated ZIP file at disclosures-clerk.house.gov. Both free, both public, both indexed by every aggregator that resells them.

Two weekends of work later, I have my own pipeline.

Output schema

Every transaction from both actors normalizes to this shape:

{
  "id": "4d6016b44239f646476ffac6798f21ae3e32c8ed75ea6c5b50a0bbdf9e5d3296",
  "politician": "Mark Alford",
  "transaction_date": "2026-03-16",
  "filing_date": "2026-03-31",
  "ticker": "AMZN",
  "asset_name": "Amazon.com, Inc. - Common Stock",
  "asset_type": "Stock",
  "type": "sell",
  "amount_min": 1001,
  "amount_max": 15000,
  "owner": "self"
}

Field	Notes
`id`	SHA-256 of `politician\
{% raw %}`type`	`buy` or `sell` (House `Purchase`/`Sale (Full)`/`Sale (Partial)` mapped)
`amount_min` / `amount_max`	Cents-precise bounds parsed from PTR brackets (`$1,001 - $15,000`)
`amount_max`	`null` for unbounded "Over $X" disclosures
`ticker`	`null` for bonds, municipals, structured notes
`owner`	`self` / `joint` / `spouse` / `child`

The id lets you idempotently sync into your own DB without seeing duplicates across runs.

How they work

Senate actor

The Senate's disclosure system is a Django app at efdsearch.senate.gov. Three things make it tricky:

Akamai bot protection. Direct curl gets a 403. Even with a residential proxy, Apify's default datacenter pool gets blocked.
Terms-acceptance gate. Every session must POST prohibition_agreement=1 to /search/home/ before any disclosure URL works. The session cookie expires fast and silently.
Two-stage data flow. The DataTables search endpoint at /search/report/data/ returns filings (one row per PTR document). Each row links to a separate detail page that contains the actual transactions.

The actor handles all three:

// Pin to a single residential exit IP for the whole run
const sessionId = `senate_${Date.now()}`;
const proxyUrl = await proxyConfig.newUrl(sessionId);

Why pinned IP? Django keeps prohibition_agreement state per-IP. If Apify rotates exit IPs between requests, the second request gets redirected back to the agreement page even though our cookie jar has the right session ID.

// Walk the redirect chain manually so we capture Set-Cookie at each hop
const client = axios.create({ maxRedirects: 0 });
client.interceptors.response.use((res) => {
  const setCookie = res.headers['set-cookie'];
  if (setCookie) jar.setCookie(setCookie, res.config.url);
  return res;
});

Axios's built-in redirect follower drops Set-Cookie headers from intermediate responses. Django sets the session cookie on the 302 returned by the agreement POST, so we have to walk the chain ourselves.

For each PTR detail page, the actor parses the HTML transaction table with cheerio. About 240 transactions per 30-day window across ~34 active filers. The whole run takes ~110 seconds.

House actor

The House serves disclosures very differently — a single ZIP file with everything:

https://disclosures-clerk.house.gov/public_disc/financial-pdfs/2026FD.zip

Inside: an XML index plus thousands of individual PTR PDFs. No login, no Akamai, no proxy needed.

The flow:

ZIP download → XML index → filter to PTRs (FilingType='P') → fetch each PDF → pdf-parse → regex

The XML index gives you politician + filing date + a DocID. The PDF lives at https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2026/<DocID>.pdf.

Parsing the PDFs is where it gets interesting.

The PDF parsing problem

House PTR PDFs are machine-generated, but the text-extraction order is chaotic. After running them through pdf-parse, here's what one transaction row looks like:

Amazon.com, Inc. - Common Stock
(AMZN) [ST]
S (partial)03/16/202603/16/2026$1,001 - $15,000
F     S    : New
S         O : Putnam Investments

A few things to notice:

The asset name is on a separate line from the ticker
The transaction type, two dates, and amount range all run together with no whitespace separators
Header text contains null bytes (F\x00\x00\x00 S\x00\x00\x00) — a PDF font-glyph hack
Comment blocks (Filing Status: New, Subholding Of:, Description:) bleed between rows in unpredictable ways

I ended up with a marker-anchored parser:

// Find every (TICKER) [XX] line — these are the row anchors
const MARKER_RE = /(?:\(([A-Z][A-Z0-9.\-]{0,5})\)\s*)?\[([A-Z]{2})\]/;

// Tx-row: type + date + date + amount, all glued together
const TX_RE = /(S\s*\(partial\)|P|S|E)\s*(\d{1,2}\/\d{1,2}\/\d{4})\s*(\d{1,2}\/\d{1,2}\/\d{4})\s*\$([\d,]+)\s*-\s*\$([\d,]+)/;

For each marker, walk backward up to 5 lines to collect the asset name, hard-stopping at any:

Other marker (next row above)
Transaction row (this would be the previous row's data)
Comment-block keyword (Filing Status:, Description:, shares sold @)
Owner-code prefix (SP/DC/JT stuck to the start)

The SP prefix detection has a fun gotcha: SPDR ETFs would falsely match. Fix:

// Only strip SP/DC/JT when followed by [A-Z][a-z] (real word boundary)
// "SPDR" → uppercase next, no strip. "SPApple" → strip → "Apple".
const m = name.match(/^(SP|DC|JT)([A-Z][a-z])/);

Plus a small post-processor for known truncations:

if (/\s-\s*Common$/i.test(assetName)) assetName += ' Stock';
else if (/Inc\.\s+New$/i.test(assetName)) assetName += ' Common Stock';

End result: ~95% of records parse cleanly. The remaining 5% are scanned PDFs (older filings — the parser logs them and moves on; OCR is Phase 2).

Architecture

Both actors share the same shape:

Fetch → Parse → Normalize → Dedup → Push to Apify Dataset

Normalize: maps each source's quirks (Senate's "Purchase"/"Sale (Full)" vs House's P/S (partial)) into the unified buy/sell enum, parses dollar ranges into integers, ISO-8601s the dates.
Dedup: hashes politician|transaction_date|asset_name|amount_min|amount_max to SHA-256. Same key in two runs = same id = no duplicate dataset entry.
Push: Actor.pushData() straight into the default dataset. Persisted by Apify, queryable via API.

Both actors are independent Node + TypeScript projects living in their own GitHub repos. Apify links each repo as a separate actor, builds it on push, runs on a schedule.

Schedule both for every 6 hours. Each run takes 1-3 minutes. Combined daily cost: under $2.

Using the data

Run either actor manually or hit the API:

# Trigger Senate run
curl -X POST "https://api.apify.com/v2/acts/seralifatih~congress-trading-pipeline/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"fetchDaysBack": 30}'

# Pull dataset items
curl "https://api.apify.com/v2/datasets/<dataset-id>/items?token=YOUR_TOKEN&format=json"

Or use Apify's Node SDK to consume the dataset directly in your app:

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const { items } = await client
  .dataset('senate-dataset-id')
  .listItems({ limit: 200 });

const recentBuys = items.filter(t => t.type === 'buy');

Both datasets share the schema, so you can .concat() them and treat the result as one Congress-wide feed.

Lessons

A few things I'd tell past-me before starting:

Read the docs of every layer of HTTP libraries. Axios's validateStatus < 400 looks reasonable. Combined with default maxRedirects: 5, it silently swallows critical Set-Cookie headers from 302s. Took me three diagnostic runs to figure out.
Government sites have weird security postures. The Senate's Akamai layer doesn't care about your User-Agent string but cares deeply about which IP block you're coming from. Apify's residential proxy pool sails through; their datacenter pool gets walled.
PDF text extraction is lossy in non-obvious ways. pdf-parse happily returns "text" that's actually a stream of glyph references with embedded null bytes. Always normalize the output with .replace(/\x00/g, '') before any pattern matching.
Dedup early, dedup deterministically. A SHA-256 of the natural key is dumber and faster than UUIDs + a uniqueness index. Two runs of the same data produce the same set of records.
Treat the source as adversarial documentation. The Senate's terms page wording isn't just legalese — it's the literal CSRF gate. The House's PDF "Filing Status: New" boilerplate isn't human-readable bureaucracy — it's the visual marker that ends one transaction row.

Costs

For both actors combined, running every 6 hours:

Component	Daily cost
Senate compute (~110s/run × 4)	~$0.30
Senate residential proxy (~3MB/run × 4)	~$0.10
House compute (~120s/run × 4)	~$0.32
House proxy	$0 (no proxy needed)
Total	~$0.72/day

QuiverQuant's congressional plan is $10/month. This is cheaper, gives me raw transactional data instead of pre-aggregated summaries, and runs on infrastructure I control.

What's next

OCR fallback for the ~5% of House PTRs that are scanned PDFs (older filings)
Ticker enrichment — resolving asset names to tickers for the bond/muni rows where the source has no ticker column
Merge actor that consumes both Senate + House datasets and emits a unified Congress-wide stream with cross-source dedup

If you want to use the actors, both are public on Apify Store. Pricing is ~$0.40-0.50 per 1k results — a basic 30-day pull runs about $0.10. Pay-per-result, no subscription.

Data is public domain. STOCK Act compliant. No third-party vendors. Use it however you want.

DEV Community