Scraping Regulatory Filings: SEC EDGAR, FDA, and EPA Data

#python #tutorial #webdev #programming

Scraping Regulatory Filings: SEC EDGAR, FDA, and EPA Data

Government regulatory data is a goldmine for investors, researchers, and compliance teams. SEC filings reveal corporate financials, FDA submissions track drug approvals, and EPA records expose environmental violations. Here's how to scrape all three systematically with Python.

Why Scrape Regulatory Data?

These databases are public but painful to navigate manually. SEC EDGAR alone contains over 20 million filings. Automated scraping lets you monitor new filings in real time, extract structured data from unstructured documents, and build analytical pipelines.

SEC EDGAR: Corporate Filings

SEC EDGAR provides free access to all public company filings. Their API has generous rate limits (10 requests per second with proper User-Agent headers).

import requests
import time

class SECEdgarScraper:
    BASE_URL = "https://efts.sec.gov/LATEST"
    HEADERS = {
        "User-Agent": "CompanyName admin@company.com",
        "Accept-Encoding": "gzip, deflate"
    }

    def search_filings(self, company, filing_type="10-K", start_date=None):
        params = {
            "q": company,
            "dateRange": "custom",
            "startdt": start_date or "2024-01-01",
            "forms": filing_type
        }
        response = requests.get(
            f"{self.BASE_URL}/search-index",
            params=params,
            headers=self.HEADERS
        )
        time.sleep(0.1)
        return response.json().get("hits", {}).get("hits", [])

    def get_filing_text(self, accession_number, cik):
        accession_clean = accession_number.replace("-", "")
        url = f"https://www.sec.gov/Archives/edgar/data/{cik}/{accession_clean}"
        response = requests.get(url, headers=self.HEADERS)
        time.sleep(0.1)
        return response.text

FDA Drug Approval Data

The FDA openFDA API provides structured access to drug applications, adverse events, and recalls:

class FDAScraper:
    BASE_URL = "https://api.fda.gov"

    def search_drug_approvals(self, drug_name, limit=100):
        params = {
            "search": f'openfda.brand_name:"{drug_name}"',
            "limit": limit
        }
        response = requests.get(
            f"{self.BASE_URL}/drug/drugsfda.json",
            params=params
        )
        return response.json().get("results", [])

    def get_adverse_events(self, drug_name, start_date="20240101"):
        params = {
            "search": (
                f'patient.drug.openfda.brand_name:"{drug_name}"'
                f'+AND+receivedate:[{start_date}+TO+20261231]'
            ),
            "count": "patient.reaction.reactionmeddrapt.exact",
            "limit": 20
        }
        response = requests.get(
            f"{self.BASE_URL}/drug/event.json",
            params=params
        )
        return response.json().get("results", [])

EPA Environmental Data

EPA's ECHO database tracks facility compliance and enforcement actions. For JavaScript-rendered pages, use a proxy with rendering support. ScraperAPI handles this automatically:

class EPAScraper:
    ECHO_URL = "https://echodata.epa.gov/echo"
    SCRAPER_API_KEY = "YOUR_KEY"

    def search_facilities(self, state, violation_status="V"):
        params = {
            "api_key": self.SCRAPER_API_KEY,
            "url": (
                f"{self.ECHO_URL}/compliance_api/rest_services"
                f".get_facilities?p_st={state}&p_qiv={violation_status}"
                "&output=JSON"
            )
        }
        response = requests.get(
            "http://api.scraperapi.com",
            params=params
        )
        return response.json()

    def get_enforcement_cases(self, facility_id):
        url = (
            f"{self.ECHO_URL}/dfr_api/rest_services"
            f".get_enforcement_summary?p_id={facility_id}&output=JSON"
        )
        response = requests.get(url)
        return response.json()

Building a Unified Pipeline

Combine all three scrapers into a monitoring pipeline:

import pandas as pd
from datetime import datetime

def daily_regulatory_scan():
    sec = SECEdgarScraper()
    fda = FDAScraper()
    epa = EPAScraper()

    results = {
        "sec_filings": sec.search_filings("Tesla", "8-K"),
        "fda_approvals": fda.search_drug_approvals("Ozempic"),
        "epa_violations": epa.search_facilities("CA")
    }

    for source, data in results.items():
        df = pd.json_normalize(data)
        df["source"] = source
        df["scan_date"] = datetime.utcnow().isoformat()
        df.to_csv(f"regulatory_{source}_{datetime.now():%Y%m%d}.csv", index=False)
        print(f"[{source}] Found {len(data)} records")
    return results

Scaling Your Regulatory Scraper

At scale, you'll need reliable proxy infrastructure. ThorData provides residential proxies for sites that block datacenter IPs, and ScrapeOps helps monitor your scraper health and success rates across all sources.

Important Notes

SEC EDGAR requires a proper User-Agent header identifying your organization
FDA APIs have rate limits — respect them with appropriate delays
EPA data is public but some endpoints require API keys
Always store raw responses before parsing for reproducibility

Regulatory data scraping is a high-value skill. Whether you're building a fintech product, conducting academic research, or monitoring compliance, these scrapers give you a systematic edge.