Scraping Regulatory Filings: SEC EDGAR, FDA, and EPA Data
Government regulatory data is a goldmine for investors, researchers, and compliance teams. SEC filings reveal corporate financials, FDA submissions track drug approvals, and EPA records expose environmental violations. Here's how to scrape all three systematically with Python.
Why Scrape Regulatory Data?
These databases are public but painful to navigate manually. SEC EDGAR alone contains over 20 million filings. Automated scraping lets you monitor new filings in real time, extract structured data from unstructured documents, and build analytical pipelines.
SEC EDGAR: Corporate Filings
SEC EDGAR provides free access to all public company filings. Their API has generous rate limits (10 requests per second with proper User-Agent headers).
import requests
import time
class SECEdgarScraper:
BASE_URL = "https://efts.sec.gov/LATEST"
HEADERS = {
"User-Agent": "CompanyName admin@company.com",
"Accept-Encoding": "gzip, deflate"
}
def search_filings(self, company, filing_type="10-K", start_date=None):
params = {
"q": company,
"dateRange": "custom",
"startdt": start_date or "2024-01-01",
"forms": filing_type
}
response = requests.get(
f"{self.BASE_URL}/search-index",
params=params,
headers=self.HEADERS
)
time.sleep(0.1)
return response.json().get("hits", {}).get("hits", [])
def get_filing_text(self, accession_number, cik):
accession_clean = accession_number.replace("-", "")
url = f"https://www.sec.gov/Archives/edgar/data/{cik}/{accession_clean}"
response = requests.get(url, headers=self.HEADERS)
time.sleep(0.1)
return response.text
FDA Drug Approval Data
The FDA openFDA API provides structured access to drug applications, adverse events, and recalls:
class FDAScraper:
BASE_URL = "https://api.fda.gov"
def search_drug_approvals(self, drug_name, limit=100):
params = {
"search": f'openfda.brand_name:"{drug_name}"',
"limit": limit
}
response = requests.get(
f"{self.BASE_URL}/drug/drugsfda.json",
params=params
)
return response.json().get("results", [])
def get_adverse_events(self, drug_name, start_date="20240101"):
params = {
"search": (
f'patient.drug.openfda.brand_name:"{drug_name}"'
f'+AND+receivedate:[{start_date}+TO+20261231]'
),
"count": "patient.reaction.reactionmeddrapt.exact",
"limit": 20
}
response = requests.get(
f"{self.BASE_URL}/drug/event.json",
params=params
)
return response.json().get("results", [])
EPA Environmental Data
EPA's ECHO database tracks facility compliance and enforcement actions. For JavaScript-rendered pages, use a proxy with rendering support. ScraperAPI handles this automatically:
class EPAScraper:
ECHO_URL = "https://echodata.epa.gov/echo"
SCRAPER_API_KEY = "YOUR_KEY"
def search_facilities(self, state, violation_status="V"):
params = {
"api_key": self.SCRAPER_API_KEY,
"url": (
f"{self.ECHO_URL}/compliance_api/rest_services"
f".get_facilities?p_st={state}&p_qiv={violation_status}"
"&output=JSON"
)
}
response = requests.get(
"http://api.scraperapi.com",
params=params
)
return response.json()
def get_enforcement_cases(self, facility_id):
url = (
f"{self.ECHO_URL}/dfr_api/rest_services"
f".get_enforcement_summary?p_id={facility_id}&output=JSON"
)
response = requests.get(url)
return response.json()
Building a Unified Pipeline
Combine all three scrapers into a monitoring pipeline:
import pandas as pd
from datetime import datetime
def daily_regulatory_scan():
sec = SECEdgarScraper()
fda = FDAScraper()
epa = EPAScraper()
results = {
"sec_filings": sec.search_filings("Tesla", "8-K"),
"fda_approvals": fda.search_drug_approvals("Ozempic"),
"epa_violations": epa.search_facilities("CA")
}
for source, data in results.items():
df = pd.json_normalize(data)
df["source"] = source
df["scan_date"] = datetime.utcnow().isoformat()
df.to_csv(f"regulatory_{source}_{datetime.now():%Y%m%d}.csv", index=False)
print(f"[{source}] Found {len(data)} records")
return results
Scaling Your Regulatory Scraper
At scale, you'll need reliable proxy infrastructure. ThorData provides residential proxies for sites that block datacenter IPs, and ScrapeOps helps monitor your scraper health and success rates across all sources.
Important Notes
- SEC EDGAR requires a proper User-Agent header identifying your organization
- FDA APIs have rate limits — respect them with appropriate delays
- EPA data is public but some endpoints require API keys
- Always store raw responses before parsing for reproducibility
Regulatory data scraping is a high-value skill. Whether you're building a fintech product, conducting academic research, or monitoring compliance, these scrapers give you a systematic edge.
Top comments (0)