agenthustler

Posted on Mar 27

Scraping FDA Drug Approval Databases for Pharma Intelligence

#python #tutorial #webdev #programming

The Value of FDA Approval Data

The FDA approves hundreds of drugs annually, and each approval triggers stock movements, competitor responses, and market shifts. Pharma companies, investors, and researchers all need this data — but the FDA's website is notoriously difficult to navigate programmatically.

Let's build a scraper that extracts structured approval data from FDA databases.

Target Databases

The FDA maintains several key databases:

Drugs@FDA — approved drug products with labels
Orange Book — patent and exclusivity data
FAERS — adverse event reports

Setup

pip install requests beautifulsoup4 pandas

For production scraping, use ScraperAPI to avoid blocks from government sites that rate-limit aggressively.

Scraping Drugs@FDA via OpenFDA API

import requests
import pandas as pd
from datetime import datetime, timedelta

def get_recent_approvals(days=30):
    """Fetch recent drug approvals from FDA."""
    end = datetime.now()
    start = end - timedelta(days=days)
    date_range = f"[{start.strftime(\%Y%m%d)}+TO+{end.strftime(\%Y%m%d)}]"

    url = "https://api.fda.gov/drug/drugsfda.json"
    params = {
        "search": f"submissions.submission_status_date:{date_range}",
        "limit": 100
    }

    response = requests.get(url, params=params)
    data = response.json()

    approvals = []
    for result in data.get("results", []):
        for product in result.get("products", []):
            approvals.append({
                "brand_name": product.get("brand_name"),
                "generic_name": result.get("openfda", {}).get("generic_name", [""])[0],
                "manufacturer": result.get("sponsor_name"),
                "application_number": result.get("application_number")
            })
    return pd.DataFrame(approvals)

Enriching with Patent Data

The Orange Book contains patent expiration dates — critical for generic drug timing:

def get_patent_info(application_number):
    """Get patent and exclusivity data."""
    url = "https://api.fda.gov/drug/ndc.json"
    params = {
        "search": f"application_number:{application_number}",
        "limit": 1
    }

    response = requests.get(url, params=params)
    if response.status_code == 200:
        data = response.json()
        results = data.get("results", [{}])[0]
        return {
            "ndc": results.get("product_ndc"),
            "route": results.get("route", [""])[0] if results.get("route") else "",
            "dosage_form": results.get("dosage_form"),
            "marketing_start": results.get("marketing_start_date")
        }
    return None

Building an Alert System

def check_and_alert(keywords):
    """Monitor for approvals matching specific therapeutic areas."""
    df = get_recent_approvals(days=1)

    for keyword in keywords:
        matches = df[
            df["generic_name"].str.contains(keyword, case=False, na=False) |
            df["brand_name"].str.contains(keyword, case=False, na=False)
        ]

        if not matches.empty:
            print(f"\n🔔 New approval matching {keyword}:")
            for _, row in matches.iterrows():
                print(f"  {row[brand_name]} ({row[generic_name]})")
                print(f"  Manufacturer: {row[manufacturer]}")

# Monitor oncology and rare disease approvals
check_and_alert(["oncology", "orphan", "kinase", "antibody"])

Adverse Event Mining

FAERS data reveals safety signals before they become headlines:

def get_adverse_events(drug_name, limit=50):
    """Pull adverse event reports for a specific drug."""
    url = "https://api.fda.gov/drug/event.json"
    params = {
        "search": f"patient.drug.openfda.brand_name:{drug_name}",
        "count": "patient.reaction.reactionmeddrapt.exact",
        "limit": limit
    }

    response = requests.get(url, params=params)
    data = response.json()

    events = []
    for result in data.get("results", []):
        events.append({
            "reaction": result["term"],
            "count": result["count"]
        })

    return pd.DataFrame(events).sort_values("count", ascending=False)

# Example: check adverse events for a popular drug
df = get_adverse_events("HUMIRA")
print(df.head(10))

Scaling with Proxy Infrastructure

FDA APIs have rate limits (240 requests/minute). For production monitoring, use ThorData residential proxies to distribute requests, or ScrapeOps for monitoring your scraper health.

Use Cases

Investment signals — new approvals for public pharma companies
Competitive intelligence — track competitor pipeline approvals
Safety monitoring — early adverse event detection
Generic entry timing — patent expiry tracking

Conclusion

FDA databases are a goldmine for pharma intelligence. With Python and tools like ScraperAPI, you can build automated monitoring that would cost thousands from commercial providers.

DEV Community