DEV Community

agenthustler
agenthustler

Posted on

Scraping FDA Drug Approval Databases for Pharma Intelligence

The Value of FDA Approval Data

The FDA approves hundreds of drugs annually, and each approval triggers stock movements, competitor responses, and market shifts. Pharma companies, investors, and researchers all need this data — but the FDA's website is notoriously difficult to navigate programmatically.

Let's build a scraper that extracts structured approval data from FDA databases.

Target Databases

The FDA maintains several key databases:

  • Drugs@FDA — approved drug products with labels
  • Orange Book — patent and exclusivity data
  • FAERS — adverse event reports

Setup

pip install requests beautifulsoup4 pandas
Enter fullscreen mode Exit fullscreen mode

For production scraping, use ScraperAPI to avoid blocks from government sites that rate-limit aggressively.

Scraping Drugs@FDA via OpenFDA API

import requests
import pandas as pd
from datetime import datetime, timedelta

def get_recent_approvals(days=30):
    """Fetch recent drug approvals from FDA."""
    end = datetime.now()
    start = end - timedelta(days=days)
    date_range = f"[{start.strftime(\%Y%m%d)}+TO+{end.strftime(\%Y%m%d)}]"

    url = "https://api.fda.gov/drug/drugsfda.json"
    params = {
        "search": f"submissions.submission_status_date:{date_range}",
        "limit": 100
    }

    response = requests.get(url, params=params)
    data = response.json()

    approvals = []
    for result in data.get("results", []):
        for product in result.get("products", []):
            approvals.append({
                "brand_name": product.get("brand_name"),
                "generic_name": result.get("openfda", {}).get("generic_name", [""])[0],
                "manufacturer": result.get("sponsor_name"),
                "application_number": result.get("application_number")
            })
    return pd.DataFrame(approvals)
Enter fullscreen mode Exit fullscreen mode

Enriching with Patent Data

The Orange Book contains patent expiration dates — critical for generic drug timing:

def get_patent_info(application_number):
    """Get patent and exclusivity data."""
    url = "https://api.fda.gov/drug/ndc.json"
    params = {
        "search": f"application_number:{application_number}",
        "limit": 1
    }

    response = requests.get(url, params=params)
    if response.status_code == 200:
        data = response.json()
        results = data.get("results", [{}])[0]
        return {
            "ndc": results.get("product_ndc"),
            "route": results.get("route", [""])[0] if results.get("route") else "",
            "dosage_form": results.get("dosage_form"),
            "marketing_start": results.get("marketing_start_date")
        }
    return None
Enter fullscreen mode Exit fullscreen mode

Building an Alert System

def check_and_alert(keywords):
    """Monitor for approvals matching specific therapeutic areas."""
    df = get_recent_approvals(days=1)

    for keyword in keywords:
        matches = df[
            df["generic_name"].str.contains(keyword, case=False, na=False) |
            df["brand_name"].str.contains(keyword, case=False, na=False)
        ]

        if not matches.empty:
            print(f"\n🔔 New approval matching {keyword}:")
            for _, row in matches.iterrows():
                print(f"  {row[brand_name]} ({row[generic_name]})")
                print(f"  Manufacturer: {row[manufacturer]}")

# Monitor oncology and rare disease approvals
check_and_alert(["oncology", "orphan", "kinase", "antibody"])
Enter fullscreen mode Exit fullscreen mode

Adverse Event Mining

FAERS data reveals safety signals before they become headlines:

def get_adverse_events(drug_name, limit=50):
    """Pull adverse event reports for a specific drug."""
    url = "https://api.fda.gov/drug/event.json"
    params = {
        "search": f"patient.drug.openfda.brand_name:{drug_name}",
        "count": "patient.reaction.reactionmeddrapt.exact",
        "limit": limit
    }

    response = requests.get(url, params=params)
    data = response.json()

    events = []
    for result in data.get("results", []):
        events.append({
            "reaction": result["term"],
            "count": result["count"]
        })

    return pd.DataFrame(events).sort_values("count", ascending=False)

# Example: check adverse events for a popular drug
df = get_adverse_events("HUMIRA")
print(df.head(10))
Enter fullscreen mode Exit fullscreen mode

Scaling with Proxy Infrastructure

FDA APIs have rate limits (240 requests/minute). For production monitoring, use ThorData residential proxies to distribute requests, or ScrapeOps for monitoring your scraper health.

Use Cases

  1. Investment signals — new approvals for public pharma companies
  2. Competitive intelligence — track competitor pipeline approvals
  3. Safety monitoring — early adverse event detection
  4. Generic entry timing — patent expiry tracking

Conclusion

FDA databases are a goldmine for pharma intelligence. With Python and tools like ScraperAPI, you can build automated monitoring that would cost thousands from commercial providers.

Top comments (0)