How to Search ClinicalTrials.gov Programmatically (The v2 API is Actually Good Now)

#api #healthcare #data #python

ClinicalTrials.gov quietly launched a completely new API in 2024 and it's honestly a massive upgrade over the old one. If you tried the v1 API years ago and gave up because the XML responses were a nightmare, it's worth another look.

The new v2 API returns JSON, supports proper pagination with tokens, and has a clean query syntax. I spent a while building tooling around it and want to share what I learned -- the docs are decent but leave out some practical stuff.

What you can search for

The registry has over 500,000 clinical studies. You can search by:

Keyword across all fields (drug names, conditions, descriptions)
Condition (diabetes, breast cancer, PTSD, etc.)
Intervention (specific drug or device names)
Sponsor (Pfizer, NIH, Mayo Clinic, etc.)
NCT ID for direct lookup of a specific trial

You can also filter by status (recruiting, completed, terminated), phase (1-4), and study type (interventional vs observational).

The API basics

Base URL: https://clinicaltrials.gov/api/v2/studies

A simple keyword search:

GET https://clinicaltrials.gov/api/v2/studies?query.term=ozempic&pageSize=10

Filter to only recruiting Phase 3 trials:

GET https://clinicaltrials.gov/api/v2/studies?query.term=ozempic&filter.overallStatus=RECRUITING&filter.phase=PHASE3&pageSize=10

No API key, no auth, no signup. Rate limits exist but they're generous -- I haven't hit them doing reasonable batch queries.

The response structure is deeply nested

This is the part that trips people up. Each study comes back with a protocolSection that contains about a dozen nested modules. The data you actually want is buried 3-4 levels deep.

Want the trial status? It's at protocolSection.statusModule.overallStatus. The sponsor? protocolSection.sponsorCollaboratorsModule.leadSponsor.name. Conditions? protocolSection.conditionsModule.conditions (an array). Locations? protocolSection.contactsLocationsModule.locations (another array of objects with facility, city, state, country).

It's not bad once you map it out, but the first time you look at a raw response it's pretty overwhelming. I ended up writing a flattener that pulls out the ~20 fields people actually care about and produces one clean row per study.

Pagination

The v2 API uses token-based pagination. Each response includes a nextPageToken if there are more results. Pass it back as pageToken on the next request. Way better than the old offset-based approach -- you don't lose your place if new studies get added mid-query.

Max page size is 1000, and the default is 10 (which is annoyingly small -- always set pageSize explicitly).

What I use this for

I've been pulling clinical trials data for a few different things:

Competitive intelligence in pharma. Track what trials a specific company is running, which phases they're in, and where they're recruiting. If you're in biotech BD or investing, this is gold.

Site selection for CROs. Finding which facilities are running trials for a specific condition in a specific geography. The locations data is surprisingly complete.

Patient recruitment. Matching conditions and locations to find recruiting trials. The ClinicalTrials.gov website does this too, but the API lets you build custom alerts and filters.

Academic research. Studying trends in clinical research -- which conditions get the most trials, how enrollment numbers trend over time, sponsor distribution. The dataset is rich enough for real analysis.

The pre-built option

If you don't want to deal with the nested responses and pagination logic, I built a tool that wraps all of this:

ClinicalTrials.gov Search

It handles pagination, flattens the nested response into clean records, and exports to CSV/JSON/Excel. Search by keyword, condition, intervention, sponsor, or NCT ID. Filter by status, phase, and study type. Up to 10,000 results per run.

Pricing: $0.005 per result + $0.10 per run. 1,000 trial records costs about $5.10.

Gotchas if you're hitting the API directly

The query.term field searches everything. If you search for "cancer," you'll get trials where cancer appears in the title, conditions, interventions, description, eligibility criteria -- everywhere. Use query.cond for condition-specific searches or query.intr for intervention-specific.

Phase values are specific strings. Use PHASE1, PHASE2, PHASE3, PHASE4, EARLY_PHASE1, or NA. Not "Phase 1" or "1" or "phase1."

Status values too. RECRUITING, COMPLETED, ACTIVE_NOT_RECRUITING, TERMINATED, WITHDRAWN, NOT_YET_RECRUITING, SUSPENDED, ENROLLING_BY_INVITATION, UNKNOWN. Case-sensitive.

Some fields are arrays that might be empty. Conditions, interventions, locations, collaborators -- any of these can be null or empty. Don't assume they exist.

Date formats are inconsistent. Some dates come as "2024-01-15", others as "January 2024", others as "January 15, 2024". The API doesn't normalize them. Fun times.

Need clinical trials data at scale? Try the ClinicalTrials.gov Search on Apify. Feedback welcome on the actor page or Apify Discord.