I had a short window this week to evaluate ClinicalTrials.gov as a data source. Here is the condensed write-up of what the data looks like, what surprised me, and the bits of infrastructure that paid off.
The source
ClinicalTrials.gov Scraper Clinical Trial Data from the Official API Scrape clinical trials straight from the official ClinicalTrials.gov API no login, no API key, no blocking. The relevant questions for any new source are always: is the markup stable, is pagination sensible, and how aggressively does it rate-limit. For this one, all three answers are "good enough that you can build on it" -- which is honestly more than I can say for a lot of supposedly easy targets.
The schema
What you get back per record:
-
nctId-- nct id -
briefTitle-- brief title -
officialTitle-- official title -
acronym-- acronym -
organization-- organization -
overallStatus-- overall status -
studyType-- study type -
phases-- phases -
enrollmentCount-- enrollment count -
leadSponsor-- lead sponsor -
sponsorClass-- sponsor class -
collaborators-- collaborators -
conditions-- conditions -
interventions-- interventions -
briefSummary-- brief summary -
sex-- sex -
minimumAge-- minimum age -
maximumAge-- maximum age -
healthyVolunteers-- healthy volunteers -
startDate-- start date -
completionDate-- completion date -
primaryCompletionDate-- primary completion date -
firstPostedDate-- first posted date -
lastUpdatePostedDate-- last update posted date -
locations-- locations -
hasResults-- has results -
url-- url -
scrapedAt-- scraped at
Nothing exotic, which is exactly what you want from a feed. Flat records, predictable keys, types you can guess from the names.
Real rows
Two records from a sample run, trimmed for the inevitable wall of text:
{
"nctId": "NCT01213784",
"briefTitle": "Optimized Glycemic Control in Heart Failure Patients With DM2:\"Effect on Left Ventricular Function and Skeletal Muscle\"",
"officialTitle": "Optimized Glycemic Control in Type 2 Diabetics With Heart Failure:\"Effect on Left Ventricular Function and Skeletal Muscle\"",
"acronym": "HFDM",
"organization": "University of Aarhus",
"overallStatus": "COMPLETED",
"studyType": "INTERVENTIONAL",
"phases": [
"PHASE2"
],
"enrollmentCount": 40,
"leadSponsor": "University of Aarhus"
}
{
"nctId": "NCT03060538",
"briefTitle": "A Multiple Ascending Dose Study to Evaluate Safety and Tolerability of BFKB8488A in Participants With Type 2 Diabetes Mellitus and...",
"officialTitle": "A Phase Ib, Randomized, Blinded, Placebo-Controlled, Multiple Ascending-Dose Study to Evaluate the Safety, Tolerability, and...",
"acronym": null,
"organization": "Genentech, Inc.",
"overallStatus": "COMPLETED",
"studyType": "INTERVENTIONAL",
"phases": [
"PHASE1"
],
"enrollmentCount": 154,
"leadSponsor": "Genentech, Inc."
}
Gotchas
A few things I would not have known without actually pulling data:
- Optional fields disappear instead of being null. Not the end of the world, but it means every loader needs to be tolerant of missing keys.
- Long-form text fields contain control characters. Newlines, tabs, the occasional rogue carriage return. Strip them at load time unless you actively want them.
- Timestamps are UTC ISO-8601 which is great, but it does mean any local-time dashboard needs an explicit conversion.
- Some numeric fields are emitted as strings. Cast on load.
- Re-scraping with overlapping windows creates duplicates. Dedup on the natural ID.
What I would build next
A few directions this dataset would support nicely:
- A daily snapshot pipeline that lands raw JSON into object storage, then materialises a curated table for dashboards.
- A change-detection layer that computes row-level diffs between consecutive scrapes -- great for surfacing new and removed records.
- A text-extraction layer over the long-form content fields, feeding into search or topic modelling.
- A small validation suite that runs after every scrape: row count above a floor, key fields present in 100% of rows, timestamp parses cleanly. Cheap to write, catches schema drift in minutes instead of weeks.
Cost considerations
Worth thinking about before you commit. The dominant cost on a recurring feed is not the per-record extraction price -- it is the maintenance time when the upstream source changes. A solid heuristic: budget half a day per source per quarter for maintenance work, and twice that for sources with active anti-bot defences. If that maintenance budget is too steep for the value the dataset provides, the project is not a fit.
The other cost worth modelling is storage. Raw JSON partitioned by date is cheap if you compress it -- a few cents per gigabyte per month on most clouds -- but it stops being cheap if you forget about retention. Set a lifecycle policy that ages anything older than your useful replay window into a colder tier, and revisit the policy every few months.
Bottom line
For an afternoon's evaluation work this was time well spent. The dataset is structurally clean, the scraper handled rate-limits without me having to think about it, and the records are rich enough to start asking real questions immediately. If the upstream source stays stable for a quarter -- which is the realistic horizon for most public sources -- the cost-benefit of integrating this feed is firmly positive.
For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/clinicaltrials-gov-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.
Top comments (0)