What I learned scraping ClinicalTrials.gov: schema, gotchas and the tooling that worked

#webscraping #apify #opendata #tutorial

I had a short window this week to evaluate ClinicalTrials.gov as a data source. Here is the condensed write-up of what the data looks like, what surprised me, and the bits of infrastructure that paid off.

The source

ClinicalTrials.gov Scraper Clinical Trial Data from the Official API Scrape clinical trials straight from the official ClinicalTrials.gov API no login, no API key, no blocking. The relevant questions for any new source are always: is the markup stable, is pagination sensible, and how aggressively does it rate-limit. For this one, all three answers are "good enough that you can build on it" -- which is honestly more than I can say for a lot of supposedly easy targets.

The schema

What you get back per record:

nctId -- nct id
briefTitle -- brief title
officialTitle -- official title
acronym -- acronym
organization -- organization
overallStatus -- overall status
studyType -- study type
phases -- phases
enrollmentCount -- enrollment count
leadSponsor -- lead sponsor
sponsorClass -- sponsor class
collaborators -- collaborators
conditions -- conditions
interventions -- interventions
briefSummary -- brief summary
sex -- sex
minimumAge -- minimum age
maximumAge -- maximum age
healthyVolunteers -- healthy volunteers
startDate -- start date
completionDate -- completion date
primaryCompletionDate -- primary completion date
firstPostedDate -- first posted date
lastUpdatePostedDate -- last update posted date
locations -- locations
hasResults -- has results
url -- url
scrapedAt -- scraped at

Nothing exotic, which is exactly what you want from a feed. Flat records, predictable keys, types you can guess from the names.

Real rows

Two records from a sample run, trimmed for the inevitable wall of text:

{
  "nctId": "NCT01213784",
  "briefTitle": "Optimized Glycemic Control in Heart Failure Patients With DM2:\"Effect on Left Ventricular Function and Skeletal Muscle\"",
  "officialTitle": "Optimized Glycemic Control in Type 2 Diabetics With Heart Failure:\"Effect on Left Ventricular Function and Skeletal Muscle\"",
  "acronym": "HFDM",
  "organization": "University of Aarhus",
  "overallStatus": "COMPLETED",
  "studyType": "INTERVENTIONAL",
  "phases": [
    "PHASE2"
  ],
  "enrollmentCount": 40,
  "leadSponsor": "University of Aarhus"
}

{
  "nctId": "NCT03060538",
  "briefTitle": "A Multiple Ascending Dose Study to Evaluate Safety and Tolerability of BFKB8488A in Participants With Type 2 Diabetes Mellitus and...",
  "officialTitle": "A Phase Ib, Randomized, Blinded, Placebo-Controlled, Multiple Ascending-Dose Study to Evaluate the Safety, Tolerability, and...",
  "acronym": null,
  "organization": "Genentech, Inc.",
  "overallStatus": "COMPLETED",
  "studyType": "INTERVENTIONAL",
  "phases": [
    "PHASE1"
  ],
  "enrollmentCount": 154,
  "leadSponsor": "Genentech, Inc."
}

Gotchas

A few things I would not have known without actually pulling data:

Optional fields disappear instead of being null. Not the end of the world, but it means every loader needs to be tolerant of missing keys.
Long-form text fields contain control characters. Newlines, tabs, the occasional rogue carriage return. Strip them at load time unless you actively want them.
Timestamps are UTC ISO-8601 which is great, but it does mean any local-time dashboard needs an explicit conversion.
Some numeric fields are emitted as strings. Cast on load.
Re-scraping with overlapping windows creates duplicates. Dedup on the natural ID.

What I would build next

A few directions this dataset would support nicely:

A daily snapshot pipeline that lands raw JSON into object storage, then materialises a curated table for dashboards.
A change-detection layer that computes row-level diffs between consecutive scrapes -- great for surfacing new and removed records.
A text-extraction layer over the long-form content fields, feeding into search or topic modelling.
A small validation suite that runs after every scrape: row count above a floor, key fields present in 100% of rows, timestamp parses cleanly. Cheap to write, catches schema drift in minutes instead of weeks.

Cost considerations

Worth thinking about before you commit. The dominant cost on a recurring feed is not the per-record extraction price -- it is the maintenance time when the upstream source changes. A solid heuristic: budget half a day per source per quarter for maintenance work, and twice that for sources with active anti-bot defences. If that maintenance budget is too steep for the value the dataset provides, the project is not a fit.

The other cost worth modelling is storage. Raw JSON partitioned by date is cheap if you compress it -- a few cents per gigabyte per month on most clouds -- but it stops being cheap if you forget about retention. Set a lifecycle policy that ages anything older than your useful replay window into a colder tier, and revisit the policy every few months.

Bottom line

For an afternoon's evaluation work this was time well spent. The dataset is structurally clean, the scraper handled rate-limits without me having to think about it, and the records are rich enough to start asking real questions immediately. If the upstream source stays stable for a quarter -- which is the realistic horizon for most public sources -- the cost-benefit of integrating this feed is firmly positive.

For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/clinicaltrials-gov-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.