DEV Community

Akhila Chanubala
Akhila Chanubala

Posted on

Building a Public Clinical Trial Data Quality Observatory with Python

Public clinical trial data is valuable, but it is not always analytics-ready.

An API response can be valid. A CSV can load successfully. A dashboard can render charts. But none of that proves the data is complete, consistent, or trustworthy enough for analytics.

That is the problem I wanted to explore with OpenTrialDQ and OpenTrialLens, an open-source project for validating and visualizing public ClinicalTrials.gov data.

The newest step is turning the project from a dashboard into a repeatable Clinical Trial Data Quality Observatory.

What is a data quality observatory?

A data quality observatory is a repeatable reporting layer that measures the condition of a dataset over time.

Instead of asking, β€œCan I display this data?”, it asks:

  • Are required fields present?
  • Are IDs unique?
  • Are dates logical?
  • Are enrollment values valid?
  • Which fields fail most often?
  • Can the results be reproduced later?

For public clinical trial data, this means generating condition-level quality snapshots across searches like diabetes, breast cancer, asthma, cardiovascular disease, and Alzheimer disease.

The basic pipeline

The first version of the observatory follows a simple flow:

  1. Pull records from the ClinicalTrials.gov API
  2. Flatten selected study fields
  3. Apply data quality rules
  4. Generate failed-record output
  5. Publish Markdown and JSON reports

The goal is not to make clinical claims. The goal is to make data readiness visible before analytics.

Fields used in the snapshot

For a first useful version, I focused on fields that commonly matter for analytics:

nct_id
overall_status
start_date
completion_date
phases
sponsor_name
sponsor_class
enrollment_count
conditions
countries

These fields support basic trial status summaries, sponsor analysis, enrollment metrics, geography coverage, and quality checks.

Example validation rules

The observatory applies simple, explainable rules:

nct_id must not be null
nct_id must be unique
overall_status must not be null
phases should not be missing
sponsor_name must not be null
enrollment_count should be positive when present
countries should not be missing
completion_date should not be before start_date

Each failed check is captured with context:

record_index
nct_id
field
rule
severity
reason

That failed-record output matters. A quality score by itself is not enough; users need to know what failed and why.

First baseline report

The first baseline snapshot analyzed 250 public ClinicalTrials.gov records across five condition searches:

  • diabetes
  • breast cancer
  • cardiovascular disease
  • asthma
  • Alzheimer disease

The result:

  • 250 records analyzed
  • 97% weighted quality score
  • 90 failed checks
  • most common issues: missing phase data, missing country coverage, and occasional enrollment issues

This confirms a practical data engineering point: public data can be accessible and structured, but still need validation before analytics.

Why publish Markdown and JSON?

The observatory generates both human-readable and machine-readable outputs.

Markdown gives readers a simple report:

  • condition summaries
  • quality scores
  • failed checks
  • status mix
  • sponsor class mix
  • common quality issues

JSON gives developers structured output for follow-up analysis:

  • generated timestamp
  • condition list
  • report metadata
  • failed-rule counts
  • failed-field counts
  • enrollment totals
  • status and phase distributions

This makes the report easier to inspect, compare, and reuse.

Why this pattern is useful

This pattern is not limited to clinical trial data.

The same approach can be used for:

  • public health datasets
  • provider directories
  • research datasets
  • sample claims data
  • customer engagement datasets
  • operational reporting feeds

The core idea is reusable:

  1. flatten the source data
  2. define quality rules
  3. apply the rules consistently
  4. export failed records
  5. publish an audit summary
  6. repeat the snapshot over time

That repeatable layer is especially important for analytics and AI workflows. A model or dashboard is only as trustworthy as the data pipeline feeding it.

Project links

OpenTrialLens was recently featured on HackerNoon through its Proof of Usefulness program. The observatory is the next step: moving from a dashboard demo to repeatable public data quality reporting.

Final thought

Modern data tools make it easier to move data. The harder question is whether the data is ready to use after it moves.

For clinical trial analytics, a data quality observatory gives data engineers a practical way to make completeness, consistency, and auditability visible before downstream dashboards or AI workflows begin.

Top comments (0)