Akhila Chanubala

Posted on Jun 17

Building a Public Clinical Trial Data Quality Observatory with Python

#python #dataengineering #opensource #healthtech

Public clinical trial data is valuable, but it is not always analytics-ready.

An API response can be valid. A CSV can load successfully. A dashboard can render charts. But none of that proves the data is complete, consistent, or trustworthy enough for analytics.

That is the problem I wanted to explore with OpenTrialDQ and OpenTrialLens, an open-source project for validating and visualizing public ClinicalTrials.gov data.

The newest step is turning the project from a dashboard into a repeatable Clinical Trial Data Quality Observatory.

What is a data quality observatory?

A data quality observatory is a repeatable reporting layer that measures the condition of a dataset over time.

Instead of asking, “Can I display this data?”, it asks:

Are required fields present?
Are IDs unique?
Are dates logical?
Are enrollment values valid?
Which fields fail most often?
Can the results be reproduced later?

For public clinical trial data, this means generating condition-level quality snapshots across searches like diabetes, breast cancer, asthma, cardiovascular disease, and Alzheimer disease.

The basic pipeline

The first version of the observatory follows a simple flow:

Pull records from the ClinicalTrials.gov API
Flatten selected study fields
Apply data quality rules
Generate failed-record output
Publish Markdown and JSON reports

The goal is not to make clinical claims. The goal is to make data readiness visible before analytics.

Fields used in the snapshot

For a first useful version, I focused on fields that commonly matter for analytics:

nct_id
overall_status
start_date
completion_date
phases
sponsor_name
sponsor_class
enrollment_count
conditions
countries

These fields support basic trial status summaries, sponsor analysis, enrollment metrics, geography coverage, and quality checks.

Example validation rules

The observatory applies simple, explainable rules:

nct_id must not be null
nct_id must be unique
overall_status must not be null
phases should not be missing
sponsor_name must not be null
enrollment_count should be positive when present
countries should not be missing
completion_date should not be before start_date

Each failed check is captured with context:

record_index
nct_id
field
rule
severity
reason

That failed-record output matters. A quality score by itself is not enough; users need to know what failed and why.

First baseline report

The first baseline snapshot analyzed 250 public ClinicalTrials.gov records across five condition searches:

diabetes
breast cancer
cardiovascular disease
asthma
Alzheimer disease

The result:

250 records analyzed
97% weighted quality score
90 failed checks
most common issues: missing phase data, missing country coverage, and occasional enrollment issues

This confirms a practical data engineering point: public data can be accessible and structured, but still need validation before analytics.

Why publish Markdown and JSON?

The observatory generates both human-readable and machine-readable outputs.

Markdown gives readers a simple report:

condition summaries
quality scores
failed checks
status mix
sponsor class mix
common quality issues

JSON gives developers structured output for follow-up analysis:

generated timestamp
condition list
report metadata
failed-rule counts
failed-field counts
enrollment totals
status and phase distributions

This makes the report easier to inspect, compare, and reuse.

Why this pattern is useful

This pattern is not limited to clinical trial data.

The same approach can be used for:

public health datasets
provider directories
research datasets
sample claims data
customer engagement datasets
operational reporting feeds

The core idea is reusable:

flatten the source data
define quality rules
apply the rules consistently
export failed records
publish an audit summary
repeat the snapshot over time

That repeatable layer is especially important for analytics and AI workflows. A model or dashboard is only as trustworthy as the data pipeline feeding it.

Project links

OpenTrialLens was recently featured on HackerNoon through its Proof of Usefulness program. The observatory is the next step: moving from a dashboard demo to repeatable public data quality reporting.

GitHub: https://github.com/akhilachanubala-alt/OpenTrialDQ
Observatory baseline report: https://github.com/akhilachanubala-alt/OpenTrialDQ/blob/main/docs/observatory/2026-06-baseline.md
Live dashboard: https://akhilachanubala-alt.github.io/OpenTrialDQ/opentriallens/
HackerNoon feature: https://hackernoon.com/opentriallens-earns-a-4646-proof-of-usefulness-score-for-improving-clinical-data-quality

Final thought

Modern data tools make it easier to move data. The harder question is whether the data is ready to use after it moves.

For clinical trial analytics, a data quality observatory gives data engineers a practical way to make completeness, consistency, and auditability visible before downstream dashboards or AI workflows begin.

DEV Community