<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Akhila Chanubala</title>
    <description>The latest articles on DEV Community by Akhila Chanubala (@akhila_chanubala).</description>
    <link>https://dev.to/akhila_chanubala</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3989697%2F72c48a7a-de33-459d-b9b5-d2c20a7d0dd4.png</url>
      <title>DEV Community: Akhila Chanubala</title>
      <link>https://dev.to/akhila_chanubala</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/akhila_chanubala"/>
    <language>en</language>
    <item>
      <title>Building a Public Clinical Trial Data Quality Observatory with Python</title>
      <dc:creator>Akhila Chanubala</dc:creator>
      <pubDate>Wed, 17 Jun 2026 19:32:31 +0000</pubDate>
      <link>https://dev.to/akhila_chanubala/building-a-public-clinical-trial-data-quality-observatory-with-python-2ocm</link>
      <guid>https://dev.to/akhila_chanubala/building-a-public-clinical-trial-data-quality-observatory-with-python-2ocm</guid>
      <description>&lt;p&gt;Public clinical trial data is valuable, but it is not always analytics-ready.&lt;/p&gt;

&lt;p&gt;An API response can be valid. A CSV can load successfully. A dashboard can render charts. But none of that proves the data is complete, consistent, or trustworthy enough for analytics.&lt;/p&gt;

&lt;p&gt;That is the problem I wanted to explore with &lt;strong&gt;OpenTrialDQ&lt;/strong&gt; and &lt;strong&gt;OpenTrialLens&lt;/strong&gt;, an open-source project for validating and visualizing public ClinicalTrials.gov data.&lt;/p&gt;

&lt;p&gt;The newest step is turning the project from a dashboard into a repeatable &lt;strong&gt;Clinical Trial Data Quality Observatory&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a data quality observatory?
&lt;/h2&gt;

&lt;p&gt;A data quality observatory is a repeatable reporting layer that measures the condition of a dataset over time.&lt;/p&gt;

&lt;p&gt;Instead of asking, “Can I display this data?”, it asks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are required fields present?&lt;/li&gt;
&lt;li&gt;Are IDs unique?&lt;/li&gt;
&lt;li&gt;Are dates logical?&lt;/li&gt;
&lt;li&gt;Are enrollment values valid?&lt;/li&gt;
&lt;li&gt;Which fields fail most often?&lt;/li&gt;
&lt;li&gt;Can the results be reproduced later?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For public clinical trial data, this means generating condition-level quality snapshots across searches like diabetes, breast cancer, asthma, cardiovascular disease, and Alzheimer disease.&lt;/p&gt;

&lt;h2&gt;
  
  
  The basic pipeline
&lt;/h2&gt;

&lt;p&gt;The first version of the observatory follows a simple flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pull records from the ClinicalTrials.gov API&lt;/li&gt;
&lt;li&gt;Flatten selected study fields&lt;/li&gt;
&lt;li&gt;Apply data quality rules&lt;/li&gt;
&lt;li&gt;Generate failed-record output&lt;/li&gt;
&lt;li&gt;Publish Markdown and JSON reports&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The goal is not to make clinical claims. The goal is to make data readiness visible before analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fields used in the snapshot
&lt;/h2&gt;

&lt;p&gt;For a first useful version, I focused on fields that commonly matter for analytics:&lt;/p&gt;

&lt;p&gt;nct_id&lt;br&gt;
overall_status&lt;br&gt;
start_date&lt;br&gt;
completion_date&lt;br&gt;
phases&lt;br&gt;
sponsor_name&lt;br&gt;
sponsor_class&lt;br&gt;
enrollment_count&lt;br&gt;
conditions&lt;br&gt;
countries&lt;/p&gt;

&lt;p&gt;These fields support basic trial status summaries, sponsor analysis, enrollment metrics, geography coverage, and quality checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example validation rules
&lt;/h2&gt;

&lt;p&gt;The observatory applies simple, explainable rules:&lt;/p&gt;

&lt;p&gt;nct_id must not be null&lt;br&gt;
nct_id must be unique&lt;br&gt;
overall_status must not be null&lt;br&gt;
phases should not be missing&lt;br&gt;
sponsor_name must not be null&lt;br&gt;
enrollment_count should be positive when present&lt;br&gt;
countries should not be missing&lt;br&gt;
completion_date should not be before start_date&lt;/p&gt;

&lt;p&gt;Each failed check is captured with context:&lt;/p&gt;

&lt;p&gt;record_index&lt;br&gt;
nct_id&lt;br&gt;
field&lt;br&gt;
rule&lt;br&gt;
severity&lt;br&gt;
reason&lt;/p&gt;

&lt;p&gt;That failed-record output matters. A quality score by itself is not enough; users need to know what failed and why.&lt;/p&gt;

&lt;h2&gt;
  
  
  First baseline report
&lt;/h2&gt;

&lt;p&gt;The first baseline snapshot analyzed &lt;strong&gt;250 public ClinicalTrials.gov records&lt;/strong&gt; across five condition searches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;diabetes&lt;/li&gt;
&lt;li&gt;breast cancer&lt;/li&gt;
&lt;li&gt;cardiovascular disease&lt;/li&gt;
&lt;li&gt;asthma&lt;/li&gt;
&lt;li&gt;Alzheimer disease&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;250 records analyzed&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;97% weighted quality score&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;90 failed checks&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;most common issues: missing phase data, missing country coverage, and occasional enrollment issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This confirms a practical data engineering point: public data can be accessible and structured, but still need validation before analytics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why publish Markdown and JSON?
&lt;/h2&gt;

&lt;p&gt;The observatory generates both human-readable and machine-readable outputs.&lt;/p&gt;

&lt;p&gt;Markdown gives readers a simple report:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;condition summaries&lt;/li&gt;
&lt;li&gt;quality scores&lt;/li&gt;
&lt;li&gt;failed checks&lt;/li&gt;
&lt;li&gt;status mix&lt;/li&gt;
&lt;li&gt;sponsor class mix&lt;/li&gt;
&lt;li&gt;common quality issues&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;JSON gives developers structured output for follow-up analysis:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generated timestamp&lt;/li&gt;
&lt;li&gt;condition list&lt;/li&gt;
&lt;li&gt;report metadata&lt;/li&gt;
&lt;li&gt;failed-rule counts&lt;/li&gt;
&lt;li&gt;failed-field counts&lt;/li&gt;
&lt;li&gt;enrollment totals&lt;/li&gt;
&lt;li&gt;status and phase distributions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes the report easier to inspect, compare, and reuse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this pattern is useful
&lt;/h2&gt;

&lt;p&gt;This pattern is not limited to clinical trial data.&lt;/p&gt;

&lt;p&gt;The same approach can be used for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;public health datasets&lt;/li&gt;
&lt;li&gt;provider directories&lt;/li&gt;
&lt;li&gt;research datasets&lt;/li&gt;
&lt;li&gt;sample claims data&lt;/li&gt;
&lt;li&gt;customer engagement datasets&lt;/li&gt;
&lt;li&gt;operational reporting feeds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core idea is reusable:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;flatten the source data&lt;/li&gt;
&lt;li&gt;define quality rules&lt;/li&gt;
&lt;li&gt;apply the rules consistently&lt;/li&gt;
&lt;li&gt;export failed records&lt;/li&gt;
&lt;li&gt;publish an audit summary&lt;/li&gt;
&lt;li&gt;repeat the snapshot over time&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That repeatable layer is especially important for analytics and AI workflows. A model or dashboard is only as trustworthy as the data pipeline feeding it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project links
&lt;/h2&gt;

&lt;p&gt;OpenTrialLens was recently featured on HackerNoon through its Proof of Usefulness program. The observatory is the next step: moving from a dashboard demo to repeatable public data quality reporting.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/akhilachanubala-alt/OpenTrialDQ" rel="noopener noreferrer"&gt;https://github.com/akhilachanubala-alt/OpenTrialDQ&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Observatory baseline report: &lt;a href="https://github.com/akhilachanubala-alt/OpenTrialDQ/blob/main/docs/observatory/2026-06-baseline.md" rel="noopener noreferrer"&gt;https://github.com/akhilachanubala-alt/OpenTrialDQ/blob/main/docs/observatory/2026-06-baseline.md&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Live dashboard: &lt;a href="https://akhilachanubala-alt.github.io/OpenTrialDQ/opentriallens/" rel="noopener noreferrer"&gt;https://akhilachanubala-alt.github.io/OpenTrialDQ/opentriallens/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;HackerNoon feature: &lt;a href="https://hackernoon.com/opentriallens-earns-a-4646-proof-of-usefulness-score-for-improving-clinical-data-quality" rel="noopener noreferrer"&gt;https://hackernoon.com/opentriallens-earns-a-4646-proof-of-usefulness-score-for-improving-clinical-data-quality&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;Modern data tools make it easier to move data. The harder question is whether the data is ready to use after it moves.&lt;/p&gt;

&lt;p&gt;For clinical trial analytics, a data quality observatory gives data engineers a practical way to make completeness, consistency, and auditability visible before downstream dashboards or AI workflows begin.&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>opensource</category>
      <category>healthtech</category>
    </item>
  </channel>
</rss>
