Romina Elena Mendez Escobar for AWS Community Builders

Posted on Dec 9, 2025

From Raw Clinical Data to AI: Building a Modern Healthcare Data Platform on AWS

#ai #aws #cloud #architecture

The OMOP Common Data Model (CDM) is a standard for observational health data that allows the analysis of clinical data in a consistent and reproducible way. Implementing OMOP CDM in AWS requires a robust architecture that handles everything from data ingestion to advanced AI analysis, maintaining the highest standards of security and regulatory compliance, especially HIPAA for health data.

This guide describes a set of components in an architecture within AWS, and these do not define the only possible solution, I am only presenting a proposal of a series of components that you can use among the many services that this platform has available.

────────────────────────────────

🗂️ What is OMOP CDM?

The OMOP Common Data Model (CDM) is a standard designed by the OHDSI community to represent observational health data in a uniform way. Its main objective is to enable the standardization of medical data where different institutions, clinical systems and databases speak the same “language,” in order to facilitate reproducible analysis, cohort comparisons and multicenter studies.
The model is based on a set of normalized tables, standardized vocabularies and modeling conventions that define how patients, diagnoses, procedures, medication, clinical measurements, visits and temporal events should be represented.

────────────────────────────────

👤 Model Structure: Patient as Central Entity

OMOP organizes the information around the patient, who acts as the central unit of the model, and this structure allows the reconstruction of the patient’s clinical timeline and the analysis of their events in a temporal way.

────────────────────────────────

❤️ Standardized Vocabularies: the semantic heart of OMOP

One of the most important strengths of the CDM is the use of standardized vocabularies, which replace the diversity of ways of writing the same text with numeric IDs. These IDs allow the representation of clinical concepts in a consistent, interoperable and computable way.
In addition, the vocabularies have:

Hierarchies (for example, “type 2 diabetes mellitus” is a subconcept of “endocrine and metabolic diseases”),
Semantic relationships,
Standard and non-standard concepts.

Thanks to these hierarchies, an analyst can perform broad studies without knowing all the specific codes. For example, to analyze metabolic diseases, they can query the higher category and automatically include all subclasses (including different types of diabetes)

────────────────────────────────

☁️ OMOP in AWS

The architecture of the OMOP Common Data Model can be implemented in multiple environments (on-premise, hybrid or in different cloud providers). However, AWS offers a particularly robust ecosystem to address the challenges of standardization, integration, governance and advanced clinical data analysis.

In this section, we explore how to combine AWS services to build a complete pipeline that allows ingesting, transforming, standardizing and analyzing health data under the OMOP standard, maintaining high levels of security, regulatory compliance and operational efficiency.

⚠️ This approach is not intended to be the only way to implement OMOP, but a practical and modular guide that will allow you to understand which AWS services can help you in each phase of the process.

OMOP in AWS: Services by section

(1) 📄 Data: Clinical Sources, APIs and Personal Devices

In a modern health ecosystem, data no longer comes only from a hospital’s internal systems. Today, clinical information is distributed across multiple platforms, technologies and devices, requiring architectures capable of integrating, unifying and standardizing heterogeneous sources.

(2) 🔧 Pipeline Services: Data Ingestion and Initial Processing

To build a robust pipeline that enables the standardization of clinical data toward OMOP, it is essential to define how the data is extracted, ingested and prepared before transformation.
In this stage, the main objective is to capture the data from different sources and store them in raw format in Amazon S3, always preserving traceability and the original state of the information.
Below are the key services used in this phase:

Amazon MWAA (Managed Workflows for Apache Airflow)
Amazon MWAA allows running Apache Airflow DAGs without managing the underlying infrastructure.

Amazon Kinesis
Hospitals and health devices generate more and more real-time data; for these scenarios, Amazon Kinesis offers a highly scalable streaming solution.
The combined use of:

Kinesis Data Streams (real-time ingestion)
Kinesis Data Firehose (automated delivery to S3) allows capturing data streams without additional infrastructure and storing them directly in the raw bucket, ready to be processed by Airflow or other services.

AWS Lambda
This service allows executing serverless functions without provisioning servers, which makes it ideal for small tasks and specific events within the pipeline.
In this context, it is used for:

Lightweight pre-validation or normalization processes before sending files to S3.
Moving or restructuring files when new data arrives.
Automatic triggers when new objects are detected in S3 (for example, activating notifications).

(3) 🗂️ RAW Storage

Once extracted, all data will be stored initially in Amazon S3, which will act as the RAW zone of the data lake. This layer preserves the data in its original format, without transformations, to guarantee traceability, auditing and reprocessing capability.
Storage in S3 must be complemented with a set of key practices:

IAM + S3 Bucket Policies ensure role-based access.
Tags help automate governance and classification.
Lake Formation adds granular control at table/column level.
Lifecycle policies ensure retention and cost efficiency.

(4) 📌 Orchestration

In this section we describe the key DAGs we need to coordinate the different stages of the pipeline. Orchestration is essential to ensure that the extractions, transformations and loads are executed consistently, auditable and scalable.

(5) 🧠 AI & Unstructured Data

To process clinical notes and other unstructured data, we need to incorporate NLP techniques that allow extracting entities, mapping clinical concepts and automatically encoding information.
For this type of processing, we can rely on the following AWS services:

Amazon SageMaker
Allows training, tuning and deploying custom NLP models, from classic models to advanced transformer-based ones. It is ideal when full control of the ML pipeline, preprocessing, fine-tuning and integration with other system components is needed.

Amazon Comprehend Medical
Managed service that extracts clinical entities, relationships and conditions directly from medical text.
Important: Comprehend Medical supports a limited set of languages, so it is necessary to validate documentation before integrating it into the project.

In the following article you can find a complete implementation of a batch process using this service

Employing AWS Comprehend Medical for Medical Data Extraction in Healthcare Analytics

Romina Elena Mendez Escobar ・ Aug 7 '24

#aws #python #datascience #nlp

Amazon Bedrock integrated with SageMaker
Although Bedrock is a separate service, it can be integrated into ML flows in SageMaker. Its main contribution is enabling foundational models and generative AI capabilities, opening the door to new use cases:

Automatic classification of clinical text.
Concept normalization assisted by generative models.
Semantic searches and context retrieval through vector databases (for example, to enrich mapping results or suggest probable clinical codes).

(6) 🩺 OMOP CDM

All processing stages converge in the implementation of the OMOP Common Data Model (CDM), stored in a relational database optimized for analytical and mixed workloads.

Amazon Aurora PostgreSQL
The recommended engine for hosting the CDM is Amazon Aurora PostgreSQL, because it:

Maintains full SQL compatibility and supports OHDSI ecosystem tools.
Provides high availability, automatic replication, and fast recovery.
Scales horizontally with read replicas, ideal for analytical and concurrent workloads.
Integrates seamlessly with ETL/ELT pipelines across AWS services.

Depending on the use case, Aurora can be complemented with additional analytics-oriented services.

Amazon Redshift
For advanced analytics over large datasets derived from the CDM, Amazon Redshift offers a distributed, high-performance environment for complex analytical queries.

Amazon Athena
Amazon Athena enables querying raw data stored in S3 without loading it into a database. It is especially useful for:

Quick validations before loading data into the CDM.
Debugging and data quality checks using SQL.
Exploring semi-structured files (CSV, JSON, Parquet).

Amazon ElastiCache
When the solution requires high-frequency or computationally expensive queries on the OMOP model, adding a cache layer with Redis or Memcached helps:

Reduce latency for repeated queries.
Store results of heavy computations (e.g., cohort definitions, vocabulary lookups).
Improve performance for dashboards and clinical applications that require fast responses.

(7) 📊 Data Visualization

Data visualization is essential not only to consume information but also to analyze, monitor and validate each stage of the pipeline. As we process clinical data, vocabularies, transformations and AI results, we need tools that make the quality, behavior and evolution of the data evident.

Below are various options depending on the use case:

Amazon QuickSight: It enables fast, interactive dashboards connected to Aurora, Redshift, Athena or S3. Its in-memory SPICE engine accelerates visualizations at scale while reducing load on source databases, making it ideal for data quality tracking and clinical monitoring.
Amazon SageMaker Model Dashboard: The SageMaker Model Dashboard centralizes observability for ML workflows, displaying metrics such as precision, recall and F1-score, along with model versions, drift indicators and execution history. This makes it easier to detect degradation early and maintain reliable NLP or predictive models.
Amazon Fargate / Amazon EKS: When fully custom dashboards are required—such as advanced visualizations, semantic comparisons or interactive analytics—Fargate and EKS provide the compute layer to run applications built with tools like Plotly, Dash, Streamlit or React-based libraries. This allows teams to create

(8) 🧭 Data Governance

Data governance is critical when working with sensitive health information, ensuring that data remains cataloged, documented and protected throughout every stage of the pipeline. A strong governance layer enforces access policies, allowing only authorized users to interact with clinical datasets under strict regulatory requirements. It also guarantees full traceability, enabling auditing of how data is accessed, transformed and shared across environments. Finally, governance provides controlled discoverability, ensuring that curated datasets can be safely searched and consumed while maintaining consistent metadata.

AWS Lake Formation
AWS Lake Formation centralizes governance for data stored in S3, offering fine-grained permissions at the table, column or row level, enforcing traceability and integrating tightly with the Glue Data Catalog to maintain consistent metadata.

Amazon DataZone
Amazon DataZone supports the organized publication and controlled sharing of datasets across the organization, enabling teams to work within structured data domains—such as Clinical, NLP, OMOP or Research—while unifying cataloging, governance and collaboration in one environment.

(9) 🔐 Security and Networking

Security and connectivity are fundamental pillars in any health data architecture, especially to comply with regulations such as HIPAA. In AWS, there are multiple services that protect both data and infrastructure. Below we describe the main components and their role within our OMOP CDM architecture.

(10) 🎚️ Monitoring and Billing

Monitoring and cost control are essential in health data architectures, especially when processing large clinical datasets or running AI workloads where training and inference can be resource-intensive.

🔍 Monitoring
AWS CloudWatch provides centralized metrics, logs and events from all AWS services, enabling teams to track infrastructure health, Airflow DAG execution and the behavior of ETL/ELT pipelines while receiving alerts for anomalies. For deeper inspection, AWS X-Ray traces requests across distributed systems—such as containerized services on ECS/EKS or APIs that expose OMOP data—making it easier to detect bottlenecks and debug complex data flows.

🧾 Billing
To maintain financial visibility and prevent cost overruns, AWS Cost Explorer offers detailed insights into usage patterns across services, including AI and data-intensive components. Complementing this, AWS Budgets allows setting custom spending limits and automated alerts, ensuring that project costs remain predictable and aligned with operational goals.

(11)🧱 Code & Deployment

Managing code and deploying infrastructure is essential to guarantee reproducibility, traceability and security in cloud-based health projects. This includes not only provisioning resources, but also maintaining reliable pipelines, consistent environments and well-governed ML assets.

🔧 Infrastructure as Code
Terraform allows defining the entire AWS architecture in a declarative way, ensuring that environments remain consistent and reproducible across development, staging and production. It supports provisioning core components such as S3 buckets, VPCs, databases and IAM roles while enforcing infrastructure governance.

🗂️ Versioning & CI/CD
GitHub serves as the central platform for code collaboration, offering pull requests, reviews and issue management. With GitHub Advanced Security, teams can catch vulnerabilities early through dependency scanning and code analysis.
GitHub Actions complements this by automating CI/CD pipelines building containers, validating data quality, deploying Airflow DAGs or updating infrastructure definitions—ensuring that each change is tested and safely promoted.

🏷️ Models & Containers
For containerized workloads, Amazon ECR provides a secure and scalable registry for images used in ECS, EKS or Fargate, ensuring consistency across environments. In parallel, the Amazon SageMaker Model Registry manages ML model versions, capturing lineage, approvals and metadata so that each model deployed into production remains auditable and reproducible.

(12) 🚀 AI Consume

Once the data is standardized and loaded into the OMOP CDM, it becomes the foundation for advanced analytics, AI-driven insights and secure data consumption. This unlocks opportunities for clinical research, decision support and the development of intelligent health applications.

☁️ Data Consumption through APIs
Standardized OMOP data can be exposed through secure API layers, enabling internal and external systems to retrieve curated clinical information. Services such as Amazon API Gateway combined with AWS Lambda provide scalable, low-latency endpoints that support both real-time and batch consumption.

📊 Advanced Analysis and Machine Learning
Amazon SageMaker enables training, evaluating and deploying Machine Learning models directly on top of OMOP data. This supports use cases such as predicting clinical risks, classifying patients by comorbidities or analyzing treatment response patterns, all while integrating seamlessly with the existing data pipeline.

🧩 Vector Search with Aurora and pgvector
By storing patient feature vectors in Aurora PostgreSQL using pgvector, the system can perform semantic similarity searches between patients or clinical cases. This capability enhances cohort discovery and enables personalized recommendation workflows.

🧠 Generative AI with Amazon Bedrock
Amazon Bedrock provides access to foundational models that can summarize clinical notes, extract information from unstructured text or augment concept mapping processes, expanding analytical depth through generative AI.

Researchers can query patients with similar disease profiles using pgvector, deploy readmission prediction models in SageMaker or generate automated insights from clinical notes using Bedrock-powered NLP.

📚 Conclusions

This guide presents a compact proposal for implementing OMOP CDM on AWS, showing how its services can support secure, scalable and efficient clinical data processing. The architecture is flexible and can be adapted to different project needs.

AWS provides an ecosystem that covers the entire data lifecycle, allowing integration with open-source tools and containerized workloads while maintaining control over performance and costs. This balance is especially important in health and AI-driven environments.

Building on strong governance and security practices, the proposed approach demonstrates that AWS enables compliant and reliable data workflows. With the right configuration, clinical data can be transformed into meaningful insights for research, analytics and innovation.

DEV Community

From Raw Clinical Data to AI: Building a Modern Healthcare Data Platform on AWS

🗂️ What is OMOP CDM?

👤 Model Structure: Patient as Central Entity

❤️ Standardized Vocabularies: the semantic heart of OMOP

☁️ OMOP in AWS

OMOP in AWS: Services by section

(1) 📄 Data: Clinical Sources, APIs and Personal Devices

(2) 🔧 Pipeline Services: Data Ingestion and Initial Processing

(3) 🗂️ RAW Storage

(4) 📌 Orchestration

(5) 🧠 AI & Unstructured Data

Employing AWS Comprehend Medical for Medical Data Extraction in Healthcare Analytics

Romina Elena Mendez Escobar ・ Aug 7 '24

(6) 🩺 OMOP CDM

(7) 📊 Data Visualization

(8) 🧭 Data Governance

(9) 🔐 Security and Networking

(10) 🎚️ Monitoring and Billing

(11)🧱 Code & Deployment

(12) 🚀 AI Consume

📚 Conclusions

Top comments (0)