SciForce

Posted on Sep 24

Automating Research-to-Care Data Integration via OMOP and FHIR

#datascience #bigdata #healthcare #ohdsi

Client Profile

Our client is a university hospital based in Germany, aimed to enhance cross-institutional data exchange using standardized data integration tools

through structured health data pipelines. Multiple institutions involved in observational research sought to integrate their findings, such as risk models and disease prevalence metrics, into operational clinical workflows. To support this goal, the client required a conversion pipeline from OMOP CDM (used in research analytics) to HL7 FHIR (used in clinical applications) to enable real-time data interoperability.

Challenge

Despite both data models are widely used in healthcare data management and interoperability, OMOP CDM and HL7 FHIR serve distinct purposes and vary a lot. A brief comparison of the models is illustrated in the table below.

An OMOP-to-FHIR conversion is well known for several issues that occur and can cause a tumult of emotions even among the experienced data conversion team. We faced the following:

1) Scalability & Interoperability
The client’s OMOP CDM instance comprised millions of records accumulated over many years of longitudinal care. While exact figures are proprietary, the scale was typical of major academic centers. This scale, combined with the need to integrate standardized analytics-ready data into a patient-centric FHIR model, created both technical and semantic integration challenges.

2) Data Model Differences
OMOP represents the relationships using foreign keys across normalized tables (e.g., condition_occurrence, concept_relationship), while FHIR uses nested references and resource linkage (e.g., Condition.subject, Encounter.hospitalization). Mapping OMOP to FHIR required not just field-level transformation but structural remodeling.

3) Granularity & Data Loss
Some of the OMOP tables, e.g. cohort, observation_period have no direct analog in FHIR, which mandates stricter resource boundaries and predefined value sets. Diving in, as in OMOP CDM dwelt rich observational and derived data such as cohort definitions, risk scores, predictive models that do not always fit into FHIR’s predefined resource types, the important research metadata can be lost or require custom FHIR extensions. At the same time, some FHIR fields (e.g., Observation.method, Medication.status) lacked corresponding data in OMOP, requiring defaulting, inference, or omission strategies.

4) Patient Privacy & Query Optimization
OMOP often uses de-identified data for research, while FHIR is used in real-time clinical care with patient-identifiable interactions via RESTful APIs. SQL-based access in OMOP allows efficient bulk queries, whereas FHIR APIs are less efficient for high-volume data retrieval compared to SQL-based OMOP queries. This mismatch required an optimization of the extraction and loading process to avoid bottlenecks during FHIR resource creation.

Solution

Automated Mapping with Clinical Review
When converting OMOP to FHIR we implemented several custom solutions (to be seen below), which allowed us to build a scalable, flexible pipeline, and leveraged OHDSI’s White Rabbit and Rabbit-in-a-Hat tools for automation. Further, the alignment between OMOP concepts and FHIR resources, profiles, and extensions was validated by our system of checks and manually reviewed by clinical domain experts.

Ensuring Consistency and Interoperability
To address mapping issues, we built a custom Extract, Transform, Load (ETL) solution. As an orientation tool we’ve linked OMOP and FHIR via the Biomedical Research Integrated Domain Group Model (BRIDG), a common semantic framework, facilitating interoperability between research (OMOP CDM) and operational (FHIR) models. This was especially helpful in resolving ambiguous one-to-many or many-to-one mappings.

However, the final decision was made based on clinical logic and project demands. Frequently, a post-coordination was the option of choice as the one allowing the exact match, where direct OMOP to FHIR one-to-one mapping was impossible. Thus, our approach allowed us to convert OMOP to FHIR or FHIR Extensions, capturing the distinct characteristics of both models while ensuring interoperability and real-world semantics.

Flexible Data Pipelines
We constructed custom Python-based scripts and an ETL orchestrator to build a sustainable, flexible, and scalable pipeline. It supported resumable FHIR loading, with documented versioning that allowed the pipeline to pause and resume without losing context. Versions of ETL scripts were documented in Git, ensuring proper governance.

Validation System
Our system of checks performed a hash-based system for deduplication at conversion stage, as well as during load stage an automated validation of FHIR-structures, such as record-level and aggregate parity check.

Resiliency & Recovery System
At each automated step we included a fallback mechanism, e.g. fallback logging for failed extractions or for unmappable codes during conversion, or audit logs per bundle/record processed while loading. We also implemented a progress-saving component, allowing to restore\resume after a breakdown.

Features:

Custom ETL Pipeline
SQL and Python scripts for sustainable, flexible and extendable pipeline, and a version-controlled scripts of the ETL orchestrator for the data flow in Git.
Proprietary Mapping Framework (Jackalope)
Jackalope enables accurate post-coordinated mappings onto FHIR extensions when direct FHIR equivalents are unavailable.
Automated Unit Conversion
An automated unit conversion using the Unified Code for Units of Measure (UCUM) allows to unify units of measure during OMOP-to-FHIR transformation.
Hash-Based Deduplication
Logic to detect and remove duplicates across OMOP resources using hash-matching on key attributes such as person_id, concept ID, date, and value.
CI/CD with Auto-Testing
Continuous integration and testing validate FHIR schema compliance after each mapping update supporting stable, versioned releases.
Patient ID pseudonymization logic
Pseudonymization logic and secured keys storage ensure patient privacy without losing the ability to re-identify the patient if necessary.
Pipeline Resumability
Built-in recovery mechanism ensures safe pausing and resumption of FHIR server loading without context loss.

Development Process

1) Preparation
To assess data quality and plan the workflow, we started with profiling the source OMOP CDM tables, using the White Rabbit scanning tool. Since the client's data was already in OMOP CDM format, the initial setup proceeded smoothly. However, we faced a few local peculiarities of data structure, such as:

‘null’ in important attributes;
non-standard usage of certain event types (e.g. drug_type_concept_id was not always relevant);
duplicates in source tables;
unmatched data between domains (e.g. no visit_occurence_id in several records).

In more complicated cases, such as when the records contradicted the OMOP specification, we validated local corrections with the client. This pre-extract step of validation allowed us to clean and structure the data.

2) Extraction
After evaluating the data quality, completeness and logical coherence, we extracted OMOP CDM data using SQL scripts optimized for batching allowing memory-safe parallel processing. To facilitate the extraction process, we have optimized queries by index scanning of the source tables such as person_id or visit_occurrence_id, and we parallelized the extraction process by domain, grouping by OMOP table type (condition_occurrence, drug_exposure, observation etc.).

3) Conversion
We began with an automated mapping step:

- Terminology Standardization
Using Rabbit-in-a-Hat tool and our custom scripts we performed terminology standardization via OMOP vocab and mapped OMOP tables to FHIR resources, using OMOP Vocabulary, SNOMED CT, RxNorm, and LOINC.

- Unit Normalization
To normalize units, we used our custom script to convert values from the measurement.unit_concept_id field into standardized UCUM entities for FHIR Observation.valueQuantity. The script was integrated into the Transformation phase of the pipeline and worked as a callable function with parameters and supported a centralised, easily maintainable mapping. For non-standard or absent entities, our script had a fallback mechanism with logging, which helped to identify and resolve such issues.

- Post-Coordination for Complex Cases
To ensure semantic consistency while preserving essential details, we implemented post-coordination of the complicated cases, when OMOP-concept didn’t have direct FHIR equivalent, like the cases of complications’ severity or complex symptoms. To ensure semantic precision, we used our automation framework, Jackalope, to generate post-coordinated mappings in cases where direct alignment with FHIR was not possible.

4) Load

Based on mapping and transformation rules, we formed JSON FHIR resources and generated valid JSON documents according to FHIR R4.
To avoid the bottleneck of numerous calls, we parallelly uploaded multiple FHIR Bundle transactions via the Batch API, compressing JSON body via GZIP-compression, and a paralleled uploading, and where possible, employed $import operations.
The generated resources were validated using the FHIR validator and cross-checked against aggregated OMOP and FHIR statistics for consistency.

Technical highlights

To meet the challenges comprised, we created:

Flexible, resilient and scalable pipeline, based on custom Python-scripts.
Version-controlled and Git-documented ETL orchestrator to guide the data flow.
Automated solutions for real-world semantics, such as module of normalisation of the units of the measurement, including a fallback logging and hash-based deduplication mechanism.
Generator of FHIR JSON resources, which considers the context, as logic of drug_type_concept_id.
System of checks with audit logs per bundle/record processed, automated validation of FHIR-structures, and progress-saving component, allowing to restore\resume after a breakdown.

Result

- Accelerated Data Operations with a Scalable Transformation Pipeline
The client reduced manual data handling by 60–80%, freeing up engineering and clinical staff time and avoided bottlenecks. The pipeline significantly enhanced overall efficacy and reliability through built-in error handling and versioning, and full traceability to support audit and compliance.

- Future-Ready Scalability and Real-Time Monitoring
Real-time monitoring empowered the client to, if needed, manage system performance, reducing downtime and ensuring operational continuity.

- Enhanced Clinical Relevance Through Standardization
The solution enabled consistent vocabulary mapping and code normalization across datasets, preserving clinical accuracy. This standardization supported meaningful comparisons, better decision support, and improved downstream analytics.

- Achieved Seamless Interoperability with OMOP-to-FHIR Conversion
A complex OMOP-to-FHIR data conversion was successfully implemented through the development of novel technical solutions. The project integrated engineering innovation with synchronous clinical expertise to ensure both technical accuracy and real-world semantics.
A full OMOP-to-FHIR transformation was delivered across 12 OMOP domains, with peak loading productivity 800-1000 FHIR resources per second, all schema-validated and aligned with institutional vocabularies.

- Improved Clinical Decision-Making with Standardized Insights
Standardized data from observational research was integrated into routine workflows, enabling the client to evaluate treatment effectiveness and care delivery with greater consistency—ultimately improving patient outcomes and clinical efficiency.

DEV Community