SciForce

Posted on Dec 4, 2025

From Raw Claims and Clinical Data to PCORnet CDM: End-to-End ETL on Snowflake

#ai #healthcare #datascience #bigdata

Client Profile

Our client, a U.S. health insurer collaborating with multiple hospital systems, aimed to aggregate and harmonize anonymized claims and clinical data in the PCORnet Common Data Model (CDM) to support large-scale outcomes research and operational analytics. The incoming medical and billing feeds came from heterogeneous hospital and payer systems with inconsistent schemas, variable data quality, and no unified governance. The client asked SciForce to design and implement a sustainable, cloud-native ETL/ELT pipeline on Snowflake that would:

1) Continuously integrate raw source feeds into a centralized Snowflake data platform;

2) Transform them into a PCORnet-conformant CDM with strong data quality guarantees;

3) Enable near real-time analytics for patient demand forecasting, capacity planning, and revenue cycle optimization.

Challenge

1) Choosing the optimal cloud data platform
The client was evaluating modern cloud data platforms and had a strong preference for Snowflake but wanted an evidence-based comparison with AWS-native tooling (Redshift, Glue, Lambda, S3). SciForce performed a focused R&D assessment comparing:

Total cost of ownership (compute, storage, data egress);
Scalability and concurrency for PCORnet-scale workloads;
Support for ELT patterns (in-database transforms) and CI/CD;
Security and compliance controls (HIPAA, PHI handling);
Fit for PCORnet CDM and healthcare-specific workloads.

Based on this assessment and the client’s technology strategy, Snowflake was selected as the core analytical platform, with all heavy transformations executed in-database using Snowflake virtual warehouses.

2) Diverse data sources and quality issues
Different source formats (HL7 FHIR, HL7 CDA as well as openEHR) provided by several hospital systems and insurance providers, exhibiting substantial heterogeneity:

Different coding systems and formats (ICD, CPT/HCPCS, local codes);
Inconsistent use of nulls, default values, and free text;
Schema drift between file drops (columns added/removed/renamed);
Duplicate and conflicting records across payers and providers.

Because source tables were not systematically validated, we had to implement extensive automated profiling, anomaly detection, and data cleaning prior to mapping into PCORnet.

3) Appropriate tooling and environment setup
To ensure robustness, scalability, and maintainability of the ETL pipeline on a designated platform, we established a dedicated Snowflake environment aligned with:

Separate development, staging, and production accounts and virtual warehouses;
Role-based access control (RBAC) aligned with least-privilege principles;
Automated CI/CD pipelines for ETL code (SQL/JavaScript/dbt) and configuration;
Monitoring dashboards for performance, cost, and Service Level Agreements (SLA) adherence.

4) Meeting Snowflake features and constraints
Snowflake’s architecture - separate storage and compute, micro-partitioning, result caching, and multi-cluster virtual warehouses - allowed us to implement an ELT-first approach:

Raw feeds are landed into Snowflake staging schemas with minimal pre-processing;
All complex transformations, joins, and PCORnet mappings run inside Snowflake using virtual warehouses tuned per workload;
Streams and Tasks orchestrate incremental loads and change data capture (CDC) natively inside Snowflake, so external schedulers are not required.

Compared to a fully AWS-native stack, Snowflake places more responsibility on well-engineered SQL/JavaScript transformations and metadata management. SciForce addressed this by implementing:

A reusable, parameterized ETL framework in SQL/JavaScript and dbt;
Centralized data cataloging and lineage tracking integrated with Snowflake metadata;
Idempotent, restartable pipelines to support robust recovery and reprocessing.

5) Snowflake integration and ETL design
Leveraging these platform capabilities, SciForce designed a Snowflake-centric architecture for the PCORnet ETL that emphasizes:

A modular, parameterized SQL/JavaScript transformation framework optimized for PCORnet tables;
Reusable mapping libraries for diagnosis/procedure/medication/encounter domains;
Idempotent load patterns (truncate-insert, merge-upsert) with robust audit logging;
Config-driven pipelines so that most behavior can be changed via metadata rather than code.

Complex business logic (e.g., encounter inference, episode construction, payer aggregation) was implemented as well-tested Snowflake stored procedures, while dbt models handled declarative transformations and dependency management. This approach allows controlled reuse across Snowflake-based projects while keeping the design transparent and maintainable.

6) Billing
Because Snowflake separates storage from compute and bills per-second for warehouse usage, we deliberately optimized the ETL design to:

Use dedicated, size-appropriate virtual warehouses for staging, transformations, and analytics;
Enable auto-suspend and auto-resume so warehouses run only during active ETL windows, minimizing idle time;
Leverage clustering, pruning, and selective materialization to reduce the amount of data scanned in each step.

In Snowflake, the total cost of running ETL workloads is primarily driven by:

Compute: warehouse runtime, measured in Snowflake credits consumed while processing data;
Storage: the volume of raw, staged, and PCORnet-conformant data retained in the platform;
Optional integrations: any third-party ingestion or orchestration tools used alongside Snowflake.

By keeping warehouses active only for the duration of ETL batches and minimizing scanned volumes through careful clustering and partition pruning, we achieved a measured compute cost of approximately 9–15 Snowflake credits per TB processed, with clear visibility and control over spend.

During the initial architecture assessment, we also compared this cost model with AWS Glue’s serverless pricing. For always-on, long-running ETL pipelines over very large, continuous workloads, AWS Glue can be more economical thanks to its serverless execution model. However, for this client’s bursty, SQL-centric ELT workloads - where heavy transformations run in short, well-optimized batches directly inside Snowflake - the Snowflake-based approach proved more cost-effective overall, while also simplifying governance and performance tuning.

Solution

Scalable multi-cloud warehouse
Leveraging Snowflake’s multi-cluster virtual warehouser and micro-partitioning, we designed a scalable end-to-end ETL/ELT pipeline that harmonizes multi-terabyte datasets into PCORnet CDM without performance degradation as volumes grow or concurrency increases.

Performance and speed
Our solution has minimized data movement delays and allows fast, in-database data processing and low-latency transformations that meet the client’s agreed SLAs.

Transparent & cost-effective architecture
Snowflake-based architecture tailored for client’s operational demands delivered predictable and transparent cloud costs, optimized through right-sized virtual warehouses, auto-suspend/auto-resume, and targeted clustering.

Robust data validation and quality assurance
Taking advantage of built-in tools Time Travel and Fail-safe as well as our custom scripts, our pipeline provides a fallback mechanism for failed extractions, supports point-in-time recovery, maintains detailed audit logs, as well as progress-saving and and source-to-target validation components.

Data security & compliance
Secure data transfer and storage mechanisms as well as fine-grained role-based access control maintain safety. The solution was given in full compliance with HIPAA standards, including encryption in transit and at rest, and audited access to sensitive PCORnet tables

Development Process

1) Stakeholder alignment and requirements gathering
First, we provided an R&D comparison between AWS and Snowflake to propose a solution best-tailored for client’s requirements and constraints. Then, after a proof-of-concept run that validated performance and cost assumptions, our team proceeded with setting up the data source inventory and access management within the Snowflake environment.

2) Data assessment and infrastructure configuration
As mentioned above, we profiled and analysed the source data (e.g. identifying and removing duplicates and missing values), preparing the data for further transformations.

To automate the process for subsequent refresh cycles, we built custom data quality and integrity checks within scalable Snowflake infrastructure and tooling setup, including automated anomaly detection, row-count reconciliation, and schema-drift monitoring.

3) ETL Development

- Snowflake-oriented architecture
The quality of the ETL process directly depends on the quality of the code. Our development team ensured an agile and efficient process, using a robust SQL engine in Snowflake for all transformations and aggregations, and implementing modular, parameterized scripts for PCORnet-specific logic.

- Scalability
There were a few concerns for scalability, important to mention:

For dynamic scaling of clusters (multi-cluster warehouses) during ETL, we aligned with the client’s governance model to allow controlled auto-scaling, especially when handling large volumes of data and testing environments;
It is more efficient to break down large ETL queries into smaller tasks for parallel processing;
We configured Streams for incremental updates instead of processing the entire dataset in one go.

- Data restructuring
We extracted the data from different sources, then cleaned and sorted it. Mapping of source data to PCORnet Common Data Model was supervised by a team of medical doctors. This ensured semantic consistency and data integrity.

- Resiliency and Recovery
SciForce has a strong legacy in building resilient pipelines. The client can confidently proceed with the data aligning and further analyze operational efficacy, grounding on our solution fitted for the Snowflake Cloud. We leveraged Snowflake Time Travel and Fail-safe and implemented a progress-saving component, allowing the client to restore\resume after a breakdown. Clear documentation supported future reuse of the script.

4) Snowflake integration
The ingestion of SQL scripts into Snowflake platform followed the following steps:

To construct a Snowflake-based architecture and deploy scalable end-to-end data model ETL solution, we used the following tools:

5) Robust pipeline optimization and quality assurance
Performance benchmarking revealed potential bottlenecks when we worked with full-scale and stress-test data volumes. Therefore, we introduced explicit clustering on high-cardinality columns and refactored the heaviest queries significantly improving overall throughput. Further, the speed (e.g. throughput, latency and runtime) and reliability metrics met or exceeded client’s requirement and allowed scaling and processing of datasets up to 10x larger without violating SLAs.

We also integrated automated source-to-target checks, transformation-level tests and post-load validation in the pipeline to achieve efficient iterative refinement and ensure data quality.

6) End-to-end workflow automation or maintainability
Using Snowflake Tasks and Streams, together with our configuration-driven ETL framework, the entire workflow - from raw data arrival in staging through transformation into PCORnet CDM and post-load validation - is now automated and requires minimal manual intervention. Operational dashboards and alerting allow the client’s team to monitor runtimes, failures, and Snowflake credit usage, while clear documentation and runbooks make the solution easy to operate and extend.

Result

- Data Harmonization
We integrated heterogeneous patient datasets from 5 hospitals and 12 insurance providers into a single, PCORnet-conformant CDM to enable detailed cross-site analytics.

- ETL Pipeline Engineering
We designed a sustainable, parameterized ETL pipeline on the Snowflake Data Cloud, implementing automated schema validation, incremental loads, and error handling for data quality assurance. For continuous deployment and monitoring, we integrated CI/CD processes.

- Performance Metrics
Achieved transformation runtime: ~25 minutes per TB of processed data, error rate: 0.089%, validated across multiple transformation batches. Horizontal scalability: allowed 10× performance improvement under increased data volume and concurrency.

- Architecture Optimization
Separation of compute and storage layers ensured elastic scalability and cost efficiency, while query optimization, result caching, and materialized views minimized processing time. Computational efficiency ranged between 9–15 Snowflake credits per TB, thus lowering SQL transformation costs and providing clear visibility into compute spend.

- Operational Impact
Our solution enabled predictive analytics for patient demand forecasting, capacity planning, and utilization monitoring. As a result, the client improved resource allocation and revenue cycle management through timely, data-driven insights.

DEV Community