DEV Community

Sandeep Bhatia
Sandeep Bhatia

Posted on

Healthcare Analytics Has a Cloud Cost Problem. And It's Not the One You Think.

Healthcare analytics organizations are quietly building some of the most complex cloud data architectures in existence. Petabytes of EHR data. Claims streams arriving in near-real time. Clinical trial datasets that span multiple sites, multiple countries, and multiple consent regimes. Predictive models for patient readmission risk running on top of BigQuery ML. Databricks clusters processing Epic and Cerner extracts into curated Delta tables every four hours.

And almost all of them are significantly overspending.

Not because they chose the wrong vendors. Not because they over-built their infrastructure. But because the skills required to manage cost at this level of complexity demand something the industry hasn't fully articulated yet: you need to be both a FinOps engineer and a deep query performance specialist, simultaneously, across multiple platforms that each have completely different cost models.

I want to walk through exactly what that looks like in practice — where the money goes, why it's so hard to catch, and what the engineers who are successfully tackling it are doing differently.


The Stack That Healthcare Analytics Actually Runs On

Before we get into cost traps, it's worth describing the architecture I see most often.

On the Azure side: Databricks for data engineering and ML pipeline work, Azure Synapse Analytics SQL Pools for the analytical warehouse that clinical and operations teams query directly. Jupyter notebooks scattered across the Databricks workspace — some attached to long-running clusters, some firing up their own clusters, some nobody has opened in six months but the cluster is still warm. The orchestration is usually Azure Data Factory or Databricks Workflows, and the governance layer is Unity Catalog if the team is reasonably mature.

On the GCP side — often run in parallel by the same organization — BigQuery for analytics, BigQuery ML for the predictive modeling work, and Cloud Composer (managed Airflow) to schedule everything. The BigQuery workload tends to be heavier on ad-hoc query traffic because the SQL interface is familiar to clinical analysts who don't want to touch PySpark.

What drives this dual-cloud topology? Often it's acquisition history. A health system acquires a specialty pharmacy data company that runs on Azure. The core analytics platform was built on GCP three years ago. Nobody wants to migrate anything, and honestly they shouldn't — migration risk in healthcare data environments is real. So you end up with both.

The result is a cost management problem that spans two major clouds, three or four distinct compute billing models, and a dozen different ways for a single query to silently consume thousands of dollars.


What Healthcare Companies Are Now Requiring of Their Data Platform Engineers

Something has shifted in the last eighteen months in the healthcare analytics job market. The job descriptions that used to read "strong SQL skills, experience with ETL pipelines" now require a genuinely different profile.

Here is what leading healthcare analytics companies are now asking for in their job descriptions:

  • 6+ years of hands-on data engineering experience, with significant exposure to healthcare clinical data — EHR systems, claims, clinical trials, population health datasets
  • Proficiency with Databricks for building and optimizing data pipelines, including Unity Catalog, Medallion architecture, and Delta Lake optimization
  • Strong skills in Azure Synapse and/or BigQuery with an understanding of their respective cost models, not just their query interfaces
  • Cost awareness as a first-class skill — the ability to estimate, attribute, and reduce compute spend across platforms, not just as an afterthought but as part of pipeline design
  • Experience with Jupyter Notebooks governance — understanding which notebooks are driving cluster costs, which are idle, and how to build guardrails around ad-hoc compute consumption
  • Familiarity with FinOps practices: tag-based cost allocation, chargeback to clinical vs. research vs. operations teams, and the tooling required to make that visible
  • Understanding of HIPAA-compliant approaches to cost monitoring , which means you can't just pipe query logs to an external SaaS without thinking about what's in the metadata

That last point deserves its own paragraph. Cost monitoring tools that work cleanly for a retail analytics team create compliance headaches in healthcare environments. Query text can contain patient identifiers. Execution logs can reveal which clinical cohorts are being analyzed. Any tooling that touches query metadata needs to be evaluated not just for technical capability but for data governance suitability.

The companies asking for all of this in a single engineer are essentially describing a new subspecialty that sits at the intersection of database performance engineering and FinOps. Let me show you what that looks like across the two most common platform combinations.


Use Case 1: Databricks on Azure + Synapse SQL Pools

The Architecture

A mid-sized health system I worked with had this setup: Databricks handling the heavy transformation work — ingesting Epic ADT feeds, normalizing Cerner lab result streams, building the curated patient encounter tables that downstream systems consumed. Synapse SQL Pools sat downstream, serving as the analytical warehouse for clinical operations and population health queries. Finance was getting charged a blended rate for "data platform." Nobody knew how much was Databricks versus Synapse, and nobody knew how much of either was actually necessary.

Cost Trap 1: Databricks Cluster Scaling Gone Wrong

Databricks auto-scaling is genuinely useful. It's also one of the most common sources of uncontrolled spend I've seen.

The problem is that autoscaling responds to pending tasks, not to business priority. An EHR ingestion job that runs at 3 AM and a data scientist running an exploratory cohort analysis at 2 PM can both trigger a cluster scale-out event. But they have completely different cost profiles and completely different latency requirements. The ingestion job can run on a smaller cluster over a longer window. The analyst query needs to complete before the afternoon clinical review meeting.

What I typically find when I dig into system.billing.usage on these clusters:

SELECT
    sku_name,
    DATE_TRUNC('hour', usage_date) AS hour_bucket,
    SUM(usage_quantity) AS dbus,
    SUM(usage_quantity * list_price) AS est_cost
FROM system.billing.usage
WHERE usage_date >= DATEADD(day, -30, CURRENT_DATE())
GROUP BY 1, 2
ORDER BY 4 DESC
Enter fullscreen mode Exit fullscreen mode

Almost universally, the top cost hours are not the scheduled pipeline windows. They are the hours immediately surrounding when data scientists opened notebooks and started exploring — and the clusters scaled out to accommodate the interactive workload, then didn't scale back in before the next scheduled job ran and kept the cluster alive.

The fix is workload isolation by job type, not just by time of day. Scheduled EHR ingestion pipelines should run on dedicated job clusters with fixed sizing, sized for the actual task. Interactive exploration should happen on separate clusters with aggressive auto-terminate configured (ten minutes of inactivity is a reasonable starting point for clinical analytics environments where the analyst might step away for a call).

Cost Trap 2: Shuffle Spill in Clinical Data Joins

The most expensive Spark anti-pattern I see in healthcare pipelines is the shuffle spill on patient-encounter joins. The reason it's endemic to healthcare specifically: encounter tables are enormous (millions of rows per patient-year for high-utilization populations), and the analytical question almost always requires joining them back to patient demographic tables, claims tables, and diagnosis code tables simultaneously.

When Spark can't fit the shuffle exchange in memory, it spills to local disk and potentially to remote storage. By the time a single claims adjudication query has spilled 15 GB to remote storage, you're looking at significant I/O cost on top of the DBU cost — and the query is also dramatically slower than it needs to be.

The signature I look for in system.query.history:

SELECT
    statement_id,
    statement_text,
    total_duration_ms,
    spilled_local_bytes,
    spilled_remote_bytes,
    read_bytes
FROM system.query.history
WHERE spilled_remote_bytes > 1073741824  -- 1 GB
  AND execution_status = 'FINISHED'
ORDER BY spilled_remote_bytes DESC
LIMIT 20
Enter fullscreen mode Exit fullscreen mode

The remediation depends on the specific join. For joins where one side is a small reference table (ICD-10 code descriptions, formulary data, provider rosters), adding a /*+ BROADCAST(small_table) */ hint eliminates the shuffle entirely. For genuine large-to-large joins (patient demographics to encounter history), the answer is usually Z-ORDER on the join columns and ensuring partition pruning is firing. If the claims table is partitioned by service_date and your join query doesn't include a date filter, you're scanning the entire table every time.

Cost Trap 3: Synapse DWU Over-Provisioning

Synapse SQL Pools bill by DWU-hour. DW300c, DW500c, DW1000c — each step up is a significant price increase, and the pricing is not linear relative to performance gains for all workload types.

The pattern I see most often: the Synapse pool was provisioned at DW1000c for a large historical load that ran during initial deployment, and nobody ever downsized it. The ongoing operational workload — population health dashboard queries, clinical operations reports, care management cohort queries — is actually quite manageable at DW300c or DW500c. The expensive compute is idle most of the time.

Query the DMVs to see what you're actually running:

SELECT
    r.request_id,
    r.command,
    r.resource_class,
    r.status,
    r.total_elapsed_time / 1000.0 AS elapsed_sec,
    r.rows_returned
FROM sys.dm_pdw_exec_requests r
WHERE r.status = 'Completed'
  AND r.submit_time >= DATEADD(day, -7, GETDATE())
ORDER BY r.submit_time DESC
Enter fullscreen mode Exit fullscreen mode

If the vast majority of completed requests show elapsed times under 30 seconds and row counts in the thousands rather than millions, you're running an enterprise warehouse at clinical-trial-data-load scale. That's an expensive mismatch.

The more subtle version of this problem is the distribution key mismatch. Synapse distributes data across 60 distributions. If your patient_encounters table is hash-distributed on encounter_id but your most common analytical query joins on patient_id, every join requires a data movement operation — shuffling rows between distributions before the join can execute. When I pull sys.dm_pdw_exec_requests and see consistent data_movement_bytes in the multi-GB range for routine population health queries, this is almost always the cause.

The fix requires a table rebuild with the correct distribution key. It's disruptive, but it's often the single highest-leverage Synapse optimization available.

The Jupyter Notebook Problem

Data scientists in healthcare analytics environments are frequently the largest uncontrolled cost source on a Databricks platform — not because they're wasteful, but because the notebook interface abstracts away cost visibility entirely.

A clinical data scientist analyzing a patient cohort for a care management program opens a notebook, attaches it to the shared analytics cluster, runs several exploratory queries against a 500 GB Delta table, gets pulled into a meeting, and leaves the notebook session open. The cluster stays alive. The interactive SQL query that scanned 180 GB of patient encounter data without partition filters runs twelve times as they refine their cohort definition. There's no indication anywhere in the notebook interface that this sequence of operations cost $340 in compute.

This is a governance problem as much as a technical one. The teams handling it well are doing three things:

  1. Cluster policies that enforce auto-terminate for all interactive clusters, no exceptions
  2. Query tagging that attributes Databricks spend to cost centers by notebook name and username, using the custom_tags field in system.billing.usage
  3. Cost visibility at the notebook level — surfacing DBU consumption per notebook session to the data scientist in near-real time, so the feedback loop is immediate rather than appearing on a monthly cloud invoice

The chargeback piece is particularly important in healthcare organizations with distinct clinical, research, and operations functions. Clinical informatics and health economics teams often have separate budgets. Without tag-based cost allocation that maps Databricks workspace spending to those cost centers, the platform team ends up absorbing costs that should be attributable to specific programs — and they lose the ability to have informed conversations about compute allocation priorities.


Use Case 2: BigQuery on GCP with ML Capabilities

The Architecture

On the GCP side, BigQuery handles the analytical layer, and an increasing number of healthcare organizations are moving into BigQuery ML for the predictive modeling work that used to require standing up separate ML infrastructure. Readmission risk models. Drug interaction flags. Patient population stratification for value-based care contracts. Real-world evidence analytics for specialty pharmacy programs.

BigQuery's billing model is deceptively simple on the surface — on-demand pricing at $6.25 per TB scanned — and deceptively expensive in practice. The reason: BigQuery scans are billed at column-level granularity, which means every SELECT * on a 20-column table is billing you for 20 columns worth of scan even if your WHERE clause only needs three of them.

Cost Trap 1: BQML Training Jobs Without Partition Pruning

BigQuery ML training jobs are some of the most expensive queries I've ever seen in a production environment. The pattern in healthcare: a readmission risk model trains on three years of encounter history, and the training SQL looks something like this:

-- What this query actually does to your BigQuery bill:
CREATE OR REPLACE MODEL `clinical.readmission_risk_v3`
OPTIONS (model_type='logistic_reg', input_label_cols=['readmitted_30d'])
AS
SELECT
    p.age_at_admission,
    p.primary_diagnosis_code,
    p.discharge_disposition,
    e.prior_admissions_12mo,
    e.length_of_stay_days,
    e.readmitted_30d
FROM `clinical.patient_encounters` e
JOIN `clinical.patient_demographics` p ON e.patient_id = p.patient_id
WHERE e.admission_date >= '2022-01-01'
Enter fullscreen mode Exit fullscreen mode

Looks reasonable. The problem is that patient_encounters is partitioned by admission_date and clustered on patient_id, but the BQML job runs as a full table scan because the query optimizer doesn't always push partition filters through the JOIN path in training queries. On a 2 TB encounters table, that's a $12.50 charge per training run. Run it weekly as the model retrains on new data, and you're at $650/year for a single model — and this organization has fourteen predictive models.

The fix: build a partitioned, incrementally maintained feature table first, then train against that bounded dataset. This separates the scan cost from the training cost, makes the scan explicit and optimizable, and avoids the strict limitations and maintenance overhead of BigQuery materialized views with joins.

-- Build a curated, partitioned feature table
CREATE OR REPLACE TABLE `clinical.readmission_features`
PARTITION BY admission_date
CLUSTER BY patient_id AS
SELECT
  e.admission_date,
  e.patient_id,
  p.age_at_admission,
  p.primary_diagnosis_code,
  p.discharge_disposition,
  e.prior_admissions_12mo,
  e.length_of_stay_days,
  e.readmitted_30d
FROM `clinical.patient_encounters` e
JOIN `clinical.patient_demographics` p
  USING (patient_id)
WHERE e.admission_date >= DATE '2022-01-01';

-- Training now scans the curated feature table, not the raw history
CREATE OR REPLACE MODEL `clinical.readmission_risk_v3`
OPTIONS (
  model_type = 'logistic_reg',
  input_label_cols = ['readmitted_30d']
) AS
SELECT
  age_at_admission,
  primary_diagnosis_code,
  discharge_disposition,
  prior_admissions_12mo,
  length_of_stay_days,
  readmitted_30d
FROM `clinical.readmission_features`
WHERE admission_date >= DATE '2022-01-01';
Enter fullscreen mode Exit fullscreen mode

This single change reduced the weekly training cost for this particular model from $12.50 to under $2.00 — because the feature table pipeline only processes the partitions that have new data since the last refresh, and the model training only scans the curated features.

Cost Trap 2: Partition Pruning Failures on Clinical Data

The most common BigQuery anti-pattern I see in healthcare environments is partition pruning failure on date-partitioned clinical tables. It's particularly insidious because the query looks correct.

The encounter table is partitioned by admission_date. The analyst writes:

SELECT
    patient_id,
    primary_diagnosis_code,
    length_of_stay_days
FROM `clinical.patient_encounters`
WHERE CAST(admission_date AS DATE) >= '2024-01-01'
Enter fullscreen mode Exit fullscreen mode

The CAST() wrapper defeats partition pruning. BigQuery cannot evaluate the partition filter before scanning because the cast is applied per-row, not per-partition. The query scans the entire table. A small change — removing the redundant cast on a column that is already of type DATE — immediately activates partition elimination.

The same failure mode appears with date functions:

-- Defeats pruning:
WHERE EXTRACT(YEAR FROM admission_date) = 2024

-- Activates pruning:
WHERE admission_date BETWEEN '2024-01-01' AND '2024-12-31'
Enter fullscreen mode Exit fullscreen mode

For a 2 TB encounters table queried twenty times a day by clinical analysts, the difference between pruned and unpruned queries is approximately $225/day in scan costs. That's $82,000/year from a single table, and I've seen organizations with dozens of tables in this state.

Cost Trap 3: On-Demand vs. Slot Reservations

Most healthcare analytics teams start on BigQuery's on-demand pricing because it requires no commitment. The total spend is low, the per-query cost is invisible, and the billing model seems straightforward.

As the analytics workload grows — more users, more dashboards, more BQML training jobs — on-demand pricing starts to become expensive and unpredictable. The shift from on-demand to slot reservations (BigQuery Editions) is one of the highest-leverage cost decisions a healthcare data platform team can make, but it requires knowing your actual workload pattern well enough to size the commitment correctly.

The analysis I run before making this recommendation:

SELECT
    EXTRACT(HOUR FROM creation_time) AS hour_of_day,
    COUNT(*) AS query_count,
    SUM(total_slot_ms) / 1000 / 3600 AS slot_hours,
    SUM(total_bytes_billed) / POW(1024, 4) * 6.25 AS on_demand_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND statement_type != 'SCRIPT'
GROUP BY 1
ORDER BY 1
Enter fullscreen mode Exit fullscreen mode

This gives you a per-hour view of slot consumption and equivalent on-demand cost. For most healthcare analytics organizations with 24-hour query activity driven by clinical operations in the US and research teams in other time zones, the breakeven point for a 100-slot Enterprise Edition reservation typically sits around $4,000–$6,000 in monthly on-demand spend. Below that threshold, stay on on-demand. Above it, model the commitment carefully.

One nuance that matters in healthcare: ML training jobs consume slots in bursts that look very different from steady analytical query traffic. A slot reservation sized for the analytical baseline will run fine most of the time but queue aggressively when training jobs fire. The right answer is usually a baseline reservation for predictable analytical workloads plus on-demand overflow for training bursts — not a single reservation attempting to cover both.

Cost Attribution in Healthcare Environments

The chargeback conversation in healthcare organizations is more complex than in most industries because the cost centers are more numerous and the political stakes of attribution are higher. Clinical informatics, research computing, population health management, specialty pharmacy analytics, and corporate finance analytics all have legitimate claims on the data platform, and they all have separate budget owners.

The tooling to handle this correctly requires:

  1. Label-based attribution on every BigQuery job — team, cost_center, project_code, and data_classification (the last one matters for HIPAA considerations)
  2. Automated label enforcement — labels that have to be set manually will be forgotten; they need to be injected at the pipeline level or caught by a policy that fails jobs without required labels
  3. A reporting layer that translates BigQuery billing export data into per-department spend, with drill-down to the job level for the month-end chargeback conversation

The data_classification label deserves specific attention. Before you pipe BigQuery job metadata to any external cost reporting tool, you need to understand what's in the query text and execution metadata. For most analytical queries this is innocuous — aggregate counts, summary statistics. But for ad-hoc queries run by clinical analysts, query text can contain patient identifiers, cohort definitions referencing diagnosis codes, or other protected health information. Your cost attribution infrastructure needs to be evaluated under the same data governance framework as your clinical data infrastructure.


The Dual Skillset Problem

What makes all of this hard is that it genuinely requires two different types of expertise to do well, and they don't naturally coexist in the same person or the same team.

Query performance tuning is a deeply technical skill. Understanding why a BigQuery partition filter is failing requires knowing how the query optimizer evaluates predicates against partition metadata. Understanding why a Databricks Spark job is spilling to remote storage requires reading the physical query plan and understanding how broadcast joins compare to sort-merge joins in terms of memory pressure. These are skills that develop over years of reading execution plans and tuning real production workloads.

FinOps engineering is a different skill. It requires understanding billing models across multiple vendors, building cost attribution infrastructure, having difficult conversations with budget owners about chargeback policies, and designing governance frameworks that work in regulated industries without becoming so restrictive that they impede legitimate analytical work.

The healthcare analytics market is increasingly recognizing that data platform teams need both. The roles being posted now aren't just "senior data engineer" in the traditional sense — they're asking for engineers who can optimize a Spark shuffle operation in the morning and design a tag-based chargeback policy in the afternoon.

That combination is rare. And organizations that have it, or are building toward it, have a significant structural advantage: they fix the right thing first. They tune the queries, let utilization settle into its real baseline, and then right-size the infrastructure against that accurate signal. The teams that don't have this expertise tend to do it backwards — they see high infrastructure costs, provision more compute to alleviate the slowness, and end up with an even larger bill for an even more complicated system.

Healthcare data is too important, and the compute costs are too large, to get that order wrong.


Where to Start

If your team is managing a multi-platform healthcare analytics environment and you're not sure whether your current spend reflects your actual workload requirements, here is where I'd begin:

For Databricks on Azure:

  • Query system.query.history for spilled_remote_bytes > 1073741824 — any query spilling more than 1 GB to remote storage is a high-priority tuning target
  • Check system.billing.usage for DBU consumption by sku_name broken down by hour — are your peaks during scheduled pipeline windows or during interactive hours?
  • Audit notebook cluster attachment — how many notebooks are attached to clusters that have been idle for more than 30 minutes?

For Azure Synapse SQL Pools:

  • Check sys.dm_pdw_exec_requests for data_movement_bytes — if routine queries are moving more than 1 GB between distributions, your distribution key strategy needs revisiting
  • Look at columnstore_segments_scanned vs. columnstore_segments_skipped — if segment elimination is below 50%, your ordered CCI configuration needs work
  • Is your DWU tier aligned to your actual peak workload, or to the heaviest query you ever ran during initial setup?

For BigQuery on GCP:

  • Pull INFORMATION_SCHEMA.JOBS_BY_PROJECT and look for queries where total_bytes_billed is significantly higher than total_bytes_processed — this gap often indicates partition pruning failure
  • Check your BQML training jobs for full table scans that could be curated feature table references instead
  • Run the hourly slot consumption analysis above and compare against your monthly on-demand bill — the flat-rate breakeven may be closer than you think

The underlying question in all of these cases is the same: are you paying for what you actually need, or are you paying for what a poorly optimized query decided to consume?

In healthcare analytics, getting that answer right matters more than in most industries. The compute budget that's going to unnecessary BigQuery scans is compute budget that isn't going to expanding your real-world evidence analytics program, or improving your readmission prediction models, or accelerating the clinical trial data aggregation pipelines that your research teams are waiting on.

The data you're working with is too important to leave money on the table like this. And the tools to find it are, for the most part, already built into the platforms you're running.


Have you navigated multi-platform cost optimization in a healthcare analytics environment? I'd be particularly interested in hearing how your teams are handling the HIPAA considerations around query metadata in cost monitoring tools — it's one of the most underappreciated constraints in this space.

Tags: FinOps, DataEngineering, HealthcareAnalytics, Databricks, BigQuery, AzureSynapse, CloudCostOptimization, SQL, QueryTuning, HIPAA

Top comments (0)