JSON is Not Enough: The Engineering Headache of Flattening FHIR for Analytics

#webdev #database #programming #dataengineering

Let’s be real for a second: FHIR (Fast Healthcare Interoperability Resources) is absolutely fantastic for transactional data exchange. It’s the API glue that holds modern healthcare apps together.

But if you are a Data Engineer tasked with getting that data into a Data Warehouse (BigQuery, Snowflake, Redshift) for your analytics team? FHIR is a nightmare.

I learned this the hard way about three years ago. I thought, "It's just JSON, right? Modern data warehouses handle JSON natively!"

Boy, was I wrong. Here is why simple JSON extraction fails at scale, and how we actually solve the problem of flattening these deeply nested structures without losing our minds.

The Problem: It's a Graph, Not a Table

The issue isn't that FHIR is JSON. The issue is that FHIR resources are deeply nested, recursive, and polymorphic.

Take a simple Patient resource. In a SQL table, you’d expect a column for Name. In FHIR? Name is an array of objects.

You have a "Legal" name.
You have a "Maiden" name.
You have a "Nickname".

Each of those objects contains a given array (first name, middle name) and a family string.

Here is a stripped-down example:

{
  "resourceType": "Patient",
  "id": "example",
  "name": [
    {
      "use": "official",
      "family": "Chalmers",
      "given": ["Peter", "James"]
    },
    {
      "use": "nickname",
      "given": ["Jim"]
    }
  ],
  "identifier": [
    {
      "system": "http://hospital.org/mrns",
      "value": "12345"
    }
  ]
}

If you just toss this into BigQuery and try to query name.family, you get null or an error because it's inside an array.

The "Explosion" Problem

To analyze this in SQL, you usually have to UNNEST or LATERAL VIEW EXPLODE these arrays.

If a Patient has 2 names, 3 identifiers, and 5 extensions, and you try to flatten this into a single wide row without being careful, you trigger a Cartesian Product.
2 * 3 * 5 = 30 rows for a single patient.

Now imagine doing this for an ExplanationOfBenefit resource with hundreds of line items and diagnosis codes. Your 1 million record dataset suddenly becomes 100 million rows of duplicate garbage.

Strategy 1: The "Good Enough" Extraction

If your analysts only care about the Official name and the MRN, don't try to genericize the flattening. Hard code the path.

In Python/Pandas, this often looks like writing specific lambdas. It's brittle, but it works for MVPs.

import pandas as pd

# Assume 'data' is your list of dicts
def get_official_family_name(patient_row):
    names = patient_row.get('name', [])
    for n in names:
        if n.get('use') == 'official':
            return n.get('family')
    return None

# Apply it
df['official_last_name'] = df.apply(get_official_family_name, axis=1)

This is okay for small scripts, but it doesn't scale well in an ETL pipeline processing terabytes of data.

Strategy 2: Array-aware Columnar Formats (The Real Fix)

The modern approach—and what I usually recommend—is to stop trying to force FHIR into a CSV-style flat table immediately. Instead, move it to a columnar format that supports nested structures, like Parquet.

Tools like Apache Spark or Databricks are great here because they preserve the schema.

However, eventually, you do need to flatten it for the Tableau/PowerBI crowd. The trick is to create Satellite Tables.

Instead of one giant Patient table, you create:

Patient_Core (id, birthdate, gender)
Patient_Names (id, use, family, given)
Patient_Identifiers (id, system, value)

This normalizes the data. It's basically reverse-engineering the JSON back into a relational 3rd Normal Form.

Here is a quick pseudo-code example of how you might handle this transformation logic in a data pipeline:

# Pseudo-code for a Spark transformation
from pyspark.sql.functions import col, explode

# Read raw FHIR JSON
df = spark.read.json("s3://bucket/fhir/Patient/")

# Create the Names table
df_names = df.select(
    col("id").alias("patient_id"),
    explode(col("name")).alias("name_struct")
).select(
    "patient_id",
    "name_struct.use",
    "name_struct.family",
    "name_struct.given" # Note: given is still an array here!
)

# Now you have a clean linkable table without row explosion on the main ID
df_names.write.parquet("s3://bucket/analytics/patient_names/")

Context is King

The biggest headache isn't just the structure; it's the loss of context.

When you flatten Observation.component, you might get a value of 80 and 120. But which is Systolic and which is Diastolic? In the JSON, they are sibling objects. If you flatten them blindly into two rows, you have to ensure you carry the code field with the value field.

Always flatten the object containing the value and the label together. Never flatten them independently, or you'll never know which number belongs to which label.

Conclusion

FHIR is here to stay, and honestly, the schema is brilliant for clinical accuracy. It just requires a shift in mindset for Data Engineering. We have to move away from "One Big Table" and get comfortable with handling arrays and structs as first-class citizens in our pipelines.

If you are struggling with complex data architectures or just want to see how others are tackling these specific ETL challenges, you should check out more tech guides on my other articles. I've documented a few more of these "late night debugging" discoveries there.

Happy coding, and may your JSON always be valid!