What building HIPAA-compliant lakehouses taught me about real-world encryption

#data #security #cloud #compliance

Eighty-two percent of data breaches in healthcare don't happen because of a sophisticated nation-state actor; they happen because a junior engineer accidentally left an S3 bucket open or pushed a cleartext JSON blob containing social security numbers to a shared staging environment.

We obsess over "zero trust" and "encryption at rest," but we rarely talk about the reality of the data lifecycle. If your lakehouse isn't architected for granular, row-level access control, you aren't HIPAA compliant—you’re just waiting for a forensic audit to end your career.

Most engineers treat AES-256 like a magic wand. They check the box for "Server-Side Encryption" on their S3 buckets and assume they’ve satisfied the Privacy Rule. They haven't. Compliance isn't about whether the disk is encrypted; it’s about who can see the decrypted contents and where that data manifests in your logs.

The mechanics of the pipeline

When building a lakehouse (think Databricks on Delta Lake or Snowflake), the "Gold" layer is where compliance goes to die. You have clean, joined, enriched data that happens to contain PHI. If you are still using simple IAM roles to govern access, you are doing it wrong.

You need to implement column-level masking and row-level security (RLS) at the storage abstraction layer. In Databricks, for example, you shouldn't just be granting SELECT on a table. You should be using MASKING FUNCTIONS on columns containing identifiers.

Here is what the actual implementation looks like in a production environment:

CREATE FUNCTION redact_ssn(ssn STRING)
  RETURN CASE WHEN is_member('data_scientists') THEN '***-**-****'
              ELSE ssn END;

ALTER TABLE silver_health_records 
ALTER COLUMN ssn SET MASKED WITH (FUNCTION redact_ssn(ssn));

This is the baseline. But the real "gotcha" happens when your Spark job kicks in. When you run a df.write.mode("overwrite") operation, Spark creates temporary files in your staging directory. If you aren't careful, these temporary files contain the raw, unmasked data. Even if you have masking on the table, the raw data sits in an S3 prefix that your monitoring tools or data discovery crawlers might index.

To fix this, you must enforce ephemeral encryption keys for the shuffle service. In your Spark config, you need:

spark.io.encryption.enabled true
spark.hadoop.fs.s3a.encryption.algorithm SSE-KMS
spark.hadoop.fs.s3a.server-side-encryption.key <your-kms-key-id>

Without spark.io.encryption.enabled, your shuffle files—those bits of data written to disk during a join or a sort—are written in plain text. If a node is decommissioned and the underlying EBS volume isn't wiped immediately, you’ve just created a HIPAA violation.

The tradeoffs nobody mentions

The primary downside of a locked-down, encrypted lakehouse is "performance tax." Every time you introduce a UDF (User Defined Function) for masking or enforce RLS, you break the query optimizer.

When you run a SELECT *, the engine has to evaluate the masking function for every single row. If you’re doing a join across a 50TB dataset, the cost of these functions adds up. Your query latency will spike. I’ve seen teams move from a 10-minute job to a 45-minute job just by adding RLS.

Then there is the issue of "key rot." If you’re using AWS KMS, you’re likely using Customer Managed Keys (CMKs). Managing the lifecycle, rotation, and—God forbid—re-encrypting the data when a key is compromised, is a nightmare. If you lose access to the KMS key, your data is effectively incinerated. There is no "I forgot my password" for an encrypted lakehouse.

Also, logging becomes significantly harder. If you are masking data, your logs need to account for why a user saw a masked value versus the raw value. You end up with a massive metadata overhead. You’re no longer just storing the data; you’re storing the audit trail of who requested the data, what their clearance level was, and which specific masking policy was triggered.

When to reach for it (and when not to)

Use granular masking and RLS when you are building a multi-tenant platform. If your lakehouse serves both clinical researchers and internal billing analysts, you have no choice. The billing team needs the SSN; the researcher needs the diagnosis code but shouldn't know the patient's name. In this scenario, the lakehouse is a tool for data minimization, and these features are your primary defense.

Don't use it when you are running a purely internal, high-throughput analytics pipeline where the "users" are just automated microservices. If you are building a feature engineering pipeline for a machine learning model, and the model only needs the anonymized vector, do the masking before the data hits the lakehouse.

Why? Because if you wait until the data is in the lakehouse to mask it, you’ve already failed the principle of "least privilege." The raw, sensitive data is already sitting in your storage layer. If a developer needs to debug the raw data, they’ll have access. Move the transformation upstream. Perform PII/PHI scrubbing in your ingest layer (your Lambda or Fargate tasks) before the data ever touches the bronze table.

The best way to pass a HIPAA audit isn't to build a fancy gatekeeper at the end of the pipeline; it's to ensure the data is effectively neutralized before it enters your environment.

Conclusion

Compliance is often treated as a bureaucratic checkbox, but in the world of high-scale data engineering, it’s a technical constraint. If you treat PHI as just "another string column," you’re setting yourself up for a catastrophic failure.

Focus on the mechanics: encrypt your shuffle, use native masking functions rather than application-level logic, and always, always scrub at the ingest point. The goal is to make the data useless to anyone who doesn't have the explicit, logged, and audited right to see it. If you can do that while keeping your query times under an hour, you're doing better than most of the industry.

Cover photo by Tyler on Unsplash.