Jakkie Koekemoer

Posted on Mar 13

Dynamic Data Masking: Use Cases, Limitations, and What to Do Instead

#dataengineering #data

Let’s imagine two everyday scenarios. A customer support agent pulls up an order record. They need the order history, but there's no reason for them to see the full credit card number. A developer is debugging a user profile issue, but shouldn't have access to the actual customer's email address.

The common solution for this is Dynamic Data Masking (DDM). Instead of the real credit card number, the database shows ****-****-****-3456. Instead of the real email, it shows j***@email.com.

You can think of DDM as a filter sitting between your database and the user querying it. The underlying data stays untouched, DDM just controls what different users get to see. For customer support teams working with live data, it works well since you don't want them to see the sensitive details. But for engineering and development teams, it tends to create more problems than it solves because masked data can break query logic, skew test results, and make it harder to reproduce real bugs accurately. If you're deciding how to protect sensitive data in your development workflow, understanding that distinction matters a lot.

What is Dynamic Data Masking?

Dynamic Data Masking modifies data at query time based on your DB policies. The underlying data stays intact on disk. When someone queries a table, the database engine intercepts the SELECT and rewrites the output before returning results.

The diagram below shows this flow:

Microsoft SQL Server implements DDM using column-level masking functions defined in the schema:

-- Define masked columns directly in schema
CREATE TABLE customers (
    id INT PRIMARY KEY,
    email VARCHAR(50) MASKED WITH (FUNCTION = 'email()'),
    credit_card VARCHAR(16) MASKED WITH (FUNCTION = 'partial(0,"XXXX-XXXX-XXXX-",4)')
);

-- Grant unmask permission for specific roles
GRANT UNMASK ON customers(email) TO support_role;

PostgreSQL Anonymizer uses security labels to define rules:

-- You need the PostgreSQL Anonymizer extension installed and enabled
-- Mark columns with masking functions
SECURITY LABEL FOR anon ON COLUMN customer.email 
  IS 'MASKED WITH FUNCTION anon.fake_email()';

-- Enable transparent masking for specific roles
SECURITY LABEL FOR anon ON ROLE analyst IS 'MASKED';

There are three common patterns for DDM masking:

Full masking replaces the entire value — so john.doe@email.com becomes NULL or *****.
Partial masking hides only the sensitive part — so 1234-5678-9012-3456 becomes xxxx-xxxx-xxxx-3456.
Randomization swaps the real value with fake but realistic-looking data, so "John Smith" might become "Robert Johnson", from a dictionary of fake names.

It's important to understand that your database still runs all queries against the real, unmasked data. JOINs, aggregations, and WHERE filters all work on the actual values in your RDBMS tables. Masking only takes place at the very end, just before the results are sent back to the relevant user.

Dynamic vs. Static Data Masking

With Dynamic Data Masking, the original data stays untouched in your database. Masking happens on the fly when data is retrieved, and what users see depends on their permissions. On the other hand, Static Data Masking works differently: it permanently transforms the data itself, creating a separate sanitized copy of the database where sensitive values are replaced with fake or masked data.

That architectural difference determines which approach fits your situation. The diagram below shows both:

Dynamic Data Masking transforms data in flight

Advantages: You're always working with current production data. There's no extra storage needed, and you maintain a single source of truth.

Disadvantages: It adds CPU overhead of 2-10% per query at runtime. Managing role-based access control can get complex. And it's fundamentally vulnerable to inference attacks: where someone deduces sensitive information by analyzing patterns in the data they can see.

Best for: Production operational access, customer support dashboards, and read-only reporting where users can't write their own arbitrary queries.

Static Data Masking transforms data at rest

Advantages: No runtime overhead once the initial transformation is done. It's immune to inference attacks since the sensitive data is replaced entirely. And it's safe to share freely with developers.

Disadvantages: Historically, creating a full masked copy of your database could take hours. The data starts going stale the moment the copy is created. And setting up the ETL pipeline adds its own layer of complexity.

Best for: Development environments, staging databases, testing, and QA. Basically any scenario where developers need write access or the ability to run ad hoc queries.

Static masking has long had a staleness problem though. If generating a masked copy of a 200GB database takes 4 hours, your developers are waiting half a day just to get test data. And by the time that copy is ready, production has already moved on. That delay is a big reason why many teams turned to DDM for development workflows, even knowing its security tradeoffs.

The Problem with DDM for Developers

Here's what Microsoft's documentation warns about DDM explicitly:

"Dynamic data masking (DDM) doesn't aim to prevent database users from connecting directly to the database and running exhaustive queries that expose pieces of the sensitive data."

Oracle's security guide says the same thing:

"Oracle Data Redaction is not intended to protect against attacks by regular and privileged database users who run ad hoc queries directly against the database."

The core issue is that WHERE clauses always run against the real, unmasked values. Only the output the user sees gets masked. That gap is exactly what makes inference attacks possible, and the diagram below shows how they work:

To see this in practice, consider a salary column masked to show zero. You run:

SELECT name, salary 
FROM employees 
WHERE salary > 99999 AND salary < 100001;

The result returns "Jane Doe" with her salary showing as 0. But you already know what that means: Jane earns $100,000. Take this further with a binary search approach, and just 20 queries are enough to pinpoint any value within a million-dollar range down to the exact dollar.

String columns leak through character-by-character extraction:

-- Find first character
SELECT name FROM customers WHERE email LIKE 'a%';  -- No results
SELECT name FROM customers WHERE email LIKE 'b%';  -- No results
SELECT name FROM customers WHERE email LIKE 'j%';  -- Returns "Customer 1"

-- Find second character
SELECT name FROM customers WHERE email LIKE 'ja%'; -- No results
SELECT name FROM customers WHERE email LIKE 'jo%'; -- Returns "Customer 1"

A 24-character email requires roughly 6,000 iterations, and scripts can easily automate this.

Aggregate functions disclose data regardless of masking:

-- Both return actual sum, even though individual salaries show as 0
SELECT SUM(salary) FROM employees WHERE department = 'Engineering';
SELECT AVG(salary) FROM employees WHERE job_title = 'Senior Engineer';

ORDER BY reveals relative rankings even when values display as masked. Someone with the highest salary appears first in ORDER BY salary DESC regardless of what the salary column shows.

This isn't a bug, it's a documented design decision. DDM is built to prevent accidental exposure in controlled, read-only contexts. It was never meant to protect against users who can query the database directly. And since developers write and run SQL queries against database tables, DDM simply can't protect sensitive data from them.

Beyond DDM: Copy-on-Write Branching

Copy-on-write branching is a storage technique that creates instant snapshots of your database. Instead of duplicating the entire dataset upfront, it only copies data when it's actually modified. This solves the staleness problem by making a "static" copy available instantly. Tools like Xata implement this at the database level, giving you the security benefits of static masking with the speed that makes developers actually use it.

The diagram below shows this approach:

Here's how it works: Creating a branch instantiates a logical copy that initially shares all physical data pages with its parent. No data moves. A 200GB database branch creates in 30 seconds regardless of size. When you modify data in the branch, only changed pages consume additional storage.

A secure development workflow looks like this:

Create a "golden" parent branch from production
Apply static masking to this parent (happens once)
Create child branches from the masked parent for developers or CI/CD
Developers get full read/write access to realistic data
Delete branches when done, storage reclaims immediately

Developers get to work with actual production data patterns. Foreign keys work. Query performance matches production. Edge cases surface during testing instead of production incidents. But no unmasked sensitive data exists anywhere in the non-production environment.

The security gain: Inference attacks fail with this approach because sensitive data is physically absent. No cleartext values exist to infer through WHERE clauses, JOINs, or backup files. That's a meaningful improvement over DDM, where the database always contains unmasked data regardless of access controls.

Xata's data anonymization integrates directly into this branching workflow:

# Create anonymized branch from production
xata branch create dev-branch --from main --anonymize

# Branch ready in seconds with:
# - All PII masked deterministically (referential integrity preserved)
# - Realistic data distributions maintained
# - Full read/write access for testing

Where this approach proves most useful

Let's say a production bug is impacting three specific customers. With DDM, handing developers access to investigate isn't really a safe option. But with copy-on-write branches, you can spin up an anonymized copy with just the relevant records in about 30 seconds. Developers can dig into real, realistic data, and once they're done, the branch is simply deleted. Production stays off-limits and no PII ever gets exposed.

Where traditional static masking used to mean waiting through 4-hour ETL jobs, modern copy-on-write branching gets you a fully masked environment in half a minute. You get the security guarantees of physically transforming the data, at a speed that developers will actually enjoy working with.

Best Practices for Implementation

Step 1: Identify your PII

Start by cataloging every column that holds sensitive data. This includes the obvious ones like names, emails, phone numbers, addresses, payment details, and health records, but also any regulatory-defined identifiers. Don't overlook derived fields either: usernames sometimes embed real names, and session logs can contain PII buried in URLs.

For a comprehensive definition of what counts as PII, GDPR is a good reference. It covers genetic data, biometric data, location data, and online identifiers like IP addresses and cookies. The CCPA goes further, extending coverage to commercial information and internet activity.

Step 2: Define roles and access patterns

Think through which users actually need to see which data. Support agents might need the last four digits of a credit card for verification, but never the full number. Analysts need aggregated statistics, not individual customer records. Developers need data that looks and behaves realistically, but not real customer information.

A useful way to think about this: separate your read-only operational users like support staff and analysts from your query-capable users like developers and data scientists. DDM can work reasonably well for the first group. For the second group, it's not a safe option.

Step 3: Choose the right tool

The flowchart below guides how to effectively protect sensitive data with static or dynamic masking:

Use DDM for production operational access when:

Users only interact with data through controlled applications
There's no ability to run ad hoc queries
Read-only access is all that's needed
You can audit every data access request

Use anonymized branches for development when:

Developers need to write and run their own queries
Write access is required
Integration testing needs referential integrity to hold up
You want to remove any possibility of inference attacks

You can also use both together. DDM handles protection on the production side for operational users, while copy-on-write branches give developers and staging environments a safe place to work. These two masking approaches solve two different problems.

Step 4: Set up security monitoring

Keep an eye on data access patterns so you can catch inference attacks early. Some things worth watching for:

Repeated similar queries where the WHERE clause value keeps incrementing
A high volume of queries that each return just a single row
Aggregate queries that target unusually small subsets of data
Pattern matching queries with systematic character variations

PostgreSQL's audit extensions and AWS RDS audit logs can both help you capture and analyze these patterns.

Conclusion

Dynamic Data Masking (DDM) has a clear and specific purpose: keeping sensitive data from being accidentally exposed in read-only, application-controlled production scenarios. So it is ideal for customer support dashboards, operational reporting, or any context where users interact with data through fixed interfaces and can't run their own queries.

Where it falls short is with developers. Because developers write and run queries directly, WHERE clauses, aggregations, and inference attacks can pull out masked values no matter how DDM is configured. Development environments need write access and query freedom, which means DDM is a poor fit there by design.

Real security for development workflows means giving developers safe data to work with, not just hiding unsafe data behind a filter. Storage-layer approaches that combine copy-on-write branching with static masking give you both: the data is physically transformed so it's genuinely secure, and masked environments are ready in about 30 seconds. Traditional static masking had the security part right but killed developer velocity with hours-long ETL jobs. Modern branching tools solve both sides of that tradeoff, and if your team is still leaning on DDM to protect development environments, they're worth a serious look.

DEV Community