Manuel Weiss

Posted on Feb 27

Data Pseudonymization: When You Can't Just Delete Everything

#data #dataengineering

Here's a problem I run into all the time: you need to track that "cust_47832" who made a purchase today is the same "cust_47832" who signed up last year, but you can't actually know they're Sarah Chen from Portland. Plain anonymization (stripping all identifying details so a person can never be traced back) falls short here. Sometimes you need to protect data without making it completely untraceable.

GDPR Article 4(5) defines pseudonymization as processing personal data so it can't be attributed to a specific person without additional information, typically, a separate key you keep locked away. This matters in practice: consider longitudinal medical studies tracking patient outcomes over years, fraud detection systems flagging suspicious patterns across transactions, or SaaS platforms analyzing feature usage while keeping customer identities protected.

The core distinction is this: anonymization makes re-identification impossible, while pseudonymization keeps a reversible link. That reversibility is exactly the point, pseudonymization preserves utility (you can still analyze patterns) and linkability (you can connect the same data point across time). Anonymization, on the other hand, prioritizes safety: data becomes genuinely unidentifiable and steps outside strict privacy regulations. Pseudonymized data still counts as personal data under GDPR; anonymized data doesn't. If you get this wrong, you're either over-protecting data (creating friction for your team) or under-protecting it (risking regulatory action).

Two Ways to Pseudonymize Data

Tokenization: The Vault Approach

With tokenization, you swap sensitive values for random tokens and store the mapping in a secure vault. The diagram below depicts this process:

Customer ID "cust_47832" (say: Sarah Chen) becomes the token "tok_9x4k2m1p" everywhere. However, only your secure vault knows this mapping.

Companies like VGS and Skyflow have built entire businesses around this, charging premium prices because token vaults are genuinely complex to operate. You need high-availability storage, strict access controls, audit logging, and key rotation procedures just to keep them running reliably.

For payment processing, that complexity is worth it. It's the industry standard. Your credit card number becomes "tok_visa4532" everywhere except at the payment gateway. For general application data, though, the operational burden often outweighs the benefits.

Encryption: When You Need Reversibility

With encryption, you transform data using a cryptographic key and can decrypt it back to the original value whenever needed. That makes encryption a natural fit when you need to pseudonymize data but still want the option to retrieve real identities. The flowchart below shows this process:

PostgreSQL's pgcrypto extension supports symmetric encryption:

CREATE EXTENSION IF NOT EXISTS pgcrypto;

-- Encrypt
SELECT pgp_sym_encrypt('sarah.chen@example.com', 'secret_key');
-- Produces: \xc30d04070302...

-- Decrypt (reversible)
SELECT pgp_sym_decrypt(
    pgp_sym_encrypt('sarah.chen@example.com', 'secret_key'),
    'secret_key'
);
-- Returns: sarah.chen@example.com

The encrypted output's consistency depends on the mode you use. With a random initialization vector (a random value mixed into encryption to ensure unique outputs), the same input produces different ciphertext each run. With deterministic modes like AES-SIV, the same input always produces the same output, which is often preferable for pseudonymization since you need consistent, linkable identifiers across datasets.

That key becomes your single point of control. Lose it and you can't decrypt. Expose it and anyone can. Tools like AWS KMS and HashiCorp Vault solve this at enterprise scale, though they do add infrastructure complexity.

GDPR Article 32 lays out three requirements for compliant pseudonymization:

Modifying data to prevent direct attribution
Keeping the reversal mechanism (keys or tokens) physically separate from the pseudonymized data
Applying technical and organizational measures to prevent unauthorized re-identification.

Both tokenization and encryption can meet these requirements, but whether you actually achieve compliance comes down to the implementation details.

When Anonymization Doesn't Work

Tracking Users Over Time

Say you need to measure churn in your system: what percentage of users who signed up in January 2025 are still active in January 2026?

Full anonymization breaks this kind of analysis. If you strip all identifiers or randomize user IDs, you lose the ability to recognize that "anon_123" from January 2025 is the same person as "anon_456" in January 2026. Without that linkability (the ability to connect the same user across different points in time), retention metrics become impossible to calculate. You can count active users each month, but you can't track who actually stayed and who left.

Pseudonymization solves this. User "cust_47832" always maps to the same pseudonym "pseudo_9x4k2m1p" across all time periods, so you can track that this specific user remained active from January 2025 to January 2026 and measure retention accurately, without ever knowing their real identity.

The image below highlights this distinction between anonymization and pseudonymization:

However, it's worth being realistic about the limits of anonymization. Research published in Nature Communications found that 99.98% of Americans can be re-identified using just 15 demographic attributes. If your "anonymized" analytics data still includes age, location, and behavior patterns, it probably isn't as anonymous as you think.

Keeping Identity Consistent Across Systems

Consider this example: your billing system charges customer "cust_47832" $99 monthly, your CRM tracks their support tickets, and your analytics warehouse measures their feature usage. All three systems need to reference the same person.

Random anonymization breaks foreign key relationships (the links between related records across tables) across your database. Deterministic anonymization solves this using consistent hashing (a method that always produces the same output for the same input): every system transforms "cust_47832" into the same pseudonym "pseudo_9x4k2m1p" using the same algorithm and key, preserving referential integrity without exposing real identities.

Xata's deterministic transformers effectively implement this at the database layer:

-- Every database branch uses the same transformation
SELECT anon.hash('cust_47832');
-- Always returns: pseudo_9x4k2m1p

Security Investigations

Imagine this: someone just accessed 10,000 customer records within your system in 30 seconds. Is it a breach? A compromised account? Or a legitimate bulk operation? You need to figure that out quickly, and that means tracing which account did it and what they've been up to recently.

With fully anonymized data, you simply can't do that. "User anon_xyz789 accessed records" doesn't help you much. You can't identify the account, notify the user, or investigate their history. Pseudonymized data gives you a way out: authorized security personnel can reverse "pseudo_9x4k2m1p" back to "cust_47832" using the decryption key or token vault, and the investigation can actually move forward.

This isn't just good practice. The NIST Cybersecurity Framework explicitly requires organizations to identify and respond to security events effectively, and irreversible anonymization directly undermines that.

The Legal Risk of Pseudonymized Data

Here's the most important thing to understand: pseudonymized data is still personal data under GDPR. Article 4(1) defines personal data as "any information relating to an identified or identifiable natural person." If you can technically re-identify someone using separately stored keys, it still qualifies as personal data, full stop.

That means real obligations follow you:

Data subject rights still apply: access requests, deletion requests, and portability requirements are all in force.
Security requirements remain: Article 32 mandates encryption, access controls, and audit logging.
Cross-border transfer restrictions hold: you can't move pseudonymized EU citizen data to non-EU servers without adequate safeguards.
Breach notification rules persist: if pseudonymized data leaks along with the keys, you must notify authorities within 72 hours.

The Massachusetts Governor medical records incident is a good illustration of how badly this can go wrong. A hospital "anonymized" patient data by removing names, but kept ZIP code, birthdate, and sex. A researcher cross-referenced voter rolls and identified the Governor's medical records. Only six people in Cambridge shared his birthday, and only one matched his ZIP code. The hospital believed they'd anonymized the data. Legally and technically, they'd only pseudonymized it poorly.

The trap most teams fall into is treating pseudonymized data as "safe enough" to store on developer laptops, copy into test environments with weaker security, or load into analytics systems without proper access controls. It isn't. Pseudonymized data is still personal data under GDPR, which means you still need encryption at rest, strict access controls, and clear policies governing who can reverse the pseudonymization and under what circumstances.

Properly anonymized data is a different story. GDPR Recital 26 states: "The principles of data protection should not apply to anonymous information." Implement k-anonymity correctly (making each record indistinguishable from at least k-1 others) and add sufficient noise through differential privacy (a mathematical technique for adding controlled randomness to data so individuals can't be singled out), and the data falls outside GDPR's scope entirely.arch/wp-content/uploads/2016/02/dwork.pdf), the data falls outside GDPR's scope entirely.

The Developer Solution: Deterministic Masking

Developers require realistic data in staging and development to test effectively: catching edge cases, validating migrations, and debugging production-like scenarios. But using actual customer data isn't really an option under privacy regulations, and purely synthetic or randomized test data often misses the real-world edge cases you're trying to catch in the first place.

Traditional pseudonymization with token vaults doesn't really solve this either. Requiring developers to authenticate against a production vault for every test query adds enough friction that they'll find workarounds, and the most common one is copying production databases directly to their laptops. On top of that, maintaining pseudonymized data in non-production environments expands your compliance footprint since it still counts as personal data under GDPR.

Deterministic transformation threads the needle: it gives you pseudonymization's technical benefits (consistent identifiers and preserved referential integrity) while getting much closer to anonymization's safety profile for development environments.

Here's how deterministic transformation works:

The implementation:

CREATE OR REPLACE FUNCTION deterministic_email(original TEXT)
RETURNS TEXT AS $$ 
BEGIN
    RETURN substring(encode(hmac(original::bytea, 'secret_key', 'sha256'), 'hex'), 1, 8) 
           || '@example.com';
END;
$$ LANGUAGE plpgsql;

-- sarah.chen@company.com → a7f4c9e1@example.com
-- Always transforms to same pseudonym
-- Can't determine original email

This maintains referential integrity. If orders.customer_email and customers.email both contain "sarah.chen@company.com", they both transform to "a7f4c9e1@example.com", so JOIN queries work correctly.

Xata implements this at the storage layer. Branching (copy-on-write database cloning) applies transformations automatically:

# Create anonymized development branch
xata branch create --from main --anonymize dev-feature-branch

# Branch has real data distributions and relationships
# But all PII is deterministically transformed

Did you know? A study found that 54% of organizations experienced breaches tied to insecure non-production environments, most of them using copies of production databases with inadequate protection. Deterministic transformation at the database layer closes that gap by ensuring developers simply can't accidentally expose customer data.

Choosing Between Anonymization and Pseudonymization

The diagram below shows when to choose each approach:

If you ask me, it really comes down to one question: do you need to link data across time or systems? If you're analyzing customer behavior over months or years, pseudonymization is the right call. If you're giving developers realistic test data without privacy risk, anonymization or deterministic masking is the safer path.

Here's a quick comparison:

Criterion	Anonymization	Pseudonymization
Reversibility	No, original data cannot be recovered	Yes, can be reversed with additional information (key/vault)
Primary Use Cases	Development environments, public datasets, testing with realistic data	Production analytics, longitudinal studies, systems requiring user tracking
Data Risk Level	Low, not considered personal data under GDPR when done properly	Medium, still personal data, requires security controls and access policies
Linkability	None, cannot connect records across datasets or time periods	High, maintains consistent identifiers for tracking and joins across systems
Regulatory Status	Falls outside GDPR scope if truly irreversible	Remains under GDPR/privacy regulations, qualifies as a security measure
Access Controls	Minimal, can be widely distributed once anonymized	Strict, requires policies on who can reverse pseudonyms and when

Final Thoughts

Pseudonymization is the right tool for production analytics, security investigations, and keeping data consistent across multiple systems. It lets you track entities over time while protecting privacy, but it comes with a tradeoff: you're still handling personal data with full compliance obligations attached.

Anonymization works better for development environments, public datasets, and any scenario where you don't need to link back to original users. The privacy guarantees are stronger and the regulatory burden is lighter, but you give up the ability to track users or maintain consistent identifiers across systems.

The mistake I see most often is reaching for pseudonymization when anonymization would do the job just fine. If your developers don't need to trace individual customer journeys, there's no good reason to give them pseudonymized data that technically allows re-identification.

Xata's approach sits in the middle. Deterministic transformers keep pseudonymization's consistency benefits (foreign keys work, data distributions stay realistic) but implement it as a one-way transformation at the database layer. Your analytics team gets properly managed pseudonymization with formal key governance and audit logging. Your development team works with deterministic anonymization that looks like pseudonymization but can't be reversed. Both teams get the data utility they need, with security controls that actually match the risk.

If you have thoughts on how you're handling pseudonymization in your stack, or questions about implementing deterministic masking, I'd love to hear from you. Drop a comment below.

DEV Community