Dror Atariah

Posted on Jul 8

Two Approaches to Data Anonymization with Pydantic

#pydantic #anonymization #bestpractices #pii

When working with Personally Identifiable Information (PII), you often need to anonymize sensitive information before processing or storing it. Let's explore two different approaches to implement this using Pydantic models.

The Problem

You have a User model and need to create an anonymized version for privacy compliance:

from pydantic import BaseModel, EmailStr

class User(BaseModel):
    id: int
    name: str
    email: str

Solution 1: Factory Method Approach

Create a classmethod that handles anonymization during object creation:

class AnonymizedPerson(BaseModel):
    id: int
    name: str
    email: EmailStr

    @classmethod
    def from_user(cls, user: User) -> "AnonymizedPerson":
        # Anonymize email by replacing username with "anonymized"
        # You might want to use some hashing, so analytics on
        # the data could still be supported.
        email_parts = user.email.split("@")
        anonymized_email = f"anonymized@{email_parts[1]}"

        return cls(
            id=user.id,
            name=f"anonymized_{user.name}",
            email=anonymized_email
        )

How to use:

user = User(id=1, name="Alice Brown", email="alice@company.com")
anonymized = AnonymizedPerson.from_user(user)
assert anonymized.name == "anonymized_Alice Brown"
assert anonymized.email == "anonymized@company.com"

Solution 2: Field Validator Approach

Use Pydantic's @field_validator decorator with mode='after' to transform data automatically:

from pydantic import BaseModel, EmailStr, field_validator

class AnonymizedPersonAuto(BaseModel):
    id: int
    name: str
    email: EmailStr

    @field_validator("name", mode="after")
    @classmethod
    def anonymize_name(cls, v: str) -> str:
        """Automatically anonymize the name field after validation"""
        if not v.startswith("anonymized_"):
            return f"anonymized_{v}"
        return v

    @field_validator("email", mode="after")
    @classmethod
    def anonymize_email(cls, v: str) -> str:
        """Automatically anonymize the email field after validation"""
        if not v.startswith("anonymized@"):
            email_parts = v.split("@")
            return f"anonymized@{email_parts[1]}"
        return v

    @classmethod
    def from_user(cls, user: User) -> "AnonymizedPersonAuto":
        """Create from User - name and email will be auto-anonymized"""
        return cls(id=user.id, name=user.name, email=user.email)

How to use:

user = User(id=1, name="Bob Wilson", email="bob@tech.org")

# All these methods automatically anonymize both fields:
from_factory = AnonymizedPersonAuto.from_user(user)
direct = AnonymizedPersonAuto(id=2, name="Carol", email="carol@startup.io")
from_json = AnonymizedPersonAuto.model_validate({"id": 3, "name": "Dave", "email": "dave@corp.net"})

# All result in anonymized data:
assert from_factory.name == "anonymized_Bob Wilson"
assert direct.email == "anonymized@startup.io"

Summary: When to Use Each Approach

Factory Method Approach

Pros:

Simple and explicit
Clear control over transformation logic
Easy to understand and debug
Suitable for complex multi-field transformations

Cons:

Only works when using the factory method
Other creation paths bypass anonymization
Manual process that can be forgotten

Best for: Simple use cases where you control all object creation paths

Field Validator Approach

Pros:

Guaranteed consistency across all creation methods
Self-contained transformation logic
Works with direct instantiation, JSON parsing, and factory methods
Prevents accidental non-anonymized instances

Cons:

Slightly more complex setup
Field-level transformations only
Less explicit about when transformation occurs

Best for: Production systems where data consistency and reliability are critical

Bonus: Making Factory Method Approach Bulletproof

If you want to use the factory method approach but prevent direct instantiation bypass, you can implement a private constructor pattern:

class AnonymizedPersonSecure(BaseModel):
    id: int
    name: str
    email: EmailStr

    def __init__(self, **data):
        # Private constructor - should only be called by factory methods
        if not hasattr(self, '_from_factory'):
            raise ValueError("Use AnonymizedPersonSecure.from_user() instead of direct instantiation")
        super().__init__(**data)

    @classmethod
    def from_user(cls, user: User) -> "AnonymizedPersonSecure":
        # Anonymize email by replacing username with "anonymized"
        email_parts = user.email.split("@")
        anonymized_email = f"anonymized@{email_parts[1]}"

        # Create instance through private constructor
        instance = cls.__new__(cls)
        instance._from_factory = True
        instance.__init__(
            id=user.id,
            name=f"anonymized_{user.name}",
            email=anonymized_email
        )
        return instance

How it works:

user = User(id=1, name="Alice Brown", email="alice@company.com")

# This works - using factory method
anonymized = AnonymizedPersonSecure.from_user(user)
assert anonymized.name == "anonymized_Alice Brown"

# This fails - direct instantiation blocked
try:
    direct = AnonymizedPersonSecure(id=2, name="Bob", email="bob@test.com")
except ValueError as e:
    assert str(e) == "Use AnonymizedPersonSecure.from_user() instead of direct instantiation")

This gives you the control of the factory method approach with the safety of preventing bypass routes.

Key Takeaway

Choose the factory method for simple, controlled scenarios. Choose field validators when you need bulletproof data transformation that works everywhere your model might be instantiated. Both patterns have their place in modern Python applications.

Have you used Pydantic field validators for data transformation? Share your use cases in the comments!

DEV Community