How to Use Pydantic to Validate External API Data in Python

#python #programming #automation

External APIs are the most common source of data quality problems in automation pipelines. The data they return follows a schema that is documented at a point in time and then changes, sometimes without notice. Fields become nullable. Types change. New required fields appear. Pydantic is the most practical Python tool for catching these changes before they corrupt your database or produce incorrect downstream calculations.

This guide walks through setting up Pydantic validation for API responses, adding business rule checks, and handling the records that fail.

Step 1: Install Pydantic

pip install pydantic

Pydantic version 2 is the current release. The import structure and some validator syntax differ from version 1. This guide uses version 2 syntax throughout.

Step 2: Define a Model for Your API Record Type

Start with the fields your API actually returns. Map each field to its expected Python type. Use Optional for fields that may be absent or null.

from pydantic import BaseModel
from datetime import datetime
from typing import Optional

class EventRecord(BaseModel):
    event_id: str
    user_id: int
    event_type: str
    occurred_at: datetime
    value: float
    metadata: Optional[dict] = None

Pydantic validates each field on model instantiation. If user_id arrives as a string, Pydantic will attempt to coerce it to an integer. If occurred_at arrives as a malformed timestamp string, it raises a ValidationError. If a required field like event_id is missing entirely, it raises a ValidationError.

The coercion behavior is configurable. For automation pipelines where silent type coercion can mask upstream data drift, use model_config = ConfigDict(strict=True) to raise errors on type mismatches rather than coercing:

from pydantic import BaseModel, ConfigDict

class EventRecord(BaseModel):
    model_config = ConfigDict(strict=True)

    event_id: str
    user_id: int
    event_type: str
    occurred_at: datetime
    value: float

In strict mode, a user_id arriving as "12345" (a string) raises an error rather than being silently converted to 12345. This is usually what you want in a validation layer -- you want to know that the upstream type changed, not have Pydantic paper over it.

Step 3: Add Business Rule Validators

Schema validation catches type and structure errors. Business rules catch semantic errors -- values that have the right type but are logically wrong.

from pydantic import BaseModel, field_validator, model_validator
from datetime import datetime
import datetime as dt

class EventRecord(BaseModel):
    event_id: str
    user_id: int
    event_type: str
    occurred_at: datetime
    value: float

    @field_validator('event_type')
    @classmethod
    def event_type_must_be_valid(cls, v):
        allowed = {'click', 'view', 'purchase', 'refund', 'cancel'}
        if v not in allowed:
            raise ValueError(f'event_type "{v}" not in allowed set: {allowed}')
        return v

    @field_validator('occurred_at')
    @classmethod
    def timestamp_not_in_future(cls, v):
        if v > datetime.now(tz=v.tzinfo):
            raise ValueError(f'occurred_at cannot be in the future: {v}')
        return v

    @field_validator('value')
    @classmethod
    def value_reasonable(cls, v):
        if v < -100000 or v > 1000000:
            raise ValueError(f'value {v} outside expected range [-100000, 1000000]')
        return v

Validators run after type validation succeeds. If occurred_at fails to parse as a datetime, the type validator raises first -- the business rule validator for the same field never runs. This prevents confusing cascading error messages.

Photo by ThisIsEngineering on Pexels

Step 4: Validate a Batch of Records

Most API responses return arrays of records. Process them in a loop, separating valid from invalid:

from pydantic import ValidationError
import json

def validate_batch(records: list[dict]) -> tuple[list[EventRecord], list[dict]]:
    valid = []
    invalid = []

    for record in records:
        try:
            valid.append(EventRecord(**record))
        except ValidationError as e:
            invalid.append({
                'raw': record,
                'errors': e.errors(),
                'error_count': e.error_count()
            })

    return valid, invalid

The e.errors() method returns a list of dicts, each containing:

loc: the field path where the error occurred
type: the error type (e.g., 'missing', 'int_type', 'value_error')
msg: a human-readable error message
input: the value that caused the error

This structured output makes logging and debugging straightforward. You can sort errors by type to see patterns -- are you getting mostly 'missing' errors on one field, suggesting an upstream change?

Step 5: Handle the Invalid Records

Never silently drop invalid records. Log them with full context so you can investigate upstream changes and recover data if needed.

import logging
import json
from datetime import datetime

logger = logging.getLogger(__name__)

def process_api_response(response: dict, source: str) -> None:
    records = response.get('data', [])
    valid, invalid = validate_batch(records)

    if invalid:
        logger.warning(
            'Validation failures in %s: %d/%d records invalid',
            source, len(invalid), len(records)
        )
        for item in invalid:
            logger.error(
                'Invalid record from %s: errors=%s raw=%s',
                source,
                json.dumps(item['errors']),
                json.dumps(item['raw'])
            )

    # Process only valid records
    for record in valid:
        write_to_database(record)

The invalid list size relative to total records is a useful metric. A sudden increase in the invalid rate often indicates an upstream API change. Log both counts every run, even when the count is zero -- a baseline of zero followed by a spike is more informative than a single alert without context.

For richer signal, group failures by error type using collections.Counter from the Python standard library. The type key in each dict returned by e.errors() classifies failures as 'missing', 'int_type', 'string_type', 'value_error', and similar categories. A cluster of 'missing' errors on one field suggests that field was removed or made optional upstream. A cluster of 'int_type' or 'string_type' errors suggests a type change for that field. A spike distributed across many fields and error types suggests the API returned a different record structure entirely -- possibly a new version or a different endpoint. This grouping adds a few lines of code to the logging function and substantially improves the signal-to-noise ratio when investigating validation failures in production.

Step 6: Write Tests for Your Validators

Validation logic should be unit tested independently of the pipeline:

import pytest
from pydantic import ValidationError

def test_rejects_future_timestamp():
    from datetime import datetime, timedelta, timezone
    future = datetime.now(tz=timezone.utc) + timedelta(hours=1)
    with pytest.raises(ValidationError) as exc_info:
        EventRecord(
            event_id='test-1',
            user_id=42,
            event_type='click',
            occurred_at=future,
            value=1.0
        )
    errors = exc_info.value.errors()
    assert any(e['loc'] == ('occurred_at',) for e in errors)

def test_rejects_unknown_event_type():
    with pytest.raises(ValidationError):
        EventRecord(
            event_id='test-2',
            user_id=42,
            event_type='unknown_type',
            occurred_at='2026-01-01T00:00:00Z',
            value=1.0
        )

Testing the validation layer separately from the pipeline logic means you can iterate on validation rules without re-running the full pipeline, and you can add tests for edge cases you discover from real production failures.

Where to Go From Here

The Pydantic documentation covers the full validator API, including cross-field validators with @model_validator, nested model validation, and custom JSON serialization. PyPI lists alternative validation libraries (Cerberus, marshmallow, Voluptuous) if Pydantic's feature set is more than you need for a simpler use case.

The broader guide on how to build a data validation layer before processing in Python covers where to place the validation step in the pipeline, how to handle the validated data downstream, and tiering validation depth by field criticality.

137Foundry builds data automation pipelines where this pattern is standard infrastructure -- not an add-on. If your pipeline is growing past what a single developer can maintain reliably, that is worth a conversation.

"The teams that build reliable data automation are not more talented -- they just enforce validation at the boundary before processing starts, while everyone else validates after something breaks." - Dennis Traina, founder of 137Foundry

Photo by Brett Sayles on Pexels

Schema validation with Pydantic takes an afternoon to add to most pipelines and substantially reduces the risk of silent data corruption going undetected in production. The tests take longer to write than the validators themselves, and they are worth writing.