Gabriel Henrique

Posted on Jun 20

Data Contracts in Production: Stop Trusting Your Upstream Sources

#dataengineering #python #data #mlops

Your upstream data source changed a column type last night. Your pipeline ran at 2am, ingested everything without a single error, and by the time your stakeholders opened their dashboards at 9am, the revenue numbers were wrong.

No alert fired. No test failed. The pipeline was technically healthy.

This is the most common and expensive failure mode in data engineering, and it happens because we build systems that trust the data they receive. Data contracts are the fix.

What a Data Contract Actually Is

A data contract is a formal agreement between a data producer and a data consumer that defines what the data looks like, what quality guarantees it carries, and who owns it.

Not documentation. Not a README. An executable specification that can be validated automatically, versioned like code, and broken like an API contract when violated.

Think of it like an API contract, but for your data. A REST API fails loudly with a 400 when you send the wrong payload. A data pipeline fails silently with bad numbers. Contracts change that.

A contract typically covers: schema definition (fields, types, nullability), quality rules (completeness, uniqueness, valid value ranges), SLA metadata (freshness, update frequency), and ownership (who produces this, who consumes it).

Anatomy of a Real Data Contract

Here is what a minimal contract looks like using the open datacontract.yaml format:

dataContractSpecification: 0.9.3
id: orders-v2
info:
  title: Orders Contract
  version: 2.0.0
  owner: data-platform-team
  status: active

models:
  orders:
    description: One row per order placed on the platform
    fields:
      order_id:
        type: string
        required: true
        unique: true
      customer_id:
        type: string
        required: true
      total_amount:
        type: decimal
        required: true
        minimum: 0
      status:
        type: string
        enum: [pending, confirmed, shipped, delivered, cancelled]
      created_at:
        type: timestamp
        required: true

quality:
  type: SodaCL
  specification: |
    checks for orders:
      - row_count > 0
      - missing_count(order_id) = 0
      - duplicate_count(order_id) = 0
      - invalid_count(status) = 0:
          valid values: [pending, confirmed, shipped, delivered, cancelled]
      - freshness(created_at) < 6h

servicelevels:
  freshness:
    description: Data must not be older than 6 hours
    threshold: 6h

This file is checked into Git alongside the dbt models that produce the orders table. When the schema changes, the contract changes. When the contract breaks, the pipeline stops.

Three Places to Enforce Contracts

Most teams put the enforcement in one place and leave gaps everywhere else. You need all three layers.

[Producer / Source System]
        |
        v
[Ingestion Layer]  <-- enforce schema + type contracts here
        |
        v
[Transformation Layer (dbt)]  <-- enforce quality contracts here
        |
        v
[Serving Layer / Warehouse]  <-- enforce SLA and freshness here
        |
        v
[Consumer / Dashboard / LLM / API]

At ingestion you catch schema drift early, before bad data poisons your warehouse. Use Pydantic models to validate incoming records.

At transformation you use dbt tests or Soda checks to enforce business-level quality rules. A row count of zero is not a schema violation, but it is a contract violation.

At serving you monitor freshness and completeness so consumers know the data they are reading meets SLA guarantees.

A Real Ingestion Contract with Pydantic

This runs at the top of every ingestion job, before writing a single row to the warehouse:

from pydantic import BaseModel, validator, Field
from decimal import Decimal
from datetime import datetime
from enum import Enum
from typing import Optional
import logging

logger = logging.getLogger(__name__)

class OrderStatus(str, Enum):
    pending = "pending"
    confirmed = "confirmed"
    shipped = "shipped"
    delivered = "delivered"
    cancelled = "cancelled"

class Order(BaseModel):
    order_id: str
    customer_id: str
    total_amount: Decimal = Field(ge=0)  # must be non-negative
    status: OrderStatus
    created_at: datetime
    promo_code: Optional[str] = None  # optional, but we track null rate

    @validator("order_id")
    def order_id_not_empty(cls, v):
        if not v.strip():
            raise ValueError("order_id cannot be blank")
        return v

def validate_and_load(records: list[dict]) -> tuple[list[Order], list[dict]]:
    # Returns (valid_records, failed_records).
    # Never silently drops failures. Log and route to a dead-letter topic.
    valid = []
    failed = []

    for record in records:
        try:
            valid.append(Order(**record))
        except Exception as e:
            logger.error(f"Contract violation: {e} | Record: {record}")
            failed.append({"record": record, "error": str(e)})

    # Fail the pipeline if more than 1% of records are invalid.
    failure_rate = len(failed) / len(records)
    if failure_rate > 0.01:
        raise RuntimeError(
            f"Contract breach: {failure_rate:.1%} of records failed validation "
            f"({len(failed)} / {len(records)})"
        )

    return valid, failed

Two decisions here worth explaining.

First, the 1% threshold. You do not want to fail the pipeline on a single bad record, but you also do not want to silently ingest garbage. Set a threshold that reflects your tolerance and make it explicit in the code.

Second, the dead-letter queue. Every failed record should go somewhere observable. If you drop it, it is gone forever. If you log it, you can replay it after fixing the issue.

Common Mistakes

Treating contracts as documentation. A YAML file that nobody checks is just noise. The contract has to run automatically, fail fast, and block bad data from propagating.

Putting all validation at one layer. Schema is not the same as quality. You can have perfectly typed data that is 90% null. Both need contracts.

Versioning contracts separately from the code. When a producer changes a column, the contract and the dbt model and the ingestion code all need to change together. Keep them in the same repo, reviewed in the same PR.

Using blocking contracts everywhere from day one. You will break things. Start with logging-only mode, measure your actual failure rates, then flip to hard-blocking after you understand the baseline.

Ignoring freshness SLAs. A technically correct dataset from 14 hours ago is a broken contract for a real-time dashboard. Freshness is a first-class quality dimension.

When Contracts Are Not Worth the Investment

Not every dataset needs a formal contract. Internal scratch tables, exploratory datasets, and one-off analyses do not need this overhead.

Contracts pay off when the data crosses a team or system boundary. If another team, application, or AI system depends on your data, you need a contract. If it breaks for them, you will spend more time debugging than you saved by skipping the contract in the first place.

The ROI is clearest in two scenarios: high-value production pipelines (revenue, product metrics, ML features) and AI/LLM systems consuming structured data. An LLM receiving malformed features will not throw an exception. It will just produce worse outputs. Contracts at the feature serving layer are non-negotiable for production AI.

The Shift Happening Right Now

The industry is moving toward contracts-first development. Write the contract before you write the pipeline. Define what the output should look like, what quality guarantees it carries, and who owns it. Then build to meet that spec.

It is the same discipline that made API development more reliable. The data ecosystem is just a few years behind on this.

In 2026, with AI systems consuming data directly, a schema break is no longer just a broken dashboard. It is a broken model, a wrong recommendation, a compounding error in an automated pipeline that nobody noticed. The cost of trust without verification has gone up significantly.

If your pipelines have never failed because of an upstream schema change, consider yourself lucky. Put contracts in place before that luck runs out.

Abs,
Gabriel Henrique Cardoso Antonio 🔗 gabrielh.dev

DEV Community