Gowtham Potureddi

Posted on Jul 5

Data Contracts: Open Data Contract Standard, Schema Registry & Producer-Consumer SLAs

#python #sql #interview #dataengineering

data contracts are the machine-readable, version-controlled agreements that finally answer the question "who broke the pipeline" without a Slack channel, a war room, or a 3 AM page — and by 2026 they have moved from thought-leadership decks into concrete, production-shipped YAML files that every serious data platform now checks into git. The pre-contract world is grim: a producer team quietly renames user_id to customer_id, the change lands in the source-system release, twelve downstream models fail overnight, the on-call scans dashboards for two hours before someone finally traces the rename through a diff, and every consumer team files an incident against a producer team that had no idea they had any consumers at all. A data contract collapses that entire failure mode into a single artefact — the producer commits a YAML file that declares the schema, the SLA, the quality checks, and the ownership; consumers subscribe to that file; the CI pipeline blocks any producer change that violates it.

This guide is the senior-DE walkthrough you wished existed the first time an interviewer asked "walk me through an open data contract standard file" or "what does BACKWARD compatibility mean in a schema registry and which producer-side change breaks it?" or "how does a producer consumer sla actually get enforced in CI vs at runtime?" It walks through why data contracts became inevitable given the shape of modern data teams, the anatomy of an odcs YAML with schema plus sla plus quality plus roles blocks, the Confluent / Apicurio schema registry model with subject-naming strategies and compatibility modes, the two enforcement gates (pre-merge lint plus runtime consumer-side validation) that turn a contract from a wish into a rule, and the rollout ladder — warn-only, block-on-new, block-everywhere — that lets a large organisation introduce contracts without shipping a wave of overnight breakages. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps immediately after reading, drill the ETL practice library →, rehearse on the streaming practice library →, and sharpen the schema-design axis with the optimization practice library →.

On this page

Why data contracts became the answer to "who broke the pipeline"
Open Data Contract Standard (ODCS) — the schema
Schema Registry integration — Avro, Protobuf, JSON
Contract enforcement — CI + runtime
SLAs + ownership + rollout
Cheat sheet — data contract recipes
Frequently asked questions
Practice on PipeCode

1. Why data contracts became the answer to "who broke the pipeline"

From tribal knowledge to version-controlled YAML — the producer-consumer bargain that stops silent breakage

The one-sentence invariant: a data contract is a machine-readable, version-controlled, producer-signed agreement that declares the schema, the SLA, the quality checks, and the ownership of a dataset — so any producer-side change that violates any of those axes is blocked before it lands, and every consumer team can reason about their upstream without asking a human. The pre-contract world runs on tribal knowledge — the producer team knows some downstream exists but not which, the consumer team knows the shape of yesterday's data but not tomorrow's, and every "silent schema drift" incident traces back to the same root cause: the agreement lived only in someone's head.

The four axes interviewers actually probe.

Schema. Column names, types, nullability, PII flags, unique constraints. The most obvious axis and the one most breakages come from. A rename, a type change, or a column drop without a deprecation window is a contract violation.
SLA. Freshness (how stale can the data be), availability (what fraction of the day is it queryable), and quality (null rate, uniqueness, range). SLAs give the contract teeth — a schema-conformant but four-hours-stale table is still a broken contract if the freshness SLA is 15 minutes.
Quality. Assertions the producer commits to keep true — order_total >= 0, country_code IN (ISO-3166 list), null(order_id) == 0. Enforced at write time by the producer or at read time by the consumer; violations either block or route to a dead-letter queue.
Ownership. The producer team, the consumer team(s), the escalation path. A contract without owners is a contract nobody enforces; a contract with owners is the single artefact the on-call opens when things go wrong.

Why 2026 is the "data contracts have won" year.

ODCS v3.x is the emerging standard. The Open Data Contract Standard shipped 3.x in 2025 with widespread adoption across Bitol / Linux Foundation projects. Vendor-neutral YAML shape means the same file drives dbt contracts, schema registry subjects, Great Expectations quality checks, and BI-tool metadata.
Confluent Schema Registry + dbt contracts complement. Streaming schemas live in the registry (Avro / Protobuf / JSON with compatibility modes); batch/warehouse schemas live in dbt model contracts. ODCS is the single YAML that either side derives from.
The tooling is mature. GitHub Actions + odcs-lint catch pre-merge violations; Confluent's compatibility check catches producer-side incompatible schemas; consumer-side validators (fastavro, protobuf runtime, JSON Schema) reject rogue payloads at runtime.
The organisational pattern is understood. "Producer-signed, consumer-consumed, platform-enforced" is now the shorthand — the producer owns the contract file in their repo, consumers depend on the version, the platform team runs the enforcement infrastructure.

What a contract actually contains.

id and version. A stable identifier (orders.public.v1) plus a semver version. Consumers pin to a major version; producers bump minor for backwards-compatible additions and major for breaking changes.
schema. Column list with name, type, description, nullable, PII flag, and unique/primary-key constraints. The rendered form drives Avro / Protobuf / dbt contract generation.
sla. Freshness threshold (e.g. max_freshness: PT15M), availability target (e.g. availability_pct: 99.9), and query latency SLO if the table backs an interactive surface.
quality. List of assertions — null-rate ceilings, uniqueness constraints, range checks, referential integrity. Each assertion has a severity (warn or block) and a rollout phase.
roles. Producer team, consumer team(s), platform steward, on-call rotation. Free-form YAML but tools like Backstage and DataHub already parse these into service-catalogue entries.

What interviewers listen for.

Do you name the four axes schema, SLA, quality, ownership without prompting? — senior signal.
Do you distinguish schema-only agreements (registry) from full contracts (ODCS)? — senior signal.
Do you describe the contract as a producer-signed, version-controlled artefact rather than a "documentation page"? — required.
Do you name CI + runtime as the two enforcement points? — required.
Do you mention deprecation windows and rollout phases when discussing schema changes? — senior signal.

Worked example — the silent-rename incident and how a contract prevents it

Detailed explanation. The canonical failure story: an upstream microservice team renames user_id to customer_id in the orders topic to align with a company-wide naming convention. The change ships in a Friday afternoon release. By Monday morning, twelve downstream dbt models have failed overnight, three ML feature pipelines are producing NULLs, and the executive dashboard is empty. The on-call spends four hours tracing the failure back to the rename. Walk an interviewer through what would have happened with a data contract in place.

The symptom. Downstream models fail with column "user_id" does not exist.
The root cause. Producer renamed a column with no deprecation notice.
The pre-contract discovery time. Four hours of Slack, grep, and git bisect.
The post-contract discovery time. Zero — the producer's PR fails CI with "removes column user_id from contract orders.public.v1 — this is a breaking change; open a v2 with deprecation window or use deprecated: true."

Question. Given a producer team, a consumer team, and a shared orders topic, design the data-contract flow that catches the rename at PR time and offers the producer a paved path to deprecate the old name.

Input.

Party	Before contract	After contract
Producer team	Renames column, ships Friday	Opens PR against `orders/contract.yaml`; CI fails on breaking change
Consumer team (dbt)	Discovers failure Monday	Sees deprecation notice via contract diff subscription
Platform team	Fields angry incident tickets	Runs lint + registry compat check; ships alerts
Data	Silently broken over weekend	Continues to flow; new column added; old kept until v2

Code.

# orders/contract.yaml — ODCS-shaped data contract, committed to the producer repo
apiVersion: v3.0.0
kind: DataContract
id: orders.public
version: 1.3.0
info:
  title: Orders — public model
  owner: team-checkout
  status: active
  description: One row per confirmed order in the checkout flow.
schema:
  - name: order_id
    type: STRING
    required: true
    unique: true
    description: Stable id of the order.
  - name: user_id
    type: STRING
    required: true
    pii: true
    description: Foreign key to users.public.
    deprecated: false
  - name: order_total
    type: DECIMAL(12,2)
    required: true
    description: Order total in USD.
  - name: created_at
    type: TIMESTAMP
    required: true
    description: UTC creation timestamp.
sla:
  max_freshness: PT15M
  availability_pct: 99.9
quality:
  - assertion: null_rate(order_id) == 0
    severity: block
  - assertion: unique(order_id)
    severity: block
  - assertion: null_rate(user_id) < 0.001
    severity: warn
roles:
  producer: team-checkout
  consumers:
    - team-analytics
    - team-fraud
    - team-marketing
  steward: platform-data

# .github/workflows/contract-lint.yml — the CI gate
name: contract-lint
on: [pull_request]
jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install odcs-lint
        run: pip install odcs-lint==0.9.0
      - name: Validate contract shape
        run: odcs-lint orders/contract.yaml
      - name: Diff against main branch
        run: odcs-lint diff --base main --path orders/contract.yaml
      # If diff removes a column or narrows a type without a version bump,
      # odcs-lint exits non-zero and the PR is blocked.

Step-by-step explanation.

The producer opens a PR that renames user_id → customer_id. The diff on orders/contract.yaml removes one column entry and adds another. odcs-lint diff --base main classifies this as a breaking change — removing a column that consumers depend on.
The lint job exits non-zero. The GitHub required-status-check blocks the merge. A comment on the PR reads: Breaking: column user_id removed. Options: (a) bump version to 2.0.0 with a deprecation window; (b) keep user_id and add customer_id in parallel; (c) mark user_id deprecated: true for 30 days.
The producer follows option (b): they keep user_id, add customer_id, and bump the contract minor version to 1.4.0 (additive change). The PR now passes.
Consumer teams subscribed to the contract file (via GitHub Actions or DataHub webhook) receive a diff notification. They have until the eventual 2.0.0 removal to migrate their queries.
In parallel, a scheduled job posts a deprecated: true marker on user_id after 30 days, and the eventual 2.0.0 removes the column. Every consumer had a paved migration path; no Monday-morning outage.

Output.

Metric	Pre-contract	Post-contract
Time to discover breaking rename	4 hours	0 (PR-time)
Downstream models broken	12	0
Producer-consumer trust	eroded	preserved
Migration path	ad-hoc	30-day deprecation window

Rule of thumb. A data contract is not "documentation" — it is a code-checked artefact in the producer's repo whose diff is a breaking-change classifier. Anything that changes column presence, types, or nullability without a version bump is a CI failure. The producer chooses between "bump minor with additive change" and "bump major with deprecation window"; they cannot silently ship a rename.

Worked example — the freshness SLA violation nobody noticed

Detailed explanation. A less obvious but equally common failure mode: the schema is unchanged but the freshness SLA is silently violated. A batch pipeline that was supposed to refresh every 15 minutes starts refreshing every 4 hours because someone changed the Airflow cron. The dashboard shows "up-to-date" because the freshness is not measured. Executives make decisions on stale data. Discovery happens weeks later when a finance reconciliation fails.

The pre-SLA world. Freshness is a footnote in the model description; nobody polls it.
The post-SLA world. The contract declares max_freshness: PT15M; a runtime probe compares max(created_at) against now() every 5 minutes and pages on violation.
The signal. A conformance test that runs on a schedule and asserts contract freshness — not a schema check.

Question. Given the same orders contract with max_freshness: PT15M, design the runtime probe that detects the 4-hour cron drift and pages the producer within 20 minutes.

Input.

Element	Value
Contract freshness SLA	15 minutes
Probe cadence	5 minutes
Alert threshold	freshness > 15 minutes for 3 consecutive probes
Pager destination	team-checkout (producer)

Code.

# freshness_probe.yml — a scheduled job that reads the contract SLA
# and compares against actual max(created_at)
apiVersion: v1
kind: FreshnessProbe
contract: orders.public
version: 1.4.0
sla:
  max_freshness: PT15M
probe:
  cadence: PT5M
  query: |
    SELECT
      EXTRACT(EPOCH FROM (NOW() - MAX(created_at))) AS staleness_seconds
    FROM warehouse.orders;
  alert_after_consecutive_failures: 3
  alert:
    channel: pagerduty
    service: team-checkout

# probe_runner.py — parses the contract SLA and runs the probe
import time, yaml
from datetime import timedelta
from isodate import parse_duration
import psycopg2, requests

def load_sla(contract_path):
    with open(contract_path) as f:
        return yaml.safe_load(f)

def evaluate(contract, warehouse_conn_str):
    sla = parse_duration(contract["sla"]["max_freshness"])
    conn = psycopg2.connect(warehouse_conn_str)
    cur = conn.cursor()
    cur.execute("SELECT EXTRACT(EPOCH FROM (NOW() - MAX(created_at))) FROM warehouse.orders;")
    (staleness_s,) = cur.fetchone()
    breach = staleness_s > sla.total_seconds()
    return breach, staleness_s

def page(service):
    requests.post("https://events.pagerduty.com/v2/enqueue",
                  json={"routing_key": service, "event_action": "trigger",
                        "payload": {"summary": "orders freshness SLA violated",
                                    "severity": "error", "source": "contract-probe"}})

if __name__ == "__main__":
    consecutive = 0
    while True:
        contract = load_sla("orders/contract.yaml")
        breach, staleness_s = evaluate(contract, "postgres://…")
        if breach:
            consecutive += 1
            if consecutive >= 3:
                page("team-checkout")
        else:
            consecutive = 0
        time.sleep(300)

Step-by-step explanation.

The probe reads the same orders/contract.yaml the producer signed. There is exactly one source of truth for the freshness SLA — no drift between the contract and the alert threshold.
Every 5 minutes, the probe measures actual staleness against MAX(created_at). PT15M (ISO-8601) parses to 900 seconds via isodate; the probe compares staleness_seconds > 900.
To suppress noise from single-scrape flakes, the alert fires only after 3 consecutive breaches — a 15-minute persistent violation. Transient blips (a slow refresh cycle finishing 30 seconds late) do not page.
When the cron drift lands, the producer sees the pager within 15–20 minutes of the actual SLA breach. Discovery-time drops from "weeks after finance reconciliation" to "before executives log in Tuesday morning."
The contract shape means the alert code is generic — any dataset with an ODCS contract that declares sla.max_freshness gets a probe by convention. New datasets do not require a new alert; they inherit the probe.

Output.

Metric	Pre-SLA probe	Post-SLA probe
Time to detect 4-hour cron drift	weeks	20 minutes
Alert configuration	per-dataset ad hoc	generic, contract-driven
Downstream consumer confidence	low (staleness invisible)	high (SLA published)
Executive dashboard trust	broken by unmeasured drift	restored by probes

Rule of thumb. A contract without runtime SLA probes is a contract only for schema. To turn a schema-contract into a full data-contract, every declared SLA (freshness, availability, quality) needs an automated probe that reads the contract and pages the producer on breach. The probe code is generic; the SLA is data.

Worked example — the ownership handshake

Detailed explanation. Contracts fail not because the tech is wrong but because nobody signs. A common anti-pattern: the platform team writes the contract, checks it in, and considers the job done. The producer team never reviews it; when the contract is violated, the producer says "we didn't sign this" and refuses to fix. Show how the roles block plus a governance flow makes the sign-off explicit and auditable.

The failure. Contract exists but ownership is unenforced; producer disowns it.
The fix. The contract's roles.producer field is a CODEOWNERS entry; the producer team is a required reviewer on the contract file; the sign-off is git-history evidence.
The escalation. If the producer refuses to sign, the platform team escalates to a data governance council; the alternative is deprecating the dataset entirely.

Question. Design the GitHub CODEOWNERS + branch-protection setup that forces producer sign-off on every contract change, and shows the audit trail.

Input.

Component	Value
Producer team	team-checkout
Consumer teams	team-analytics, team-fraud, team-marketing
Contract file	orders/contract.yaml
Required reviewers	1 from team-checkout + 1 from platform-data

Code.

# .github/CODEOWNERS
# The producer team owns the contract file; every change must be signed by them.
orders/contract.yaml                @company/team-checkout @company/platform-data

# .github/branch-protection.yml — enforced via probot or Terraform
branch: main
required_status_checks:
  - contract-lint
  - registry-compat-check
required_pull_request_reviews:
  required_approving_review_count: 2
  require_code_owner_reviews: true
enforce_admins: true

# audit-log-example — captured from GitHub's PR API and archived to warehouse
pr:
  number: 4217
  title: "orders: add customer_email column, bump to 1.5.0"
  author: producer-eng-1
  approvers:
    - alice@company (team-checkout, code owner)
    - bob@company   (platform-data, code owner)
  merged_at: 2026-07-01T14:22:03Z
  ci_status: contract-lint=pass registry-compat-check=pass

Step-by-step explanation.

CODEOWNERS binds the contract file to two GitHub teams: team-checkout (the producer) and platform-data (the steward). Any PR touching orders/contract.yaml requires an approving review from both.
Branch protection enforces the code-owner review requirement plus the two CI checks (contract-lint and registry-compat-check). Admins cannot bypass because enforce_admins: true.
Every contract change now has a git-history sign-off from the producer team. If the producer later disowns a violation, the audit log shows the exact reviewer who approved the contract at that version. Escalation goes to a governance council with evidence.
Consumers do not need to review the contract file — they subscribe to it via releases. Requiring consumer sign-off on every contract change would slow the producer to a crawl; the deprecation window is how consumer interests are protected instead.
The audit trail is archived to the warehouse for compliance queries — "who approved orders.public v1.4 → v1.5" is a SQL question, not a "please check GitHub" ticket.

Output.

Governance layer	Enforcement mechanism
Producer sign-off	CODEOWNERS + required review
Steward sign-off	CODEOWNERS + required review
Schema safety	contract-lint CI gate
Registry safety	registry-compat-check CI gate
Audit trail	git history + warehouse archive

Rule of thumb. A contract without an owner is a wish. Bind the contract file to a CODEOWNERS entry, require review from that team, and archive the audit trail. The producer cannot claim ignorance; the consumer has a paper trail; the platform team can escalate with evidence.

Senior interview question on the four axes of a data contract

A senior interviewer often opens with: "You're introducing data contracts to an organisation that has never had them. Walk me through the four axes you'd insist every contract cover, the enforcement gates, the rollout phases, and the failure modes you'd guard against in the first quarter."

Solution Using the four-axis contract + phased rollout + dual-gate enforcement

# The four-axis contract template — every dataset gets one of these
apiVersion: v3.0.0
kind: DataContract
id: <domain>.<name>
version: <semver>
info:
  title: <human title>
  owner: <team-slug>
  status: active | deprecated | draft
# 1. SCHEMA — the shape
schema:
  - name: <column>
    type: <STRING | INT | DECIMAL | TIMESTAMP | STRUCT | ARRAY>
    required: <bool>
    unique: <bool>
    pii: <bool>
    description: <text>
    deprecated: <bool>
# 2. SLA — the timing and availability
sla:
  max_freshness: <ISO-8601 duration>
  availability_pct: <float 0..100>
  query_latency_p99: <ISO-8601 duration>
# 3. QUALITY — the assertions
quality:
  - assertion: <expression>
    severity: warn | block
    rollout: warn_only | block_on_new | block_all
# 4. OWNERSHIP — the humans
roles:
  producer: <team-slug>
  consumers:
    - <team-slug>
  steward: <team-slug>
  on_call_rotation: <schedule>

Rollout phases — introducing contracts to an org
================================================

Phase 1 (weeks 0-4) — warn-only
   All contracts run in observe mode; violations logged, no blocking
   Purpose: discover how bad the current state is; shame nobody

Phase 2 (weeks 4-8) — block on new datasets
   Every net-new dataset requires a signed contract before launch
   Existing datasets stay in warn-only

Phase 3 (weeks 8-16) — block on new contract violations
   Existing datasets: any producer PR that would violate the contract is blocked
   Existing violations grandfathered with a 30-day deprecation timer

Phase 4 (week 16+) — block everywhere
   Every dataset has a signed contract; every PR is gated
   Runtime SLA probes cover every declared freshness/availability threshold

Step-by-step trace.

Axis	Enforcement point	Failure mode caught
Schema	contract-lint CI + registry-compat-check	column rename / drop / type narrow
SLA — freshness	runtime probe, pages producer	cron drift, backfill hang
SLA — availability	uptime probe, external synthetic	source system outage
Quality — assertion	dbt test / Great Expectations at write	null spike, uniqueness break
Ownership	CODEOWNERS + required review	disowned contract, missing sign-off

The four-axis template plus phased rollout ships to production without organisational whiplash. Warn-only reveals how bad the current state is (usually shocking); the phased blocking lets producers migrate at their own pace; by phase 4 every dataset is contracted and every SLA is probed.

Output:

Quarter milestone	Result
End of week 4	Warn-only visibility live; violation baseline measured
End of week 8	New datasets 100% contracted
End of week 16	Existing datasets 100% contracted; PR gating live
End of quarter	Runtime probes on every SLA; incident rate down ~70%

Why this works — concept by concept:

Four axes are load-bearing — schema alone is a subset of a contract, not the whole thing. SLA, quality, and ownership carry the same weight; treating any of them as optional creates the exact incidents contracts were introduced to prevent.
Dual-gate enforcement — CI catches the producer before the change ships; runtime probes catch the payload after the change ships. Neither gate is sufficient alone. CI protects against intent; runtime protects against reality.
Phased rollout — going straight from "no contracts" to "block everywhere" produces a wave of overnight breakages and destroys organisational trust. The four-phase ladder (warn → block-new → block-existing → block-all) lets the org adapt without pain.
CODEOWNERS is the ownership primitive — the contract's roles block is documentation; the CODEOWNERS binding is enforcement. Producers cannot claim ignorance when git history shows their team-lead approved the contract.
Cost — contract files are cheap (a few hundred lines of YAML per dataset); the enforcement infrastructure is one-time (CI actions, a probe runner, a lint tool). The avoided cost of one "silent-rename-broke-twelve-models" incident (~40 engineer hours) pays for the entire quarter's setup.

ETL
Topic — etl
ETL problems on producer-consumer contracts

Practice →

Streaming Topic — streaming Streaming problems on schema evolution and event validation

Practice →

2. Open Data Contract Standard (ODCS) — the schema

One YAML per producer — the ODCS v3.x shape that drives every downstream tool

The mental model in one line: ODCS v3.x is a vendor-neutral YAML schema whose top-level blocks are info, schema, sla, quality, roles, and servers — one file per dataset, checked into the producer's repo, and consumed by every downstream tool (dbt contracts, Confluent Schema Registry, Great Expectations, DataHub, Backstage) as the single source of truth. Every ODCS interview question is a variant of "walk me through the file" or "what's the semver rule for a breaking change."

The top-level blocks.

apiVersion + kind + id + version. The four identity fields. apiVersion: v3.0.0 locks the schema shape; kind: DataContract disambiguates from other Bitol kinds; id is a stable dotted identifier (orders.public); version is semver.
info. Human metadata — title, owner, status (draft / active / deprecated), description, tags. Rendered by DataHub / Backstage as the catalogue entry.
schema. The column list. Each entry is name + type + required + unique + pii + description + deprecated. This block is the source that dbt contracts and Avro / Protobuf generators consume.
sla. Freshness (max_freshness), availability (availability_pct), query latency (query_latency_p99), and retention (retention: P90D). All ISO-8601 durations and percentages.
quality. List of assertion + severity + rollout entries. Each assertion is a boolean expression the producer commits to keep true. Severity is warn or block; rollout is the phase (warn_only, block_on_new, block_all).
roles. Producer + consumers + steward + on-call rotation. Free-form YAML; tools parse into service catalogues.
servers. Physical resolution — where the data lives. Warehouse table (snowflake://…), Kafka topic (kafka://cluster/topic), object store (s3://bucket/prefix). Multiple entries when the same logical dataset has multiple materialisations.

The version semantics.

Patch (1.4.0** → 1.4.1).** Description-only change; no schema, SLA, or quality change.
Minor (1.4*.0 → 1.5.0).* Additive change — new column with required: false, new quality assertion with severity: warn, new consumer added to roles.consumers. Backwards compatible; existing consumers are unaffected.
Major (1*.4.0 → **2.0.0).* Breaking change — column removed, type narrowed, required: false → required: true, SLA tightened (freshness reduced), or quality severity raised. Consumers must migrate; a deprecation window is required.
The lint gate. odcs-lint diff --base main inspects the diff between the incoming PR and main; if the change is breaking and the version bump is not major, the CI fails.

The type system.

Primitives. STRING, BOOLEAN, INT (with subtypes TINYINT / SMALLINT / BIGINT), FLOAT, DOUBLE, DECIMAL(p,s), TIMESTAMP (with optional timezone), DATE, TIME.
Structured. STRUCT<...> for nested records; ARRAY<T> for lists; MAP<K,V> for maps. Nested schemas can carry their own pii flags and required semantics.
Semantic tags. format: email, format: iso-country-code, pii: true, sensitive: true. Used by DLP tooling and catalogues.
Bindings. Each primitive has a canonical mapping into Avro, Protobuf, JSON Schema, and warehouse SQL types. ODCS lint verifies the binding is unambiguous.

Common interview probes on ODCS.

"Walk me through the top-level blocks." — required.
"What's the semver rule for adding a new optional column?" — minor bump; additive.
"What's the rule for tightening a freshness SLA?" — major bump; consumers may not survive tighter constraints.
"How does ODCS relate to Schema Registry?" — ODCS is the source; Schema Registry is a derived materialisation for streaming.

Worked example — the 40-line orders contract

Detailed explanation. The canonical example: a public orders dataset produced by the checkout team and consumed by three downstream teams. Show the full 40-line ODCS YAML with every block filled in, and explain each field.

The dataset. One row per confirmed order, materialised in Snowflake and mirrored to a Kafka topic.
The SLA. 15-minute freshness (dashboards refresh every 15 min), 99.9% availability, retention 90 days.
The quality. Non-null order_id, unique order_id, order_total >= 0, country_code in ISO-3166.

Question. Produce the full ODCS YAML for this dataset with each block explicitly filled and inline comments explaining the interviewer-signal fields.

Input.

Attribute	Value
Producer	team-checkout
Consumers	team-analytics, team-fraud, team-marketing
Storage	Snowflake (`analytics.public.orders`), Kafka (`orders.public.v1`)
Freshness	15 minutes
Availability	99.9%
Retention	90 days

Code.

# orders/contract.yaml — the 40-line reference
apiVersion: v3.0.0
kind: DataContract
id: orders.public
version: 1.4.0
info:
  title: Orders — public model
  owner: team-checkout
  status: active
  description: One row per confirmed order in the checkout flow.
  tags: [core, revenue, tier1]

schema:
  - name: order_id
    type: STRING
    required: true
    unique: true
    description: Stable id of the order.
  - name: user_id
    type: STRING
    required: true
    pii: true
    description: FK to users.public.
  - name: order_total
    type: DECIMAL(12,2)
    required: true
    description: Order total in USD.
  - name: country_code
    type: STRING
    required: true
    format: iso-country-code
    description: ISO-3166 alpha-2.
  - name: created_at
    type: TIMESTAMP
    required: true
    description: UTC creation timestamp.

sla:
  max_freshness: PT15M      # ISO-8601 duration — 15 minutes
  availability_pct: 99.9
  retention: P90D           # 90 days

quality:
  - assertion: null_rate(order_id) == 0
    severity: block
    rollout: block_all
  - assertion: unique(order_id)
    severity: block
    rollout: block_all
  - assertion: order_total >= 0
    severity: block
    rollout: block_on_new
  - assertion: country_code IN ('ISO-3166 list')
    severity: warn
    rollout: warn_only

roles:
  producer: team-checkout
  consumers: [team-analytics, team-fraud, team-marketing]
  steward: platform-data
  on_call_rotation: pagerduty/team-checkout-primary

servers:
  - type: snowflake
    dsn: snowflake://analytics.public.orders
  - type: kafka
    dsn: kafka://prod-cluster/orders.public.v1

Step-by-step explanation.

The apiVersion and kind lock the file's shape to ODCS v3.x. Every ODCS-aware tool (lint, dbt-contract-generator, registry-sync) reads these two fields first and refuses non-matching files.
id: orders.public is the stable identifier — the producer never renames it; only the version changes. Downstream tools reference this id in their model definitions, so changing it would break every reference in the org.
version: 1.4.0 is semver. The .4 minor version reflects four additive changes since 1.0.0; if the producer removes country_code next, they must bump to 2.0.0 and offer a deprecation window.
The schema block enumerates every column with type, required-ness, and PII flag. user_id carries pii: true — DLP tooling reads this and applies masking rules automatically. order_total uses DECIMAL(12,2) — exact monetary type, never FLOAT.
sla.max_freshness: PT15M is the ISO-8601 duration. The freshness probe reads this exactly and asserts now() - MAX(created_at) < PT15M. availability_pct: 99.9 sets a monthly uptime target — ~43 minutes of downtime per month is the budget.
Each quality assertion carries a severity and rollout phase. Block-severity assertions fail the producer's write; warn-severity assertions log and route to a dashboard. Rollout phases let the producer introduce a new assertion in warn_only mode first, promote to block_on_new for freshly-arriving rows, and finally block_all for existing rows once cleanup is done.
roles.producer: team-checkout is the CODEOWNERS anchor. The consumers list is the notification target for release announcements; the steward is the neutral third party who mediates escalations; the on-call rotation is where the freshness-SLA pager fires.
servers maps the logical dataset to physical resources. The same orders.public contract covers both a Snowflake table and a Kafka topic; consumers can subscribe to either surface and get the same guarantees.

Output.

Block	Fields	Interviewer signal
info	title, owner, status, tags	core discovery metadata
schema	column list with types + pii	binds Avro / dbt contracts
sla	freshness + availability + retention	runtime probes drive from here
quality	assertions + severity + rollout	producer commitments
roles	producer + consumers + steward	governance anchor
servers	physical materialisation	consumer resolution target

Rule of thumb. Every ODCS contract fits on a screen. If yours is 400 lines, you're over-modelling — split into multiple contracts per logical dataset. The 40-line reference above covers 90% of production tables; the exceptions are dimensional models with lots of columns.

Worked example — semver bump decisions

Detailed explanation. The single most common interview probe on ODCS: "given this change, what's the version bump?" Walk through eight realistic changes and classify each.

Rule 1. Additive + optional = minor.
Rule 2. Additive + required = major.
Rule 3. Any column drop = major.
Rule 4. Type narrowing = major; type widening = minor.
Rule 5. Tightening SLA = major; loosening SLA = minor.

Question. Given eight changes, classify each as patch, minor, or major, with a one-line justification.

Input.

#	Change
1	Add optional column `promo_code`
2	Add required column `channel` (no default)
3	Remove column `legacy_ref`
4	Change `order_total` from DECIMAL(10,2) to DECIMAL(12,2)
5	Change `order_total` from DECIMAL(12,2) to DECIMAL(10,2)
6	Tighten `max_freshness` from PT1H to PT15M
7	Loosen `max_freshness` from PT15M to PT1H
8	Fix a typo in `description`

Code.

# Diff shape 1 — additive optional (minor 1.4 → 1.5)
+  - name: promo_code
+    type: STRING
+    required: false

# Diff shape 2 — additive required (major 1.4 → 2.0)
+  - name: channel
+    type: STRING
+    required: true          # existing consumers have no default → breaking

# Diff shape 3 — column drop (major 1.4 → 2.0)
-  - name: legacy_ref
-    type: STRING

# Diff shape 4 — type widen (minor 1.4 → 1.5)
-  - name: order_total
-    type: DECIMAL(10,2)
+  - name: order_total
+    type: DECIMAL(12,2)     # larger precision → readers survive

# Diff shape 5 — type narrow (major 1.4 → 2.0)
-  - name: order_total
-    type: DECIMAL(12,2)
+  - name: order_total
+    type: DECIMAL(10,2)     # smaller precision → risk of truncation

# Diff shape 6 — tighten SLA (major)
-sla:
-  max_freshness: PT1H
+sla:
+  max_freshness: PT15M

# Diff shape 7 — loosen SLA (minor)
-sla:
-  max_freshness: PT15M
+sla:
+  max_freshness: PT1H

# Diff shape 8 — description typo (patch 1.4.0 → 1.4.1)
-    description: Order totall in USD.
+    description: Order total in USD.

Step-by-step explanation.

Change 1 adds an optional column — every existing consumer's query still works because the new column is unreferenced. Minor bump; no deprecation window needed.
Change 2 adds a required column with no default. Existing producer writes that don't set the column will fail; existing consumer queries that reference the schema (Avro / Protobuf) will have a new required field with no old-data value. Major bump; producer must also backfill or provide a default: <value>.
Change 3 removes a column. Any consumer referencing it breaks. Major bump; the correct pattern is deprecate first (mark deprecated: true in 1.5.0), then remove in 2.0.0 after the deprecation window (typically 30–90 days).
Change 4 widens DECIMAL(10,2) to DECIMAL(12,2). Every value that fit the old type still fits the new type. Minor bump; consumers whose downstream types are inferred at read time survive without action.
Change 5 narrows the same field. Existing values with 11 or 12 digits of precision will truncate or throw at write time. Major bump; producer must migrate historical data or accept the loss.
Change 6 tightens the freshness SLA. Consumers who rely on the old SLA might have downstream pipelines that assume up to 1 hour of staleness is fine; tightening to 15 minutes doesn't break them — but consumer downstream SLAs might be dependent (a chain of "add 5 min buffer to upstream SLA"). By convention this is a major bump because SLA tightening changes the promised behaviour of the producer.
Change 7 loosens the SLA — data will be older than before. Every consumer's downstream freshness expectations degrade. This is a breaking change from the consumer's perspective (their downstream SLAs may violate). But by ODCS v3.x convention, loosening is a minor bump — the producer's promise stays truthful, though weaker; consumers can renegotiate.
Change 8 is a description fix. No schema, SLA, or quality change. Patch bump.

Output.

#	Change	Bump	Why
1	Add optional column	minor (1.5.0)	additive, non-breaking
2	Add required column	major (2.0.0)	old writes fail
3	Remove column	major (2.0.0)	consumers break
4	Widen type	minor (1.5.0)	existing values fit
5	Narrow type	major (2.0.0)	truncation risk
6	Tighten SLA	major (2.0.0)	promised behaviour changes
7	Loosen SLA	minor (1.5.0)	promise weaker but truthful
8	Typo in description	patch (1.4.1)	metadata only

Rule of thumb. When in doubt, bump major and offer a deprecation window. The cost of a major bump is one PR; the cost of an unannounced breaking change is a war room. Producers never regret a conservative bump.

Worked example — the ODCS lint tool and its diff mode

Detailed explanation. The lint tool is the CI muscle that enforces the semver rules mechanically. Show its two modes — shape validation (does the YAML parse as a valid ODCS v3.x contract?) and diff validation (does the incoming PR conform to semver?) — and the failure output shape.

Shape mode. odcs-lint <path> verifies the file parses, all required blocks are present, types are canonical, and the semver format is valid.
Diff mode. odcs-lint diff --base main --path <file> compares the PR file against main; classifies changes as patch / minor / major; asserts the version bump matches.
Exit codes. 0 = pass; 1 = shape error; 2 = version mismatch.

Question. Show a producer PR that mistakenly removes a column while bumping only the minor version, and the lint failure output that catches it.

Input.

Code.

# contract.yaml on main (before PR)
version: 1.4.0
schema:
  - name: order_id
    type: STRING
  - name: legacy_ref
    type: STRING
  - name: order_total
    type: DECIMAL(12,2)

# contract.yaml in PR (buggy — minor bump but breaking change)
version: 1.5.0
schema:
  - name: order_id
    type: STRING
  # legacy_ref removed — breaking change!
  - name: order_total
    type: DECIMAL(12,2)

$ odcs-lint diff --base main --path orders/contract.yaml
[odcs-lint v0.9.0] validating orders/contract.yaml

SHAPE: pass — file conforms to apiVersion v3.0.0
DIFF:
  changes detected:
    - schema.remove: column "legacy_ref" removed  → severity: BREAKING
  version bump: 1.4.0 → 1.5.0 (minor)
  required bump for BREAKING: major

FAIL: version bump (minor) does not match change severity (BREAKING).
      Options:
        (a) bump to 2.0.0 and add a deprecation window;
        (b) restore column "legacy_ref";
        (c) mark "legacy_ref" as deprecated: true and keep for one release cycle.

exit code: 2

Step-by-step explanation.

odcs-lint first runs shape validation — checks the file parses as YAML, has the required top-level blocks, and every schema entry has the mandatory fields. This step catches typos and structural mistakes.
In diff mode, the tool loads both versions (main and PR), computes the set difference on the schema block, and classifies each change. Removing an entry is classified BREAKING; adding an optional entry is MINOR; changing a description is PATCH.
The tool computes the maximum severity across all changes: BREAKING. It then compares to the version bump — 1.4.0 → 1.5.0 is a minor bump. The maximum severity requires a major bump. Mismatch. Exit code 2.
The failure message lists three remediation options in decreasing order of preference. Option (a) is the correct one for genuine intentional breaking change; option (b) is for accidental removals; option (c) is the deprecation path that gives consumers a migration window.
The GitHub required-status-check consumes exit code 2 as a failure. The merge button is disabled; the PR author must pick one of the three options before the check can pass.

Output.

Lint stage	Result	Action
Shape	pass	continue
Diff — remove column	BREAKING	require major
Version bump	1.5.0 (minor)	mismatch
Overall	FAIL exit 2	PR blocked

Rule of thumb. The lint tool is the mechanical enforcer of the semver rules. Trusting humans to bump correctly is a bug waiting to happen — every producer eventually forgets, and the diff-classifier is the safety net. Ship odcs-lint diff as a required status check on day one.

Senior interview question on the ODCS shape and versioning

A senior interviewer might ask: "Design the ODCS contract for a payments dataset that includes card-network-classified PII, has a 5-minute freshness SLA, is consumed by a fraud model with strict quality requirements, and needs to go through a schema migration that renames card_bin to card_iin. Walk me through the initial contract, the deprecation contract, and the final contract."

Solution Using an initial v1 + a deprecation-window v1.5 + a final v2 with the rename

# ---------- v1.0.0 — initial contract ----------
apiVersion: v3.0.0
kind: DataContract
id: payments.public
version: 1.0.0
info:
  title: Payments — public model
  owner: team-payments
  status: active
schema:
  - name: payment_id
    type: STRING
    required: true
    unique: true
  - name: card_bin
    type: STRING
    required: true
    pii: true
    sensitive: true
    description: First 6 digits of card number (Bank Identification Number).
  - name: amount
    type: DECIMAL(12,2)
    required: true
  - name: created_at
    type: TIMESTAMP
    required: true
sla:
  max_freshness: PT5M
  availability_pct: 99.95
quality:
  - assertion: null_rate(payment_id) == 0
    severity: block
    rollout: block_all
  - assertion: unique(payment_id)
    severity: block
    rollout: block_all
  - assertion: amount > 0
    severity: block
    rollout: block_all
roles:
  producer: team-payments
  consumers: [team-fraud, team-finance]
  steward: platform-data
servers:
  - type: kafka
    dsn: kafka://prod-cluster/payments.public.v1

---
# ---------- v1.5.0 — deprecation-window contract ----------
apiVersion: v3.0.0
kind: DataContract
id: payments.public
version: 1.5.0
info:
  status: active
  description: card_iin introduced as canonical name for BIN; card_bin retained for backward compatibility until 2.0.0 (deprecation window 45 days).
schema:
  - name: payment_id
    type: STRING
    required: true
    unique: true
  - name: card_bin
    type: STRING
    required: true
    pii: true
    sensitive: true
    deprecated: true         # signal to consumers
    description: DEPRECATED. Use card_iin. Removed in 2.0.0 on 2026-08-16.
  - name: card_iin
    type: STRING
    required: false          # optional in 1.5.0 so producers can migrate at their pace
    pii: true
    sensitive: true
    description: Issuer Identification Number (canonical name for what was BIN).
  - name: amount
    type: DECIMAL(12,2)
    required: true
  - name: created_at
    type: TIMESTAMP
    required: true
# sla, quality, roles, servers unchanged
sla:
  max_freshness: PT5M
  availability_pct: 99.95
quality:
  - assertion: null_rate(payment_id) == 0
    severity: block
    rollout: block_all
  - assertion: unique(payment_id)
    severity: block
    rollout: block_all
  - assertion: amount > 0
    severity: block
    rollout: block_all
  - assertion: card_bin == card_iin OR card_iin IS NULL
    severity: warn
    rollout: warn_only       # producer commits to both being equal during migration
roles:
  producer: team-payments
  consumers: [team-fraud, team-finance]

---
# ---------- v2.0.0 — final contract with rename complete ----------
apiVersion: v3.0.0
kind: DataContract
id: payments.public
version: 2.0.0
info:
  status: active
schema:
  - name: payment_id
    type: STRING
    required: true
    unique: true
  - name: card_iin              # canonical — card_bin removed
    type: STRING
    required: true
    pii: true
    sensitive: true
  - name: amount
    type: DECIMAL(12,2)
    required: true
  - name: created_at
    type: TIMESTAMP
    required: true
sla:
  max_freshness: PT5M
  availability_pct: 99.95
quality:
  - assertion: null_rate(payment_id) == 0
    severity: block
    rollout: block_all
  - assertion: unique(payment_id)
    severity: block
    rollout: block_all
  - assertion: amount > 0
    severity: block
    rollout: block_all
roles:
  producer: team-payments
  consumers: [team-fraud, team-finance]

Step-by-step trace.

Version	Key change	Consumer action
1.0.0	initial contract, `card_bin` required, PII flagged	subscribe
1.5.0 (deprecation)	`card_bin` marked deprecated; `card_iin` added optional	migrate reads to `card_iin`, keep fallback to `card_bin`
2.0.0 (final)	`card_bin` removed; `card_iin` now required	consumers must have migrated by now

The deprecation window is the middle contract's 45-day interval. During those 45 days both columns are populated (enforced by the warn-only assertion card_bin == card_iin OR card_iin IS NULL). Consumers migrate their queries at their own pace; by day 45, every consumer reads card_iin. Then 2.0.0 removes card_bin safely.

Output:

Metric	1.0.0	1.5.0	2.0.0
card_bin present	yes	yes (deprecated)	no
card_iin present	no	yes (optional)	yes (required)
Consumers still on card_bin	100%	migrating	0%
Breaking?	—	no (additive)	yes (drop)

Why this works — concept by concept:

Deprecation-window contract — the 1.5.0 middle step is what makes the rename safe. Both columns are populated; consumers can migrate at their own pace; the warn-only assertion guarantees the two columns are equal during the window.
Semver bumps match change semantics — 1.0.0 → 1.5.0 is minor (additive card_iin, additive deprecated flag). 1.5.0 → 2.0.0 is major (removing card_bin). The lint tool passes both bumps.
PII + sensitive flags carried forward — every contract flags card_iin as pii: true, sensitive: true. DLP tooling continues to mask it in query results and encrypt at rest.
SLA and quality unchanged — the rename is a schema-only migration. The 5-minute freshness SLA and the block-severity quality assertions stay stable across all three versions; consumers' downstream freshness expectations do not need renegotiating.
Cost — one deprecation window (45 days) plus three PRs. The alternative (rename in-place, break every consumer) costs one weekend war room per consumer plus the trust hit. The paved path is O(days) of calendar time and O(1) engineer-hours; the reckless path is O(engineers × hours × consumers).

ETL
Topic — etl
ETL problems on schema evolution and deprecation windows

Practice →

Optimization Topic — optimization Optimization problems on contract design

Practice →

3. Schema Registry integration — Avro, Protobuf, JSON

Confluent and Apicurio Schema Registry — subjects, versions, and compatibility modes are the streaming-side of ODCS

The mental model in one line: a schema registry is a centralised, versioned, compatibility-checked store of streaming schemas — every producer publishes an Avro / Protobuf / JSON Schema to a subject, the registry enforces a compatibility mode (BACKWARD, FORWARD, FULL, NONE) that either accepts or rejects the new schema, and consumers read the schema by version to deserialise payloads. ODCS is the source-of-truth YAML; the registry is the runtime-enforcement mirror for the streaming path.

The three moving pieces.

Subject. A namespaced string that identifies the schema slot — e.g. orders.public.v1-value for the value schema of the orders.public.v1 topic. Every schema publication goes to a subject; consumers query the registry by subject to fetch the schema.
Version. An integer per subject, incrementing every time a compatible schema is registered. v1 is the first schema; v2 is the next accepted schema; v3 the next; and so on.
Compatibility mode. The rule the registry enforces when a new schema is proposed. Modes decide whether a proposed schema is compatible with the previous version(s) — and reject it at publication time if not.

The compatibility modes in detail.

BACKWARD. New schema can be used to read data written by the previous version. Producers can add optional fields, but adding required fields or removing existing fields breaks BACKWARD. This is the default for most streaming setups — consumers can be upgraded first, then producers.
BACKWARD_TRANSITIVE. Same as BACKWARD but the new schema must be compatible with all previous versions, not just the previous one. Prevents drift across long lived subjects.
FORWARD. Previous schema can read data written by the new schema. Producers can be upgraded first; consumers stay on the old schema. Useful when producers ship faster than consumers.
FORWARD_TRANSITIVE. Same but against all previous versions.
FULL. Both directions — BACKWARD and FORWARD simultaneously. Neither party has to upgrade first; migrations are safest but changes are most restricted.
FULL_TRANSITIVE. Both directions, all previous versions.
NONE. No compatibility checks; anything goes. Only used during initial development or emergency schema rewrites.

The subject-naming strategies.

TopicNameStrategy (default). <topic>-key and <topic>-value. One schema per topic; simplest model; requires every event on the topic to share the same schema.
RecordNameStrategy. <record-fullname> (e.g. com.company.OrderCreated). One schema per record type; useful when the same topic carries multiple event types.
TopicRecordNameStrategy. <topic>-<record-fullname>. Combines the two — one schema per (topic, record-type) pair; useful for topics carrying multiple event types with topic-scoped versioning.

The Avro-Protobuf-JSON trade-off.

Avro. Binary encoding, compact, schema stored per message header (schema-id), fastest in most benchmarks. Ideal for high-throughput streaming.
Protobuf. Binary encoding, similar compactness to Avro, wire format is language-neutral, gRPC-native. Ideal for polyglot organisations with existing Protobuf ecosystems.
JSON Schema. Text encoding, human-readable, larger on the wire, easier debugging. Ideal for internal-tools topics where debuggability trumps performance.

Common interview probes on schema registry.

"Explain BACKWARD compatibility with a producer-side example." — required.
"What's the difference between BACKWARD and FORWARD?" — who upgrades first.
"What subject-naming strategy would you pick for a multi-event-type topic?" — RecordName or TopicRecordName.
"How does the registry integrate with ODCS?" — the ODCS schema block is the source; a sync tool projects it into the registry as an Avro / Protobuf schema.

Worked example — BACKWARD-compatible Avro evolution

Detailed explanation. The canonical evolution story: an orders Avro schema v1 has order_id, user_id, order_total. The producer wants to add a discount_code field. Show the two-line Avro schema evolution, the registry BACKWARD check that passes, and the consumer that survives without recompilation.

v1 schema. three required fields.
v2 schema. three original fields + discount_code as optional (nullable with default null).
BACKWARD check. does v2 reader read v1 data? Yes — the missing discount_code in v1 data is filled with the v2 default.
Producer + consumer upgrade order. upgrade consumers first (they can read both v1 and v2); then producers (they start writing v2).

Question. Produce the Avro schema for v1 and v2 of the orders topic value, the registry publish command, and the consumer code (Python) that reads both versions.

Input.

Element	Value
Topic	orders.public
Subject strategy	TopicNameStrategy
Subject	`orders.public-value`
Compatibility	BACKWARD
Registry	Confluent Schema Registry

Code.

// orders_v1.avsc — value schema v1
{
  "type": "record",
  "name": "Order",
  "namespace": "com.company.orders",
  "fields": [
    {"name": "order_id",    "type": "string"},
    {"name": "user_id",     "type": "string"},
    {"name": "order_total", "type": {"type": "bytes", "logicalType": "decimal",
                                     "precision": 12, "scale": 2}}
  ]
}

// orders_v2.avsc — value schema v2 (BACKWARD-compatible with v1)
{
  "type": "record",
  "name": "Order",
  "namespace": "com.company.orders",
  "fields": [
    {"name": "order_id",      "type": "string"},
    {"name": "user_id",       "type": "string"},
    {"name": "order_total",   "type": {"type": "bytes", "logicalType": "decimal",
                                       "precision": 12, "scale": 2}},
    {"name": "discount_code", "type": ["null", "string"], "default": null}
  ]
}

# Publish v1 and v2 to the registry with BACKWARD compatibility
curl -X PUT http://registry:8081/config/orders.public-value \
     -H "Content-Type: application/vnd.schemaregistry.v1+json" \
     -d '{"compatibility":"BACKWARD"}'

# Register v1
curl -X POST http://registry:8081/subjects/orders.public-value/versions \
     -H "Content-Type: application/vnd.schemaregistry.v1+json" \
     -d @<(jq -Rs '{schema: .}' < orders_v1.avsc)

# Register v2 — passes BACKWARD check (adds optional field with default)
curl -X POST http://registry:8081/subjects/orders.public-value/versions \
     -H "Content-Type: application/vnd.schemaregistry.v1+json" \
     -d @<(jq -Rs '{schema: .}' < orders_v2.avsc)

# consumer.py — reads both v1 and v2 messages transparently
from confluent_kafka import Consumer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroDeserializer

registry = SchemaRegistryClient({"url": "http://registry:8081"})

# The deserializer reads the schema-id from the message header,
# fetches the writer schema, and projects onto the current reader schema
reader_schema = open("orders_v2.avsc").read()   # consumer knows v2
avro_deserializer = AvroDeserializer(registry, reader_schema)

consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "orders-consumer",
    "auto.offset.reset": "earliest",
})
consumer.subscribe(["orders.public"])

while True:
    msg = consumer.poll(1.0)
    if msg is None or msg.error(): continue
    order = avro_deserializer(msg.value(), None)
    # order["discount_code"] is None for v1 messages, populated for v2 messages
    print(order["order_id"], order.get("discount_code"))

Step-by-step explanation.

The v1 schema declares three required fields. Every producer writing v1 must populate all three; consumers deserialising v1 receive all three.
The v2 schema adds discount_code as a union of null and string with a default of null. In Avro, adding a nullable field with a default is the canonical BACKWARD-compatible change.
Setting the subject-level compatibility to BACKWARD via PUT /config/orders.public-value establishes the rule the registry will enforce. Any subsequent POST /subjects/.../versions runs the compatibility check first.
Registering v2 succeeds because the BACKWARD check passes: a reader using v2's schema can read v1 data (the missing discount_code is filled with the default null). If the producer had instead added discount_code as required, the registry would return 409 Conflict with a Schema being registered is incompatible with an earlier schema error.
The consumer code declares v2 as its reader schema. The AvroDeserializer reads each message's writer-schema-id from the header (a 4-byte prefix), fetches the writer schema from the registry, and projects into the reader schema. For v1 messages, discount_code is filled with null; for v2 messages, it's populated.
The upgrade order is: (a) update consumers to use the v2 reader schema first — they can still read v1 messages; (b) once consumers are deployed, update producers to write v2. If the order were reversed, v1 consumers would fail on v2 messages carrying an unknown field.

Output.

Message written by	Schema id	Consumer reads with v2 reader	discount_code value
producer v1	1	success	null (default)
producer v2	2	success	"SPRING2026"
producer v2 (no discount)	2	success	null

Rule of thumb. BACKWARD is the streaming default. Add optional fields with defaults; upgrade consumers first, then producers. Reject any Avro change that adds a required field or removes an existing field — those are not BACKWARD-compatible and the registry will (correctly) refuse to register the schema.

Worked example — Protobuf schema evolution with FULL compatibility

Detailed explanation. Protobuf's approach to schema evolution differs from Avro. All fields carry a field number; unknown fields are preserved on read; field numbers must never be reused. Show a Protobuf schema evolution under FULL compatibility (both BACKWARD and FORWARD), demonstrating why Protobuf's design choices make FULL easier than Avro's.

Protobuf design. Fields are optional by default in proto3; adding a new optional field is always FULL-compatible.
The forbidden change. Renaming a field is fine (field number is the identity); reusing a field number for a new field is forbidden.
Wire format. Unknown fields are preserved during deserialise + reserialise, so a consumer on old schema can pass through data from a new-schema producer without losing information.

Question. Show a proto3 schema evolution from v1 to v2 (adding a discount_code field) under FULL compatibility, plus the registry compatibility check.

Input.

Element	Value
Topic	orders.public.proto
Subject	`orders.public.proto-value`
Compatibility	FULL
Format	Protobuf (proto3)

Code.

// orders_v1.proto
syntax = "proto3";
package com.company.orders;

message Order {
  string order_id    = 1;
  string user_id     = 2;
  string order_total = 3;   // decimal serialised as string for exactness
}

// orders_v2.proto — adds discount_code with a new field number
syntax = "proto3";
package com.company.orders;

message Order {
  string order_id      = 1;
  string user_id       = 2;
  string order_total   = 3;
  string discount_code = 4;   // new field, new number
}

# Publish with FULL compatibility
curl -X PUT http://registry:8081/config/orders.public.proto-value \
     -H "Content-Type: application/vnd.schemaregistry.v1+json" \
     -d '{"compatibility":"FULL"}'

# Register v1 and v2 — both pass FULL check for proto3 with new field number
for v in orders_v1.proto orders_v2.proto; do
  curl -X POST http://registry:8081/subjects/orders.public.proto-value/versions \
       -H "Content-Type: application/vnd.schemaregistry.v1+json" \
       -d @<(jq -Rs '{schema: ., schemaType: "PROTOBUF"}' < "$v")
done

// Consumer — Protobuf's design means the v1 consumer can read v2 messages
// (unknown fields preserved) and the v2 consumer can read v1 messages (missing
// fields default to empty string in proto3).

Order v1Order = Order.parseFrom(bytes);
// If bytes came from v2 producer, v1Order.getDiscountCode() throws (v1 has no such field)
// but v1 can still process the fields it knows about, and unknown-field bytes survive
// a round-trip through v1's serialiser.

Step-by-step explanation.

Proto3's design has all fields optional by default. There is no concept of "required" in proto3 — a field either has a value or its default (empty string, 0, false).
Adding a new field with a new number (discount_code = 4) is always FULL-compatible in proto3. The v1 reader ignores field 4 (unknown field, preserved on the wire); the v2 reader deserialises it. The v1 reader can also write — it just doesn't emit field 4.
The registry's FULL check on the PUT succeeds because the change is bidirectionally safe. If the producer had removed field 3 or reused field number 3 for a new field, FULL would fail.
Consumers written against v1 keep working when producers upgrade to v2; the unknown discount_code bytes are preserved and round-tripped. This is the "proto3 unknown fields" behaviour.
Compared to Avro's BACKWARD-only default, Protobuf's design makes FULL compatibility achievable for a broader class of changes. The trade-off: Protobuf is less compact on the wire for sparse data and requires more careful management of field numbers (never reuse).

Output.

Change	Avro BACKWARD	Protobuf FULL
Add optional field with default	pass	pass
Add required field	fail	(no required in proto3)
Remove field	fail	fail (breaking)
Rename field	fail (name matters)	pass (number matters)
Reuse field number for new field	N/A	forbidden

Rule of thumb. For polyglot organisations or where FULL compatibility matters, prefer Protobuf. For maximum compactness and where BACKWARD is sufficient, prefer Avro. JSON Schema is fine for internal-tools topics where debuggability wins. Never mix formats within a subject — pick one per topic.

Worked example — Subject-naming strategies for multi-event topics

Detailed explanation. A topic that carries multiple event types (e.g. orders.events carrying OrderCreated, OrderPaid, OrderShipped) cannot use TopicNameStrategy because there'd be one schema slot for three different record types. RecordNameStrategy assigns a subject per record type; TopicRecordNameStrategy scopes it further per topic. Walk through the three strategies with a concrete example.

TopicNameStrategy. orders.events-value — one slot; wrong for multi-event topics.
RecordNameStrategy. com.company.orders.OrderCreated, com.company.orders.OrderPaid — three slots; same schema shared across every topic that carries the record.
TopicRecordNameStrategy. orders.events-com.company.orders.OrderCreated — three slots scoped per topic; different topics can evolve the same record type independently.

Question. Show a producer config and a consumer config for a topic orders.events carrying OrderCreated, OrderPaid, and OrderShipped, using TopicRecordNameStrategy.

Input.

Component	Value
Topic	orders.events
Event types	OrderCreated, OrderPaid, OrderShipped
Subject strategy	TopicRecordNameStrategy
Compatibility	BACKWARD (per subject)

Code.

// OrderCreated.avsc
{
  "type": "record",
  "name": "OrderCreated",
  "namespace": "com.company.orders",
  "fields": [
    {"name": "order_id",   "type": "string"},
    {"name": "created_at", "type": "long", "logicalType": "timestamp-millis"}
  ]
}

// Producer — java, Kafka + Confluent client
Properties props = new Properties();
props.put("bootstrap.servers",         "kafka:9092");
props.put("schema.registry.url",       "http://registry:8081");
// Subject strategy — one subject per (topic, record-type)
props.put("value.subject.name.strategy",
          "io.confluent.kafka.serializers.subject.TopicRecordNameStrategy");
props.put("key.serializer",   "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");

KafkaProducer<String, GenericRecord> producer = new KafkaProducer<>(props);

// Publishing an OrderCreated event to orders.events
GenericRecord created = new GenericData.Record(orderCreatedSchema);
created.put("order_id",   "o-42");
created.put("created_at", System.currentTimeMillis());
producer.send(new ProducerRecord<>("orders.events", "o-42", created));
// Registry subject: orders.events-com.company.orders.OrderCreated

# Registry state after publishing all three event types
subjects:
  orders.events-com.company.orders.OrderCreated:   [v1]
  orders.events-com.company.orders.OrderPaid:      [v1]
  orders.events-com.company.orders.OrderShipped:   [v1]

Step-by-step explanation.

The producer sets value.subject.name.strategy = TopicRecordNameStrategy. The Kafka Avro serializer inspects each message's Avro schema, extracts the fullname (com.company.orders.OrderCreated), and constructs the subject orders.events-com.company.orders.OrderCreated.
The three event types register as three distinct subjects. Each subject has its own version history and compatibility rule; evolving OrderPaid does not touch OrderCreated's history.
The consumer subscribes to orders.events and receives interleaved messages of all three types. For each message, it reads the schema-id from the header, fetches the writer schema from the registry, and deserialises. The consumer then dispatches based on the schema fullname to type-specific handlers.
If the team later adds a fourth event type OrderCancelled, it gets its own subject orders.events-com.company.orders.OrderCancelled automatically — no config change on the producer, no coordination.
The catch is quota — Confluent Schema Registry counts subjects for its subject-limit quota. Multi-event topics multiply subject counts; teams with many topics and many event types can hit the limit and need to negotiate a quota bump.

Output.

Strategy	Subject per topic per event type	Cross-topic sharing	Use case
TopicNameStrategy	1 subject / topic	not applicable	single-event topics
RecordNameStrategy	1 subject / record type	yes	shared record across topics
TopicRecordNameStrategy	1 subject / (topic, record)	no	multi-event topics

Rule of thumb. Default to TopicNameStrategy for single-event topics (simplest, matches ODCS 1:1). Escalate to TopicRecordNameStrategy for multi-event topics — it's the most flexible and matches the "one contract per (topic, event)" mental model. Reserve RecordNameStrategy for the rare case where the same record type flows through multiple topics.

Senior interview question on schema registry integration with ODCS

A senior interviewer might ask: "You have an ODCS contract for orders.public version 1.4.0. Walk me through how you'd sync that contract into the Confluent Schema Registry, what compatibility mode you'd set, how you'd handle a v1.5.0 additive change and a v2.0.0 breaking change, and what happens to consumers pinned to v1 during a v2 rollout."

Solution Using an ODCS-to-registry sync tool with BACKWARD-TRANSITIVE + per-version consumer pinning

# sync_odcs_to_registry.py — the reference sync tool
import yaml, json, requests
from typing import Dict, Any

# Map ODCS types to Avro types
ODCS_TO_AVRO = {
    "STRING":    "string",
    "INT":       "int",
    "BIGINT":    "long",
    "BOOLEAN":   "boolean",
    "TIMESTAMP": {"type": "long", "logicalType": "timestamp-millis"},
    # DECIMAL(p,s) handled specially
}

def odcs_column_to_avro_field(col: Dict[str, Any]) -> Dict[str, Any]:
    avro_type = ODCS_TO_AVRO.get(col["type"], "string")
    if col["type"].startswith("DECIMAL"):
        # DECIMAL(12,2) → bytes with logicalType
        precision, scale = _parse_decimal(col["type"])
        avro_type = {"type": "bytes", "logicalType": "decimal",
                     "precision": precision, "scale": scale}
    if not col.get("required", False):
        avro_type = ["null", avro_type]
    field = {"name": col["name"], "type": avro_type}
    if not col.get("required", False):
        field["default"] = None
    if col.get("description"):
        field["doc"] = col["description"]
    return field

def _parse_decimal(t: str):
    inside = t[len("DECIMAL("):-1]
    p, s = inside.split(",")
    return int(p), int(s)

def odcs_to_avro(contract: Dict[str, Any]) -> Dict[str, Any]:
    return {
        "type": "record",
        "name": "Value",
        "namespace": contract["id"],
        "fields": [odcs_column_to_avro_field(c) for c in contract["schema"]],
    }

def sync(contract_path: str, registry_url: str, compat: str = "BACKWARD_TRANSITIVE"):
    contract = yaml.safe_load(open(contract_path))
    subject = f"{contract['id']}.v{contract['version'].split('.')[0]}-value"

    # 1. set compatibility
    requests.put(f"{registry_url}/config/{subject}",
                 json={"compatibility": compat})

    # 2. compute the Avro schema from the ODCS schema
    avro = odcs_to_avro(contract)

    # 3. register the schema (registry runs its own compatibility check first)
    r = requests.post(f"{registry_url}/subjects/{subject}/versions",
                      json={"schema": json.dumps(avro)})
    if r.status_code == 409:
        raise SystemExit(f"registry rejected schema — incompatible: {r.text}")
    return r.json()["id"]

if __name__ == "__main__":
    print(sync("orders/contract.yaml", "http://registry:8081"))

Rollout plan — v1.5.0 (additive) → v2.0.0 (breaking)
======================================================

1.4.0 → 1.5.0 (additive: add discount_code optional)
   [producer PR]  odcs-lint diff  → passes (MINOR)
   [CI step]      sync_odcs_to_registry orders/contract.yaml
                    → subject: orders.public.v1-value
                    → registry check: BACKWARD_TRANSITIVE passes
                    → schema id incremented
   [consumers]    still on 1.4.0-compatible Avro reader schema
                  new discount_code ignored; no code change required

1.5.0 → 2.0.0 (breaking: rename card_bin → card_iin, drop card_bin)
   [producer PR]  odcs-lint diff  → BREAKING (drop card_bin)
                                    version bump matches: 1.x → 2.0.0 OK
   [subject]      NEW subject: orders.public.v2-value
                  compatibility: BACKWARD_TRANSITIVE within v2's history
   [producers]    dual-write for 45 days:
                    write v1 schema to orders.public.v1
                    write v2 schema to orders.public.v2
   [consumers]    pinned to v1 keep working (v1 topic still flowing)
                  migrate to v2 topic at their own pace
   [cutover day]  producers stop writing to v1; v1 topic reaches retention → deprecated

Step-by-step trace.

Change	Subject	Compat check	Consumer action
1.4 → 1.5 additive	orders.public.v1-value	BACKWARD_TRANSITIVE passes	none
1.5 → 2.0 breaking (rename)	new: orders.public.v2-value	fresh subject, own history	migrate over 45 days
Cutover	drop orders.public.v1 topic	v1 retention expires	consumers all on v2

The major-version bump gets its own registry subject (v2-value). Producers dual-write for the deprecation window; v1 consumers keep working on the old topic; v2 consumers use the new topic. On cutover day, producers stop writing to v1; the topic ages out through retention; no v1 message ever fails a consumer.

Output:

Rollout day	v1 topic	v2 topic	v1 consumers	v2 consumers
Day 0 (1.5 → 2.0 PR)	active	active	reading v1	reading v2
Day 30	active	active	some migrated	growing
Day 45 (deprecation ends)	producers stop writing	active	zero (all migrated)	100%
Day 90 (retention expires)	closed	active	—	100%

Why this works — concept by concept:

ODCS-as-source — the producer maintains one YAML; the sync tool projects it into whatever runtime materialisation the streaming stack needs (Avro, Protobuf, JSON Schema). Drift between the contract and the registry is impossible because the sync runs in CI.
Major version = new subject — instead of trying to squeeze breaking changes into an existing subject with tortured compat rules, ODCS + registry integrations create a fresh subject per major version. The v1 subject keeps flowing; v2 flows in parallel; consumers migrate on their timeline.
BACKWARD_TRANSITIVE within a major — inside a major version, additive changes remain compatible across all minors. The stricter TRANSITIVE variant catches drift-induced incompatibility that non-transitive BACKWARD misses.
Dual-write during deprecation — the producer writes v1 to the v1 topic and v2 to the v2 topic for the deprecation window. Consumers on either version keep working; the cost is 2x write throughput during the window.
Cost — the sync tool is ~150 lines of Python; the dual-write imposes 2x storage during deprecation. The avoided cost of a coordinated org-wide upgrade (dozens of consumer teams migrating on the same weekend) more than pays for the throughput doubling. O(days) of runtime cost; O(0) coordination cost.

Streaming
Topic — streaming
Streaming problems on schema registry and Avro evolution

Practice →

ETL Topic — etl ETL problems on serialisation and compatibility

Practice →

4. Contract enforcement — CI + runtime

Two gates catch two failure modes — CI blocks bad intent; runtime blocks bad reality

The mental model in one line: CI enforcement catches the producer before the change ships (pre-merge lint against the contract, pre-merge registry compatibility check, pre-merge dbt contract test); runtime enforcement catches the payload after the change ships (consumer-side deserialisation validation, streaming producer-side schema-id header, batch loader constraint check). Neither gate suffices alone; both together give the contract teeth.

The two enforcement gates.

CI gate. Runs against every producer PR before merge. Steps: (1) contract-lint validates the ODCS YAML shape; (2) contract-diff checks the semver bump matches change severity; (3) registry-compat-check (for streaming datasets) verifies the projected Avro / Protobuf schema passes the subject's compatibility mode; (4) dbt-contract-check (for warehouse datasets) verifies the dbt model's declared columns match the contract.
Runtime gate. Runs against every payload. For streaming: the consumer-side deserialiser reads the schema-id header, fetches the writer schema, projects into the consumer's reader schema; incompatible payloads throw a SerializationException that routes to a DLQ. For batch: the loader (Airflow / dbt / Fivetran) checks the incoming schema against the contract before writing; incompatible files land in a quarantine bucket.

The CI failure classifier.

Level 1 — YAML shape. File doesn't parse or missing required blocks. Immediate lint failure with a pointer to the offending line.
Level 2 — Semver mismatch. Diff detects a breaking change; version bump is not major. Fail with the three remediation options (bump major / restore / deprecate).
Level 3 — Registry incompatibility. ODCS says the schema should be BACKWARD-compatible, but the projected Avro schema is not. Fail with the registry's specific error message.
Level 4 — dbt contract mismatch. The dbt model contract's columns block does not match the ODCS schema block. Fail with the diff.

The runtime failure taxonomy.

Deserialisation error. Consumer receives a message whose writer schema is incompatible with its reader schema. Options: (a) fail-fast — throw and let the consumer die; (b) route to DLQ; (c) skip and log. Production default: DLQ + alert.
Quality assertion violation. Payload deserialised successfully but a quality assertion (null-rate ceiling, uniqueness, range) is violated. Options: (a) block the write (severity=block); (b) log-and-continue (severity=warn); (c) route to a quarantine table.
SLA violation. Freshness or availability probe detects a breach; page the producer.
PII leak. DLP scanner detects unmasked PII in a queryable surface (dashboard, downstream table, log line). Immediate quarantine + security ticket.

The dead-letter queue pattern.

When. Consumer-side deserialisation error or quality violation with severity=block.
Where. A dedicated Kafka topic (<original-topic>.dlq) or an object-store prefix (s3://…/dlq/<topic>/<date>/).
What. The original payload bytes + the error message + timestamp + correlation id.
Who. Producer team owns the DLQ; consumer team monitors DLQ growth as a signal.
How long. DLQ retention typically 14–30 days; long enough for a producer to diagnose and replay, not so long it becomes a data-leak surface.

Common interview probes on enforcement.

"Where does contract enforcement happen — CI or runtime?" — both; describe the split.
"What happens when a schema-registry-BACKWARD check fails in CI?" — PR blocked, remediation options.
"Consumer receives an incompatible payload — what does it do?" — DLQ + alert; don't crash.
"How do you enforce a quality assertion?" — dbt test (batch) or streaming validator (Flink / Beam); severity + rollout phase.

Worked example — CI catches a schema-breaking rename

Detailed explanation. Reprise the silent-rename story, but this time the CI catches it. Walk through the exact GitHub Actions run, the odcs-lint output, the registry-compat-check, and the required remediation.

The proposed change. Producer renames user_id → customer_id and bumps 1.4.0 → 1.5.0.
The CI pipeline. contract-lint (shape) → contract-diff (semver) → registry-compat-check (BACKWARD).
The failure. contract-diff exits 2 (BREAKING vs minor bump); registry-compat-check also fails (renamed field breaks BACKWARD).

Question. Show the GitHub Actions workflow, the failing step outputs, and the PR comment bot that summarises remediation.

Input.

Element	Value
Contract file	orders/contract.yaml
Baseline version	1.4.0
Proposed version	1.5.0
Change	rename `user_id` → `customer_id`
Registry	Confluent (BACKWARD)

Code.

# .github/workflows/contract-enforcement.yml
name: contract-enforcement
on:
  pull_request:
    paths:
      - '**/contract.yaml'

jobs:
  lint-and-diff:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install tooling
        run: pip install odcs-lint==0.9.0 avro-python3 requests

      - name: Shape validation
        run: |
          for f in $(git diff --name-only origin/main HEAD | grep 'contract\.yaml$'); do
            echo "--- Linting $f ---"
            odcs-lint "$f"
          done

      - name: Diff validation (semver)
        id: diff
        run: |
          for f in $(git diff --name-only origin/main HEAD | grep 'contract\.yaml$'); do
            odcs-lint diff --base origin/main --path "$f" | tee /tmp/diff-out
          done

      - name: Registry compatibility check
        run: python .github/scripts/registry_compat_check.py

      - name: Post PR comment on failure
        if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const output = fs.readFileSync('/tmp/diff-out', 'utf8');
            const body = `## Contract enforcement failed\n\n\`\`\`\n${output}\n\`\`\`\n\n**Remediation:** bump version to 2.0.0 with a deprecation window, restore the removed column, or mark it \`deprecated: true\` for one release cycle.`;
            await github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

--- Linting orders/contract.yaml ---
SHAPE: pass — file conforms to apiVersion v3.0.0

--- Diff against origin/main ---
DIFF:
  changes detected:
    - schema.remove: column "user_id"     → severity: BREAKING
    - schema.add:    column "customer_id" → severity: MINOR
  version bump: 1.4.0 → 1.5.0 (minor)
  required bump for BREAKING: major

FAIL: version bump (minor) does not match change severity (BREAKING).
      Options:
        (a) bump to 2.0.0 with deprecation window
        (b) restore column "user_id"
        (c) mark "user_id" as deprecated: true for one release cycle
exit code: 2

--- Registry compatibility check ---
subject: orders.public.v1-value
compatibility: BACKWARD
projected Avro schema differs from registry v1:
  removed field: user_id
  added field:   customer_id
FAIL: schema is not BACKWARD-compatible.
      To pass BACKWARD, retain user_id (optionally as deprecated).
exit code: 1

Step-by-step explanation.

The GitHub Actions workflow triggers on any PR that touches a contract.yaml file. Three steps run sequentially: shape validation, diff validation, registry compatibility check. Each step exits non-zero on failure.
Shape validation passes — the file is still valid YAML and has all required blocks. The failure is in the diff.
Diff validation reads the base branch's contract.yaml and the PR's file, computes the schema delta, and classifies each change. The remove is BREAKING (removing a column consumers depend on); the add is MINOR (additive optional column). The maximum severity is BREAKING; the version bump is 1.4.0 → 1.5.0 (minor); mismatch → exit 2.
The registry-compat-check projects the ODCS schema into Avro (using the sync tool from section 3), fetches the current registry schema for orders.public.v1-value, and runs Confluent's compatibility check locally (or via API). The rename fails BACKWARD because a consumer reading the new schema cannot read messages produced under the old schema (missing field).
The PR comment bot posts the failure output plus a remediation guide. The PR author sees exactly what the fix must be; no guessing, no waiting for a human reviewer.

Output.

Step	Exit	Comment	Merge?
Shape	0	pass	continue
Diff	2	BREAKING vs minor	blocked
Registry	1	BACKWARD fails	blocked
Overall	fail	PR comment posted	no

Rule of thumb. The CI gate is the gate producers experience. Ship it early, ship it with a clear PR-comment output, and never bypass it — a bypass on Friday is an incident on Monday. The registry check is the streaming-specific belt-and-braces; the contract-diff is the format-agnostic backstop.

Worked example — dbt model contract enforcement for a warehouse table

Detailed explanation. dbt's contract: enforced: true feature is the warehouse-side counterpart to the registry's runtime enforcement. When a dbt model contract is enforced, dbt (a) verifies the model's SQL yields columns matching the declared columns block, (b) fails compile if a column is missing or has the wrong type, (c) generates a CREATE TABLE with the declared types and constraints, and (d) runs the constraint checks at write time.

The dbt model. A dim_orders model that materialises the orders.public contract into the warehouse.
The dbt contract block. Copy-paste of the ODCS schema's columns with type mappings.
The enforcement points. dbt compile (columns match), dbt build (constraints), dbt test (quality assertions).

Question. Produce the dbt model with a contract that mirrors the ODCS orders.public schema, and show the compile-time failure when a producer PR breaks the contract.

Input.

Component	Value
dbt project	analytics
Model	dim_orders
Materialisation	table
Contract	enforced: true
Warehouse	Snowflake

Code.

# models/marts/dim_orders.yml
version: 2
models:
  - name: dim_orders
    description: Dimensional table for confirmed orders.
    config:
      contract:
        enforced: true
    columns:
      - name: order_id
        data_type: varchar
        constraints:
          - type: not_null
          - type: primary_key
      - name: user_id
        data_type: varchar
        constraints:
          - type: not_null
      - name: order_total
        data_type: decimal(12,2)
        constraints:
          - type: not_null
      - name: country_code
        data_type: varchar(2)
        constraints:
          - type: not_null
      - name: created_at
        data_type: timestamp_tz
        constraints:
          - type: not_null
    tests:
      - dbt_utils.expression_is_true:
          expression: "order_total >= 0"
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns: ['order_id']

-- models/marts/dim_orders.sql
{{ config(materialized='table') }}

SELECT
    order_id::VARCHAR             AS order_id,
    user_id::VARCHAR              AS user_id,
    order_total::DECIMAL(12,2)    AS order_total,
    country_code::VARCHAR(2)      AS country_code,
    created_at::TIMESTAMP_TZ      AS created_at
FROM {{ source('raw', 'orders') }}
WHERE status = 'confirmed';

# What happens when a producer PR renames the source column
$ dbt build --select dim_orders

Running with dbt=1.7.10
Found 1 model, 2 tests, ...

Compiling model.analytics.dim_orders

CONTRACT ERROR in model dim_orders:
  Column 'user_id' declared in contract is missing from the model.
  Model yields:      order_id, customer_id, order_total, country_code, created_at
  Contract declares: order_id, user_id, order_total, country_code, created_at

  Missing: user_id
  Unexpected: customer_id

dbt build FAILED.

Step-by-step explanation.

The dbt model contract mirrors the ODCS schema. The data_type values are Snowflake-specific translations of the ODCS types (STRING → varchar, DECIMAL(12,2) → decimal(12,2), TIMESTAMP → timestamp_tz). A sync tool can auto-generate this from the ODCS YAML.
contract: enforced: true tells dbt to verify the model's output columns match the declared columns exactly. At compile time, dbt runs the model's SELECT statement in dry-run mode, inspects the output columns, and compares to the contract.
When the source table's user_id is renamed to customer_id, the model's SELECT yields customer_id instead of user_id. dbt's contract check detects the mismatch: user_id declared but missing, customer_id present but not declared. The build fails.
The failure happens in dbt build, which runs in the analytics team's CI as a required check. The producer PR that caused the rename to propagate through the raw table is blocked at the point where it would break the dim table.
Fixing forward: either update the source loader to keep populating user_id (aliasing from the renamed column), or update the dbt model + contract simultaneously in one PR. The contract change goes through the ODCS lint gate before the model change ships, ensuring the two stay in sync.

Output.

Step	dbt outcome	Producer PR
dbt parse	pass	—
dbt compile	pass	—
dbt contract check	FAIL — column mismatch	blocked
dbt build	not reached	blocked
dbt test	not reached	blocked

Rule of thumb. For warehouse-materialised contracts, dbt contract: enforced: true is the runtime gate. Sync the ODCS schema into the dbt contract block via a code-generator; keep them 1:1. The dbt gate is what stops a producer-side rename from silently corrupting the dim table.

Worked example — consumer-side runtime validator with DLQ routing

Detailed explanation. A streaming consumer that reads Avro-encoded messages needs to defend against payloads that don't conform to its reader schema. Show a Kafka consumer that (a) reads the schema-id, (b) fetches the writer schema from the registry, (c) attempts deserialisation, (d) routes failures to a DLQ topic with the original payload + error message + trace metadata.

The failure mode. Producer accidentally publishes a message whose schema id points to a subject the registry no longer serves (deleted subject, disconnected cluster).
The graceful handling. DLQ the message; alert the consumer team; keep the consumer running.
The reckless handling. Throw uncaught; consumer dies; lag grows; on-call woken up.

Question. Implement a Python Kafka consumer that reads Avro-encoded messages from orders.public.v1, validates against the contract's reader schema, and routes deserialisation failures to orders.public.v1.dlq. Include the metadata payload.

Input.

Component	Value
Source topic	orders.public.v1
DLQ topic	orders.public.v1.dlq
Registry	http://registry:8081
Consumer group	orders-consumer

Code.

import json
import logging
import time
from confluent_kafka import Consumer, Producer, KafkaError
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroDeserializer
from confluent_kafka.serialization import SerializationError, MessageField, SerializationContext

log = logging.getLogger("consumer")

registry = SchemaRegistryClient({"url": "http://registry:8081"})

# The reader schema the consumer expects (from the ODCS contract, projected via sync tool)
READER_SCHEMA = open("orders_reader.avsc").read()
avro_deser = AvroDeserializer(registry, READER_SCHEMA)

consumer = Consumer({
    "bootstrap.servers":  "kafka:9092",
    "group.id":           "orders-consumer",
    "auto.offset.reset":  "earliest",
    "enable.auto.commit": False,   # commit only on successful process
})
consumer.subscribe(["orders.public.v1"])

dlq = Producer({"bootstrap.servers": "kafka:9092"})

def route_to_dlq(msg, error: Exception):
    envelope = {
        "original_payload_hex": msg.value().hex() if msg.value() else None,
        "original_key":         msg.key().decode("utf-8") if msg.key() else None,
        "topic":                msg.topic(),
        "partition":            msg.partition(),
        "offset":               msg.offset(),
        "timestamp_ms":         msg.timestamp()[1],
        "error_type":           type(error).__name__,
        "error_message":        str(error),
        "consumer_group":       "orders-consumer",
        "trace_id":             _extract_trace_id(msg),
    }
    dlq.produce(topic="orders.public.v1.dlq",
                key=msg.key(),
                value=json.dumps(envelope).encode("utf-8"))
    dlq.flush()
    log.warning("routed message offset=%s to DLQ: %s", msg.offset(), error)

def _extract_trace_id(msg):
    hdrs = dict(msg.headers() or [])
    return hdrs.get(b"trace_id", b"").decode("utf-8") if hdrs.get(b"trace_id") else None

def process(order):
    # Real business logic; deliberately simple in the example
    log.info("processed order_id=%s total=%s", order["order_id"], order["order_total"])

while True:
    msg = consumer.poll(1.0)
    if msg is None:
        continue
    if msg.error():
        log.error("consumer error: %s", msg.error())
        continue
    try:
        ctx = SerializationContext(msg.topic(), MessageField.VALUE)
        order = avro_deser(msg.value(), ctx)
        process(order)
        consumer.commit(msg)
    except SerializationError as e:
        route_to_dlq(msg, e)
        consumer.commit(msg)  # commit past the poison pill so we don't loop
    except Exception as e:
        log.exception("unexpected error, routing to DLQ: %s", e)
        route_to_dlq(msg, e)
        consumer.commit(msg)

# DLQ envelope schema — the shape landed in orders.public.v1.dlq
type: object
properties:
  original_payload_hex: {type: string}
  original_key:         {type: string}
  topic:                {type: string}
  partition:            {type: integer}
  offset:               {type: integer}
  timestamp_ms:         {type: integer}
  error_type:           {type: string}
  error_message:        {type: string}
  consumer_group:       {type: string}
  trace_id:             {type: [string, "null"]}

Step-by-step explanation.

The consumer reader schema is the projection of the ODCS contract into Avro. The sync tool from section 3 generates it; the consumer reads it as a static file at startup.
On each poll, the deserialiser reads the 4-byte schema-id header, fetches the writer schema from the registry (cached), and projects the payload into the reader schema. Compatible payloads deserialise normally; incompatible ones throw SerializationError.
On a SerializationError, the DLQ router builds an envelope containing the original bytes (as hex for JSON safety), the topic/partition/offset for replay reconstruction, the timestamp, the error class and message, and any trace-id from message headers. The envelope is JSON to keep the DLQ debuggable.
Publishing to the DLQ topic uses a synchronous flush to guarantee the DLQ write lands before the consumer commits the source offset. Without the flush, a consumer crash could lose the DLQ message.
consumer.commit(msg) after DLQ routing advances the source offset past the poison pill. Without this, the consumer would loop forever on the same bad message. The trade-off: the message is technically "processed" from the consumer's perspective; the DLQ is the durable record.
Alerting is external — a small side-car scrapes the DLQ topic and emits Prometheus metrics; on-call gets paged when DLQ growth exceeds a threshold.

Output.

Message	Deserialise result	Action
Compatible v1 or v2 payload	success	processed + committed
Payload with unknown schema-id	SerializationError	DLQ + committed
Payload with corrupt bytes	SerializationError	DLQ + committed
Legitimate write, valid schema	success	processed + committed

Rule of thumb. Every streaming consumer needs a DLQ. Not sometimes — every one. The DLQ is the shock absorber that keeps a single poison pill from taking down the consumer; it's also the audit trail that lets producer teams diagnose their own bugs. Ship the DLQ + envelope + alert wiring on day one; retrofitting is 10× the work.

Senior interview question on contract enforcement

A senior interviewer might ask: "Design the full enforcement stack for a batch-plus-streaming dataset governed by an ODCS contract. Walk me through what CI runs, what runtime does, where the DLQ lives, how a producer PR gets from open-to-merged, and what happens when a rogue payload hits the runtime gate."

Solution Using a four-check CI + a two-gate runtime + envelope-shaped DLQ

Full enforcement stack — batch + streaming dataset
==================================================

CI (pre-merge, GitHub Actions)
  Check 1  odcs-lint <path>                                   # YAML shape
  Check 2  odcs-lint diff --base main --path <path>           # semver
  Check 3  sync_odcs_to_registry --dry-run <path>             # registry compat
  Check 4  dbt parse + dbt --contracts compile                # dbt contract match

Runtime — streaming path
  Gate A   producer-side: KafkaAvroSerializer talks to registry;
           if the current schema-id is not registered, PRODUCE fails
  Gate B   consumer-side: AvroDeserializer + reader schema;
           on SerializationError → DLQ envelope → alert

Runtime — batch path
  Gate C   loader (Airflow/dbt) reads schema of incoming file/table;
           runs `dbt contract check` against ODCS-derived contract;
           fail → quarantine bucket; pass → load

DLQ envelope (both streaming and batch)
  {original_bytes, error_class, error_msg, topic/table, offset/row_id,
   timestamp, trace_id, consumer_group}

Alerting
  metric  dlq_growth_rate_5m > 0            → PagerDuty producer team
  metric  contract_lint_fail_rate_daily > 0 → Slack platform-data
  metric  freshness_probe_breach > 0        → PagerDuty producer team

# .github/workflows/enforce.yml — the full CI wiring
name: enforce
on:
  pull_request:
    paths:
      - '**/contract.yaml'
      - 'models/**'

jobs:
  ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: {fetch-depth: 0}
      - uses: actions/setup-python@v5
        with: {python-version: '3.11'}
      - run: pip install odcs-lint==0.9.0 dbt-core==1.7.10 dbt-snowflake==1.7.4

      - name: 1. odcs-lint (shape)
        run: odcs-lint orders/contract.yaml

      - name: 2. odcs-lint (diff / semver)
        run: odcs-lint diff --base origin/main --path orders/contract.yaml

      - name: 3. registry compat (dry-run)
        env:
          REGISTRY_URL: ${{ secrets.REGISTRY_URL }}
        run: python .github/scripts/registry_compat_check.py --dry-run

      - name: 4. dbt contracts
        run: |
          dbt deps
          dbt parse
          dbt compile --select models/marts/dim_orders.sql

Step-by-step trace.

Stage	Check	What it catches
CI check 1	shape lint	YAML typos, missing blocks
CI check 2	semver diff	breaking change without major bump
CI check 3	registry dry-run	Avro/Proto incompatibility
CI check 4	dbt contract	warehouse schema mismatch
Runtime gate A	producer serializer	writing with unregistered schema
Runtime gate B	consumer deserializer	receiving incompatible payload
Runtime gate C	batch loader	file/table with wrong schema
DLQ	envelope	preserves poison pill + metadata
Alert 1	DLQ growth	consumer sees bad payloads
Alert 2	Freshness probe	SLA breach

The stack has three layers: intent (CI), reality (runtime), and observability (DLQ + alerts). Every failure mode has a home: CI catches producer intent errors; runtime catches producer or infrastructure reality errors; DLQ preserves everything with enough metadata to replay; alerts wake the right team.

Output:

Failure mode	Where caught	Blast radius
Producer PR renames column	CI check 2	0 rows corrupted
Producer PR breaks BACKWARD	CI check 3	0 rows corrupted
Producer PR breaks dbt contract	CI check 4	0 rows corrupted
Producer writes with unregistered schema	Runtime gate A	0 rows landed
Consumer receives incompatible payload	Runtime gate B → DLQ	1 message quarantined
Batch loader receives bad file	Runtime gate C → quarantine bucket	1 file quarantined
Freshness SLA breach	probe → PagerDuty	consumers still on stale data but aware

Why this works — concept by concept:

Four CI checks, not one — each check owns a distinct failure mode. Skipping any one leaves a class of bugs uncaught. The four-check pattern is what makes the CI gate credible.
Two runtime gates — producer-side (Gate A) and consumer-side (Gate B) form a belt-and-braces. Gate A relies on the registry being reachable; Gate B protects against Gate A being bypassed or misconfigured.
DLQ envelope structure — the envelope stores enough metadata (original bytes, offset, trace id, error) to replay after fixing the producer. A DLQ without full metadata is a lossy drop; with metadata it's a replayable audit log.
Alerts tied to blast radius — DLQ growth pages the producer immediately (their bug); freshness breach pages the producer (their SLA); lint failure just posts to Slack (developer-loop, not production).
Cost — the CI adds ~30 seconds to a producer PR; runtime gates add ~1 ms per message (registry lookup is cached); the DLQ costs ~1% of source topic storage. The avoided cost is measured in hours of incident time — one prevented Monday-morning war room pays for years of enforcement infrastructure.

ETL
Topic — etl
ETL problems on contract testing and DLQ patterns

Practice →

Streaming Topic — streaming Streaming problems on schema validation and runtime enforcement

Practice →

5. SLAs + ownership + rollout

Freshness, availability, quality — the three SLA dials plus a phased rollout ladder from warn to block

The mental model in one line: the SLA block of an ODCS contract has three dials — freshness (max_freshness: PT15M), availability (availability_pct: 99.9), quality (per-assertion severity + rollout phase) — and a healthy rollout moves each assertion up a three-rung ladder (warn-only → block-on-new → block-everywhere) instead of shipping in enforcement mode from day one. SLA teeth without patience break more than they fix; SLA teeth with patience are how large organisations adopt contracts without whiplash.

The three SLA dials in detail.

Freshness. max_freshness: PT15M — the maximum age of the newest row when the consumer reads. Enforced by a scheduled probe that compares now() - MAX(created_at) (or similar) against the declared threshold. Producers own the alert.
Availability. availability_pct: 99.9 — the fraction of the SLA window (typically monthly) during which the dataset is queryable. Enforced by uptime probes (batch: dataset present and non-empty; streaming: topic reachable and consumer lag bounded).
Quality. per-assertion — null_rate(x) < 0.001, unique(id), x IN (allowed set). Enforced by dbt tests (batch) or streaming validators (Flink / Beam / consumer-side runtime). Each assertion has a severity (warn / block) and a rollout phase.

The rollout ladder — warn → block-new → block-all.

Rung 1 — warn-only. The assertion is evaluated; violations are logged and dashboarded; nothing is blocked. Purpose: measure the current state without breaking anything. Duration: 2–4 weeks typically.
Rung 2 — block-on-new. New rows that violate the assertion are rejected (streaming) or quarantined (batch); existing rows are grandfathered. Purpose: stop the bleeding without invalidating historical data. Duration: 2–4 weeks.
Rung 3 — block-everywhere. The assertion is treated as a hard constraint; any read returning violating rows fails the query. Purpose: full enforcement. Existing violations must be cleaned up before promoting to this rung.

The ownership matrix.

Producer. Signs the contract; commits to the SLAs and quality assertions; owns the pager when SLA probes breach.
Consumers. Subscribe to the contract; agree on the version they read; represent their needs in negotiation (asking for tighter SLAs, more assertions).
Steward (platform / data-gov team). Owns the enforcement infrastructure (lint, sync, probes, DLQ); mediates producer-consumer disputes; maintains the ODCS shape and the tooling.
On-call rotation. The specific PagerDuty (or equivalent) schedule that receives the pages. Usually the producer team, but the contract can specify a delegate for off-hours.

The deprecation contract.

When. Any major-version bump that removes or breaks a field must ship a deprecation-window contract first.
How long. Typically 30–90 days depending on the consumer count and their release velocity. High-consumer datasets get longer windows; low-consumer datasets can shorten.
What. The intermediate contract marks the affected field with deprecated: true and adds a warn-only quality assertion that flags any consumer still reading the field.
Who. The producer files the deprecation; the platform team communicates to consumers; the consumers acknowledge and migrate.

Common interview probes on SLAs + rollout.

"How do you introduce a new quality assertion to a dataset with existing violations?" — warn-only → block-on-new → block-all, over 6–12 weeks.
"Freshness SLA — who owns the pager?" — the producer team, always.
"How do you handle a contract change with 20 downstream consumers?" — deprecation window; 60–90 days; explicit consumer acknowledgement.
"What's the ODCS field for retention?" — sla.retention: P90D.

Worked example — introducing a new quality assertion to a dirty dataset

Detailed explanation. The team wants to enforce null_rate(user_id) == 0 on the orders.public dataset. Current reality: 0.3% of historical rows have a null user_id (an old bug). Shipping this in block mode immediately would fail every query. Walk through the three-rung rollout that lands the assertion in block mode without any downstream breakage.

Week 0. Add assertion as severity: block, rollout: warn_only. Every consumer sees the assertion in the contract but nothing blocks.
Week 4. Producer team ships a fix that populates user_id for new rows.
Week 4. Promote assertion to rollout: block_on_new. Only rows written after this rung are subject to enforcement.
Week 6. Backfill script fixes historical null rows.
Week 8. Promote assertion to rollout: block_all. Full enforcement.

Question. Show the three contract versions (weeks 0, 4, 8), the CI check that verifies the rollout, and the streaming validator that enforces block-on-new.

Input.

Rung	Duration	Producer action	Consumer action
warn_only	weeks 0-4	fix producer code	none
block_on_new	weeks 4-8	backfill historical	none
block_all	week 8+	ongoing enforcement	none

Code.

# Week 0 — assertion in warn-only
version: 1.5.0
quality:
  - assertion: null_rate(user_id) == 0
    severity: block
    rollout: warn_only          # rollout phase; not yet blocking
    since: "2026-07-04T00:00:00Z"

# Week 4 — assertion promoted to block-on-new
version: 1.6.0
quality:
  - assertion: null_rate(user_id) == 0
    severity: block
    rollout: block_on_new
    since: "2026-07-04T00:00:00Z"
    block_cutoff: "2026-08-01T00:00:00Z"   # rows before this stay grandfathered

# Week 8 — assertion promoted to block-all
version: 1.7.0
quality:
  - assertion: null_rate(user_id) == 0
    severity: block
    rollout: block_all
    since: "2026-07-04T00:00:00Z"

# streaming_validator.py — enforces block_on_new using the block_cutoff
import yaml
from datetime import datetime
from confluent_kafka import Consumer, Producer

contract = yaml.safe_load(open("orders/contract.yaml"))
assertions = contract["quality"]

def evaluate(record):
    """Run each block-rated assertion; return list of violations."""
    violations = []
    for a in assertions:
        if a["severity"] != "block":
            continue
        if a["rollout"] == "warn_only":
            continue                        # measured but not enforced
        if a["rollout"] == "block_on_new":
            cutoff = datetime.fromisoformat(a["block_cutoff"].rstrip("Z"))
            record_ts = record["created_at"]
            if record_ts < cutoff:
                continue                    # grandfathered
        if not _check_expr(a["assertion"], record):
            violations.append(a["assertion"])
    return violations

def _check_expr(expr: str, record):
    # Simplified — real impl parses the assertion DSL
    if "null_rate(user_id) == 0" in expr:
        return record.get("user_id") is not None
    return True

# In the consumer/validator loop:
#   violations = evaluate(record)
#   if violations: route_to_dlq(record, violations)
#   else:          process(record)

Step-by-step explanation.

Week 0 ships the assertion in rollout: warn_only. The validator evaluates the assertion, logs violations, and updates a Prometheus counter, but does not block. The producer team sees a null_rate(user_id) = 0.003 gauge; they diagnose and fix the producer code.
Week 4 the producer fix is deployed. New rows have user_id populated. The team promotes the assertion to rollout: block_on_new with block_cutoff: 2026-08-01. From this timestamp forward, any new row with user_id = null fails the assertion and routes to DLQ. Historical rows (before the cutoff) stay in the dataset unmolested.
Week 6 a backfill script fixes the 0.3% of historical rows. The team runs SELECT COUNT(*) FROM orders WHERE user_id IS NULL to confirm zero. The backfill is one SQL migration + a dbt macro; producer team owns it.
Week 8 the team promotes to rollout: block_all. The block_cutoff field is removed. Any row that returns user_id = null in query results (there should be zero now) fails the assertion. Consumers can rely on the guarantee.
Each promotion is a version bump with a lint-checked contract. The rollout ladder is the semver-safe path: warn_only → block_on_new → block_all is always a minor bump per rung; producers can move quickly without breaking the contract semantics.

Output.

Week	Contract version	Rollout phase	Violations blocked?	Producer status
0	1.5.0	warn_only	no	fixing code
4	1.6.0	block_on_new	new rows only	code fixed
6	1.6.0	block_on_new	new rows only	backfilling
8	1.7.0	block_all	all rows	steady state

Rule of thumb. Never introduce a quality assertion in block mode against a dirty dataset. The three-rung ladder is the paved path — warn-only to discover the mess, block-on-new to stop the bleeding, block-all after cleanup. Shortcuts here are the number-one cause of "we tried contracts and they broke everything."

Worked example — a freshness SLA with a per-partition twist

Detailed explanation. A dataset ingested from multiple sources needs a per-partition freshness SLA — the global MAX(created_at) is misleading if one active partition is fresh but a low-volume partition is 8 hours stale. Show a partition-aware freshness probe with per-partition alerting.

The pitfall. Global freshness looks fine; per-partition freshness reveals the failure.
The fix. Probe reads MAX(created_at) per partition; alerts on any partition older than SLA.
The contract. Same max_freshness: PT15M but the probe evaluates per-partition.

Question. Design the probe query and the alerting logic that catches a stale partition without paging on healthy ones.

Input.

Element	Value
Partitioning column	source_id
SLA	15 minutes per partition
Sources	8 sources with wildly different volumes
Alert dedup window	10 minutes

Code.

# Contract addition — per-partition freshness
sla:
  max_freshness: PT15M
  freshness_scope:
    granularity: per_partition
    partition_column: source_id

-- Per-partition freshness probe
WITH per_partition AS (
  SELECT
    source_id,
    MAX(created_at)                        AS latest_row,
    EXTRACT(EPOCH FROM (NOW() - MAX(created_at))) AS staleness_seconds,
    COUNT(*) FILTER (WHERE created_at > NOW() - INTERVAL '1 hour') AS rows_last_1h
  FROM warehouse.orders
  GROUP BY source_id
)
SELECT
  source_id,
  latest_row,
  staleness_seconds,
  rows_last_1h,
  CASE
    WHEN staleness_seconds > 900 AND rows_last_1h = 0
        THEN 'stale_and_idle'         -- likely broken source
    WHEN staleness_seconds > 900 AND rows_last_1h > 0
        THEN 'stale_and_active'       -- lag; may recover
    ELSE 'ok'
  END AS status
FROM per_partition
ORDER BY staleness_seconds DESC;

# probe_scheduler.py — reads contract, runs probe, alerts per partition
import yaml, psycopg2, requests, time
from isodate import parse_duration

contract = yaml.safe_load(open("orders/contract.yaml"))
sla_s = parse_duration(contract["sla"]["max_freshness"]).total_seconds()
partition_col = contract["sla"]["freshness_scope"]["partition_column"]

conn = psycopg2.connect("postgres://…")
cur = conn.cursor()

def probe():
    cur.execute(f"""
        SELECT {partition_col},
               EXTRACT(EPOCH FROM (NOW() - MAX(created_at)))
        FROM   warehouse.orders
        GROUP  BY {partition_col};
    """)
    return cur.fetchall()

# Track consecutive breaches per partition
consecutive = {}

while True:
    for partition, staleness in probe():
        if staleness > sla_s:
            consecutive[partition] = consecutive.get(partition, 0) + 1
            if consecutive[partition] >= 3:
                requests.post("https://events.pagerduty.com/v2/enqueue", json={
                    "routing_key": contract["roles"]["on_call_rotation"],
                    "event_action": "trigger",
                    "dedup_key": f"orders-freshness-{partition}",
                    "payload": {
                        "summary": f"orders partition {partition} freshness > SLA "
                                   f"({staleness:.0f}s > {sla_s:.0f}s)",
                        "severity": "error",
                        "source": "contract-probe",
                    },
                })
        else:
            consecutive[partition] = 0
    time.sleep(300)

Step-by-step explanation.

The contract now declares freshness_scope.granularity: per_partition with partition_column: source_id. The probe reads this and pivots from a single global freshness measurement to a group-by-partition query.
The SQL probe returns one row per partition with staleness_seconds. The added status column disambiguates two failure modes: stale_and_idle (no new rows in the last hour — probably a broken source) versus stale_and_active (some rows arriving, but the latest is old — probably a queue lag).
The Python scheduler tracks consecutive breaches per partition to avoid paging on a single-scrape flake. Three consecutive 5-minute breaches = 15 minutes of sustained lag = page.
dedup_key: orders-freshness-<partition> scopes deduplication per partition. If two partitions are stale at once, two independent incidents fire; each auto-resolves when its partition recovers.
When a partition recovers (staleness drops back below SLA), consecutive[partition] resets to zero and the next breach starts a new counter. The alert path is one-shot per breach cycle; the on-call sees exactly one incident per genuine stale period per partition.

Output.

Partition	Staleness	rows_last_1h	Status	Alert?
src_a	40 s	12000	ok	no
src_b	950 s	20	stale_and_active	if sustained 15 min
src_c	30000 s	0	stale_and_idle	immediate (adjust threshold)
src_d	60 s	1200	ok	no

Rule of thumb. Global freshness SLAs hide per-partition failures. If your dataset has any partitioning that matters, encode it in the SLA scope and probe accordingly. The extra alerting fidelity is what catches a broken source-of-the-month before the executive dashboard notices.

Worked example — the deprecation announcement flow

Detailed explanation. A producer needs to deprecate the country_code column and rename to country_alpha2. Twenty downstream consumers depend on the field. Walk through the deprecation announcement flow — the contract diff, the automated consumer notification, the acknowledgement tracking, and the cutover.

The change. Rename country_code → country_alpha2; deprecation window 60 days.
The announcement. Automated Slack + email + PR comment to every consumer team listed in roles.consumers.
The tracking. A dashboard shows per-consumer acknowledgement status.
The cutover. After 60 days, the 2.0.0 contract removes country_code.

Question. Design the announcement + tracking flow. Show the CODEOWNERS + Slack + tracking-dashboard integration.

Input.

Element	Value
Contract	orders.public 1.7.0 (before) → 1.8.0 (deprecation) → 2.0.0 (cutover)
Consumers	20 teams
Deprecation window	60 days
Notification channels	Slack, email, PR comment

Code.

# 1.8.0 deprecation-window contract
version: 1.8.0
info:
  status: active
  changelog:
    - version: 1.8.0
      date: "2026-07-04"
      change: "Deprecate country_code; introduce country_alpha2 as canonical."
      cutover_at: "2026-09-02T00:00:00Z"
      breaking_in: 2.0.0
schema:
  - name: country_code
    type: STRING
    deprecated: true
    description: "DEPRECATED. Use country_alpha2. Removed in 2.0.0 on 2026-09-02."
  - name: country_alpha2
    type: STRING
    required: false
    description: "Canonical ISO-3166 alpha-2 name."

# announce_deprecation.py — triggered by the CI on merge of the 1.8.0 contract
import yaml, requests, os
contract = yaml.safe_load(open("orders/contract.yaml"))
consumers = contract["roles"]["consumers"]

changelog = contract["info"]["changelog"][-1]
msg = (f":warning: *Contract deprecation* :warning:\n"
       f"Dataset: `{contract['id']}` v{contract['version']}\n"
       f"Change: {changelog['change']}\n"
       f"Cutover: {changelog['cutover_at']} (v{changelog['breaking_in']})\n"
       f"Please migrate before cutover. Acknowledge via `/contract ack {contract['id']}`.")

for team in consumers:
    requests.post(os.environ["SLACK_WEBHOOK_URL"], json={
        "channel": f"#{team}",
        "text":    msg,
    })

-- Tracking dashboard — per-consumer acknowledgement status
SELECT
    consumer_team,
    acked_at,
    CASE WHEN acked_at IS NULL THEN 'PENDING' ELSE 'ACKED' END AS status
FROM   contract_deprecations d
LEFT   JOIN contract_acks a
       ON d.contract_id = a.contract_id
      AND a.consumer_team = ANY (d.consumers)
WHERE  d.contract_id = 'orders.public'
  AND  d.deprecation_version = '1.8.0'
ORDER  BY acked_at NULLS FIRST;

Step-by-step explanation.

The 1.8.0 contract's info.changelog records the deprecation intent, the cutover date, and the target major version. The country_code column is marked deprecated: true in the schema block; country_alpha2 is added as optional.
On merge of the 1.8.0 PR, a CI job runs announce_deprecation.py. It reads the consumers list, formats a Slack message with the changelog, and posts to each consumer team's channel. Email and per-team PR comments follow the same template.
The tracking dashboard reads a contract_deprecations table (populated by the CI job) and a contract_acks table (populated by a /contract ack Slack command). The join produces per-consumer status: PENDING or ACKED.
As consumers migrate, they run /contract ack orders.public in Slack; the command writes to contract_acks. The dashboard updates in real time. The steward reviews it weekly; teams that don't ack after 30 days get escalated to their manager.
On cutover day (60 days after the deprecation contract merged), the producer merges the 2.0.0 contract that removes country_code. The CI checks: (a) 2.0.0 major bump matches the removal (odcs-lint OK); (b) every consumer in roles.consumers has an ACKED status in the tracking table; if any is PENDING, the CI blocks with a "cannot cut over — pending consumers" error.

Output.

Day	Contract	Consumer acked	Cutover blocked?
0	1.8.0	0/20	—
15	1.8.0	8/20	not yet
45	1.8.0	18/20	2 pending → escalate
60	2.0.0 PR	20/20	no; cutover proceeds

Rule of thumb. Deprecation windows without tracked acknowledgements are wishful thinking. Wire the /contract ack command, dashboard, and CI gate on day one — the producer never has to guess whether consumers are ready; the CI answers definitively.

Senior interview question on SLAs, ownership, and rollout

A senior interviewer might ask: "You're introducing data contracts to an org with 300 datasets, 40 teams, and a history of silent schema breaks. Walk me through the first-quarter rollout plan — how you'd sequence the SLAs, the enforcement, the consumer acknowledgements, and the on-call ownership handoff."

Solution Using the four-phase rollout + explicit owner handoff + SLA measurement lag

Quarter-1 rollout plan — data contracts across 300 datasets
============================================================

Weeks 0-2 — Foundation
  - Ship odcs-lint, sync tool, registry-compat-check as GitHub Actions
  - Ship dbt contract enforcement in warehouse-side CI
  - Deploy freshness probe framework
  - Stand up ownership dashboard

Weeks 2-6 — Contract generation (all in warn-only)
  - Auto-generate ODCS contracts from existing dbt models + Avro schemas
  - Each contract begins in status: draft, all assertions warn_only
  - Producer teams review + edit + status: active
  - Consumer teams subscribe by adding themselves to roles.consumers

Weeks 4-8 — SLA measurement (still warn-only)
  - Every contract with an SLA gets a probe
  - Freshness / availability / quality metrics gathered
  - Producer teams see their "actual vs promised" gap
  - Some teams negotiate looser SLAs to match reality

Weeks 6-10 — Rollout ladder
  - Contract quality assertions promote warn_only → block_on_new
  - Producer teams fix or backfill dirty data
  - Consumer teams unaffected (they're already reading grandfathered data)

Weeks 10-13 — Full enforcement
  - Assertions promote block_on_new → block_all where data is clean
  - Remaining dirty datasets stay block_on_new with a scheduled cleanup
  - Ownership handoff: on-call rotation lives in producer team calendar

Ongoing
  - Every new dataset must ship with a contract (block_on_new for schema; warn_only for quality)
  - Deprecation window minimum: 30 days for internal; 60 days for cross-team; 90 days for external
  - Quarterly SLA review: producer + consumer sit down; re-negotiate if needed

Step-by-step trace.

Week	Milestone	Datasets in phase
2	Tooling live	0 contracts, 300 datasets uncontracted
6	Contracts drafted	300 in draft, warn_only
8	SLAs measured	300 with measured metrics
10	Block-on-new active	200 promoted, 100 still measuring
13	Block-all active	150 fully enforced, 150 block_on_new
13 (Q1 end)	Ownership dashboard	Every dataset has a producer + on-call

The plan sequences discovery before enforcement, measurement before promises, and paved paths before hard rules. Quarter 1 ends with every dataset contracted; enforcement rolls forward through quarters 2 and 3 as data cleanup completes.

Output:

Quarter-1 milestone	Result
Contracts drafted	300/300
SLAs measured	300/300
Producer ownership assigned	300/300
Consumers acknowledged	300/300
Assertions in block-all	150/300
Assertions in block-on-new	150/300
Incident rate (silent schema breaks)	~75% lower vs pre-rollout

Why this works — concept by concept:

Sequence discovery before enforcement — measuring the current state (warn-only) before promising anything (block) is what makes SLAs credible. Teams cannot commit to a freshness they never measured.
Auto-generation seeds contracts — hand-writing 300 ODCS files is a non-starter; auto-generation from existing dbt models + Avro schemas produces first-draft contracts in hours, not weeks. Producers edit and sign; the marginal cost per contract drops from days to minutes.
Rollout ladder is dataset-scoped — each dataset moves through the ladder at its own pace based on data cleanliness. The org's rollout is the median dataset's rollout; slow ones stay in block_on_new longer without holding back the fast ones.
Ownership dashboard as the anchor — the dashboard tracks producer + on-call + consumer acknowledgement. It's the single artefact leadership reviews weekly; a dataset without an owner is immediately visible and immediately actionable.
Cost — quarter of platform-engineer time + roughly one producer-engineer day per contract. The avoided cost is the incident lift: pre-rollout the org averaged 12 silent-break incidents per quarter at ~8 hours each (96 hours); post-rollout the number drops to ~3 at the same 8 hours (24 hours). The saved 72 hours per quarter dwarfs the setup cost by month three.

ETL
Topic — etl
ETL problems on SLA design and rollout patterns

Practice →

Optimization
Topic — optimization
Optimization problems on contract ownership and enforcement

Practice →

Cheat sheet — data contract recipes

The four axes. Schema (columns + types + PII + unique), SLA (freshness + availability + quality + retention), quality (assertions + severity + rollout), ownership (producer + consumers + steward + on-call). A contract missing any axis is a schema-registry entry, not a data contract.
The 20-line ODCS template. apiVersion: v3.0.0, kind: DataContract, id: <domain>.<name>, version: <semver>, info: {title, owner, status}, schema: [{name, type, required, unique, pii, description, deprecated}], sla: {max_freshness, availability_pct, retention}, quality: [{assertion, severity, rollout}], roles: {producer, consumers, steward, on_call_rotation}, servers: [{type, dsn}]. Every contract fits on one screen.
Semver rules. Additive optional column = minor; additive required column = major; drop column = major; type widen = minor; type narrow = major; tighten SLA = major; loosen SLA = minor; description-only = patch. When in doubt, bump major and offer a deprecation window.
Registry compatibility modes. BACKWARD (consumer upgrades first; producer can add optional fields) is the streaming default; FORWARD (producer upgrades first; consumer keeps old schema); FULL (both directions); TRANSITIVE variants require compat across all previous versions. Use BACKWARD_TRANSITIVE for high-longevity subjects.
Avro BACKWARD-safe changes. Add optional field with default; widen type (int → long, float → double); rename via aliases. Never add required field, never remove field, never narrow type — the registry will reject.
Protobuf FULL-safe changes. Add optional field with new field number; add new enum values (readers preserve unknowns); rename via option deprecated = true while keeping the number. Never reuse a field number; never remove a required field.
Subject-naming strategies. TopicNameStrategy for single-event topics (default; ODCS 1:1); TopicRecordNameStrategy for multi-event topics; RecordNameStrategy for shared record types across topics. Pick per-topic; never mix within a topic.
CI enforcement stack. (1) odcs-lint shape, (2) odcs-lint diff semver, (3) sync_odcs_to_registry --dry-run, (4) dbt parse + dbt compile --contracts. Four checks as required status checks on every producer PR.
Runtime enforcement stack. Producer-side: KafkaAvroSerializer talks to registry, unknown schema-id fails PRODUCE. Consumer-side: AvroDeserializer + reader schema, SerializationError → DLQ envelope. Batch loader: schema check against contract before write; failure → quarantine bucket.
DLQ envelope. {original_payload_hex, original_key, topic, partition, offset, timestamp_ms, error_type, error_message, consumer_group, trace_id}. Enough metadata to replay after fix; JSON so it stays debuggable.
Rollout ladder for a quality assertion. Week 0-4 warn_only (measure current state); week 4-8 block_on_new (stop the bleeding, grandfather old rows); week 8+ block_all (full enforcement after cleanup). Never ship in block_all against a dirty dataset.
Deprecation window. Internal 30 days; cross-team 60 days; external 90 days. Intermediate contract marks the field deprecated: true; announcement bot pings every consumer; /contract ack command tracks acknowledgement; cutover CI blocks if any consumer is PENDING.
Freshness probe. now() - MAX(created_at) < parse_duration(sla.max_freshness). Runs every 5 minutes; alerts after 3 consecutive breaches (≥15 min sustained). Per-partition scope if the dataset partitions matter.
Ownership binding. roles.producer → CODEOWNERS entry on the contract file → required PR review. Add on_call_rotation for the pager destination. Audit trail = git log. Producer cannot claim ignorance.
dbt contract. contract: enforced: true on the model config; columns block mirrors ODCS schema block with warehouse-specific type names. Sync tool auto-generates. Dbt compile fails on column mismatch; dbt build fails on constraint violation.
Change-signal channels. Slack for consumers listed in roles.consumers; PR comment auto-posted with changelog; email for offline audiences; DataHub / Backstage for catalogue subscribers. Never expect consumers to poll git.
Contract quarterly review. Producer + consumers + steward meet quarterly; review SLA hits/misses; renegotiate loosen/tighten; retire deprecated fields; add new consumers. Cadence prevents drift.

Frequently asked questions

What is a data contract and how is it different from a schema?

A data contract is a machine-readable, version-controlled, producer-signed agreement that declares the schema, SLA, quality assertions, and ownership of a dataset — all four axes together. A schema (Avro, Protobuf, JSON Schema, dbt model columns) is just the shape — the column names, types, and nullability. A contract wraps the schema with promises about timing (freshness, availability), content (null-rate ceilings, uniqueness, range constraints), and humans (producer team, consumer teams, steward, on-call rotation). The Open Data Contract Standard (ODCS) v3.x is the vendor-neutral YAML shape most 2026 platforms have converged on; Confluent Schema Registry and dbt model contracts are downstream materialisations that a sync tool derives from the ODCS source. In one line: the schema is what shape the data has; the contract is what promises the producer commits to keep.

What is the Open Data Contract Standard (ODCS)?

ODCS is a Linux Foundation / Bitol-hosted, vendor-neutral YAML specification for data contracts, currently at v3.x. The top-level blocks are info (metadata), schema (columns with types + PII + constraints), sla (freshness / availability / retention / query latency), quality (assertions with severity and rollout phase), roles (producer + consumers + steward + on-call), and servers (physical materialisations — Snowflake table, Kafka topic, S3 prefix). Version is semver: patch for description-only, minor for additive optional, major for breaking. The file is checked into the producer's repo, linted in CI (odcs-lint), and synced into downstream tooling (registry, dbt contracts, Great Expectations, DataHub). Adoption grew rapidly in 2025-2026 because it's the first data-contract shape that vendors don't control — teams can adopt without lock-in.

Do I still need a Schema Registry if I have data contracts?

Yes — the schema registry and the data contract play different roles. The data contract (ODCS YAML) lives in git; it's the design-time source of truth reviewed by humans and enforced by CI. The schema registry (Confluent, Apicurio) lives in the streaming stack; it's the runtime source of truth consulted by producers and consumers on every message. A sync tool projects the ODCS schema block into the registry as an Avro / Protobuf schema and enforces the registry's compatibility mode (BACKWARD, FORWARD, FULL). The registry catches producer-side bugs at the point of PRODUCE (the registry rejects an incompatible schema); the contract catches producer-side intent at the point of merge (CI blocks the PR). Together they form the belt-and-braces: contract for intent, registry for wire.

How do I enforce a data contract at runtime?

Runtime enforcement has three gates. Producer-side (streaming): the KafkaAvroSerializer (or Protobuf equivalent) publishes to the registry on startup and on every message; if the schema is not registered or is incompatible with the subject's compatibility mode, PRODUCE fails and the producer surfaces the error. Consumer-side (streaming): the AvroDeserializer reads the schema-id from the message header, fetches the writer schema, and projects into the consumer's reader schema (from the contract); incompatible payloads throw SerializationError and route to a DLQ topic with an envelope containing the original bytes + error + trace-id + offset. Batch: the loader (Airflow, dbt, Fivetran) checks the incoming file's schema against the contract before write; failures land in a quarantine bucket. Complement all three with runtime probes for the SLA block: freshness probes read now() - MAX(created_at) against sla.max_freshness and page the producer on breach. The rule of thumb: CI catches producer intent; runtime catches producer or infrastructure reality; DLQ + quarantine preserve every rogue payload for replay.

Who owns a data contract — the producer team or the consumer team?

The producer team owns the contract — always. The producer is the party making the promises (schema stability, freshness SLA, quality assertions); they sign it via a CODEOWNERS binding on the contract file that requires their team's approval on every change. The steward team (usually platform-data or a data-governance team) owns the enforcement infrastructure (lint, sync, probes, DLQ) and mediates producer-consumer disputes; they are the neutral third party. Consumers subscribe to the contract, are listed in roles.consumers, and get notified on every change; they don't sign the contract but they do need to acknowledge deprecations via a tracked /contract ack command. When a contract is violated, the producer's on-call rotation (roles.on_call_rotation) gets paged. Consumers who want tighter SLAs or additional assertions negotiate with the producer during quarterly contract reviews; the steward mediates if they can't agree. Contract ownership without a producer is a wish; producer ownership without a steward is a silo.

How do I introduce data contracts to an organisation that has never had them?

Sequence discovery before enforcement. Quarter 1: stand up the tooling (odcs-lint, dbt contract, registry-sync, freshness probes, DLQ framework, ownership dashboard). Auto-generate ODCS contracts from existing dbt models and Avro schemas; every contract begins in status: draft with all assertions in warn_only. Producer teams review and sign; consumers subscribe. Quarter 2: measure the SLAs — run probes, gather freshness / availability / quality metrics, let producers see their "actual vs promised" gap. Some teams renegotiate looser SLAs to match reality; that's healthy. Quarter 3: promote assertions up the rollout ladder — warn_only → block_on_new → block_all — one rung at a time, one dataset at a time. Datasets with clean data promote fast; dirty ones stay in block_on_new while backfills run. Quarter 4: ownership handoff — every dataset has a producer, an on-call, a consumer list, and a quarterly review cadence. New datasets ship with a contract by default. The most common failure is going straight from "no contracts" to "block-everywhere in CI" — that produces a wave of overnight breakages and destroys organisational trust. The phased ladder is what makes the rollout survivable.

Practice on PipeCode

Drill the ETL practice library → for the producer-consumer, schema-evolution, and contract-testing problems senior interviewers use to probe data-contracts intuition.
Rehearse on the streaming practice library → for the Avro / Protobuf / schema-registry, subject-strategy, and BACKWARD-compat problems every senior streaming role expects.
Sharpen the schema-design axis with the optimization practice library → for the contract-shape, rollout-ladder, and SLA-tuning problems that come up in system-design rounds.
Stack the prerequisites against the broader 450+ data-engineering catalogue to anchor the ODCS + registry + enforcement intuition against real graded inputs.

Lock in data-contract muscle memory

Vendor docs explain fields. PipeCode drills explain the decision — when a rename becomes a major bump, when BACKWARD compatibility rejects a schema, when a quality assertion belongs in warn-only versus block-all, when a deprecation window has to run 90 days. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.

Practice ETL problems →
Practice streaming problems →