Gabriel Anhaia

Posted on May 23

Event Schema Evolution: 4 Versioning Strategies, 1 That Quietly Breaks Consumers

#kafka #architecture #eventdriven #backend

Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Versioned topics, the strategy every Kafka tutorial recommends, is the one that quietly breaks consumers more often than any other. Not loudly, not with an exception. Quietly, with dropped reads.

A team I talked to last month had orders.v1 and orders.v2 running side by side for six months. The migration was "done." Then their fraud analytics dashboard started missing 8% of orders. The fraud consumer was still on orders.v1. The producer had stopped writing to orders.v1 three weeks earlier. No alarm fired. No exception. The consumer was healthy. It just had nothing to read.

That's the failure mode the Confluent docs don't warn you about loudly enough.

Why "schema evolution" is a five-year problem

Most schema-evolution advice is calibrated for a release cycle: how do you ship a new field without breaking the existing consumer next Tuesday. That problem is easy. Avro forward compatibility, Protobuf field numbers, JSON Schema's additionalProperties. All four major tools handle the next-release case fine.

The real problem is the five-year case. By year three you have 14 nullable fields on the schema because nobody wanted to break compatibility. By year four you have two parallel topics and nobody remembers which one is canonical. By year five somebody asks "can we delete the legacy_promo_code field" and the answer is "we have no idea who reads it."

The four strategies below all work in year one. They diverge sharply by year three. Only one survives year five without coordinated cross-team deploys.

Strategy 1: Additive-only changes

The safe default. You only add new fields, never remove, never rename, never repurpose. Old consumers ignore unknown fields. New consumers tolerate missing ones.

In Avro:

{
  "type": "record",
  "name": "OrderPlaced",
  "namespace": "shop.orders",
  "fields": [
    { "name": "order_id", "type": "string" },
    { "name": "customer_id", "type": "string" },
    { "name": "amount_cents", "type": "long" },
    {
      "name": "promo_code",
      "type": ["null", "string"],
      "default": null
    }
  ]
}

Adding promo_code as ["null", "string"] with default: null is forward and backward compatible. Old producers don't set it; new consumers see null. New producers set it; old consumers ignore it.

Protobuf gets this almost for free with field numbers:

syntax = "proto3";
package shop.orders;

message OrderPlaced {
  string order_id = 1;
  string customer_id = 2;
  int64 amount_cents = 3;
  optional string promo_code = 4;  // added year 2
  optional string referral_source = 5;  // added year 2
  // field 6 reserved: we tried "tier" and bailed
  reserved 6;
  reserved "tier";
}

JSON Schema is the noisy one. You need additionalProperties: false for validation discipline, then explicit additions:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "additionalProperties": false,
  "required": ["order_id", "customer_id", "amount_cents"],
  "properties": {
    "order_id": { "type": "string" },
    "customer_id": { "type": "string" },
    "amount_cents": { "type": "integer" },
    "promo_code": { "type": ["string", "null"] },
    "referral_source": { "type": ["string", "null"] }
  }
}

Where it fails: year two, you've added 14 nullable fields. Half of them are sometimes-populated, half are dead. The schema is a museum. New engineers can't tell which fields are meaningful and which are vestigial. The dead fields keep getting populated by consumers who haven't read the changelog. You can't delete anything because some consumer might still depend on it.

The gotcha: developers think "additive-only" means "I can't break anything." It doesn't. You can absolutely break a consumer by changing the semantics of an existing field while keeping its type. Renaming amount from "minor units" to "major units" passes every compatibility check and silently 100× every charge. The compatibility tools only check structure.

Strategy 2: Versioned topics

The strategy every tutorial recommends. When the change is too big for additive (you renamed a field, restructured nested objects, split one event into three), publish to a new topic: orders.v1, orders.v2.

Producer code splits, consumers migrate one by one, eventually v1 is retired.

# producer year 2: dual-write during migration
def publish_order(order):
    v1_payload = to_v1_schema(order)
    v2_payload = to_v2_schema(order)
    producer.send("orders.v1", value=v1_payload)
    producer.send("orders.v2", value=v2_payload)

The plan reads cleanly on the whiteboard. The reality is what got the team I mentioned above.

The silent failure: consumers don't all migrate. Some teams forget. Some teams own consumers nobody's touched since the engineer left. Six months into the migration, somebody (reasonably) asks "can we stop writing to v1, it's costing us $4K/month in storage." Yes, say the loudest teams. They stop. The consumers nobody asked are still subscribed to orders.v1. They just stop receiving events. No exception. No KafkaException. They look healthy. Their lag is zero (because there's nothing to lag behind).

This is the failure mode that breaks fraud detection, billing reconciliation, search indexes, ML pipelines. They go stale silently. You find out when a customer asks why their refund hasn't processed.

# what nobody adds: the consumer gap monitor
def check_consumer_topic_match(consumer_group):
    subscribed = admin.describe_consumer_group(consumer_group).topics
    expected = registry.get_active_topics_for(consumer_group)
    drift = set(subscribed) - set(expected)
    if drift:
        alert(f"{consumer_group} still on {drift}, producer migrated")

If your versioned-topics setup doesn't have a gap monitor like that running daily, you have the problem. You just don't know yet.

Where it works: small teams (3-5 consumers per event), short retention windows (you'll notice fast), strong central ownership (somebody's job is the migration). At 30+ consumers across 12 teams, it doesn't.

Strategy 3: Expand-contract

Borrowed from database schema migrations. Three phases:

Expand. The producer publishes both old and new schema simultaneously, often on the same topic with a discriminator.
Migrate. Consumers switch to the new schema, one by one. Producer keeps publishing both.
Contract. Once all consumers are confirmed on the new schema, the producer stops writing the old.

message OrderEvent {
  oneof payload {
    OrderPlacedV1 v1 = 1;
    OrderPlacedV2 v2 = 2;
  }
  // every event carries both for the migration window
}

On the consumer side:

func handle(ev *OrderEvent) error {
    // new consumer reads v2 directly
    if v2 := ev.GetV2(); v2 != nil {
        return processV2(v2)
    }
    // shouldn't hit this once we've cut over
    return nil
}

This works. The trick is detecting when phase 2 is actually done. The pattern teams use: every consumer reports which schema version it's reading via a schema_version_consumed metric. Phase 3 starts when the lowest-version metric across all consumer groups crosses the threshold.

# the consumer-readiness gate
def can_contract(event_type, target_version):
    metrics = prometheus.query(
        f'min(schema_version_consumed{{event="{event_type}"}})'
    )
    if metrics.value < target_version:
        return False, f"consumer still on v{metrics.value}"
    return True, "all consumers caught up"

The "we forgot a consumer" failure: the consumer-readiness gate only sees consumers that emit the metric. A consumer that doesn't emit it is invisible. The most common cause: an old batch job, a Lambda triggered once a week, a script somebody runs manually at month-end. These don't show up in your "all green" dashboard until phase 3 happens and they break.

The mitigation is registry-based: every consumer registers the schema versions it claims to handle. The gate cross-checks the registry against the metrics. If a consumer is registered but not emitting, that's a hard block on contraction.

Where it works: medium-sized systems (10-30 consumers), strong observability culture, schema registry already wired in. Phase windows of weeks-to-months are normal; phase windows of years mean expand-contract isn't actually your strategy, you're just lying about it.

Strategy 4: Upcaster pattern

The strategy that survives five years. Borrowed from event sourcing (Axon framework documented it first, Greg Young referenced it in his event-sourcing talks).

The core idea: events are stored in their original schema, forever. When a consumer reads an event, it passes through a chain of upcasters that transform v1 → v2 → v3 → current. New consumers see only the current schema. Old events on disk never change.

You stop versioning the topic. You start versioning the read path.

package events

type Upcaster interface {
    SourceVersion() int
    Upcast(raw map[string]any) (map[string]any, error)
}

// V1 -> V2: renamed "amount" (minor units) to "amount_cents"
type OrderPlacedV1toV2 struct{}

func (u OrderPlacedV1toV2) SourceVersion() int { return 1 }

func (u OrderPlacedV1toV2) Upcast(
    raw map[string]any,
) (map[string]any, error) {
    // old field was "amount" in dollars, new is amount_cents
    if amt, ok := raw["amount"].(float64); ok {
        raw["amount_cents"] = int64(amt * 100)
        delete(raw, "amount")
    }
    raw["schema_version"] = 2
    return raw, nil
}

// V2 -> V3: split shipping_address into structured fields
type OrderPlacedV2toV3 struct{}

func (u OrderPlacedV2toV3) SourceVersion() int { return 2 }

func (u OrderPlacedV2toV3) Upcast(
    raw map[string]any,
) (map[string]any, error) {
    addr, ok := raw["shipping_address"].(string)
    if !ok {
        // never had it set: older events from year 1
        raw["shipping"] = map[string]any{
            "line1": "", "city": "", "country": "",
        }
        raw["schema_version"] = 3
        return raw, nil
    }
    parsed, err := parseAddressString(addr)
    if err != nil {
        return nil, fmt.Errorf("v2->v3 addr parse: %w", err)
    }
    raw["shipping"] = parsed
    delete(raw, "shipping_address")
    raw["schema_version"] = 3
    return raw, nil
}

The chain runner picks up at whatever version the stored event has and applies upcasters in order:

type Chain struct {
    upcasters map[int]Upcaster  // keyed by SourceVersion
    target    int               // current schema version
}

func (c *Chain) Apply(
    raw map[string]any,
) (map[string]any, error) {
    v, _ := raw["schema_version"].(int)
    if v == 0 { v = 1 }  // events before we tracked it
    for v < c.target {
        uc, ok := c.upcasters[v]
        if !ok {
            return nil, fmt.Errorf(
                "no upcaster for v%d->v%d", v, v+1,
            )
        }
        next, err := uc.Upcast(raw)
        if err != nil {
            return nil, err
        }
        raw = next
        v++
    }
    return raw, nil
}

Consumer code becomes one line of schema-version logic:

func (h *OrderHandler) Handle(raw []byte) error {
    var event map[string]any
    if err := json.Unmarshal(raw, &event); err != nil {
        return err
    }
    upgraded, err := h.chain.Apply(event)
    if err != nil {
        return err
    }
    var current OrderPlacedV3
    return decode(upgraded, &current)
}

The gotchas:

Upcaster chains compound errors. A bug in the v1→v2 upcaster poisons every v1 event that flows through five years of consumers. Test upcasters with the actual historical event payloads from your event store, not synthetic examples.
Performance. Every read now runs through N transformations. Most teams cache the upcast result: store the upcast event alongside the original, recompute only when a new upcaster is added.
Schema registry awkwardness. Confluent Schema Registry doesn't ship with first-class upcaster support. You wire it into the deserializer manually. Axon Server does. NATS JetStream doesn't.

Where it works: event-sourced systems where events are the source of truth and you'll need to replay them. Long-lived systems (5+ years). Teams that can afford the upfront cost of building the upcaster framework. For most shops that's about a sprint, then maintenance is small.

A year-by-year forecast

What each strategy looks like at year 1, 2, 3, and 5 of production.

Strategy	Year 1	Year 2	Year 3	Year 5
Additive-only	Clean, easy, everybody's happy	4-6 nullable fields, still fine	12-15 nullable fields, schema is a museum, semantic drift starts	Schema is unreadable, dead fields nobody dares delete, undocumented "actually this field means X now"
Versioned topics	One topic, no version suffix yet	`v1` and `v2` co-exist, dual-write everywhere	Some teams on v3, others still on v1, silent migration breakage	Three or four parallel topics, nobody knows canonical, fraud/billing/search consumers drifting
Expand-contract	Not needed yet	First migration succeeds in a clean window	Coordination overhead, "we forgot a consumer" incident #1	Works for teams with strong ownership; degenerates into versioned-topics for teams without
Upcaster	Upfront cost feels heavy for the gain	First upcaster pays for itself	Upcaster chain handles all migrations, consumers read current schema only	Chain has 8-10 transforms, performance cached, only survivor without coordinated deploys

The "additive-only year 5" cell is where most production systems actually live. Not because additive is the best strategy, but because nobody chose strategy. They just kept adding fields.

How to pick (and what to do if you've already picked wrong)

If you're starting fresh:

Under 10M events/day, fewer than 5 consumers, project life under 3 years → additive-only. The overhead of anything else doesn't pay back.
Heavy event-sourced workload, replay is part of the value prop, multi-year horizon → upcaster from day one. It's a sprint of upfront work that saves you a year of pain at year four.
Anything in between → expand-contract with strong consumer-readiness gates. Make the registry mandatory; consumers that don't register can't subscribe.

If you're on versioned topics and feeling the silent-drift pain:

Build the consumer-topic-match gap monitor first. Run it daily. Until you have visibility into which consumers are reading from which topic version, every other change is guessing.
Don't retire any old topic until the gap monitor reports zero subscribers for two consecutive weeks.
For the next major schema change, switch to expand-contract on the v2 topic. Don't make v3, v4, v5 topics.

If you're on additive-only and the schema has 14 fields you don't trust:

Field-usage telemetry. Every consumer logs which fields it reads. Run for 30 days.
Any field with zero reads across all consumers is a deletion candidate. Verify by removing it from the schema in a non-prod environment and watching consumer error rates.
The fields that survive: tag them with purpose: comments in the schema. Future-you will thank you.

The five-year problem is not a release-cycle problem. It's an ownership problem. Whichever strategy you pick, the part that actually matters is whether somebody on your team knows, by name, which consumers read which events.

What's the oldest event schema still in production at your shop, and how many people would notice if you broke it tomorrow? Drop a war story in the comments.

If this was useful

Schema evolution is one of the chapters in the Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About. The book covers upcaster chains in more depth (including how to wire them into Confluent Schema Registry and NATS JetStream), the failure modes of expand-contract under partition rebalancing, and the patterns for catching silent consumer drift before it costs you a customer. If the version-drift war you're fighting today still feels open-ended, the book has the playbook for the next four years of it.