Gabriel Anhaia

Posted on May 24

Event Sourcing Snapshots: When, How Often, and Why Most Teams Over-Snapshot

#architecture #eventsourcing #performance #backend

Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Snapshot every 50 events, the docs say. Most teams snapshot every 10 because rebuilds got slow. They snapshot the wrong things. Snapshot storage doubles every quarter and nobody notices until the bill arrives.

The cadence is wrong because it's tied to write volume, not the thing that actually hurts: how long it takes to rebuild an aggregate when somebody asks for it.

What a snapshot actually saves you — replay cost math

A snapshot is a serialized aggregate state at a point in time. When you load the aggregate, instead of replaying all 8,420 events from the beginning, you load the last snapshot at event 8,000 and replay 420.

The cost you're saving is wall-clock replay time on the read path. Write it out:

T_rebuild = T_load_snapshot + (N_events_since * T_apply_per_event)

For a typical CRM contact aggregate with 5,000 lifetime events:

T_apply_per_event measured at p95: 0.3 ms (deserialize JSON, apply handler, mutate in-memory state).
No snapshot: 5000 * 0.3 ms = 1500 ms per load. That's a 1.5-second p95 on anything that loads the aggregate.
Snapshot every 50: ~50 * 0.3 ms = 15 ms. Fast.
Snapshot every 500: ~500 * 0.3 ms = 150 ms. Still inside an SLA for most APIs.

So far the every-50 advocates look right. Now add the cost they ignored:

T_snapshot_write = T_serialize + T_persist
Storage_growth   = (lifetime_events / snapshot_interval) * avg_snapshot_size

For 100,000 active aggregates averaging 50 events between snapshots and 2 KB per snapshot, that's 2,000 snapshot writes per "50 events worth" of activity and ~200 MB of new snapshot blobs per aggregate-lifetime. Quarter over quarter, that compounds.

Worse: snapshot writes happen on the command path in most implementations. The write path pays for read-path optimization that may never get used.

Why every-N-events is the wrong cadence — replay frequency matters more than write frequency

Here's the part nobody writes down. The benefit of a snapshot only materializes when the aggregate gets loaded.

A shopping-cart aggregate during a flash sale: 200 events in 30 seconds, then abandoned forever. You snapshotted at every-10 four times. Net benefit of those snapshots: zero. Nobody loads that cart again.

A customer-loyalty aggregate that accumulates 12 events a year but gets queried 4,000 times a day (every page view in the account area). That one deserves a snapshot near the latest event.

Snapshot cadence should be a function of replay_frequency * aggregate_load_cost, not write rate. Two aggregates with identical event counts can have wildly different optimal cadences depending on how often they're read.

The rule of thumb that actually works: snapshot when the replay cost since the last snapshot exceeds the snapshot write cost, weighted by replay-to-write ratio.

In code, that becomes a decision your repository can make at write-time using runtime stats.

The right cadence — measure p95 rebuild time, set a target, snapshot to hit it

Pick a target. Something like "p95 aggregate load under 50 ms." Then derive the snapshot interval from measurement, not from a number in a blog post.

The procedure:

Instrument your aggregate repository. Record events_replayed and replay_duration_ms per load. Tag with aggregate type.
Run a week. Compute p95 of replay_duration_ms per aggregate type.
For any aggregate type whose p95 exceeds the target, lower the snapshot interval. For any type whose p95 is far under the target, raise the interval (save storage).
Re-measure monthly. Aggregates that get hotter over time will drift past the target; cold ones can have their intervals stretched.

A concrete table from a real-ish loyalty system after one tuning cycle:

Aggregate type	Avg events/load	p95 (before)	Snapshot interval	p95 (after)
LoyaltyAccount	1,840	540 ms	every 50	18 ms
ShoppingCart	12	4 ms	none	4 ms
OrderAggregate	220	68 ms	every 200	36 ms
SubscriptionPlan	38	11 ms	none	11 ms

Two of the four aggregate types don't need snapshots at all. Most codebases snapshot all of them at the same interval because it's a single config value.

Snapshot retention — keep N most recent, archive older to cold

You don't need every snapshot. The current snapshot plus a few historical ones is enough.

Reasons to keep more than one:

Debugging. Load aggregate at event 10,000 (old snapshot + replay 200 events) to reproduce a bug.
Point-in-time queries. Compliance asks for the state of account X on 2025-03-15.
Snapshot corruption recovery. Latest snapshot fails the checksum (more on that below); fall back to the previous one.

A retention policy that works:

Hot tier (Postgres/Redis): latest 3 snapshots per aggregate.
Cold tier (S3 with Intelligent-Tiering): everything older, partitioned by month.
Delete cold snapshots after 7 years if the compliance window allows it.

Implementing this as a nightly job is fine. Trying to do it inline at write-time complicates the command path for no benefit.

Snapshot schema evolution — the trap when you change the aggregate shape

You changed LoyaltyAccount to add a tier field. Old snapshots don't have it. You deploy. Three minutes later, deserialization fails on every account that hasn't been touched since the deploy.

The fix is the same pattern you already use for event schema evolution: an upcaster.

class LoyaltyAccountSnapshotUpcaster:
    CURRENT_VERSION = 4

    def upcast(self, payload: dict, version: int) -> dict:
        if version < 2:
            payload["preferred_language"] = "en"
        if version < 3:
            # field renamed: "points" -> "balance_points"
            payload["balance_points"] = payload.pop("points", 0)
        if version < 4:
            payload["tier"] = self._compute_tier(payload["balance_points"])
        return payload

Two things matter here. Store the schema version inside the snapshot envelope, not derived from anything mutable. And test the upcaster against real production-shaped snapshots in CI — synthetic test data tends to miss the edge cases that 18 months of production produced.

When upcasting gets expensive (a v1-to-v9 chain that does 200 ms of work), trigger a synchronous re-snapshot after the upcasted aggregate is loaded. The next read will skip the chain.

A 40-line snapshot policy enforcer

Here's the decision logic, applied at every aggregate save. Python because it reads cleanly; the shape ports to any language.

from dataclasses import dataclass
from typing import Callable

@dataclass
class SnapshotPolicy:
    target_p95_ms: float           # e.g. 50.0
    apply_cost_ms: float           # measured per-event apply cost
    write_cost_ms: float           # measured snapshot write cost
    min_events_between: int = 20   # never snapshot too aggressively
    max_events_between: int = 2000 # always snapshot eventually
    read_to_write_ratio: float = 1.0  # hot aggregates > 1, cold < 1

class SnapshotEnforcer:
    def __init__(self, policy: SnapshotPolicy, last_snapshot_at: Callable[[str], int]):
        self.policy = policy
        self.last_snapshot_at = last_snapshot_at  # aggregate_id -> event seq

    def should_snapshot(self, aggregate_id: str, current_seq: int) -> bool:
        last = self.last_snapshot_at(aggregate_id)
        gap = current_seq - last

        if gap < self.policy.min_events_between:
            return False
        if gap >= self.policy.max_events_between:
            return True

        # Replay cost if a read came right now:
        projected_replay_ms = gap * self.policy.apply_cost_ms
        # Amortized snapshot write cost per future read:
        amortized_write_ms = self.policy.write_cost_ms / max(self.policy.read_to_write_ratio, 0.01)

        # Snapshot when projected replay would breach the target,
        # or when the write cost is justified by read frequency.
        breaches_target = projected_replay_ms > self.policy.target_p95_ms
        write_is_worth_it = projected_replay_ms > amortized_write_ms
        return breaches_target and write_is_worth_it

Wire this into the repository's save() method. The read_to_write_ratio and apply_cost_ms come from your instrumentation. Populate them from rolling 7-day averages, per aggregate type. Now snapshot frequency adapts to actual load instead of a hardcoded if seq % 50 == 0.

The gotcha — snapshots can drift from event truth if you bypass the event store

Snapshots are a cache. Like every cache, they can be wrong.

The drift scenarios that hit production:

A bad migration writes directly to the snapshot table to "fix" an aggregate, bypassing the event stream.
A bug in an event handler corrupted state for two weeks before being caught; old snapshots have the bug baked in.
Schema upcaster has a subtle defect that silently produces different state than a fresh replay would.

The aggregate's event stream is the source of truth. The snapshot must equal what a full replay would produce. If they diverge, you have a silent correctness bug. The dashboard shows fine.

The check is simple. Periodically (nightly is enough for most domains, hourly for ledgers), for a sample of aggregates:

def verify_snapshot(aggregate_id: str) -> bool:
    snapshot = snapshot_store.load_latest(aggregate_id)
    events = event_store.load_all(aggregate_id)
    fresh_state = replay(events)
    snapshot_state = replay(events[snapshot.seq:], initial=snapshot.state)
    return checksum(fresh_state) == checksum(snapshot_state)

Pick a stable checksum. Sort dict keys before hashing. Exclude derived/transient fields. When the checksum diverges, alert, then drop the snapshot — the next load will rebuild from events.

A 0.1% sample per aggregate type per night catches drift within a week. For ledger-style aggregates where correctness is regulatory, sample 1% and alert on a single divergence. The compute cost is real but it's the kind of cost you want to pay before an auditor asks.

What to actually do Monday

You probably have a snapshot config that says SNAPSHOT_EVERY_N_EVENTS = 10 or = 50. Replace it with three things:

Per-aggregate-type measurements of apply cost, write cost, and read frequency.
A target p95 rebuild time and a policy enforcer that consults the measurements instead of counting events.
A nightly checksum job that catches snapshot drift before your data team does.

Most teams snapshot too often, retain too much, and never verify. Flipping any one of those gets you wins. Flipping all three gets you a storage bill that grows with your business, not with calendar quarters.

What's your current snapshot interval, and when did you last verify a snapshot matched a fresh replay?

If this was useful

Snapshots are one of those topics where the textbook answer ("every N events") wires straight into a production cost problem nobody flags. The Event-Driven Architecture Pocket Guide digs into the same pattern across Saga, CQRS, and Outbox — the chapter on read models and projections covers the snapshot/replay tradeoff and the upcaster pattern in more depth, with the failure modes I keep seeing in real systems.