Archit Agarwal

Posted on May 18

Your Inventory Counter Just Went Negative. Here’s Why — and How to Fix It.

#distributedsystems #go #systemdesign #programming

Black Friday. 9:02am. Your e-commerce platform has been live for two minutes.

A limited-edition sneaker drops - 500 pairs in stock. Within seconds, the inventory counter is spinning. Add to cart. Checkout. Payment confirmed. Your servers in the US, EU, and Asia are all taking orders simultaneously to keep latency low for every customer on the planet.

By 9:04am, you've sold 620 pairs.

You have 500.

Your ops channel lights up. Support tickets flood in. You're now emailing 120 paying customers to tell them their confirmed order doesn't exist. The refunds go out. The trust doesn't come back.

What went wrong wasn't a bug in your payment code. It wasn't a race condition in the traditional sense. It was something more fundamental: your distributed counter had no floor. And when multiple servers decremented it concurrently, each one saw a local view that looked safe - and the global invariant silently broke.

This is the problem Bounded Counters were designed to solve. And the solution is one of the most elegant ideas in the CRDT literature.

What an Inventory Counter Actually Is

Before the distributed systems, let's ground this in something simple.

An inventory counter is just a number. It represents how many units of something you have available. When a customer buys one, the number goes down. When a shipment arrives, the number goes up. The rule is straightforward:

The counter must never go below zero.

In SQL, this is trivial:

CREATE TABLE inventory (
    product_id  BIGINT PRIMARY KEY,
    stock       INT DEFAULT 0 CHECK (stock >= 0)
);

The CHECK constraint enforces the floor. If a transaction tries to bring stock below zero, the database rejects it. One server. One source of truth. The constraint holds perfectly.

The same logic applies to a wallet: balance must never go below zero. To an ad budget: spend must never exceed the limit. To a rate limiter: requests must never exceed the ceiling.

These are bounded counters - counters with a numeric invariant that must never be violated.

Simple when you have one server. Deeply interesting when you have many.

The Moment Simplicity Breaks

Your platform grows. One server can't handle Black Friday traffic. You distribute.

You deploy database replicas in the US, EU, and Asia. Each region accepts writes locally to keep latency low. The system is eventually consistent - replicas sync in the background, but any given server might be working from a slightly stale view of the world.

Here's what happens to your inventory counter.

500 pairs in stock. Three servers, each seeing the same initial state: stock = 500. In the first two minutes, customers are hammering all three regions simultaneously.

The US server processes 200 orders. Locally: 500 - 200 = 300. Looks fine.

The EU server, working from the same initial state it loaded before the US orders propagated, processes 180 orders. Locally: 500 - 180 = 320. Also looks fine.

The Asia server processes 170 orders. Locally: 500 - 170 = 330. Still fine locally.

When the three servers finally sync: 500 - 200 - 180 - 170 = -50.
You've oversold by 50 pairs. Every check passed locally. The global invariant shattered silently.

Why the Solutions You're Thinking Of Don't Work Here

At this point, two approaches come to mind. Both fail at scale in ways worth understanding.

Distributed locks: Before decrementing, acquire a global lock on the inventory key across all replicas. Only one server writes at a time.

This works - and kills your latency. A lock that spans data centres separated by hundreds of milliseconds of network round-trip adds that latency to every single purchase. On Black Friday, when the whole point of distributing was to make checkout fast, a global lock turns your purchase flow into a queue. Users abandon cart. Revenue drops.

Strong consistency everywhere: Route all inventory writes through a single primary node or quorum. Every decrement gets serialised.

Also works. Also slow. Same problem as distributed locks, different mechanism. The CAP theorem is not negotiable: if you want strong consistency across geographically distributed nodes, you pay in latency and availability.

In Episode 2, PN-Counter gave us increment and decrement without coordination. But I ended that piece with a warning: PN-Counter has no floor. Nothing in its design prevents the score from going to -50,000. For Reddit karma, nobody cares. For inventory, that's a P0 incident.

We need something new.

The Insight: Stop Distributing the Counter. Distribute the Permission.

Here's the shift in thinking that makes Bounded Counters work.

With PN-Counter, we stopped trying to coordinate on every write. We accepted that replicas might temporarily disagree on the exact value. That was fine because eventual consistency was acceptable.

Bounded Counters need something stronger: a hard floor that can never be violated, even under concurrent writes, even during network partitions, even when replicas have stale views.

The key insight: you don't need every replica to know the exact current value to enforce the floor. You just need every replica to know it has permission to decrement.

Think of it like tickets.

You have 100 items in stock. The floor is 0. That means you have 100 units of "decrement permission" - 100 tickets. You distribute those tickets across your three servers before the sale starts:

US server: 50 tickets
EU server: 30 tickets
Asia server: 20 tickets

Now the rule is simple: a server can only decrement if it has at least one ticket to spend. When it decrements by N, it spends N tickets.

The US server takes 50 orders: spends 50 tickets, now has 0. Stops accepting orders locally. The EU server takes 30 orders: spends 30 tickets, now has 0. Stops accepting orders locally. The Asia server takes 20 orders: spends 20 tickets, now has 0. Stops.

Total orders: 100. Exactly the stock. Global invariant: never violated.

Nobody coordinated on a single order. Every check was purely local. The constraint held globally because the math is airtight: if total tickets = total available decrements, and each replica can only spend what it owns, the sum of decrements can never exceed the total tickets.

This is the escrow idea, applied to distributed systems. You're not distributing the counter. You're distributing the permission to decrement.

The Wallet Walk-Through

Let's trace this concretely with a wallet that must never go below zero.

Initial state: Wallet value = 100. Floor = 0. Available decrement rights = 100 − 0 = 100.

Rights distributed:

DC_A: 50 rights
DC_B: 30 rights
DC_C: 20 rights

Payment at DC_A: User pays 15 units.

DC_A checks local rights: 50 ≥ 15 ✓. Executes locally. No coordination needed.

DC_A: 35 rights remaining
DC_B: 30 rights remaining
DC_C: 20 rights remaining
Global value: 85

Concurrent payment at DC_B: User pays 25 units.

DC_B checks local rights: 30 ≥ 25 ✓. Executes locally. DC_A and DC_B never talked during either payment.

DC_A: 35 rights remaining
DC_B: 5 rights remaining
DC_C: 20 rights remaining
Global value: 60

Both payments went through. Both were validated purely locally. The global floor - wallet ≥ 0 - was never at risk, because the sum of all rights (35 + 5 + 20 = 60) exactly equals the remaining balance.

Failed payment at DC_C: User tries to pay 25 units.

DC_C checks local rights: 20 < 25 ✗. Cannot guarantee the floor locally.

Two options: fail immediately, or coordinate to borrow rights from another replica.

Rights transfer: DC_C asks DC_A for 10 rights. DC_A has 35 and can spare them.

DC_A: 25 rights  (gave 10 to DC_C)
DC_B: 5 rights
DC_C: 30 rights  (received 10 from DC_A)

DC_C re-attempts: 30 ≥ 25 ✓. Payment goes through.

DC_A: 25 rights
DC_B: 5 rights
DC_C: 5 rights
Global value: 35

At no point did the wallet go below zero. At no point did all three servers need to agree on the exact current value before processing a payment. The invariant held through local checks alone - with one brief coordination event only when a replica ran out of its allocated rights.

The Data Structure

The Bounded Counter state at each replica tracks three things:

type BoundedCounter struct {
    Min int64     // lower bound (e.g. 0)
    N   int       // number of replicas
    ID  int       // this replica's index

    R [][]int64   // R[i][j]: rights replica i transferred to j
                  // R[i][i]: rights replica i created for itself (via increments)
    U []int64     // U[i]: decrements executed at replica i
}

The rights matrix R is the core. Each replica owns the diagonal - R[i][i] is the rights it created. Off-diagonal entries R[i][j] track rights transferred from replica i to replica j. Every entry only ever increases. This is the grow-only constraint, applied to a matrix instead of a single value.

Reading the current value:

func (c *BoundedCounter) Value() int64 {
    var inc, dec int64
    for i := 0; i < c.N; i++ {
        inc += c.R[i][i]
        dec += c.U[i]
    }
    return c.Min + inc - dec
}

Sum the diagonal (total increments), subtract total decrements, add the floor.

Checking local rights:

func (c *BoundedCounter) LocalRights() int64 {
    id := c.ID
    var created, received, sent, spent int64
    created = c.R[id][id]
    for i := 0; i < c.N; i++ {
        if i == id { continue }
        received += c.R[i][id]
        sent += c.R[id][i]
    }
    spent = c.U[id]
    return created + received - sent - spent
}

Rights I created, plus rights I received, minus rights I gave away, minus rights I already spent.

Decrementing:

func (c *BoundedCounter) Dec(n int64) error {
    if c.LocalRights() < n {
        return fmt.Errorf("not enough rights")
    }
    c.U[c.ID] += n
    return nil
}

Only decrement if you have the tickets. Otherwise, reject or trigger a rights transfer.

Transferring rights:

func (c *BoundedCounter) Transfer(to int, n int64) error {
    if c.LocalRights() < n {
        return fmt.Errorf("not enough rights to transfer")
    }
    c.R[c.ID][to] += n
    return nil
}

The merge - identical pattern to every CRDT we've built:

func (c *BoundedCounter) Merge(other *BoundedCounter) error {
    for i := 0; i < c.N; i++ {
        for j := 0; j < c.N; j++ {
            if other.R[i][j] > c.R[i][j] {
                c.R[i][j] = other.R[i][j]
            }
        }
        if other.U[i] > c.U[i] {
            c.U[i] = other.U[i]
        }
    }
    return nil
}

Element-wise maximum. Every entry in R and U only ever grows. Merge is commutative, associative, idempotent. Replicas always converge.

Why the Invariant Can Never Break

Here's the mathematical guarantee, stated plainly.

Total decrement rights across all replicas = current value − floor.

Each replica can only decrement by spending rights it owns. A replica cannot own negative rights. Therefore, the total decrements across all replicas can never exceed the total rights in the system. Therefore, the global value can never drop below the floor.

This holds even when replicas have stale views of each other's state. Each replica's local check - "do I have enough rights?" - is conservative by design. If you have 20 rights and the operation needs 25, you say no. You might technically have been able to borrow rights from another replica - but you don't know that yet, and the safe answer is always no.

Coordination is infrequent and amortised. For operations that fit within local rights, zero coordination. For operations that exceed local rights, one round-trip to request a transfer. Compared to coordinating on every write, this is a dramatic improvement in practice.

The Trade-Off You Should Know

Bounded Counter is the most complex CRDT in this series so far - and it comes with trade-offs worth naming.

Rights starvation: if a replica runs out of rights and the replica it needs to borrow from is unreachable (network partition, server failure), the operation fails. The invariant is safe - you'd rather fail an operation than violate the floor - but the user experience is a rejection, not a delay. For inventory, that means "out of stock" even when stock technically exists elsewhere. Acceptable for correctness. Worth designing around.

State size scales with replica count: the R matrix is N×N. With 3 replicas, 9 cells. With 100 replicas, 10,000. For most geo-distributed systems (2-7 data centres), this is negligible. For edge computing with thousands of nodes, you'd want a different approach.

Rights rebalancing is a background concern: a replica that consistently gets more traffic than its rights allocation will hit starvation more often. Operational tooling to monitor and rebalance rights distribution is not optional - it's part of running this in production.

The Bigger Pattern

Three episodes in, the shape of CRDT design is clear.

Every CRDT we've built solves a hard distributed problem by identifying a constraint that eliminates ambiguity.

G-Counter said: make it grow-only. The maximum is always correct.

PN-Counter said: never decrement - use two grow-only counters and subtract at read time.

Bounded Counter says: never coordinate on every operation - distribute the permission to operate instead.

The constraint is always the solution. The art is finding the right one.

What's Next

The next piece steps back from counters entirely.

What if instead of tracking numbers, you need to track membership - items in a set, users in a group, tags on a document? The same concurrent update problems apply. And a surprisingly simple grow-only constraint solves the basic case cleanly.

Episode 4: G-Set - grow-only sets, where the items go in but never come out, and why that constraint enables a clean distributed implementation.

The full series:

Episode 1: G-Counter - why YouTube's view count is approximate
Episode 2: PN-Counter - Reddit karma and the two-notebook trick
Episode 3: Bounded Counter - inventory floors and distributed permission (you're here)
Episode 4: G-Set - grow-only sets
Episode 5: LWW Registry - last-write-wins
Episode 6: OR-Set - the fix for "deleted but still there"
Episode 7: The full picture - when to use CRDTs and when to walk away

Subscribe to the newsletter to get each episode as it drops.
If you're building distributed systems and want to think through invariant enforcement, CRDT design, or distributed architecture with someone who has been deep in this space - I do 1:1 sessions on Topmate.

DEV Community