DEV Community: Krishna Tej Chalamalasetty

The Boring Infrastructure That Breaks AI APIs: A Guide to Billing and Metering

Krishna Tej Chalamalasetty — Sun, 05 Apr 2026 18:59:33 +0000

Recently, Anthropic users ran into a frustrating pattern. Usage limits hit faster than expected. Credits appeared late. In some cases, the same request was billed twice. The forums and GitHub issues filled up fast.
But stepping back from the frustration, have you ever thought about what it actually takes to build billing infrastructure for an AI API?

It sounds simple. Count tokens, charge money. But the moment you add streaming responses, concurrent users, prepaid credits, multiple token types, and an async pipeline underneath, it becomes one of the harder problems a platform team will face. And when it breaks, it breaks visibly. Users notice billing errors faster than almost any other kind of bug.

This article is about what that system looks like under the hood, where it tends to fail, and what engineers can do about it.

The Anatomy of a Billing System

If I were to break down how a billing system is structured, I would anchor it around three core layers.

Event is the starting point. When a request completes, the API server emits a usage event that contains information about who made the request, which model, how many tokens, and a unique ID. That unique ID does quiet but important work. It is the foundation for both auditability and idempotency, which we will get to shortly.

Meter is the core of billing and probably the hardest part to get right. It takes the stream of raw events and answers one question: how much has this account consumed, and do they have enough balance to continue? Doing this correctly at scale for millions of concurrent users, across multiple token types, with partial streaming responses in flight, this is where most of the interesting failures live.

Ledger is the authoritative record of every transaction that touched a balance. It is append-only by design. You never edit a past entry. You only add new ones. The current balance is the sum of all entries for that account. This gives you two things a single mutable balance column cannot: a full audit trail for disputes, and a natural place to enforce idempotency.

These three layers are not just organizational. They represent distinct consistency boundaries. The event layer needs to be fast. The meter layer needs to be correct. The ledger layer needs to be durable. Mixing them into a single system is where things start to go wrong.

Token Counting Is Harder Than It Looks

With the anatomy established, I want to talk about how AI labs actually meter usage. Token counting sounds mechanical. In practice, it has edge cases that are easy to underestimate.

Input and output tokens are not symmetric

Token counting happens on both sides of the request. Input tokens are known before the model runs. Output tokens are not. The response is a streaming, non-deterministic outcome. The system does not know how long it will be until it finishes.

One way AI labs handle this is by metering the chunked stream as it is produced. As each chunk arrives, the meter tracks cumulative output tokens. As long as quota remains, the stream continues. Once quota is exhausted, the stream is cut. This means the billing decision and the response generation are happening at the same time, not in sequence.

Metering is opaque to the client

The token count is finalized on the server. The client cannot see it until the response is done. Given the non-deterministic nature of the output, users have no reliable way to predict what a request will cost before sending it. That opacity is a real trust problem, and it gets worse when users are on prepaid credits with hard limits.

Not all tokens are the same

There are input tokens, output tokens, thinking tokens, and cached tokens. Each has a different price and, in some cases, a different metering path. All of them draw from the same pool, that is the account's available balance.

If a user has multiple parallel sessions running, each one is consuming from that shared pool at the same time. That is a concurrency problem on the balance. At the scale of millions of users, it is a serious one. This is where the next section begins.

The Two Hardest Problems: Idempotency & The Credit Race

These two problems sit at the intersection of distributed systems and financial correctness. Getting either one wrong produces charges that are duplicated or inaccurate. Both erode user trust fast.

Idempotency

Idempotency means that performing the same operation multiple times produces the same result as performing it once. In billing, this is not a nice-to-have. It is load-bearing.

Usage events travel through an async pipeline, typically a message queue like Kafka. These queues guarantee at-least-once delivery. That means a consumer will sometimes receive the same event more than once during retries, consumer restarts, or network hiccups.

Without an idempotency guard on the ledger write, a redelivered event produces a second charge for the same request. This is the pattern behind the Anthropic dual-billing bug. Two write paths, one for API billing and one for prepaid credits, both fired for the same request with no shared dedup check between them.

The fix at the ledger level is one line of SQL:

INSERT INTO ledger (event_id, org_id, amount, type, timestamp)
VALUES ('evt_abc123', 'org_A', -0.03, 'usage_deduct', NOW())
ON CONFLICT (event_id) DO NOTHING;

The event_id column has a unique constraint. A duplicate event does nothing. The ledger's append-only structure makes this natural. You are inserting a row, not updating one, and the database enforces uniqueness for you.

The event_id is the unique key that is sent in the event. This is one of the key pillars that ensures idempotency.

The Credit Race

The credit race is a concurrency problem. Two requests arrive at the same time for an account with $0.05 remaining. Each request costs $0.03. Both are individually affordable. Together they are not.

Without an atomic check-and-deduct, both requests read the balance as sufficient, both proceed, and the account ends at -$0.01. That is a silent overdraft.

A plain Redis DECRBY that subtracts a value from a key atomically does not protect against this. It has no floor. It will go negative without complaint.

The correct approach is an atomic check-and-deduct. In Redis, this is done with a Lua script that runs as a single uninterruptible operation:

local balance = tonumber(redis.call('GET', KEYS[1]))
local cost    = tonumber(ARGV[1])
if balance >= cost then
  return redis.call('DECRBY', KEYS[1], cost)
else
  return -1  -- insufficient balance, reject the request
end

There is no gap between the read and the write. Another thread cannot slip in between them.

In practice, most AI APIs might use both. Redis for the fast atomic balance gate on the hot path. Postgres as the durable ledger for audit and idempotency. They solve different problems. Redis buys speed. Postgres buys correctness.

Failure Modes Taxonomy

Most billing failures are not random. They cluster around a small number of root causes. Recognizing the pattern matters more than patching individual bugs.

Failure Class	Real Example	What Actually Went Wrong
Double charge	Anthropic API + credit billed for same request	Two write paths, no shared idempotency check
Ghost block	Credits exist, API returns 402 anyway	The balance is sufficient in the ledger, but the meter is reading from a cached copy that hasn't been updated yet.
Silent overdraft	More tokens consumed than remaining balance	Non-atomic check-then-deduct under concurrency
Credit destruction	Gift codes wiping each other on redemption	Stripe proration logic unaware of credit stacking rules
Consumption spike	3–5x faster depletion after a model update	No baseline drift detection on per-account meter
Measurement loss	Streaming request interrupted mid-response	Token count finalized at stream end, partial streams need a separate accounting path

Four of these six failures share one root cause:

Billing state living in more than one place without a single source of truth.
Double charge happens when two systems both think they own the write.
Ghost block happens when the cache disagrees with the ledger.
Credit destruction happens when Stripe's model and the internal credit model diverge.

The fix is not patching each bug one by one. It is deciding once which system owns each piece of billing state, and making everything else read from it.

Design Principles

Idempotency is load-bearing, not defensive.

Engineers sometimes treat idempotency as a safety net added after the system works. In billing, it belongs in the design before the first line of code. Every ledger write needs a dedup key. Every event needs a stable, unique ID. At-least-once delivery will eventually betray you. The only safe response is to design for it upfront.

Credits are not money. Model them accordingly.

A balance column feels sufficient until you need expiry dates, stacking rules for different grant types, priority ordering between purchased and promotional credits, and shared wallets across teams. At that point, a single number is not enough. Credits need to be a ledger of typed entries, each with an amount, a source, an expiry, and a priority. The Anthropic gift code bug, where redeeming multiple codes destroyed existing credit value, is a direct result of treating credits as a simple balance rather than a structured set of grants.

Measure first, charge second.

This is easy to say and consistently violated under deadline pressure. The metering pipeline should finalize the token count before any charge is recorded. Charging on estimated or in-flight counts, especially for streaming creates a class of errors that are nearly impossible to audit after the fact. The accounting path and the response path can run in parallel. But the ledger write should always wait for a confirmed count.

Conclusion

The Anthropic billing incidents were not exotic failures. They were textbook distributed systems problems that surfaced in a financial context. Duplicate writes without idempotency guards. Concurrent balance checks without atomic deduction. Credit state split across systems with no clear owner.

What makes them worth studying is not that mistakes were made. Every system at scale makes mistakes. It is that these failure classes are predictable. They have names. They have known fixes. And they tend to appear in roughly the same order as a billing system grows.

If you are building or reviewing a billing system, the most useful question to ask is not whether the happy path works. It is whether you have decided which system owns each piece of billing state, and whether everything else actually reads from it.

Billing infrastructure rarely gets a design doc. It rarely gets a dedicated team until something goes wrong. It sits in the corner of the codebase, quietly doing its job, until one day it doesn't and suddenly it's the only thing anyone is talking about.

The Anthropic incidents are a good reminder of that dynamic. The failures weren't in the model. They weren't in the API. They were in the plumbing that sits between a user's wallet and a completed request. The part nobody thought was interesting enough to design carefully.

That's the thing about boring infrastructure. It doesn't announce itself when it's working. But when it breaks, it breaks in the most visible way possible on someone's credit card statement.

Token counting, idempotency, credit ledgers, atomic deductions, none of this is glamorous work. But it is the work that determines whether users trust your platform. And trust, once lost over a billing error, is genuinely hard to get back.

Build the boring parts like they matter. Because to your users, they matter more than almost anything else.

Originally published at chkrishnatej.dev

What is a Container? The OS-Level Truth Most Engineers Don't Know

Krishna Tej Chalamalasetty — Sat, 04 Apr 2026 01:25:08 +0000

"You Keep Using That Word"

Dispelling Container Misconceptions at the OS Level

Before we write a single line of code, we need to kill the buzzword fog.

What a Container Actually Is

The marketing definition you have heard a hundred times: "a container is an executable unit of software with its dependencies bundled together." That is not wrong, but it tells you nothing useful about what is actually happening on the machine.

Here is the OS-level truth: a container is a process (or a tree of processes) that the kernel runs with a restricted view of its own namespaces and a cgroup-enforced ceiling on the resources it can consume. That is the entire trick. No hypervisor, no guest kernel, no virtualized hardware. Just a process with a carefully constructed set of constraints.
Everything else in this article is evidence for that single claim.

Spin up a simple HTTPD container

I used the following podman command to spin up the HTTPD container with limited amount of resources.

podman run -d \
  --name my-limited-httpd \
  --replace \
  --memory=512m \        # hard memory ceiling
  --memory-swap=512m \   # swap ceiling equal to memory = no swap allowed
  --cpus=1.5 \           # CPU quota: 1.5 cores worth of CPU time
  --cpu-shares=512 \     # relative CPU weight during contention
  --pids-limit=100 \     # max 100 processes/threads in this container
  --blkio-weight=500 \   # relative block I/O weight
  -p 8081:80 \
  registry.access.redhat.com/ubi8/httpd-24

This starts a container with the given resources. You can check

krish-local:~ # pstree -pT
systemd(1)─┬─NetworkManager(944)
           ├─agetty(1011)
           ├─agetty(1012)
           ├─auditd(835)
           ├─chronyd(1007)
           # ConMon is container monitor. HTTPD is the container we ran
           ├─conmon(13669)───httpd(13684)─┬─cat(13727) 
           │                              ├─cat(13728)
           │                              ├─cat(13729)
           │                              ├─cat(13730)
           │                              ├─httpd(13731)
           │                              ├─httpd(13734)
           │                              └─httpd(35659)
           ├─dbus-broker-lau(854)───dbus-broker(856)
           ├─dotd(959)───sleep(3951905)
           ├─irqbalance(858)
           ├─rsyslogd(1159)
           ├─sshd(1562)───sshd-session(3949831)───sshd-session(3950094)───bash(3950096)───pstree(3954963)
           ├─systemd(3950077)───(sd-pam)(3950079)
           ├─systemd-journal(621)
           ├─systemd-logind(890)
           ├─systemd-udevd(657)
           └─wpa_supplicant(891)

From `podman run` to a Process - What Actually Happens

When you ran podman run, Podman did not start httpd directly. It handed the work to an OCI runtime. On SLES 16, that runtime is crun (you can verify with podman info | grep -i runtime). crun is the thing that actually calls clone() with the right namespace flags, writes the cgroup limits into /sys/fs/cgroup, sets up the root filesystem from the image layers, and then calls execve() to start your process.
runc does the same job but its original OCI reference runtime is written in Go. crun is a C rewrite that is lighter and faster, and is now the default on most modern distros.

conmon in the pstree output is the container monitor that acts as a small supervisor process that holds the container's stdio open, watches for the OCI runtime to exit, and reports container exit codes back to Podman. It is not part of your workload. It is bookkeeping infrastructure.

So the full chain is:

podman → crun → clone() + execve() → your process

By the time httpd appears in the process table, crun has already exited. Its job was setup, not supervision.

Four Words the Industry Uses Interchangeably — and Shouldn't

These four terms get collapsed into each other constantly. The confusion is not accidental. Vendors benefit from the blurring. But it costs engineers clarity when something breaks.

Term	What it actually means
Kernel	The core software that mediates between hardware and every process running on the machine. It owns CPU scheduling, memory management, syscall handling, and namespace/cgroup enforcement.
Operating System	The full stack: kernel plus user-space tooling (libc, shell, package manager, init system) that makes the machine usable by humans or services. RHEL is an OS. The kernel alone is not.
Process	A running instance of a program. The kernel has allocated it a PID, some memory, and file descriptors. It exists in kernel memory; `/proc` is a virtual filesystem that exposes it, not where it lives.
Image	Explained in full below.

What an Image Actually Is

An image is a blueprint. A container is one running instance of it. You can start ten containers from the same image simultaneously and each one gets its own thin writable layer on top of the shared read-only filesystem layers underneath. The image itself never runs. It is inert until a runtime turns it into a process.

At the OCI spec level: an image is a stack of read-only filesystem layers plus a JSON config manifest. When you run a container, the OCI runtime (crun on most modern Linux systems, runc historically) unpacks those layers into a root filesystem, applies the runtime constraints, and hands a process off to the kernel.

Image = what to run.
Runtime spec = how to run it
The container = the process that results.

The PID 1 Myth

The belief: there is a mini Linux inside the container, and PID 1 is its init system — the root of a private process universe.

The reality: PID 1 is namespace-relative. The first process spawned inside a new PID namespace gets assigned PID 1 within that namespace. It is simultaneously visible on the host with a completely different PID. One process, two identities, depending on which namespace you are observing from.

Let us verify this directly. The nsenter command lets us step into a process's namespaces and run a command from inside them.

Stepping into the host's namespaces (PID 1 = systemd):

# Enter PID 1's PID namespace (-p) and mount namespace (-m, needed so
# /proc reflects the target namespace) and run ps
krish-local:~ # nsenter -t 1 -p -m ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 Mar29 ?        00:00:05 /usr/lib/systemd/systemd --switched-root --system --deserialize=47
root           2       0  0 Mar29 ?        00:00:00 [kthreadd]
root           3       2  0 Mar29 ?        00:00:00 [pool_workqueue_release]
root           4       2  0 Mar29 ?        00:00:00 [kworker/R-kvfree_rcu_reclaim]
root           5       2  0 Mar29 ?        00:00:00 [kworker/R-rcu_gp]
...

Stepping into the container's namespaces (container process host PID = 13684):

# Same command, but targeting the container's PID namespace.
# Inside this namespace, httpd was the first process spawned — so it gets PID 1.
krish-local:~ # nsenter -t 13684 -p -m ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
default        1       0  0 01:47 ?        00:00:29 httpd -D FOREGROUND
default       38       1  0 01:47 ?        00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default       39       1  0 01:47 ?        00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default       40       1  0 01:47 ?        00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default       41       1  0 01:47 ?        00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=cat /usr/bin/cat
default       42       1  0 01:47 ?        00:00:29 httpd -D FOREGROUND
default       45       1  0 01:47 ?        00:00:17 httpd -D FOREGROUND
root      615091       0  0 10:11 ?        00:00:00 ps -ef

That httpd process sitting at PID 1 inside the container namespace? On the host, it is PID 13684. Same process, different lens.

The Shared Kernel - `uname` Does Not Lie

If there were a mini Linux inside, it would have its own kernel, its own kernel version, its own architecture identity. Run uname on the host and inside the container and compare:

Host:

krish-local:~ # uname -s && uname -m && uname -o && uname -r
Linux
x86_64
GNU/Linux
6.12.0-160000.9-default

Inside the container:

krish-local:~ # podman exec my-limited-httpd `uname -s && uname -m && uname -o && uname -r`
Linux
x86_64
GNU/Linux
6.12.0-160000.9-default

Identical output. The container is not running a different kernel but, it is sharing the host kernel. The UTS namespace gives it an isolated hostname, but kernel identity is not part of that isolation. There is no guest kernel to find.

Resource Isolation: cgroups Are the Real Enforcement Mechanism

Namespaces control what a process can see. cgroups control what a process can consume. Together they are the two pillars of container isolation.

Note on cgroup versions: The demos below use cgroup v2 paths (/sys/fs/cgroup/<scope>/memory.max), which is the unified hierarchy used by SLES 16 with kernel 6.12. If you are on an older distro still running cgroup v1, your paths will look different (/sys/fs/cgroup/memory/<scope>/memory.limit_in_bytes).

Now compare cgroup limits for systemd (PID 1 on the host) versus the container process.

systemd - no enforced ceiling:

krish-local:~ # CGROUP=$(cat /proc/1/cgroup | cut -d: -f3)
krish-local:~ # echo $CGROUP
/init.scope
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/memory.max
max
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/pids.max
max

max means unlimited. The init process is not constrained.
The container process — enforced ceiling:

krish-local:~ # PID=13684  # container's PID on the host
krish-local:~ # CGROUP=$(cat /proc/$PID/cgroup | cut -d: -f3)
krish-local:~ # echo $CGROUP
/machine.slice/libpod-a67892f3083285e34c738fd1e75cccd7eaadbda71f5a8c60a522e73546c0d5a2.scope
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/memory.max
536870912
krish-local:~ # cat /sys/fs/cgroup${CGROUP}/pids.max
100

536870912 bytes = 512 MiB. Exactly the --memory=512m flag we passed at startup. The pids.max of 100 matches --pids-limit=100. The kernel is enforcing these budgets directly — not Podman, not any container runtime abstraction sitting above the kernel.

Putting It Together

A container is a process. It shares the host kernel and uname proves it. Its PID 1 is an artifact of namespace isolation, not evidence of a private OS and nsenter proves it. Its resource limits are enforced by cgroups in the kernel, not by any runtime magic and /sys/fs/cgroup proves it.

The image is the blueprint. crun/runc is the assembly line. The running container is just another entry in the host's process table, one that happens to have a restricted worldview and a constrained resource budget.

That is the mental model. Everything in Part 2 builds on top of it.

Originally published at chkrishnatej.dev

Scaling ID Generation with Redis

Krishna Tej Chalamalasetty — Thu, 26 Mar 2026 08:05:32 +0000

It Started with a Simple Counter

I work on a cloud-based document management platform used by large construction and engineering firms. Every document uploaded drawings, RFIs, approvals gets an unique ID following a tenant-defined schema:

PROJ-9012-1001
  │      │     │
  │      │     └── Sequence number (auto-incremented)
  │      └──────── Document type identifier
  └──────────────── Project code

An internal microservice called id-generator handled this. It worked fine for years, a closed system behind our portal, moderate traffic, no drama.

Then we opened Public APIs so customers could automate their workflows. And the first large customer tried to migrate 200,000 documents in a single batch.

The id-generator was called once per document, sequentially. Each call was a network hop. The migration ran for over six hours. The customer was not pleased.

Why Sequential Doesn’t Scale: The Math

A single ID generation involves a network round-trip to the id-generator (~100ms) plus the generator’s own processing and sequence counter commit (~50ms). Call it 150ms per ID.

For 200,000 documents, sequential processing means:

200,000 × 150ms = 30,000 seconds ≈ 8.3 hours

"Just add more app instances" doesn’t help. We can spread requests across three instances behind a load balancer, but all three call the same id-generator. The generator processes one request at a time to guarantee sequential numbering. The bottleneck isn’t the app layer, it’s the single-threaded sequence generation.

Even if I made the id-generator handle *10 concurrent requests *(with database level locking on the counter):

200,000 ÷ 10 concurrent × 150ms = 3,000 seconds ≈ 50 minutes

Better, but fragile. The id-generator becomes a high contention hotspot, and any slowdown cascades to every project uploading at the same time.

I needed to decouple ID consumption from ID generation entirely, serve IDs without waiting for the generator.

The Architecture

This is a classic producer-consumer problem. Thousands of clients across different projects and organizations fire upload requests through an API gateway. The gateway distributes traffic across multiple app instances. But all instances need IDs from the same sequence space per (project, documentType), and a single producer (the id-generator) feeds that space.

The insight: if I pre-generate IDs in bulk and stash them in a shared store, the app instances become consumers popping from a ready made pool, while the id-generator becomes a background producer that refills the pool asynchronously. Consumers never wait for the producer. The pool is the buffer that decouples them.
I chose Redis as that shared store.

The Fix: Pre-populated ID Pools in Redis

The idea is straightforward. Instead of generating an ID when a request arrives, generate them ahead of time and stash them in Redis.

When a request comes in, pop one off the list. No waiting.
Each (project, documentType) combination gets its own Redis List:

Key:   id-pool:PROJ:9012
Value: ["1001", "1002", "1003", ..., "2000"]   ← 1000 pre-generated IDs

LPOP gives us an atomic, O(1) retrieval. One pop, one ID, sub-millisecond.

But a pool drains. I needed a way to refill it before it runs dry.

Threshold-based Replenishment

I used a watermark pattern, borrowed from stream processing systems like Kafka and Flink. You define a threshold level on a resource, and when usage crosses that mark, you trigger an action before the resource is exhausted.

Think of a water tank with a sensor at the 25% mark: when water drops below it, you automatically reorder before the tank runs dry.

In my case: each time an ID is served, I check the pool size. When 75% of the pool has been consumed (250 or fewer remaining out of 1000), I kick off an async task that calls the id-generator for a fresh batch and pushes them into the Redis list.

public String fetchNextId(String projectCode, String docTypeId) {
    String poolKey = "id-pool:" + projectCode + ":" + docTypeId;
    // Atomic pop - O(1), sub-ms
    String sequence = redisTemplate.opsForList().leftPop(poolKey);
    if (sequence == null) {
        // Pool empty - synchronous fallback (discussed later)
        sequence = idGenerator.generateSingle(projectCode, docTypeId);
    }
    checkAndReplenish(poolKey, projectCode, docTypeId);
    return projectCode + "-" + docTypeId + "-" + sequence;
}

The replenishment runs on a bounded thread pool with CallerRunsPolicy. If the pool and queue are saturated, the request thread itself does the refill. This applies natural backpressure instead of silently dropping work. Or so I thought.

This worked beautifully in testing. Sub millisecond ID retrieval. Invisible background replenishment. We shipped it.

The Production Outage: OOM from a Thundering Herd

Within a week of going live with the large customer migration, the service started crashing with java.lang.OutOfMemoryError: Java heap space. Repeatedly.

I pulled a heap dump and started analysing it. The heap was full of hundreds of ArrayList<String> instances, each holding a thousand generated IDs. They were all alive simultaneously.

The root cause was a race condition in the replenishment logic.

Here’s what happened: during a bulk upload, thousands of concurrent requests hit the service. Multiple requests for the same (project, docType) would check the pool level at nearly the same instant, all see "below 75%", and all independently trigger replenishAsync(). Multiply this across hundreds of (project, docType) combinations, and you get hundreds of async tasks, each generating and holding a 1000 element list in memory.

The heap couldn’t take it.

Counterintuitively, the CallerRunsPolicy I'd chosen for backpressure made things worse. It was supposed to prevent RejectedExecutionException when the executor's queue filled up. It did by making the request-handling threads also run generation tasks, adding even more large lists to the heap. The policy solved thread pool rejection but amplified the memory problem.

The Fix: Redis Distributed Locks

The problem was clear: nothing prevented duplicate replenishment for the same pool key. I needed a mutex, but a Java ReentrantLock or synchronized block only protects a single JVM. Our service runs on multiple instances behind a load balancer. Instance A's lock means nothing to Instance B.

I needed a distributed lock. Redis gives you one with a single command:

SET lock:id-pool:PROJ:9012 "a1b2c3-uuid" NX EX 120

NX means "only set if it doesn't exist", atomic check-and-acquire. EX 120 means "auto-expire after 120 seconds", prevents deadlocks if the holder crashes.

The Subtle Bug: Safe Lock Release

My first implementation used a simple DELETE in the finally block. This has a nasty race condition:

t=0s    Instance A acquires lock (TTL=60s)
t=65s   Lock auto-expires (A's generation was slow)
t=66s   Instance B acquires the lock
t=70s   Instance A finishes → DELETE → deletes B's lock
t=71s   Instance C sees no lock → acquires → duplicate work

The fix: store a UUID as the lock value, and use a Lua script for atomic compare-and-delete. An instance only deletes the lock if the value still matches its own UUID.

Why Lua? Redis doesn’t offer a native "delete if value equals X" command. A GET followed by a conditional DELETE in application code has a race window another instance could acquire the lock between the GET and DELETE. Lua scripts execute atomically inside Redis, eliminating that gap.

@Async("idGenExecutor")
public void replenishAsync(String projectCode, String docTypeId) {
    String lockKey = "lock:id-pool:" + projectCode + ":" + docTypeId;
    String lockValue = UUID.randomUUID().toString();
    // Atomic acquire: SET NX EX
    Boolean acquired = redisTemplate.opsForValue()
        .setIfAbsent(lockKey, lockValue, Duration.ofSeconds(120));
    if (Boolean.FALSE.equals(acquired)) {
        return;  // Someone else is handling it
    }
    try {
        // Double-check pool level under lock - another instance
        // may have already refilled between our check and acquire
        Long currentSize = redisTemplate.opsForList().size(poolKey);
        if (currentSize > POOL_SIZE * 0.5) {
            return;  // Already refilled
        }
        List<String> newIds = idGenerator.generateBatch(
            projectCode, docTypeId, POOL_SIZE - currentSize.intValue());
        redisTemplate.opsForList().rightPushAll(poolKey, newIds);
    } finally {
        // Lua script: delete ONLY if value matches our UUID
        String lua = 
            "if redis.call('get',KEYS[1]) == ARGV[1] then " +
            "  return redis.call('del',KEYS[1]) " +
            "else return 0 end";
        redisTemplate.execute(
            new DefaultRedisScript<>(lua, Long.class),
            List.of(lockKey), lockValue);
    }
}

The double-check after acquiring the lock is the distributed equivalent of double-checked locking between the time I decided to replenish and the time I got the lock, another instance may have already done the work.

A note on Redisson’s RLock: it offers automatic lock renewal via a watchdog thread and reentrancy, which sounds appealing. I chose raw SET NX EX + Lua because my replenishment task has a predictable duration (200-500ms) a 120-second TTL gives 200x headroom. If generation ever takes longer than that, something is seriously wrong with the id-generator and I want the lock to expire. Adding Redisson for a single lock pattern felt like pulling in a heavy dependency for a problem I didn't have.

After this fix, OOM crashes dropped to zero. Each pool key gets exactly one concurrent replenishment, regardless of how many instances or threads are contending.

The Problem I Didn’t See Coming: Lost IDs

With performance solved, I hit a subtler issue during failure testing.

When a request pops an ID from the pool but the downstream document persistence fails network timeout, storage error, app exception that ID is gone. It was consumed from the pool, never attached to a document, and on retry the client gets a different ID. This creates sequence gaps and, worse, potential duplicates if the original write eventually succeeds after a timeout.

I considered several approaches: client-supplied idempotency keys (requires client cooperation unreliable with external API consumers), a reservation-commit pattern (extra Redis round-trips per request), and pre-assigned batch reservations (good for migrations, overkill for single uploads).

My Choice: The Outbox Pattern

I went with the Outbox pattern because it solves the problem entirely on the server side. No API contract changes. No client cooperation.

The idea: instead of popping an ID and then separately persisting the document, I pop the ID and write it along with the document metadata to an outbox table in a single database transaction. If the transaction fails, neither the ID assignment nor the record exists atomic.

A separate background processor handles the actual storage. The client gets their ID back immediately; the real persistence happens seconds later.

@Transactional
public String assignId(String projectCode, String docTypeId,
                        DocumentMetadata metadata) {
    String docId = idPoolService.fetchNextId(projectCode, docTypeId);
    // Single transaction: ID + record exist together or not at all
    outboxRepository.save(OutboxEntry.builder()
        .documentId(docId)
        .tenantCode(projectCode)
        .entityTypeId(docTypeId)
        .payload(metadata)
        .status(OutboxStatus.PENDING)
        .retryCount(0)
        .build());
    return docId;  // Client gets the ID immediately
}

The Outbox Processor: Retries, Backoff, and Dead Letters

The processor is a @Scheduled method that polls the outbox table every 2 seconds, picks up PENDING entries, and tries to persist each document to the target store.

On the happy path, it persists the document and marks the entry COMMITTED. But when storage fails with network timeout, S3 returning 500, disk full things get interesting.

A naive approach retries the entry on the next poll, 2 seconds later. If storage is down, I’m hammering it every 2 seconds and blocking the processor from making progress on other entries.

I used exponential backoff instead. Each outbox entry has a next_retry_at timestamp:

next_retry_at = NOW() + (2 ^ retry_count) seconds

Retry 1 waits 2 seconds. Retry 2 waits 4. Retry 3 waits 8. Retry 5 waits 32. Failing entries naturally “sink to the bottom” while fresh entries get processed promptly. The processor’s query becomes:

SELECT * FROM id_outbox 
WHERE status = 'PENDING' AND next_retry_at <= NOW()
ORDER BY created_at LIMIT 100

After a maximum number of retries (I used 5), the entry moves to DEAD_LETTER status. This means the system has given up on automatic recovery and the document payload might be corrupted, the target storage bucket might not exist for this tenant, or there's a permission issue that no amount of retrying will fix.

Dead-lettered entries become a to-do list for the ops team: investigate, fix the root cause, and either reprocess manually or mark as abandoned.

Why not retry forever? A poisoned entry like corrupted payload, invalid schema will never succeed. Infinite retries waste processing capacity and mask the real bug. The dead letter queue acts as a circuit breaker for individual entries.

The outbox table:

CREATE TABLE id_outbox (
    id              BIGSERIAL PRIMARY KEY,
    document_id     VARCHAR(64) NOT NULL UNIQUE,
    tenant_code     VARCHAR(32) NOT NULL,
    entity_type_id  VARCHAR(32) NOT NULL,
    payload         JSONB       NOT NULL,
    status          VARCHAR(16) NOT NULL DEFAULT 'PENDING',
    retry_count     INT         NOT NULL DEFAULT 0,
    next_retry_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    processed_at    TIMESTAMPTZ
);
-- Partial index: only scans PENDING rows ready for processing
CREATE INDEX idx_outbox_pending
    ON id_outbox (next_retry_at ASC) WHERE status = 'PENDING';

The tradeoff is eventual consistency. The document isn’t in the target store the instant the client gets the ID. There’s a 2–5 second delay while the processor runs. For bulk document uploads, this was perfectly acceptable.

When Redis Goes Down

My fallback is straightforward: if LPOP fails with a RedisConnectionFailureException, fall through to synchronous id-generator calls.

try {
    sequence = redisTemplate.opsForList().leftPop(poolKey);
} catch (RedisConnectionFailureException e) {
    sequence = idGenerator.generateSingle(projectCode, docTypeId);
}

This re-introduces latency during a Redis outage (sub-ms jumps to ~200ms per ID), but the system stays available.

One important detail: the id-generator's sequence counter must live independently of Redis. If the generator also relies on Redis for its counter (INCRBY), then a Redis outage takes down both the pool and the fallback. I backed the counter with a database sequence so the fallback path has no Redis dependency.

The other concern is consistency after Redis recovers. During the outage, the sync fallback generates IDs using the database counter. The Redis pool still holds stale pre-generated IDs from before the crash. If you naively resume popping from the pool, you could issue duplicates.

My approach: on Redis reconnection, invalidate all pool keys and re-seed from the current database counter value. Simple but safe.

Results

Remember the back-of-envelope math? With the pool, a single ID retrieval drops from ~150ms (network hop to id-generator) to ~0.5ms (Redis LPOP). Across 3 app instances handling 20 concurrent requests each, the theoretical throughput for 200,000 documents is about 2 minutes. In practice, accounting for API gateway overhead, outbox commits, and occasional sync fallbacks, the observed results:

The gap between theoretical (2 min) and observed (~1 hour) is mainly the outbox processor’s polling interval, document storage latency, and the client’s own upload pacing. The 85% improvement is on the end-to-end migration, not just ID generation but ID generation was the bottleneck that unlocked everything else.

What I’d Improve

Gap Reconciliation

IDs still get lost. An instance can crash between popping from the pool and writing to the outbox. The outbox processor might exhaust retries and send entries to the dead letter queue. These are rare and a handful per million, but they create sequence gaps that confuse customers expecting contiguous numbering.

I’d address this with a periodic reconciliation job that compares three sources: the sequence counter (highest ID ever generated), the document table (which IDs have committed documents), and the outbox table (which IDs are still pending or dead-lettered). Everything in the counter range that doesn’t appear in any of these is a gap. The job writes these to a lightweight id_gaps table with the detection timestamp and inferred reason — outbox_dlq for dead-lettered entries, untracked_loss for IDs that never reached the outbox at all.

An endpoint like GET /api/v1/ids/gaps?project=PROJ&type=9012 would let customers distinguish "this ID was skipped due to infrastructure" from "this document is actually missing." Gaps become explained rather than mysterious.

I’d avoid trying to reclaim and reuse lost IDs. Reuse sounds clean but risks collision with late-arriving writes from the original assignment the kind of bug that’s nearly impossible to reproduce and debug. Better to waste a few numbers and track them than to reintroduce them into circulation.

Key Takeaways

Pre-populate, don’t generate on-the-fly. When generation is expensive or serial, trade storage for latency.
Watermark replenishment, not reactive. Refill at 75% consumed, not when empty. No request should ever wait on generation.
Distributed systems need distributed locks. Java’s ReentrantLock doesn't cross JVM boundaries. Redis SET NX EX gives you a lightweight mutex.
Lock ownership matters. UUID lock values + Lua atomic compare-and-delete. Never blindly DELETE a lock key.
Double-check after acquiring the lock. The distributed equivalent of double-checked locking prevents redundant work.
Solve idempotency server-side. The Outbox pattern gives you atomicity without burdening API clients. Exponential backoff and dead letter queues turn “retry until it works” into something manageable.
Don’t reuse lost IDs. Track gaps, explain them, move on. Reuse introduces collision risks that are far worse than a missing number.

Originally published at chkrishnatej.dev

Discussion on personal developer websites with large padding on desktops

Krishna Tej Chalamalasetty — Mon, 03 Jun 2019 04:16:01 +0000

Hi all,

I was going through lot of developers personal sites. In most of them, I have observed that while browsing the site in desktop it has lot of padding on both sides, so much that even though I am browsing it on the desktop, I feel like I am reading it on a mobile device.

I understand now the whole web works on mobile-first approach. But I wanted to understand if it's a consequence of following mobile first approach or an intentional behavior.

If it is an intentional behavior, what could be the rationale behind it?

Please comment below

Go Lang Installation

Krishna Tej Chalamalasetty — Wed, 17 Apr 2019 01:32:00 +0000

Prerequisutes

go supports a wide range of operating systems. There are two ways go can be installed on the system.

Binaries
Building it from the source

I am installing it on MacBook Pro with macOS Mojave v10.14.4 for this tutorial using the Mac binaries.

You could get the download link here

As shown in the image above, select the relevant package accoridng to the operating system. I have downloaded the one for Apple macOS.

Download file

Installation

The installation is a fairly straight. You could go with the defaults as shown below.

Verification

To verify if the go has been successfully, it can be verified from the command line

You could use the following commands in the terminal to verfiy.

>>> which go # Returns path where go is installed/usr/local/go/bin/go>>> go version # Returns the version of the go installedgo version go1.12.4 darwin/amd64

If the terminal shows up as the above, then go lang has been successfully installed.

Setting up workspace

Go follows a different approach to manage the code which requires to setup the project workspace. This means all the go projects should be developed and maintained in the defined workspace.

Steps to setup the workspace

Open the shell config which is locate in the HOME directory.

NOTE:

bash_profile or .bashrc for BASH

- `.zshrc` for Oh my Zsh!

Add the following variables to shell config

>>> export GOPATH=$HOME/<go-workspace> # <go-workspace> is a filler. Fill it with proper path>>> export PATH=$PATH:$GOPATH/bin>>> export GOROOT=/usr/local/opt/go/libexec>>> export PATH=$PATH:$GOROOT/bin

$GOPATH/src : Go Projects source code
$GOPATH/pkg : Contains imported packages
$GOPATH/bin : The compiled binaries home

This completes the setup of go. And system is ready for some code.

GAE - An opinionated post

Krishna Tej Chalamalasetty — Wed, 22 Aug 2018 13:45:00 +0000

I have been looking to build a web-app and host it. My options were

To take a server and deploy it using IaaS like Azure, AWS or GCP
Deploy it using PaaS like Heroku

Honestly speaking, I didn't know any other PaaS for deploying Python apps except for Heroku. Before investing my time and knowledge of understanding the service, I was looking for if any other alternatives exist.

My idea is to understand the different choices I have. Learn the pros and cons, before taking it up (Never test the depth of a river with both feet)

My checks before taking the choice are:

Simplicity
Focus more on code than deployment

The choice was Google App Engine. Even though it was launched almost at the same time of Heroku, it didn't go well in the beginning. It was known for being notoriously hard and it did suck.

But now, things have changed. It has become a mature product with an excellent CLI. It really stands to what it claims, just focus on your code and Google takes the responsibility of deploying it with minimal setup from the developer.

It supports both Python 2.7 and 3.x. Has excellent documentation to get you started. You can build web applications and mobile backends.

Now Google App Engine (GAE) is a part of Google Cloud Platform (GCP). The pricing of the GAE is simple enough and more economical than it's counterparts(It would be a whole another post if we talk about cloud pricing). It has got free daily quotas, which is more than enough to tinker to build your personal applications and test them. And of course its scalable, I have a complete belief in the capabilities of Google regarding the scalability.

And there are many other services offered by GCP such as storage solutions, machine learning solutions and much more. All of them fit together beautifully.

So far, it has been a great fit for me for building my pet projects and haven't found cons so far as long as I am inside Google environment. I believe Google would be a force to reckon with in the cloud space with its rich cloud ecosystem and a very affordable pricing strategy.

Please comment about your experiences with GAE below.

Is it normal that devs get stuck to figure out a solution while using external libraries?

Krishna Tej Chalamalasetty — Mon, 09 Jul 2018 05:07:41 +0000

I had this lingering doubt from so long.

Whenever a problem comes up while developing, I could design a solution or at least a starting point to solve the problem. Many times I get stuck in figuring out a solution with a library which we use in the project.

The whole thing of figuring out a solution with a library takes much more time than using my solution. The cons of using my solution is, I had to write a whole method for getting the solution for which the libraries already have an existing functionality.

I wonder many times does it mean I am slow at adopting the solution using an external library. This affects my pridcutivity a lot as I tend to get stuck on trivial issues for hours.

Please share your advice and experiences regarding this.

Bulma Showcase Page

Krishna Tej Chalamalasetty — Wed, 04 Jul 2018 17:33:30 +0000

Dev showcase page with Bulma CSS

A developer showcase page with their projects and blog.

Using Bulma CSS the showcase page has been designed with vibrant colors. The tiles and cards showcase your work and blog posts.

How is it built?

The whole project has been built with using Bulma CSS and a small javascript for responsive naviagtion menu bar.
Small fragments of code for few features is sourced from different developers which all the due credit has been given.

How to use it?

Clone the repo using this link
https://github.com/chkrishnatej/bulma-dev-page.git
Update the names and links appropriately.
Test it on your local machine.
Serve it on the server.

Github Pages

Github pages allows you to host your static sites for free. For setting up Github pages, please follow the link.

You can even setup a custom domain for your Github pages. Follow the custom domain in github pages

P.S:
All ears for idea/suggestions/changes for the page. Please leave your feedback in comments.

DEV Community: Krishna Tej Chalamalasetty

The Boring Infrastructure That Breaks AI APIs: A Guide to Billing and Metering

The Anatomy of a Billing System

Token Counting Is Harder Than It Looks

Input and output tokens are not symmetric

Metering is opaque to the client

Not all tokens are the same

The Two Hardest Problems: Idempotency & The Credit Race

Idempotency

The Credit Race

Failure Modes Taxonomy

Design Principles

Conclusion

What is a Container? The OS-Level Truth Most Engineers Don't Know

"You Keep Using That Word"

Dispelling Container Misconceptions at the OS Level

What a Container Actually Is

Spin up a simple HTTPD container

From podman run to a Process - What Actually Happens

Four Words the Industry Uses Interchangeably — and Shouldn't

What an Image Actually Is

The PID 1 Myth

The Shared Kernel - uname Does Not Lie

Resource Isolation: cgroups Are the Real Enforcement Mechanism

Putting It Together

Scaling ID Generation with Redis

It Started with a Simple Counter

Why Sequential Doesn’t Scale: The Math

The Architecture

The Fix: Pre-populated ID Pools in Redis

Threshold-based Replenishment

The Production Outage: OOM from a Thundering Herd

The Fix: Redis Distributed Locks

The Subtle Bug: Safe Lock Release

The Problem I Didn’t See Coming: Lost IDs

My Choice: The Outbox Pattern

The Outbox Processor: Retries, Backoff, and Dead Letters

When Redis Goes Down

Results

What I’d Improve

Gap Reconciliation

Key Takeaways

Discussion on personal developer websites with large padding on desktops

Go Lang Installation

Prerequisutes

Installation

Verification

Setting up workspace

Steps to setup the workspace

- .zshrc for Oh my Zsh!

GAE - An opinionated post

Is it normal that devs get stuck to figure out a solution while using external libraries?

Bulma Showcase Page

Dev showcase page with Bulma CSS

How is it built?

How to use it?

Github Pages

From `podman run` to a Process - What Actually Happens

The Shared Kernel - `uname` Does Not Lie

- `.zshrc` for Oh my Zsh!