I Thought I'd Just Call a Blockchain API. It Didn't Work Out That Way.

Pavel Goldin — Sat, 02 May 2026 18:24:52 +0000

I walked into a crypto project thinking I'd just pull data from a blockchain API. That's not how it went.
I join the project. Stack: FastAPI, PostgreSQL, Redis as Celery broker, Celery workers, Docker, Web3. Startup on a hype wave, real money, architecture built on the fly. I look at the payment processing architecture and my first thought: guys, are you serious? Financial operations with real money, zero idempotency, Redis as broker with no persistence, Web3.py synchronous calls inside Celery tasks.
The conversation was short: here's the task, fix what's there. Deadlines were burning.

What Was Broken

First month of prod. A user writes to support: credited twice, withdrew double. I open the logs, clean. Two identical events, both 200, four seconds apart. Both processed. The user got double the balance.
Daily reconciliation with on-chain data showed a discrepancy: several accounts with a balance higher than confirmed transactions should allow. In the first month we found 23 duplicate credits across roughly 180k transactions, around a 0.013% error rate. 23 double credits in a month. Real money, not a metric.
The first thing that surfaced: duplicates from the provider. Alchemy, Infura, and every other blockchain provider operates on at-least-once delivery. On network failure, restart, or under load the provider retries delivery. The provider says so right in the docs. That's not a bug, that's the rules of the game. The provider retries delivery, your code has to survive that without consequences. Ours didn't.
It got worse. Two parallel withdrawal requests read the balance simultaneously, both saw enough funds, both passed validation. Two requests read the balance at the same time, both see enough money, both deduct. Textbook race condition.

async def withdraw(conn, user_id: int, amount: Decimal): balance = await conn.fetchval( "SELECT balance FROM users WHERE id = $1", user_id ) if balance >= amount: await conn.execute( "UPDATE users SET balance = balance - $1 WHERE id = $2", amount, user_id )

Then this. Celery with default settings acknowledges the task to the broker at the moment of receipt. A worker dies mid-processing, the event is acknowledged, the DB write never happened. No retry, no DLQ. Worker died, task acknowledged, money never arrived. The user waits and has no idea what happened.
And a separate silent killer: amount gets serialized to JSON through the Celery broker as a float. Decimal("50.1") becomes a JSON float, meaning 50.099999999999994. At scale this accumulates into a real loss. Nobody noticed until they ran the numbers.
Last one: calling .delay() directly from the webhook handler creates a window between the DB write and the queue insertion. If the process dies in that moment, the event hangs in pending with no automatic recovery.
Five problems total. I started fixing them.

## First Instinct: Redis Distributed Lock

SET NX EX on user_id. The pattern is described by Antirez, implemented in 20 minutes. Didn't fly.
Here's the specific scenario that showed up in the logs. A worker acquires the lock in Redis. Starts a transaction in PostgreSQL. Between those two operations, the OOM killer takes out the process. The PostgreSQL transaction rolled back automatically, balance unchanged. The Redis lock hangs for 30 seconds until TTL. After 30 seconds the next worker acquires the lock, sees that the idempotency_key isn't recorded (nobody was around to write it, the transaction rolled back) and processes the event again. Double credit. Both workers are clean in the logs.
The problem isn't the TTL size. The problem is the absence of cross-system atomicity between Redis and PostgreSQL. Redis doesn't work here, no atomicity with PostgreSQL. A code-level check doesn't work either, two workers will both pass the SELECT before the INSERT. The only thing that's atomic by definition: a unique constraint. With money there's no "almost right."

`CREATE TABLE payment_events (
event_id TEXT PRIMARY KEY,
user_id INTEGER NOT NULL REFERENCES users(id),
amount NUMERIC(38, 18) NOT NULL,
event_type TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
retry_count INTEGER NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT valid_status CHECK (
status IN ('pending', 'enqueued', 'processing', 'confirmed', 'failed')
)
);

CREATE TABLE balance_events (
id BIGSERIAL PRIMARY KEY,
user_id INTEGER NOT NULL REFERENCES users(id),
amount NUMERIC(38, 18) NOT NULL,
event_type TEXT NOT NULL,
source_event_id TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT uq_balance_events_source UNIQUE (source_event_id, event_type)
);

CREATE TABLE processed_events (
idempotency_key TEXT PRIMARY KEY,
outcome TEXT NOT NULL DEFAULT 'pending',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE dead_letter_queue (
id BIGSERIAL PRIMARY KEY,
event_id TEXT NOT NULL,
event_type TEXT NOT NULL,
user_id INTEGER NOT NULL,
amount NUMERIC(38, 18) NOT NULL,
error TEXT NOT NULL,
attempt INTEGER NOT NULL DEFAULT 1,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

ALTER TABLE users
ADD COLUMN IF NOT EXISTS initial_balance NUMERIC(38, 18) NOT NULL DEFAULT 0,
ADD CONSTRAINT balance_non_negative CHECK (balance >= 0);

CREATE INDEX idx_payment_events_pending
ON payment_events (updated_at, created_at)
WHERE status = 'pending';

CREATE INDEX idx_payment_events_enqueued
ON payment_events (updated_at)
WHERE status = 'enqueued';

CREATE INDEX idx_payment_events_processing
ON payment_events (updated_at)
WHERE status = 'processing';

CREATE INDEX idx_balance_events_user_id
ON balance_events (user_id);

CREATE INDEX idx_balance_events_created_at
ON balance_events (created_at DESC);

CREATE INDEX idx_processed_events_created
ON processed_events (created_at);

CREATE INDEX idx_processed_events_pending_stale
ON processed_events (created_at)
WHERE outcome = 'pending';

CREATE INDEX idx_dlq_event_id
ON dead_letter_queue (event_id);

CREATE INDEX idx_dlq_created_at
ON dead_letter_queue (created_at DESC);`

Three non-obvious decisions in the schema.
NUMERIC(38, 18), not NUMERIC(20, 8). The amount column stores in ETH, not wei. The webhook provider sends an already-converted value. If your provider returns wei, convert at the boundary: amount_eth = Decimal(wei_str) / Decimal(10**18) before passing to _validate_amount. ERC-20 tokens declare their own decimals(): USDC/USDT use 6, WBTC uses 8, DAI/WETH/MKR use 18. ETH in wei is also 10^18. NUMERIC(20, 8) handles USDC/USDT, but physically cannot store 18-decimal tokens, so we take the worst case: NUMERIC(38, 18).
initial_balance is needed for reconciliation. During migration you populate it with the current balance: UPDATE users SET initial_balance = balance WHERE . This means balance_events starts filling from zero, and hot_path_balance_check correctly accounts only for users whose operations have all gone through balance_events. For new systems initial_balance stays at 0.
Separate indexes for pending/enqueued/processing instead of a single status IN (...), because pollers use different access patterns. idx_payment_events_pending is a partial index on (updated_at, created_at) for ORDER BY created_at in enqueue_pending_events, otherwise the planner sorts without an index.
retry_count in payment_events was added to prevent infinite pending -> enqueued cycling during a durable Redis outage. More on that in the degradation section.

How It Got Fixed

Initialization

`import os
import uuid
import json
import hmac
import random
import hashlib
import secrets
import threading
import structlog
import psycopg2
import psycopg2.extras
import psycopg2.pool
import redis as redis_lib
from typing import Literal, Optional
import re
from decimal import Decimal, InvalidOperation
from contextvars import ContextVar
from dataclasses import dataclass, field
from datetime import datetime, timezone, timedelta
from celery import shared_task
from celery.exceptions import Ignore, MaxRetriesExceededError

logger = structlog.get_logger()

@dataclass(frozen=True)
class Settings:
DATABASE_URL: str
WEBHOOK_SECRET: str
ETH_RPC_URL: str
ALERT_EMAIL: str
REDIS_URL: str = "redis://localhost:6379/0"

@classmethod
def from_env(cls) -> "Settings":
    required = ("DATABASE_URL", "WEBHOOK_SECRET", "ETH_RPC_URL", "ALERT_EMAIL")
    missing  = [k for k in required if not os.environ.get(k)]
    if missing:
        raise RuntimeError(f"Missing required env vars: {', '.join(missing)}")
    return cls(
        DATABASE_URL   = os.environ["DATABASE_URL"],
        WEBHOOK_SECRET = os.environ["WEBHOOK_SECRET"],
        ETH_RPC_URL    = os.environ["ETH_RPC_URL"],
        ALERT_EMAIL    = os.environ["ALERT_EMAIL"],
        REDIS_URL      = os.environ.get("REDIS_URL", "redis://localhost:6379/0"),
    )

settings = Settings.from_env()

_redis_client = redis_lib.Redis.from_url(
settings.REDIS_URL,
decode_responses=True,
socket_connect_timeout=2,
socket_timeout=2,
retry_on_timeout=False,
)`

send_alert: a rate-limited wrapper around the logger. In prod it's replaced with a PagerDuty/OpsGenie SDK. Identical alert_key values within the cooldown window are suppressed. If no key is provided, sends without rate-limiting, for one-off critical alerts. Never throws exceptions.
_alert_last_sent grows with unique keys. If you generate keys per-event-id (which we do for orphan alerts), over a month that's millions of entries. So when it overflows, we first clean out stale keys, and if there's still no room after cleanup, we suppress new ones. Hacky, yes. But it hasn't fallen over in eight months.

`_alert_lock = threading.Lock()
_alert_last_sent: dict = {}
MAX_ALERT_KEYS = 1_000

def send_alert(message: str, alert_key: Optional[str] = None,
cooldown_seconds: int = 300) -> None:
try:
if alert_key is None:
logger.critical("ALERT", message=message)
return
with _alert_lock:
now = datetime.now(timezone.utc)
if len(_alert_last_sent) >= MAX_ALERT_KEYS:
stale_cutoff = now - timedelta(seconds=cooldown_seconds * 2)
stale = [k for k, v in _alert_last_sent.items() if v < stale_cutoff]
for k in stale:
del _alert_last_sent[k]
if len(_alert_last_sent) >= MAX_ALERT_KEYS and alert_key not in _alert_last_sent:
logger.warning("send_alert suppressed: rate limit dict full",
alert_key=alert_key)
return
last = _alert_last_sent.get(alert_key)
if last and (now - last).total_seconds() < cooldown_seconds:
return
_alert_last_sent[alert_key] = now
logger.critical("ALERT", message=message, alert_key=alert_key)
except Exception as e:
logger.error("send_alert failed", error=str(e))

class ImproperlyConfigured(RuntimeError):
pass`

Trace ID Through the Entire Chain

Each worker gets its own ContextVar automatically. Sharing it between threads is impossible.

`_trace_id: ContextVar[str] = ContextVar('trace_id', default='')

def get_trace_id() -> str:
return _trace_id.get() or 'no-trace'

def set_trace_id(tid: str) -> None:
_trace_id.set(tid)

def new_trace_id() -> str:
tid = str(uuid.uuid4())
_trace_id.set(tid)
return tid

structlog.configure(
processors=[
structlog.processors.add_log_level,
lambda , _, event_dict: {**event_dict, "trace_id": get_trace_id()},
structlog.processors.JSONRenderer(),
]
)`

Idempotency Key via DB Unique Constraint

The key is built from event_id and event_type, written to a separate table with a unique constraint in the same transaction as the balance change.
Redis doesn't work here, no atomicity with PostgreSQL. A code-level check doesn't work either, two workers will both pass SELECT before INSERT. The only thing that's atomic by definition: a unique constraint.
In the first version I used concatenation f"{event_id}::{event_type}". Got a collision when :: appeared in event_id. Tried a NUL separator: f"{event_id}\0{event_type}".encode(). Also a collision: _idempotency_key("a\x00b", "c") == _idempotency_key("a", "b\x00c"), both produce bytes b"a\x00b\x00c". Final version, length-prefix encoding: each field is preceded by a 4-byte length, collisions between fields are structurally impossible.

`class RetryableError(Exception):
pass

class AlreadyProcessedError(Exception):
pass

MAX_AMOUNT = Decimal("10") ** 20

_AMOUNT_RE = re.compile(r"^[0-9]+(.[0-9]+)?([eE][+-]?[0-9]+)?$")

def _validate_amount(amount) -> Decimal:
if isinstance(amount, float):
raise ValueError(
f"float is not allowed. Pass amount as str from JSON payload. "
f"Got: {amount!r}"
)
if isinstance(amount, str) and amount != amount.strip():
raise ValueError(
f"amount contains whitespace: {amount!r}. "
f"Pass amount without spaces."
)
if isinstance(amount, str) and not _AMOUNT_RE.fullmatch(amount):
raise ValueError(f"invalid amount format: {amount!r}")
try:
amount_decimal = Decimal(str(amount))
if not amount_decimal.is_finite():
raise ValueError(f"amount must be finite, got {amount_decimal}")
if amount_decimal <= 0:
raise ValueError(f"amount must be positive, got {amount_decimal}")
if amount_decimal.normalize().as_tuple().exponent < -18:
raise ValueError(f"amount precision exceeds 18 decimals: {amount_decimal}")
if amount_decimal >= MAX_AMOUNT:
raise ValueError(
f"amount exceeds NUMERIC(38,18) capacity: {amount_decimal} >= 10^20"
)
return amount_decimal
except InvalidOperation:
raise ValueError(f"invalid amount format: {amount!r}")

def _idempotency_key(event_id: str, event_type: str) -> str:
a = event_id.encode("utf-8")
b = event_type.encode("utf-8")
payload = (
len(a).to_bytes(4, "big") + a +
len(b).to_bytes(4, "big") + b
)
return hashlib.sha256(payload).hexdigest()`

Two places in the first version broke on edge cases: MAX_AMOUNT = Decimal("10")20 - 1 rejected a valid amount, and 50.0000000000000000000 failed as exponent=-19 even though there are no significant digits there. Fixed: 1020 without the -1, and normalize() before checking exponent. It's worth writing tests for the edge cases of your own validators. You learn interesting things.
Another surprise from the same category: _validate_amount("+50.1"), _validate_amount("1_000"), and _validate_amount("١٢٣") all return a valid Decimal. Python is tolerant of underscore notation, leading +, and Eastern Arabic numerals. For a financial validator this is undesirable behavior, the input should strictly be [digits].[digits]. Added regex ^[0-9]+(.[0-9]+)?([eE][+-]?[0-9]+)?$ before Decimal(), rejects everything non-standard.

FSM of Transitions, Single Source of Truth

Event status is a deterministic finite state machine. In the beginning there were three places with raw UPDATE payment_events SET status = .... That violated the FSM invariant.
Transitions: pending goes to enqueued via the poller. enqueued to processing when a worker picks up the task. From processing only to confirmed or failed. There's a separate edge, enqueued directly to confirmed, needed for the replay path: when processed_events already contains outcome=success, but the worker crashed after writing that and before transitioning payment_events to confirmed. Without this edge, events would hang in enqueued forever.

`VALID_TRANSITIONS: dict[str, set[str]] = {
"pending": {"enqueued", "failed"},
"enqueued": {"processing", "failed", "pending", "confirmed"},
"processing": {"confirmed", "failed"},
}

TERMINAL_STATUSES = frozenset({"confirmed", "failed"})

def transition_event_status(cur, event_id, from_status, to_status):
if to_status not in VALID_TRANSITIONS.get(from_status, set()):
raise ValueError(f"invalid transition: {from_status} -> {to_status}")

cur.execute(
    "UPDATE payment_events SET status = %s, updated_at = NOW() "
    "WHERE event_id = %s AND status = %s",
    (to_status, event_id, from_status),
)

if cur.rowcount == 0:
    cur.execute("SELECT status FROM payment_events WHERE event_id = %s", (event_id,))
    actual = cur.fetchone()
    actual_status = actual["status"] if actual else "not found"

    if actual_status == to_status:
        logger.info("status already set", event_id=event_id, status=to_status)
        return

    if actual_status in TERMINAL_STATUSES:
        raise AlreadyProcessedError(f"event already terminal: {actual_status}")

    raise RetryableError(
        f"concurrent status transition event_id={event_id} "
        f"expected={from_status} actual={actual_status}"
    )`

_mark_event_failed: safely transitions to failed from any non-terminal status. Commits itself, the rule is simple: failed status must land in the DB no matter what. Everything else can wait. Call only on a clean connection, after rollback.
_mark_event_failed commits itself. If you refactor, this will bite you.

def _mark_event_failed(conn, event_id) -> bool: try: with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur: cur.execute("SET LOCAL lock_timeout = '2s'") cur.execute( "SELECT status FROM payment_events WHERE event_id = %s FOR UPDATE NOWAIT", (event_id,) ) row = cur.fetchone() if row is None: conn.rollback() return False current = row["status"] if current in TERMINAL_STATUSES: conn.rollback() return False try: transition_event_status(cur, event_id, current, "failed") conn.commit() return True except (ValueError, RetryableError, AlreadyProcessedError): conn.rollback() return False except (psycopg2.errors.LockNotAvailable, psycopg2.errors.QueryCanceled): try: conn.rollback() except Exception: pass return False except Exception: try: conn.rollback() except Exception: pass raise

SELECT FOR UPDATE NOWAIT + lock_timeout

With a plain FOR UPDATE the worker silently waits for the lock, blocking the thread. NOWAIT eliminates that.
A migration with ALTER TABLE locks the entire table, and lock_timeout = '2s' prevents the worker from hanging for the full duration.
On error codes: lock_timeout throws LockNotAvailable (pgcode 55P03), statement_timeout throws QueryCanceled (pgcode 57014). Confusing them leads to an uncaught exception in production. DeadlockDetected (pgcode 40P01): the transaction was killed by PostgreSQL due to a lock cycle, also transient, also retryable. PostgreSQL picks a victim and rolls back its transaction, a retry resolves the issue. All three need to be caught together.
Connection Pool with Validation
A connection in the pool can be dead: PostgreSQL closes idle connections via tcp_keepalives_idle or idle_in_transaction_session_timeout. Without a check the worker gets a broken TCP and crashes with InterfaceError at a random moment.
_PooledConn: a wrapper around a connection that knows how to return itself to its owner. Via putconn() back to the pool or via close() if the connection was created directly. putconn() is idempotent, the second call is a no-op. getattr doesn't proxy dunder methods, so _PooledConn cannot be used as a context manager. All code works through conn.cursor().
get_validated_conn does three levels of checking without I/O in the normal path: first conn.closed (an in-memory flag), then conn.status (dirty transaction from previous use), and only if status is not STATUS_READY, it runs SELECT 1.

`class PooledConn:
def __init_(self, conn, pool=None):
self._conn = conn
self._pool = pool

def putconn(self, close=False):
    if self._pool is None:
        try:
            self._conn.close()
        except Exception:
            pass
        return
    pool, self._pool = self._pool, None  # idempotent: second call is a no-op
    try:
        pool.putconn(self._conn, close=close)
    except Exception:
        pass

def __getattr__(self, name):
    return getattr(self._conn, name)

def get_validated_conn(pool: psycopg2.pool.SimpleConnectionPool) -> "_PooledConn":
try:
conn = pool.getconn()
except psycopg2.pool.PoolError as e:
raise RetryableError(f"DB connection pool exhausted: {e}")

if conn.closed != 0:
    try:
        pool.putconn(conn, close=True)
    except Exception:
        pass
    direct = psycopg2.connect(dsn=settings.DATABASE_URL)
    return _PooledConn(direct, pool=None)

if conn.status == psycopg2.extensions.STATUS_IN_TRANSACTION:
    try:
        conn.rollback()
        logger.warning("get_validated_conn: rolled back dirty connection")
    except Exception:
        try:
            pool.putconn(conn, close=True)
        except Exception:
            pass
        direct = psycopg2.connect(dsn=settings.DATABASE_URL)
        return _PooledConn(direct, pool=None)

if conn.status != psycopg2.extensions.STATUS_READY:
    try:
        with conn.cursor() as cur:
            cur.execute("SELECT 1")
    except Exception:
        try:
            pool.putconn(conn, close=True)
        except Exception:
            pass
        direct = psycopg2.connect(dsn=settings.DATABASE_URL)
        return _PooledConn(direct, pool=None)

return _PooledConn(conn, pool)`

Deposit and Withdrawal

On INSERT INTO processed_events there are two outcomes: success, we continue (first-time path). UniqueViolation, we've seen this event before (replay path).
On the replay path we check outcome. If success, we sync payment_events status with reality. If pending, another worker is mid-transaction, we throw RetryableError for immediate retry instead of waiting for recover_stale_enqueued_events in 3 minutes.
retry_count resets to 0 on successful processing. Without this: event hit retry, processed successfully with retry_count=7, after 14+ days processed_events got cleaned up by TTL, event arrives again via reorg compensation. It starts at 7/10 to DLQ instead of 0/10.
One special case on replay: processed_events says everything is fine, but payment_events knows nothing about this event. That shouldn't happen. Log it, alert, don't panic.

`def process_deposit_sync(conn, event_id, event_type, user_id, amount):
amount_decimal = _validate_amount(amount)
idempotency_key = _idempotency_key(event_id, event_type)

with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
    cur.execute("SET LOCAL lock_timeout = '2s'")
    cur.execute("SET LOCAL statement_timeout = '5s'")

    try:
        cur.execute(
            "INSERT INTO processed_events (idempotency_key, outcome) "
            "VALUES (%s, 'pending')",
            (idempotency_key,),
        )
    except psycopg2.errors.UniqueViolation:
        conn.rollback()
        with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur2:
            cur2.execute("SET LOCAL lock_timeout = '2s'")
            cur2.execute("SET LOCAL statement_timeout = '5s'")
            cur2.execute(
                "SELECT outcome FROM processed_events WHERE idempotency_key = %s",
                (idempotency_key,),
            )
            row = cur2.fetchone()
            outcome = row["outcome"] if row else "pending"

            if outcome == "success":
                cur2.execute(
                    "SELECT status FROM payment_events WHERE event_id = %s",
                    (event_id,)
                )
                r = cur2.fetchone()
                current = r["status"] if r else None
                if current is None:
                    logger.error(
                        "deposit_replay: orphan event, processed_events "
                        "exists but payment_event not found",
                        event_id=event_id,
                    )
                    send_alert(
                        f"[CRITICAL] orphan deposit event: {event_id}",
                        alert_key=f"orphan_deposit:{event_id}",
                    )
                elif current == "enqueued":
                    transition_event_status(cur2, event_id, "enqueued", "confirmed")
                elif current == "processing":
                    transition_event_status(cur2, event_id, "processing", "confirmed")
                elif current == "confirmed":
                    pass
                else:
                    conn.rollback()
                    raise RetryableError(
                        f"deposit_replay FSM violation: "
                        f"payment_event.status={current!r} with outcome=success "
                        f"for event_id={event_id}"
                    )
                conn.commit()
                return

        conn.rollback()
        raise RetryableError(
            f"deposit idempotency hit with outcome={outcome!r} for event_id={event_id}"
        )

    except (psycopg2.errors.LockNotAvailable, psycopg2.errors.QueryCanceled,
            psycopg2.errors.DeadlockDetected) as e:
        conn.rollback()
        raise RetryableError(f"timeout on deposit for user {user_id}: {e}")

    try:
        transition_event_status(cur, event_id, "enqueued", "processing")

        cur.execute("SELECT id FROM users WHERE id = %s", (user_id,))
        if cur.fetchone() is None:
            conn.rollback()
            raise ValueError(f"user {user_id} not found")

        cur.execute(
            "UPDATE users SET balance = balance + %s WHERE id = %s",
            (amount_decimal, user_id),
        )
        if cur.rowcount == 0:
            conn.rollback()
            raise ValueError(f"user {user_id} disappeared between SELECT and UPDATE")
    except (psycopg2.errors.LockNotAvailable, psycopg2.errors.QueryCanceled,
            psycopg2.errors.DeadlockDetected) as e:
        conn.rollback()
        raise RetryableError(f"lock/timeout on deposit first-time path for user {user_id}: {e}")

    try:
        cur.execute(
            "INSERT INTO balance_events "
            "(user_id, amount, event_type, source_event_id, created_at) "
            "VALUES (%s, %s, %s, %s, NOW())",
            (user_id, amount_decimal, event_type, event_id),
        )
    except psycopg2.errors.UniqueViolation:
        conn.rollback()
        raise Exception(
            f"balance_events duplicate without idempotency key violation "
            f"event_id={event_id}, investigate immediately"
        )

    cur.execute(
        "UPDATE processed_events SET outcome = 'success' WHERE idempotency_key = %s",
        (idempotency_key,),
    )
    cur.execute(
        "UPDATE payment_events SET retry_count = 0 WHERE event_id = %s",
        (event_id,),
    )
    transition_event_status(cur, event_id, "processing", "confirmed")
    conn.commit()

WithdrawalOutcome = Literal["success", "insufficient_funds"]

def process_withdrawal_sync(conn, event_id, event_type, user_id, amount) -> WithdrawalOutcome:
amount_decimal = _validate_amount(amount)
idempotency_key = _idempotency_key(event_id, event_type)

with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
    cur.execute("SET LOCAL lock_timeout = '2s'")
    cur.execute("SET LOCAL statement_timeout = '5s'")

    try:
        cur.execute(
            "INSERT INTO processed_events (idempotency_key, outcome) "
            "VALUES (%s, 'pending')",
            (idempotency_key,),
        )
    except psycopg2.errors.UniqueViolation:
        conn.rollback()
        with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as lc:
            lc.execute("SET LOCAL lock_timeout = '2s'")
            lc.execute("SET LOCAL statement_timeout = '5s'")
            lc.execute(
                "SELECT outcome FROM processed_events WHERE idempotency_key = %s",
                (idempotency_key,),
            )
            stored = (lc.fetchone() or {"outcome": "pending"})["outcome"]

            if stored == "success":
                lc.execute(
                    "SELECT status FROM payment_events WHERE event_id = %s",
                    (event_id,)
                )
                r = lc.fetchone()
                current = r["status"] if r else None
                if current is None:
                    logger.error(
                        "withdrawal_replay: orphan event, processed_events "
                        "exists but payment_event not found",
                        event_id=event_id,
                    )
                    send_alert(
                        f"[CRITICAL] orphan withdrawal event: {event_id}",
                        alert_key=f"orphan_withdrawal:{event_id}",
                    )
                elif current == "enqueued":
                    transition_event_status(lc, event_id, "enqueued", "confirmed")
                elif current == "processing":
                    transition_event_status(lc, event_id, "processing", "confirmed")
                elif current == "confirmed":
                    pass
                else:
                    conn.rollback()
                    raise RetryableError(
                        f"withdrawal_replay FSM violation: "
                        f"payment_event.status={current!r} with outcome=success "
                        f"for event_id={event_id}"
                    )
                conn.commit()
                return "success"
            elif stored == "insufficient_funds":
                lc.execute(
                    "SELECT status FROM payment_events WHERE event_id = %s",
                    (event_id,)
                )
                r = lc.fetchone()
                current = r["status"] if r else None
                if current == "enqueued":
                    transition_event_status(lc, event_id, "enqueued", "failed")
                elif current == "processing":
                    transition_event_status(lc, event_id, "processing", "failed")
                elif current is None:
                    logger.error(
                        "withdrawal_replay insufficient_funds: orphan event",
                        event_id=event_id,
                    )
                    send_alert(
                        f"[CRITICAL] orphan withdrawal (insufficient_funds): {event_id}",
                        alert_key=f"orphan_withdrawal_insuf:{event_id}",
                    )
                conn.commit()
                return "insufficient_funds"
            else:
                conn.rollback()
                raise RetryableError(
                    f"withdrawal idempotency hit with outcome={stored!r} "
                    f"for event_id={event_id}"
                )

    except (psycopg2.errors.LockNotAvailable, psycopg2.errors.QueryCanceled,
            psycopg2.errors.DeadlockDetected) as e:
        conn.rollback()
        raise RetryableError(f"lock/timeout on withdrawal INSERT for user {user_id}: {e}")

    try:
        cur.execute(
            "SELECT balance FROM users WHERE id = %s FOR UPDATE NOWAIT",
            (user_id,),
        )
    except (psycopg2.errors.LockNotAvailable, psycopg2.errors.QueryCanceled,
            psycopg2.errors.DeadlockDetected) as e:
        conn.rollback()
        raise RetryableError(f"lock/timeout/deadlock on user lock for user {user_id}: {e}")

    row = cur.fetchone()
    if row is None:
        conn.rollback()
        raise ValueError(f"user {user_id} not found")

    try:
        if row["balance"] < amount_decimal:
            cur.execute(
                "UPDATE processed_events SET outcome = 'insufficient_funds' "
                "WHERE idempotency_key = %s",
                (idempotency_key,),
            )
            transition_event_status(cur, event_id, "enqueued", "failed")
            conn.commit()
            return "insufficient_funds"

        transition_event_status(cur, event_id, "enqueued", "processing")

        cur.execute(
            "UPDATE users SET balance = balance - %s WHERE id = %s",
            (amount_decimal, user_id),
        )

        try:
            cur.execute(
                "INSERT INTO balance_events "
                "(user_id, amount, event_type, source_event_id, created_at) "
                "VALUES (%s, %s, %s, %s, NOW())",
                (user_id, -amount_decimal, event_type, event_id),
            )
        except psycopg2.errors.UniqueViolation:
            conn.rollback()
            raise Exception(
                f"balance_events duplicate without idempotency key violation "
                f"event_id={event_id}, investigate immediately"
            )

        cur.execute(
            "UPDATE processed_events SET outcome = 'success' WHERE idempotency_key = %s",
            (idempotency_key,),
        )
        cur.execute(
            "UPDATE payment_events SET retry_count = 0 WHERE event_id = %s",
            (event_id,),
        )
        transition_event_status(cur, event_id, "processing", "confirmed")
        conn.commit()
        return "success"
    except (psycopg2.errors.LockNotAvailable, psycopg2.errors.QueryCanceled,
            psycopg2.errors.DeadlockDetected) as e:
        conn.rollback()
        raise RetryableError(f"lock/timeout on withdrawal path for user {user_id}: {e}")`

Webhook: Outbox Pattern Instead of Direct .delay()

Calling .delay() directly from the webhook handler creates a window between the DB write and the queue insertion. If the process dies in that moment, the event hangs in pending forever.
Solution: the webhook only writes to the DB. A separate poller every 5 seconds picks up pending events, atomically changes statuses and commits, and only then puts them into the Celery queue. Commit first, then enqueue. Otherwise the worker starts before the DB knows about enqueued.
Alchemy and Infura only know tx-hash and recipient address. The mapping from to_address to user_id is done via a separate query to the deposit_addresses table. That layer is not in this article, but without it an attacker with the HMAC key can credit money to any user_id. Worth keeping in mind.
verify_webhook_signature accepts raw_body as bytes before JSON parsing. The signature is computed over the original bytes. secrets.compare_digest protects against timing attacks.

`import asyncpg
from fastapi import FastAPI, Request, HTTPException
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

app = FastAPI()

limiter = Limiter(key_func=get_remote_address, storage_uri=settings.REDIS_URL)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

ALLOWED_EVENT_TYPES = frozenset({"deposit", "airdrop", "withdrawal", "withdrawal_fee"})

def verify_webhook_signature(raw_body, signature_header, signing_key):
if not signing_key:
raise ImproperlyConfigured("WEBHOOK_SECRET is not set")
if len(signing_key) < 32:
raise ImproperlyConfigured(
f"WEBHOOK_SECRET too short: {len(signing_key)} chars, minimum 32"
)
if not signature_header:
return False
mac = hmac.new(
key=signing_key.encode("utf-8"),
msg=raw_body,
digestmod=hashlib.sha256,
)
return secrets.compare_digest(mac.hexdigest(), signature_header)

@app.post("/webhook/payments")
@limiter.limit("300/minute")
@limiter.limit("30/second")
async def payment_webhook(request: Request):
raw_body = await request.body()
signature = request.headers.get("X-Alchemy-Signature", "")

if not verify_webhook_signature(raw_body, signature, settings.WEBHOOK_SECRET):
    raise HTTPException(status_code=401, detail="invalid signature")

trace_id = (
    request.headers.get("X-Request-ID")
    or request.headers.get("X-Alchemy-Request-ID")
    or new_trace_id()
)
set_trace_id(trace_id)

try:
    payload    = json.loads(raw_body)
    event_id   = payload["event_id"]
    event_type = payload["event_type"]
    user_id    = payload["user_id"]
    if not isinstance(payload.get("amount"), str):
        raise HTTPException(status_code=400, detail="amount must be a JSON string, not a number")
    amount_str = payload["amount"]
except (json.JSONDecodeError, KeyError) as e:
    raise HTTPException(status_code=400, detail=f"invalid payload: {e}")

if event_type not in ALLOWED_EVENT_TYPES:
    raise HTTPException(status_code=400, detail=f"unknown event_type: {event_type!r}")

try:
    _validate_amount(amount_str)
except ValueError as e:
    raise HTTPException(status_code=400, detail=f"invalid amount: {e}")

db = request.app.state.db
try:
    async with db.transaction():
        await db.fetchrow(
            "INSERT INTO payment_events (event_id, user_id, amount, event_type, status) "
            "VALUES ($1, $2, $3, $4, 'pending') ON CONFLICT (event_id) DO NOTHING",
            event_id, user_id, amount_str, event_type,
        )
except asyncpg.exceptions.ForeignKeyViolationError:
    logger.warning(
        "orphan webhook event (user not found)",
        event_id=event_id, user_id=user_id,
    )
    raise HTTPException(status_code=400, detail="user not found")

return {"status": "accepted", "trace_id": trace_id}`

FastAPI runs in an async event loop. Blocking psycopg2 there would kill throughput, so we use asyncpg in the webhook. Celery workers are separate processes without an event loop, asyncpg there would only add complexity.
enqueue_pending_events uses the same SAVEPOINT pattern as recover_stale_enqueued_events. Without it, a single event with AlreadyProcessedError from a race with the recoverer would roll back the entire batch of 100 events. They'd stay in pending and get picked up on the next tick, but this shows up in the logs as a lost tick. SAVEPOINT sp_enq per event isolates the error.

`@shared_task(name="enqueue_pending_events")
def enqueue_pending_events() -> dict:
conn = get_validated_conn(db_pool)
events_to_enqueue = []
try:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute("""
SELECT event_id, event_type, user_id, amount
FROM payment_events
WHERE status = 'pending'
AND updated_at < NOW() - INTERVAL '5 seconds'
ORDER BY created_at
LIMIT 100
FOR UPDATE SKIP LOCKED
""")
events = cur.fetchall()
events_ok = []
for event in events:
try:
cur.execute("SAVEPOINT sp_enq")
transition_event_status(cur, event['event_id'], "pending", "enqueued")
cur.execute("RELEASE SAVEPOINT sp_enq")
events_ok.append(event)
except (ValueError, RetryableError, AlreadyProcessedError) as e:
cur.execute("ROLLBACK TO SAVEPOINT sp_enq")
cur.execute("RELEASE SAVEPOINT sp_enq")
logger.warning("enqueue: skipped event",
event_id=event['event_id'], error=str(e))
conn.commit()
events_to_enqueue = list(events_ok)
except Exception:
try:
conn.rollback()
except Exception:
pass
logger.exception("enqueue_pending_events: transition failed")
raise
finally:
try:
conn.putconn()
except Exception:
pass

enqueued = 0
for event in events_to_enqueue:
    try:
        process_payment_event.apply_async(
            args=[event['event_id'], event['event_type'],
                  event['user_id'], str(event['amount'])],
            kwargs={"trace_id": str(uuid.uuid4())},
        )
        enqueued += 1
    except Exception:
        logger.exception("apply_async failed", event_id=event['event_id'])

logger.info("enqueue_pending_events done", enqueued=enqueued)
return {"enqueued": enqueued}`

recover_stale_enqueued_events runs every 2 minutes and finds events stuck in enqueued. After MAX_RECOVERY_ATTEMPTS attempts it transitions to failed + DLQ. SAVEPOINT per event, an error in one doesn't roll back the whole batch.

`MAX_RECOVERY_ATTEMPTS = 10

@shared_task(name="recover_stale_enqueued_events")
def recover_stale_enqueued_events() -> dict:
conn = get_validated_conn(db_pool)
recovered = 0
dlqed = 0
try:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute("""
SELECT event_id, event_type, user_id, amount, retry_count
FROM payment_events
WHERE status = 'enqueued'
AND updated_at < NOW() - INTERVAL '3 minutes'
ORDER BY created_at
LIMIT 50
FOR UPDATE SKIP LOCKED
""")
stale = cur.fetchall()
for event in stale:
try:
cur.execute("SAVEPOINT sp_recover")
if event['retry_count'] >= MAX_RECOVERY_ATTEMPTS:
transition_event_status(cur, event['event_id'], "enqueued", "failed")
cur.execute(
"INSERT INTO dead_letter_queue "
"(event_id, event_type, user_id, amount, error) "
"VALUES (%s, %s, %s, %s, %s)",
(event['event_id'], event['event_type'], event['user_id'],
str(event['amount']),
f"exhausted recovery attempts ({MAX_RECOVERY_ATTEMPTS})")
)
dlqed += 1
else:
transition_event_status(cur, event['event_id'], "enqueued", "pending")
cur.execute(
"UPDATE payment_events SET retry_count = retry_count + 1 "
"WHERE event_id = %s",
(event['event_id'],)
)
recovered += 1
cur.execute("RELEASE SAVEPOINT sp_recover")
except Exception as sp_exc:
try:
cur.execute("ROLLBACK TO SAVEPOINT sp_recover")
cur.execute("RELEASE SAVEPOINT sp_recover")
except Exception:
pass
logger.error("recover: event skipped due to error",
event_id=event['event_id'], error=str(sp_exc))
conn.commit()

    if recovered or dlqed:
        logger.warning("recover_stale_enqueued_events",
                      recovered=recovered, dlqed=dlqed)
    if dlqed:
        send_alert(
            f"[WARNING] {dlqed} events exhausted recovery attempts, check DLQ",
            alert_key="recovery_exhausted",
        )
    return {"recovered": recovered, "dlqed": dlqed}
except Exception:
    try:
        conn.rollback()
    except Exception:
        pass
    logger.exception("recover_stale_enqueued_events failed")
    raise
finally:
    try:
        conn.putconn()
    except Exception:
        pass`

Celery: acks_late + reject_on_worker_lost + Connection Pool

acks_late=True: acknowledge to the broker after processing completes, not at receipt. By default Celery acknowledges immediately: the worker crashes mid-processing, the task is lost.
reject_on_worker_lost=True: on SIGKILL/OOM the task is returned to the queue.
If you create the pool before fork, all workers inherit the same file descriptors. Two processes send requests over the same socket and responses get scrambled. That's why the pool is created in worker_process_init.

`import os
from celery.signals import worker_process_init

db_pool = None
_local_breaker = None

@worker_process_init.connect
def init_worker(**kwargs):
global db_pool, _local_breaker

_local_breaker = _InProcessBreaker()

worker_pool = os.environ.get("CELERY_WORKER_POOL", "prefork")
is_threaded = worker_pool in ("gevent", "eventlet")
pool_class  = (
    psycopg2.pool.ThreadedConnectionPool if is_threaded
    else psycopg2.pool.SimpleConnectionPool
)
db_pool = pool_class(minconn=2, maxconn=10, dsn=settings.DATABASE_URL)
logger.info("worker init done", pool=pool_class.__name__, worker_pool=worker_pool)

def _get_local_breaker() -> "_InProcessBreaker":
global _local_breaker
if _local_breaker is None:
_local_breaker = _InProcessBreaker()
return _local_breaker

BACKOFF_BASE_SEC = 1
BACKOFF_CAP_SEC = 60

def jittered_backoff(attempt: int) -> float:
cap = min(BACKOFF_CAP_SEC, BACKOFF_BASE_SEC * (2 ** attempt))
return random.uniform(0, cap)`

Two-Level DLQ

In the first version I used LPUSH and EXPIRE as two separate commands. A crash between them is possible, the key remains without a TTL and lives forever. Fixed with a pipeline.
On the DLQ schema: the table uses BIGSERIAL PRIMARY KEY, not event_id PRIMARY KEY. This allows storing multiple attempts for a single event_id. Consequence: ON CONFLICT (event_id) DO NOTHING is not valid, event_id has no UNIQUE constraint. Each INSERT creates a new record with the full attempt history.

`DLQ_REDIS_KEY = "dlq:payment_events"
DLQ_REDIS_TTL = 7 * 24 * 3600 # 7 days

def save_to_dlq_sync(conn, event_id, event_type, user_id, amount, error):
payload = {
"event_id": event_id, "event_type": event_type,
"user_id": user_id, "amount": str(amount),
"error": error, "trace_id": get_trace_id(),
}

db_ok = False
try:
    with conn.cursor() as cur:
        cur.execute(
            "INSERT INTO dead_letter_queue "
            "(event_id, event_type, user_id, amount, error, created_at) "
            "VALUES (%s, %s, %s, %s, %s, NOW())",
            (event_id, event_type, user_id, str(amount), error)
        )
        conn.commit()
    db_ok = True
except Exception as db_exc:
    logger.error("DLQ postgres write failed", event_id=event_id, error=str(db_exc))
    try:
        conn.rollback()
    except Exception:
        pass

if db_ok:
    return

try:
    pipe = _redis_client.pipeline()
    pipe.lpush(DLQ_REDIS_KEY, json.dumps(payload))
    pipe.expire(DLQ_REDIS_KEY, DLQ_REDIS_TTL)
    pipe.execute()
    logger.warning("DLQ saved to Redis fallback", event_id=event_id)
    return
except redis_lib.RedisError as e:
    logger.error("DLQ redis write failed", event_id=event_id, error=str(e))

logger.critical(
    "DLQ_UNRECOVERABLE",
    event_id=event_id, event_type=event_type,
    user_id=user_id, amount=str(amount),
    error=error, trace_id=get_trace_id(),
    dlq_payload=payload,
)`

drain_redis_dlq runs on schedule after a DB incident. failed resets to 0 on each successful INSERT, the counter is consecutive, not total. Alternating successes and failures don't trigger the emergency break, but drained > 0 at the same time.

`DRAIN_BATCH_SIZE = 500

@shared_task(name="drain_redis_dlq")
def drain_redis_dlq() -> dict:
drained = 0
failed = 0
conn = get_validated_conn(db_pool)
try:
for _ in range(DRAIN_BATCH_SIZE):
raw = _redis_client.rpop(DLQ_REDIS_KEY)
if raw is None:
break
try:
payload = json.loads(raw)
except json.JSONDecodeError:
logger.critical(
"drain_redis_dlq: malformed JSON in DLQ, item discarded",
raw=raw[:200],
)
continue
try:
with conn.cursor() as cur:
cur.execute(
"INSERT INTO dead_letter_queue "
"(event_id, event_type, user_id, amount, error, created_at) "
"VALUES (%s, %s, %s, %s, %s, NOW())",
(payload['event_id'], payload['event_type'],
payload['user_id'], payload['amount'], payload['error'])
)
conn.commit()
drained += 1
failed = 0
except Exception as e:
try:
conn.rollback()
except Exception:
pass
pipe = _redis_client.pipeline()
pipe.lpush(DLQ_REDIS_KEY, raw)
pipe.expire(DLQ_REDIS_KEY, DLQ_REDIS_TTL)
pipe.execute()
failed += 1
logger.error(
"drain failed, requeued to head",
event_id=payload.get('event_id'), error=str(e),
)
if failed >= 10:
logger.error("drain aborted after 10 consecutive failures")
break
finally:
conn.putconn()

logger.info("drain_redis_dlq done", drained=drained, failed=failed)
return {"drained": drained, "failed": failed}`

Celery Task: Full Version

`@shared_task(name="process_payment_event", bind=True, max_retries=5, acks_late=True, reject_on_worker_lost=True)
def process_payment_event(self, event_id, event_type, user_id, amount, trace_id=""):
set_trace_id(trace_id or new_trace_id())

conn      = get_validated_conn(db_pool)
committed = False
conn_ok   = True

try:
    if event_type in ("deposit", "airdrop"):
        process_deposit_sync(conn, event_id, event_type, user_id, amount)
        committed = True
    elif event_type in ("withdrawal", "withdrawal_fee"):
        outcome = process_withdrawal_sync(conn, event_id, event_type, user_id, amount)
        committed = True
        if outcome == "insufficient_funds":
            notify_user_insufficient_funds(user_id)
    else:
        logger.error("unknown event_type", event_type=event_type, event_id=event_id)
        try:
            conn.rollback()
        except Exception:
            pass
        try:
            _mark_event_failed(conn, event_id)
        except Exception as mark_exc:
            logger.error("_mark_event_failed raised",
                        event_id=event_id, error=str(mark_exc))
        save_to_dlq_sync(
            conn, event_id, event_type, user_id, amount,
            f"unknown event_type: {event_type!r}"
        )
        raise Ignore()
except AlreadyProcessedError as exc:
    logger.info("event already processed", event_id=event_id, reason=str(exc))
    raise Ignore()
except RetryableError as exc:
    delay = jittered_backoff(self.request.retries)
    logger.warning(
        "retrying", event_id=event_id,
        attempt=self.request.retries, delay=delay, reason=str(exc),
    )
    try:
        raise self.retry(exc=exc, countdown=delay)
    except MaxRetriesExceededError:
        conn.rollback()
        _mark_event_failed(conn, event_id)
        save_to_dlq_sync(conn, event_id, event_type, user_id, amount,
                         f"retries exhausted: {exc}")
        raise Ignore()
except Ignore:
    raise
except Exception as exc:
    logger.exception("unhandled error", event_id=event_id)
    try:
        conn.rollback()
    except Exception:
        pass
    try:
        _mark_event_failed(conn, event_id)
    except Exception as mark_exc:
        logger.error("_mark_event_failed raised in catch-all",
                    event_id=event_id, error=str(mark_exc))
    try:
        save_to_dlq_sync(conn, event_id, event_type, user_id, amount, str(exc))
    except Exception as dlq_exc:
        logger.critical(
            "DLQ write failed, manual recovery required",
            event_id=event_id, trace_id=get_trace_id(),
            original_error=str(exc), dlq_error=str(dlq_exc),
        )
    self.update_state(state="FAILURE", meta={"error": str(exc)})
    raise Ignore()
finally:
    if not committed:
        try:
            conn.rollback()
        except Exception as rb_exc:
            logger.error("rollback failed", event_id=event_id, error=str(rb_exc))
            conn_ok = False
    conn.putconn(close=not conn_ok)

def notify_user_insufficient_funds(user_id: int) -> None:
pass`

notify_user_insufficient_funds: a stub. In prod you need an outbox record inside process_withdrawal_sync before the final commit. Calling it from here (after commit) means out-of-band: a separate transaction, separate delivery guarantees.
There's a trap hidden here that's specific to at-least-once. The provider sends one event three times. process_payment_event runs three times. The balance won't change (idempotency via unique constraint). But notify gets called three times, the user receives three "insufficient funds" notifications instead of one. DB idempotency doesn't automatically cover side effects outside the transaction.
Second problem with this placement: if notify throws an exception, it lands in the except Exception catch-all, which writes the event to DLQ, even though the money was already deducted correctly and the business transaction committed. That's noise in the DLQ that will hide real incidents.

Circuit Breaker for Web3 RPC

Three-phase: closed, open, half-open, closed. Redis as shared state between instances, in-process breaker as fallback when Redis is unavailable.
_InProcessBreaker: a simple per-process counter with a lock. Opens after RPC_ERROR_THRESHOLD errors, closes automatically after RPC_COOLDOWN_SEC seconds. Needed specifically as a fallback: if Redis itself is unavailable, the circuit breaker shouldn't stop working.
The goal is to run only one test request while the breaker is open. SET nx=True guarantees only the first worker gets permission, and the rest see the probe key is occupied.

`import web3
from prometheus_client import Counter

rpc_errors_total = Counter("web3_rpc_errors_total", "Web3 RPC failures", ["method"])

_w3 = web3.Web3(web3.HTTPProvider(settings.ETH_RPC_URL))

FINALIZED_CACHE_TTL = 86_400
PENDING_CACHE_TTL = 30
CACHE_PREFIX = "eth:fin:"
RPC_ERROR_THRESHOLD = 5
RPC_COOLDOWN_SEC = 60
RPC_ERROR_WINDOW_SEC = 30
HALF_OPEN_PROBE_KEY = "circuit:web3:half_open_probe"
HALF_OPEN_PROBE_TTL = 10

@dataclass
class _InProcessBreaker:
_lock: threading.Lock = field(default_factory=threading.Lock)
_errors: int = 0
_open_until: "datetime | None" = None

def is_open(self) -> bool:
    with self._lock:
        if self._open_until is None:
            return False
        if datetime.now(timezone.utc) > self._open_until:
            self._open_until = None
            self._errors = 0
            return False
        return True

def record_error(self) -> None:
    with self._lock:
        self._errors += 1
        if self._errors >= RPC_ERROR_THRESHOLD:
            self._open_until = datetime.now(timezone.utc) + timedelta(seconds=RPC_COOLDOWN_SEC)

def record_success(self) -> None:
    with self._lock:
        self._errors     = 0
        self._open_until = None

def _is_circuit_open() -> bool:
try:
if _redis_client.get("circuit:web3:open"):
is_probe = _redis_client.set(
HALF_OPEN_PROBE_KEY, "1", nx=True, ex=HALF_OPEN_PROBE_TTL
)
if is_probe:
return False
return True
except redis_lib.RedisError:
pass
return _get_local_breaker().is_open()

def _record_rpc_error(method: str) -> None:
rpc_errors_total.labels(method=method).inc()
_get_local_breaker().record_error()
try:
pipe = _redis_client.pipeline()
pipe.incr("circuit:web3:errors")
pipe.expire("circuit:web3:errors", RPC_ERROR_WINDOW_SEC)
count, _ = pipe.execute()
if count >= RPC_ERROR_THRESHOLD:
open_pipe = _redis_client.pipeline()
open_pipe.setex("circuit:web3:open", RPC_COOLDOWN_SEC, "1")
open_pipe.set(HALF_OPEN_PROBE_KEY, "1", ex=HALF_OPEN_PROBE_TTL)
open_pipe.execute()
logger.critical("web3 circuit breaker opened", count=count)
send_alert(
f"[CRITICAL] Web3 RPC circuit breaker OPEN, "
f"{count} failures in {RPC_ERROR_WINDOW_SEC}s.",
alert_key="web3_breaker_open",
)
else:
if not _redis_client.exists("circuit:web3:open"):
_redis_client.delete(HALF_OPEN_PROBE_KEY)
except redis_lib.RedisError as e:
logger.warning("circuit breaker state write failed", error=str(e))

def _record_rpc_success() -> None:
_get_local_breaker().record_success()
try:
_redis_client.delete("circuit:web3:errors")
_redis_client.delete(HALF_OPEN_PROBE_KEY)
if _redis_client.delete("circuit:web3:open"):
logger.info("web3 circuit breaker closed")
send_alert(
"[INFO] Web3 RPC circuit breaker CLOSED, recovered",
alert_key="web3_breaker_closed",
)
except redis_lib.RedisError:
pass

def is_transaction_finalized(tx_hash: str) -> bool:
if _is_circuit_open():
logger.warning("web3 circuit open, skipping", tx_hash=tx_hash)
return False

cache_key = f"{CACHE_PREFIX}{tx_hash}"
try:
    cached = _redis_client.get(cache_key)
    if cached == "1":
        return True
    if cached == "0":
        return False
except redis_lib.RedisError as e:
    logger.warning("redis cache unavailable", tx_hash=tx_hash, error=str(e))

method = "get_transaction_receipt"
try:
    receipt = _w3.eth.get_transaction_receipt(tx_hash)
    if receipt is None:
        return False
    method = "get_block_finalized"
    finalized_block = _w3.eth.get_block("finalized")["number"]
    result = receipt["blockNumber"] <= finalized_block
    _record_rpc_success()
except Exception as e:
    logger.error("eth rpc error", tx_hash=tx_hash, error=str(e))
    _record_rpc_error(method)
    return False

try:
    ttl = FINALIZED_CACHE_TTL if result else PENDING_CACHE_TTL
    _redis_client.setex(cache_key, ttl, "1" if result else "0")
except redis_lib.RedisError as e:
    logger.warning("redis cache write failed", tx_hash=tx_hash, error=str(e))

return result`

Monitoring

hot_path_balance_check: not a full financial reconciliation, just monitoring of the hot path over the last 10 minutes. users.balance and SUM(balance_events) are read in a single JOIN query, the data snapshot is taken atomically. REPEATABLE_READ here is defensive engineering: it guarantees a consistent snapshot at the transaction level and protects against phantom reads if the transaction grows to multiple statements in the future. A full historical reconciliation requires a nightly job, that's in the backlog.
set_isolation_level can throw if the connection is broken. Without a try/except around it the connection will return to the pool in REPEATABLE_READ, and the next user won't expect that behavior.

`from prometheus_client import Counter
hot_path_runs = Counter("hot_path_balance_check_runs_total", "Runs of hot path balance check")

@shared_task(name="hot_path_balance_check")
def hot_path_balance_check() -> None:
conn = get_validated_conn(db_pool)
conn_ok = True
try:
conn.set_isolation_level(
psycopg2.extensions.ISOLATION_LEVEL_REPEATABLE_READ
)
try:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute("SET LOCAL statement_timeout = '10s'")
cur.execute("""
SELECT
u.id,
u.balance AS actual_balance,
u.initial_balance + COALESCE(SUM(be.amount), 0) AS calculated_balance
FROM users u
INNER JOIN (
SELECT DISTINCT user_id FROM balance_events
WHERE created_at > NOW() - INTERVAL '10 minutes'
) recent ON recent.user_id = u.id
LEFT JOIN balance_events be ON be.user_id = u.id
GROUP BY u.id, u.balance, u.initial_balance
HAVING u.balance != u.initial_balance + COALESCE(SUM(be.amount), 0)
""")
mismatches = cur.fetchall()
conn.commit()
except Exception:
conn.rollback()
raise
finally:
try:
conn.set_isolation_level(psycopg2.extensions.ISOLATION_LEVEL_DEFAULT)
except Exception:
conn_ok = False

    hot_path_runs.inc()

    if mismatches:
        send_alert(
            f"[CRITICAL] balance mismatch: {[dict(m) for m in mismatches]}",
            alert_key="balance_mismatch",
        )

except Exception:
    logger.exception("hot_path_balance_check failed")
    try:
        conn.rollback()
    except Exception:
        pass
    send_alert(
        "[WARNING] hot_path_balance_check failed, check worker logs",
        alert_key="balance_check_failed",
    )
    raise
finally:
    conn.putconn(close=not conn_ok)

@shared_task(name="alert_zombie_events")
def alert_zombie_events() -> None:
conn = get_validated_conn(db_pool)
try:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute("""
SELECT event_id, user_id, status, updated_at FROM payment_events
WHERE (status = 'processing' AND updated_at < NOW() - INTERVAL '5 minutes')
OR (status IN ('pending', 'enqueued') AND updated_at < NOW() - INTERVAL '15 minutes')
""")
zombies = cur.fetchall()

        cur.execute("""
            SELECT COUNT(*) AS dlq_size FROM dead_letter_queue
            WHERE created_at > NOW() - INTERVAL '1 hour'
        """)
        recent_dlq = cur.fetchone()['dlq_size']

        cur.execute("""
            SELECT COUNT(*) AS stuck FROM processed_events
            WHERE outcome = 'pending' AND created_at < NOW() - INTERVAL '10 minutes'
        """)
        stuck_pending = cur.fetchone()['stuck']

    if zombies:
        send_alert(
            f"[WARNING] zombie events: {[z['event_id'] for z in zombies[:20]]}",
            alert_key="zombie_events",
        )
    if recent_dlq > 10:
        send_alert(
            f"[WARNING] {recent_dlq} events in DLQ last hour, investigate",
            alert_key="dlq_flood",
        )
    if stuck_pending:
        send_alert(
            f"[CRITICAL] {stuck_pending} stuck 'pending' rows in processed_events, "
            f"architectural invariant broken, investigate urgently",
            alert_key="processed_events_stuck_pending",
        )
finally:
    try:
        conn.rollback()
    except Exception:
        pass
    conn.putconn()`

Prometheus alerts:

`- alert: HotPathBalanceCheckNotRunning
expr: increase(hot_path_balance_check_runs_total[6m]) == 0
for: 0m

alert: Web3RPCCircuitOpen expr: increase(web3_rpc_errors_total[5m]) > 5 for: 0m
alert: CeleryHighRetryRate expr: rate(celery_tasks_total{state="retry"}[5m]) / rate(celery_tasks_total{state="success"}[5m]) > 0.1
alert: CeleryQueueDepth expr: celery_queue_depth > 500 for: 5m`

Beat schedule:

`from celery.schedules import crontab

beat_schedule = {
"enqueue-pending-events": {"task": "enqueue_pending_events", "schedule": 5.0},
"recover-stale-enqueued-events": {"task": "recover_stale_enqueued_events", "schedule": 120.0},
"cleanup-processed-events": {"task": "cleanup_processed_events", "schedule": crontab(hour=3, minute=0)},
"hot-path-balance-check": {"task": "hot_path_balance_check", "schedule": 60.0},
"alert-zombie-events": {"task": "alert_zombie_events", "schedule": 60.0},
"drain-redis-dlq": {"task": "drain_redis_dlq", "schedule": crontab(hour=4, minute=0)},
}`

Cleanup

processed_events only cleans up terminal outcomes. outcome='pending' is never touched. This protects against silently deleting events mid-processing. The connection is acquired and returned to the pool on each batch, not held for the entire task duration.
A 14-day TTL is safe thanks to UNIQUE (source_event_id, event_type) on balance_events: even if the provider replays an event after 14+ days and processed_events has been cleaned up, a repeat INSERT into balance_events is rejected by the unique constraint. Even if the application fails, the unique constraint on balance_events won't let a duplicate through. The database is the last line of defense.

`CLEANUP_BATCH_SIZE = 5_000
CLEANUP_BATCH_PAUSE = 0.1
CLEANUP_MAX_BATCHES = 200
CLEANUP_TTL_DAYS = 14
CLEANUP_SAFE_STATUSES = ("success", "insufficient_funds")

@shared_task(name="cleanup_processed_events")
def cleanup_processed_events() -> dict:
import time
total_deleted = 0
batches = 0
for _ in range(CLEANUP_MAX_BATCHES):
conn = get_validated_conn(db_pool)
try:
with conn.cursor() as cur:
cur.execute("""
DELETE FROM processed_events
WHERE idempotency_key IN (
SELECT idempotency_key FROM processed_events
WHERE outcome = ANY(%s)
AND created_at < NOW() - (%s * INTERVAL '1 day')
LIMIT %s
FOR UPDATE SKIP LOCKED
)
""", (list(CLEANUP_SAFE_STATUSES), CLEANUP_TTL_DAYS, CLEANUP_BATCH_SIZE))
deleted = cur.rowcount
conn.commit()
except Exception:
try:
conn.rollback()
except Exception:
pass
raise
finally:
conn.putconn()
total_deleted += deleted
batches += 1
if deleted < CLEANUP_BATCH_SIZE:
break
time.sleep(CLEANUP_BATCH_PAUSE)
return {"batches": batches, "deleted": total_deleted}`

Tests and What They Don't Cover

You can't verify idempotency with mocks. The unique constraint has to work exactly as in prod, which means you need a real database. threading doesn't work here, the GIL gets in the way of properly reproducing a race condition. So: multiprocessing.
The balance can add up correctly while balance_events is still duplicated. Check both. That's exactly how it happens: the balance looks fine, everything appears clean, and the duplicates in balance_events only surface when an auditor shows up.

`import pytest
import psycopg2
import psycopg2.extras
import multiprocessing
from decimal import Decimal

TEST_DSN = "host=localhost port=5433 dbname=testdb user=testuser password=testuser"

@pytest.fixture(scope="session")
def db_conn_session():
conn = psycopg2.connect(TEST_DSN)
yield conn
conn.close()

@pytest.fixture
def db_conn(db_conn_session):
conn = db_conn_session
conn.rollback()
with conn.cursor() as cur:
cur.execute("""
TRUNCATE balance_events, processed_events, payment_events,
dead_letter_queue, users
RESTART IDENTITY CASCADE
""")
cur.execute(
"INSERT INTO users (id, balance, initial_balance) VALUES (1, 100.0, 100.0)"
)
conn.commit()
yield conn
try:
conn.rollback()
except Exception:
pass

def deposit_worker(dsn, event_id, event_type, user_id, amount, barrier, q):
conn = psycopg2.connect(dsn)
try:
barrier.wait(timeout=10)
process_deposit_sync(conn, event_id, event_type, user_id, amount)
q.put(("ok", None))
except Exception as e:
q.put(("err", f"{type(e).name_}: {e}"))
finally:
conn.close()

def test_duplicate_deposits_produce_single_credit(db_conn):
"""10 workers, ONE event_id, simulating at-least-once delivery."""
with db_conn.cursor() as cur:
cur.execute(
"INSERT INTO payment_events (event_id, user_id, amount, event_type, status) "
"VALUES ('evt_dup', 1, '50.0', 'deposit', 'enqueued')"
)
db_conn.commit()

N = 10

barrier = multiprocessing.Barrier(N)

q = multiprocessing.Queue()

workers = [

    multiprocessing.Process(

        target=_deposit_worker,

        args=(TEST_DSN, "evt_dup", "deposit", 1, "50.0", barrier, q),

    ) for _ in range(N)

]

for w in workers: w.start()

for w in workers: w.join(timeout=20)

with db_conn.cursor() as cur:

    cur.execute("SELECT balance FROM users WHERE id = 1")

    balance = cur.fetchone()[0]

    cur.execute("SELECT COUNT() FROM balance_events")

    be_count = cur.fetchone()[0]

    cur.execute("SELECT COUNT() FROM processed_events")

    pe_count = cur.fetchone()[0]

assert balance == Decimal("150.0"), f"duplication! balance={balance}"

assert be_count == 1, f"balance_events duplicated: {be_count}"

assert pe_count == 1`

What Tests Don't Cover

Throughput under real load. NOWAIT serializes access, throughput is achieved through Celery retry with jittered backoff. Load testing requires a separate stand with Celery workers.
Worker crash mid-transaction. reject_on_worker_lost=True requires killing the Celery worker process and verifying the task is returned to the broker. That's an integration test, it lives separately.
Redis DLQ fallback. Requires real Redis and simulating a PostgreSQL outage. Verified through chaos testing.

Backpressure and Degradation

Under high concurrency on a single user, RetryableError from NOWAIT accumulates. max_retries=5 gives 6 attempts (initial + 5 retries), total worst-case delay up to roughly 63 seconds. If the incoming webhook throughput exceeds the capacity to drain retries, CeleryQueueDepth will start growing.
When the DLQ starts filling up: alert_zombie_events will catch dlq_size > 10 events/hour, that's the first signal. The PostgreSQL dead_letter_queue grows without a limit, theoretically up to disk size. After that it's manual work: diagnose the cause, fix it, replay from the DLQ.
Auto-replay was deliberately not built. An event landed in the DLQ, which means something went wrong. Let a person figure it out before pushing it back through. I don't want the system making its own decisions about what to do with money that already failed once.
If the Redis broker is unavailable, apply_async always fails. The event stays in enqueued, after 3 minutes recover_stale_enqueued_events moves it back to pending and increments retry_count. After MAX_RECOVERY_ATTEMPTS (10 attempts, roughly 30 minutes), the event goes to DLQ + alert.
acks_late and reject_on_worker_lost protect against worker crashes, not broker crashes. If the Redis master goes down, in-flight tasks are lost. appendonly yes + appendfsync everysec, that's the minimum that should be in place. If you genuinely can't afford to lose data: appendfsync always, but throughput will drop.

On Blockchain Reorgs

For Ethereum after the switch to PoS, finalization happens through two checkpoint epochs (roughly 12-15 minutes). The "12 blocks" heuristic from the PoW era doesn't apply anymore. For L2 (Arbitrum, Optimism) the rules are different. This section covers Ethereum L1 only.
The current implementation doesn't handle reorgs. That's out of scope for the first release. During a reorg, the blockchain "rolls back" several blocks. A transaction may land in the new chain unchanged, or it may disappear entirely, effectively cancelled.
The simplest approach: a compensating entry in balance_events with a negative sign, with a separate idempotency key to avoid conflicting with the original event:

def handle_reorg_event(original_tx_hash: str, user_id: int, amount: Decimal) -> None: reorg_event_id = f"reorg:{original_tx_hash}" idempotency_key = _idempotency_key(reorg_event_id, "reorg_compensation") # then standard flow through processed_events + balance_events

What's in the Backlog

The most painful thing right now: notify_user_insufficient_funds is called outside the transaction. It needs an outbox pattern, a record written to an outbox table inside the same transaction as UPDATE users. Without that, under at-least-once delivery the user gets N notifications for one rejection, and when notify fails, DLQ gets records for successfully committed transactions. That's noise that hides real incidents.
Above 50 workers, direct connections hit max_connections, PgBouncer is needed. One mine worth noting before enabling it in transaction pooling mode: hot_path_balance_check uses conn.set_isolation_level(REPEATABLE_READ), a session-level command that PgBouncer in transaction pooling doesn't preserve between transactions. SET LOCAL lock_timeout/statement_timeout work fine (they're transaction-scoped), but isolation level will need to be rewritten as BEGIN ISOLATION LEVEL REPEATABLE READ inline, or use session pooling for that specific task. This is a classic senior-level mine: works in dev, breaks only after enabling PgBouncer in prod.
processed_events keeps growing. Cleanup handles the near term, but partitioning by date becomes inevitable past hundreds of millions of rows.
Further out: Redis broker with appendfsync always, full historical reconciliation via materialized view, full reorg handling implementation, Admin UI for DLQ instead of manual SQL. One trap in the code I haven't touched yet: _mark_event_failed commits itself. When you refactor, it will bite you.

Results

The queue draining and the balances matching are two different things. Prometheus shows the first, hot_path_balance_check shows the second. You need both. One alone is not enough.
8 months in production. 0 duplicate credits after deploying the fix, versus 23 in the first month across roughly 180k transactions. Webhook delivery at 100% accounting for provider retry logic.
There's not a single decision here that I designed upfront. Each one closes a hole that had already fired. Idempotency via DB unique constraint, TOCTOU via SELECT FOR UPDATE NOWAIT, lost transactions via acks_late + outbox, FSM via VALID_TRANSITIONS. Each of these in isolation is not a guarantee. Together they cover each other's gaps. Eight months without an incident.

DEV Community: Pavel Goldin