DEV Community: 137Foundry

10 Regex Patterns Every Backend Developer Should Have in Their Snippets Folder

137Foundry — Tue, 02 Jun 2026 10:03:45 +0000

Every backend developer ends up needing the same regex patterns over and over: email validation, URL extraction, number parsing, ID format checks, log parsing. Most rewrite the patterns each time. Most end up with subtle inconsistencies between projects. Keeping a personal snippets file of regex you have actually tested and used is one of those small productivity wins that compounds across years.

This is a curated list of ten regex patterns that show up in nearly every backend codebase, with the dialect notes and gotchas that matter. Use these as starting points; tune them to your specific inputs before shipping.

1. Pragmatic Email Validation

^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$

Accepts the vast majority of real email addresses, rejects obviously malformed input, and stays simple enough to avoid backtracking issues. The full RFC 5322 spec is a 6,000-character monster that nobody needs. The W3C HTML specification for the input type="email" element uses a very similar pragmatic pattern.

Trade-off: rejects RFC-valid quoted local parts ("foo bar"@example.com), which essentially no real users ever type. Worth it.

2. URL Validation (HTTP/HTTPS Only)

^https?:\/\/([\w-]+\.)+[\w-]+(:\d+)?(\/[\w\-._~:\/?#[\]@!$&'()*+,;=%]*)?$

Matches HTTP and HTTPS URLs with optional port and path. The path character class includes the RFC 3986 unreserved and reserved characters that URLs are allowed to contain.

Gotcha: this validates format only, not reachability. A URL that matches this pattern can still 404 or DNS-fail. For real production validation, use the language's URL parser (URL in JavaScript, urllib.parse in Python).

3. UUID v4

^[0-9a-f]{8}-[0-9a-f]{4}-4[0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}$

Validates the UUID v4 format specifically. The 4 in the third group and the [89ab] in the fourth group are version and variant bits required by the v4 spec. A pattern that accepts any 32-hex-with-dashes pattern would also accept v1, v3, and v5 UUIDs.

For case-insensitive matching of uppercase UUIDs, add the i flag or expand the character class to [0-9a-fA-F].

4. ISO 8601 Date

^\d{4}-\d{2}-\d{2}(T\d{2}:\d{2}:\d{2}(\.\d+)?(Z|[+-]\d{2}:?\d{2})?)?$

Matches ISO 8601 dates and datetimes, including optional fractional seconds and timezone offsets. Both Z (UTC) and +05:00-style offsets are accepted.

Gotcha: this is format validation only. A pattern-valid date like 2026-13-32 is still nonsensical. After regex passes, parse with a real date library to catch semantic errors.

5. US Phone Number (Permissive)

^(\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$

Matches US phone numbers in common formats: 1234567890, 123-456-7890, (123) 456-7890, +1-123-456-7890. The optional country code and varied separators reflect what users actually type.

For international phone validation, this pattern is wrong. Use a library like libphonenumber, which handles every country's quirks. Regex alone cannot encode international phone format rules.

6. Strong Password Format

^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$

Requires at least one lowercase letter, one uppercase letter, one digit, one symbol, and 8+ characters total. Uses positive lookaheads to enforce each requirement independently.

Modern security guidance from groups like NIST and OWASP has shifted away from composition rules in favor of length and breach-list checks, but composition checks remain common in production. Apply this for systems that still require them; do not assume it represents current best practice.

7. Hex Color Code

^#?([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})$

Accepts both 3-digit and 6-digit hex colors, with or without the leading #. Useful for color picker validation, CSS parsing, design tool input.

For modern CSS, this regex does not cover rgb(), rgba(), hsl(), or named colors. Hex remains the most common format for stored color values, so this pattern handles most cases.

8. Whitespace Collapse

\s+

Combined with the g flag and .trim(), this collapses runs of whitespace into single spaces and strips leading and trailing whitespace. The most-used regex in any string-normalization pipeline.

text.replace(/\s+/g, ' ').trim()

Gotcha: \s does not match all Unicode whitespace in JavaScript regex. For text that may include non-breaking spaces, em spaces, or ideographic spaces, use a broader character class.

9. Extract Markdown Link

\[([^\]]+)\]\(([^)]+)\)

Captures the link text (group 1) and URL (group 2) from standard markdown links. Useful for processing markdown content programmatically, building link analyzers, or migrating content between formats.

Breaks on links with literal parentheses in the URL (anything where the URL itself contains an unescaped closing parenthesis confuses the simple character class). For full markdown parsing, a real markdown parser is the right tool. This pattern works for the standard cases.

10. Pull Numbers From Text

-?\d+(\.\d+)?

Matches signed integers and decimals. Useful for log parsing, financial data extraction, or any context where mixed text and numbers need separating.

For numbers with thousand separators:

-?\d{1,3}(,\d{3})*(\.\d+)?

For numbers in scientific notation:

-?\d+(\.\d+)?([eE][+-]?\d+)?

Pick the variant that matches your input format. The most common bug is using the basic pattern on data that includes thousand separators and getting truncated values.

Bonus: Trim and Normalize in One Pass

A useful idiom that combines patterns 8 and others: trim leading and trailing whitespace and collapse internal whitespace in a single operation. In JavaScript:

text.trim().replace(/\s+/g, ' ')

In Python:

re.sub(r"\s+", " ", text.strip())

This shows up constantly in user-input normalization paths. Worth memorizing rather than rewriting each time.

A related pattern: stripping zero-width characters that sneak in via copy-paste from word processors and PDFs. Add [-‍] to your normalization regex to remove zero-width spaces, joiners, and BOMs. These cause real bugs in downstream systems that compare strings byte-for-byte, and they are invisible in most editors.

How to Maintain These Patterns

A snippets file is only useful if you maintain it. A few practices that keep it useful over time:

One file per language or one file with language-tagged sections. JavaScript regex literals and Python raw strings have different syntax but the same patterns.
Comments on every pattern explaining what it accepts, what it rejects, and what it does not try to do.
Linked test cases somewhere accessible (a gist, a repo, a tests file in your dotfiles).
A changelog of when patterns were updated and why. The change history is valuable when you wonder why a pattern looks the way it does.

Pair the snippets file with regex testing tools like regex101.com for testing modifications before saving them back. The combination of a tested snippets library and an interactive tester turns regex from a guessing game into a quick lookup.

When These Patterns Are Not Enough

These ten cover the most common needs but not every need. Cases where you need to reach beyond simple regex:

Internationalized inputs. Unicode-aware regex is harder than ASCII regex. Use Unicode property escapes (\p{L} for any letter) where the dialect supports them.
Recursive structures. Nested HTML, balanced brackets, nested function calls. Regex cannot parse these correctly. Use a real parser.
Context-sensitive validation. "Valid if this other field equals X" requires more than regex. Use a schema validator.
Performance-sensitive paths. Compile patterns once and reuse them. In Python, re.compile(). In Java, Pattern.compile(). In JavaScript, the literal form is automatically cached but new RegExp(...) inside a loop is not.

For production data validation work at 137Foundry, these patterns are the starting point of a layered approach: cheap regex format checks at the boundary, then more expensive validation layers that handle the cases regex cannot.

For more on the validation patterns that go around these regex snippets in production systems, see the full article Regex Code Snippets: Patterns for Common Validation and Parsing Problems. The 137Foundry data integration service covers the architectural side of validation in production pipelines, and the services hub describes related integration and automation work.

How to Write Regex Patterns That Survive Real-World Input

137Foundry — Tue, 02 Jun 2026 10:02:05 +0000

A regex that works on test data is a hypothesis. A regex that works on production data is an answer. Most developers do not appreciate this distinction until they ship a parser that processes 95 percent of inputs correctly and crashes on the other 5 percent at 2 AM.

This is a step-by-step approach to writing regex patterns that hold up against real-world input, including the categories of input that test data rarely covers.

Step 1: Collect Real Input Before Writing the Pattern

The single biggest mistake in regex design is writing the pattern first and looking at real data second. The correct order is the opposite: collect a meaningful sample of actual input, look at the variations, and only then write a pattern that handles them.

For an email validator, sample input from your actual user signups (if any exist). For a date parser, sample the dates from the actual document corpus. For a CSV splitter, sample the actual CSV files you need to process.

The variations you find are almost always wider than you would have guessed. Real CSV files have inconsistent quoting. Real dates include misspellings. Real URLs include trailing whitespace, mixed case, and trailing punctuation from the surrounding text. Real names include hyphens, apostrophes, periods, and Unicode characters.

A regex written without this sample is a regex written against an imaginary input distribution. It will fail on the real distribution in proportion to how much the real distribution differs from the imagined one.

Step 2: Anchor Decisively

The most common cause of regex bugs that pass tests but fail in production is missing anchors. A pattern without ^ and $ matches substrings, which means a "valid email" regex will accept random text and then user@example.com inside it.

Decide explicitly whether the pattern should:

Match the entire string (^pattern$)
Match any substring (pattern with no anchors)
Match at word boundaries (\bpattern\b)
Match from the start but not require the end (^pattern without $)

Each is appropriate in different contexts. None is the right default in all cases. The single biggest source of bugs is people writing un-anchored patterns when they meant fully-anchored ones.

Step 3: Escape Literally Everything Special

Regex special characters in a pattern that should match them literally need escaping. The list of special characters in most dialects: . * + ? ^ $ ( ) [ ] { } | \ /. Plus - inside character classes when used as a range.

The dot is the most commonly missed escape. example.com in a regex matches strings like exampleXcom because the dot matches any character. Almost always, when you write a domain in a regex, you want example\.com instead.

Languages that support raw regex strings (Python's r"...", JavaScript's literal /.../, Ruby's %r{...}) make this easier because you do not have to double-escape backslashes. Languages that require regex in regular strings double-escape: "\\d+" for the same pattern that r"\d+" produces. The double-escaping is one of the easiest sources of subtle bugs.

Step 4: Make Greedy vs Non-Greedy Choices Explicit

A regex like <.*> is greedy: it matches from the first < to the last >. A regex like <.*?> is non-greedy: it matches each <...> pair individually.

Greedy is the default in almost all dialects. Most of the time, this is the wrong default for what people actually want.

The rule of thumb: if you are extracting tagged content, you almost always want non-greedy. If you are validating a single token, the choice does not matter because anchors will make the question moot.

A useful test: when the regex matches against an input with multiple instances of the pattern, does it return one large match or several small ones? If the answer is unexpected, the greedy vs non-greedy choice is probably wrong.

Step 5: Test Boundary Inputs Systematically

Once the pattern is written, test it against these categories of input deliberately:

Empty input. Does the pattern reject empty strings appropriately, or does it accept them when it should not?

Whitespace-only input. Spaces, tabs, newlines, and especially mixtures of them.

Input at exactly the boundary. If the pattern allows 1 to 50 characters, test 0, 1, 50, and 51.

Input one character over the boundary. Common off-by-one errors show up here.

Input with leading or trailing whitespace. Real users paste from PDFs and word processors that include invisible characters.

Input with mixed line endings. \r\n vs \n vs \r causes parsing bugs in regex that uses . (which usually does not match newlines) without thinking about it.

Input with Unicode characters. Cyrillic, Chinese, emoji, mathematical symbols. The pattern's behavior on these is usually surprising.

Input that should fail. Make a deliberately malformed version of the expected input and confirm the pattern rejects it.

A pattern that passes all these tests is much more likely to survive production than one that passes only the happy-path tests.

Step 6: Use a Real Regex Tester

Manual testing in code is slow and lossy. Real regex testers like regex101.com and regexr.com show the pattern matching live against input, with explanations of each token, capture group highlighting, and step-count metrics.

The step counter is especially valuable because it reveals catastrophic backtracking before it hits production. A pattern that takes 100 steps on a 50-character input is fine. A pattern that takes 100,000 steps on a 60-character input has a backtracking problem and should be rewritten.

Most regex testers also explain what each part of the pattern matches in plain language, which is a useful sanity check when reading regex written by someone else (or by you six months ago).

Step 7: Watch for Catastrophic Backtracking

Certain regex constructs can become exponentially slow on certain inputs. The classic case is nested quantifiers like (a+)+ against a long string of as.

Defenses:

Avoid nested quantifiers when possible. (a+)+ is almost always equivalent to a+ and is much safer.
Use possessive quantifiers (*+, ++) or atomic groups in regex dialects that support them. JavaScript does not; Python and PCRE do.
Cap input length before applying regex to user-controlled input.
Test against pathological inputs deliberately, especially for regex that processes untrusted input.

For security-sensitive contexts, MDN documentation and OWASP resources both cover ReDoS (regex denial of service) and the patterns to avoid. The short version: if the input is attacker-controlled and the regex has backtracking risk, treat it as a vulnerability and rewrite.

Step 8: Add Comments and Test Cases

A regex pattern with no documentation is unreadable six months later, even to the person who wrote it. Two practical mitigations:

Use the x flag where supported (Python and PCRE call this verbose mode). This lets you write regex on multiple lines with comments:

pattern = re.compile(r"""
    ^                 # start
    [A-Za-z0-9._%+-]+ # local part
    @                 # at sign
    [A-Za-z0-9.-]+    # domain
    \.[A-Za-z]{2,}    # TLD
    $                 # end
""", re.VERBOSE)

The pattern is the same, but a human can read it.

Maintain test cases as living documentation. A test file with a list of valid and invalid inputs (with comments explaining each case) is more useful than any amount of inline regex documentation. It also fails loudly when someone refactors the regex incorrectly.

Step 9: Plan for the Regex to Be Wrong

Even careful regex will sometimes be wrong. The right architectural defense is to make wrongness recoverable.

In data pipelines, this means logging the inputs that fail the regex (for later analysis), not just rejecting them silently. In user-facing forms, this means providing clear error messages so the user can fix the input. In imports, this means rejecting the failing row but continuing with the rest of the file, not crashing the whole import.

A regex that is occasionally wrong but reports its wrongness clearly is much more operationally useful than one that is occasionally wrong and silently corrupts downstream data.

Putting It Together

A regex pattern that survives production is one that:

Was written against real sample input, not imagined input
Anchors decisively
Escapes special characters consistently
Handles greedy vs non-greedy explicitly
Has been tested against boundary cases
Has been benchmarked for backtracking risk
Is documented for future readers
Operates inside a system that can recover from being wrong

The full reference on patterns we use for common validation and parsing problems is in Regex Code Snippets: Patterns for Common Validation and Parsing Problems on our site at https://137foundry.com. The structural advice above is what makes those patterns actually work in production rather than just in test files.

For production data validation work, our data integration service covers the architectural patterns that go around regex use in messy real-world data flows.

Implementing Change Data Capture for Reliable Bi-Directional Data Sync

137Foundry — Mon, 01 Jun 2026 11:32:19 +0000

Before you can sync data between two systems, you need a reliable way to know what changed. Change Data Capture (CDC) is the pattern for detecting changes as they happen rather than scanning entire tables on every sync cycle. Without CDC, the sync layer must do expensive full-table comparisons or risk missing changes entirely.

There are two main approaches: database-level CDC that reads from the write-ahead log, and application-level CDC that uses timestamps and polling.

Approach 1: Database-Level CDC (PostgreSQL WAL)

PostgreSQL logical replication allows consumers to subscribe to a stream of row-level changes directly from the database write-ahead log (WAL). Every insert, update, and delete is captured with before and after values.

Setup:

-- postgresql.conf must have wal_level = logical
CREATE PUBLICATION sync_pub FOR TABLE customers, orders, products;
SELECT pg_create_logical_replication_slot('sync_slot', 'pgoutput');

Python consumer using psycopg2:

import psycopg2
import psycopg2.extras
import json

def create_repl_conn(dsn: str):
    return psycopg2.connect(
        dsn,
        connection_factory=psycopg2.extras.LogicalReplicationConnection
    )

def consume_changes(dsn: str, slot: str, pub: str):
    conn = create_repl_conn(dsn)
    cur = conn.cursor()
    cur.start_replication(
        slot_name=slot, decode=True,
        options={"proto_version": "1", "publication_names": pub}
    )
    def handle(msg):
        payload = json.loads(msg.payload)
        action = payload.get("action")  # I=insert, U=update, D=delete
        if action in ("I", "U"):
            record = {c["name"]: c["value"] for c in payload.get("columns", [])}
            # emit {action: "upsert", table: ..., record: ...}
        elif action == "D":
            identity = {c["name"]: c["value"] for c in payload.get("identity", [])}
            # emit {action: "delete", table: ..., identity: ...}
        msg.cursor.send_feedback(flush_lsn=msg.data_start)
    cur.consume_stream(handle)

WAL-based CDC captures every change including hard deletes. The main operational concern is replication slot lag: if the consumer falls behind, the slot prevents WAL segments from being recycled, potentially filling disk. Monitor pg_replication_slots for lag_bytes and alert at 500 MB.

Alternatively, Debezium provides a managed CDC layer on top of PostgreSQL that handles slot management, schema evolution, and failure recovery.

Approach 2: Application-Level CDC (Timestamp Polling)

Timestamp polling queries for records where updated_at > last_processed_timestamp. This is simpler to set up and works with any database, but cannot detect hard deletes.

Schema requirement:

ALTER TABLE customers ADD COLUMN updated_at TIMESTAMP DEFAULT NOW();
ALTER TABLE customers ADD COLUMN deleted_at TIMESTAMP;  -- soft deletes
CREATE INDEX idx_customers_updated_at ON customers(updated_at);

Poller with persistent high-watermark:

from datetime import datetime, timedelta
from sqlalchemy import create_engine, text
import redis as redis_lib

class TimestampCDCPoller:
    def __init__(self, engine, table: str, redis_client):
        self.engine = engine
        self.table = table
        self.redis = redis_client
        self.wm_key = f"sync:watermark:{table}"

    def get_watermark(self) -> datetime:
        val = self.redis.get(self.wm_key)
        return (datetime.fromisoformat(val.decode()) if val
                else datetime.utcnow() - timedelta(hours=1))

    def set_watermark(self, ts: datetime):
        self.redis.set(self.wm_key, ts.isoformat())

    def poll(self):
        # 10-second overlap buffer for rows arriving slightly out of order
        since = self.get_watermark() - timedelta(seconds=10)
        with self.engine.connect() as conn:
            rows = conn.execute(
                text(f"SELECT * FROM {self.table} "
                     "WHERE updated_at > :since ORDER BY updated_at ASC"),
                {"since": since}
            ).fetchall()
        if rows:
            self.set_watermark(max(r.updated_at for r in rows))
        return [dict(r._mapping) for r in rows]

The 10-second overlap means some rows are processed twice -- which is why idempotent consumers are essential. Store the watermark in Redis (persistent across restarts) rather than in memory. An in-memory watermark is lost on consumer restart, causing either a full re-scan or missed changes.

Filtering Sync-Originated Changes

Both CDC approaches will detect changes made by your sync layer itself, causing sync loops. Tag the sync connection:

with engine.connect() as conn:
    conn.execute(text("SET application_name = 'data_sync'"))
    conn.execute(
        text("UPDATE customers SET name = :name WHERE id = :id"),
        {"name": new_name, "id": record_id}
    )
    conn.commit()

In your WAL consumer, skip events where application_name = 'data_sync'. For timestamp polling, add a synced_at column and skip rows where updated_at - synced_at < INTERVAL '2 seconds'.

Choosing the Right Approach

Use WAL-based CDC when: you need capture of hard deletes, you have high change volume (over 1,000 rows per minute), or you need sub-second sync latency. Use Debezium for production systems with high reliability requirements.

Use timestamp polling when: you cannot modify database replication configuration (common in managed database services like AWS RDS), you need a simpler operational setup, or sync latency tolerance is 30 seconds or more.

For the full guide including conflict resolution and dead-letter queue setup, see How to Build a Bi-Directional Data Sync Between Business Applications. For production implementation support, the 137Foundry data integration services team has run both CDC patterns across multiple client systems.

Monitoring CDC Consumers in Production

Regardless of which CDC approach you use, the operational monitoring for the consumer follows the same pattern:

Consumer lag. The gap between the most recent change event captured and the most recent change event processed. For WAL-based CDC, this is visible as replication slot lag in pg_replication_slots. For timestamp polling, it is the difference between NOW() and the current high-watermark.

Consumer heartbeat. For consumers running as background processes, a heartbeat check verifies that the consumer is still running and processing events. A consumer that has crashed but has not been restarted is invisible without an explicit heartbeat check.

Error rate. The fraction of processed events that result in errors (schema mismatch, downstream API failures, constraint violations). A rising error rate indicates something changed in the upstream system or downstream dependencies.

DLQ depth. For conflicts and permanent failures routed to a dead-letter queue, the DLQ depth should remain low (typically under 10 entries in a healthy system). A growing DLQ means problems are accumulating without resolution.

These four metrics -- lag, heartbeat, error rate, DLQ depth -- are the minimum viable monitoring for any CDC-based sync system. Without them, the sync can fail silently for hours before anyone notices.

For the complete guide on building reliable bi-directional sync including CDC implementation and operational monitoring, see 137Foundry's data integration resources and the data integration services overview.

Why This Matters for Production Reliability

The failure modes of bi-directional sync are almost always discovered in production, not in testing. Test environments rarely replicate the exact conditions that cause clock skew conflicts -- clock synchronization on development machines is generally better than on production infrastructure. Test environments rarely replicate the specific bulk operation patterns that create consistency gaps. And test environments rarely run long enough to reveal the slow drift that accumulates when a field authority map is not updated after a schema change.

This is not an argument against testing -- it is an argument for investing in observability alongside testing. The monitoring patterns in this guide (sync lag, DLQ depth, conflict rate, record count parity) give you visibility into problems that tests will not catch before they affect users.

For teams building a bi-directional sync for the first time, the practical recommendation is: build the operational baseline (DLQ, monitoring, idempotency, loop prevention) before the first production deployment, not after the first production incident. The upfront cost is modest. The incident prevention value is significant.

For technical implementation guidance, see 137Foundry and the data integration resources. For production architecture review and implementation support, the 137Foundry services team works with teams across the integration lifecycle.

Why This Matters for Production Reliability

Conflict Resolution in Bi-Directional Data Sync: Strategies That Hold Up in Production

137Foundry — Mon, 01 Jun 2026 11:32:18 +0000

In one-way data pipelines, conflict resolution is not a concept -- there is one authoritative source. In bi-directional sync, both systems write to shared data, and the sync layer must decide what to do when both systems have updated the same record since the last sync cycle. Getting this wrong means silently dropping updates, silently overwriting updates, or accumulating conflicts that gradually corrupt your data.

This covers three conflict resolution strategies with Python implementation examples, and the operational setup that keeps them working.

Strategy 1: Last-Write-Wins with Clock Skew Tolerance

Last-write-wins compares updated_at timestamps and applies the more recent version. Simple in theory; fragile in practice because of clock skew.

Clock skew between servers in a distributed system can reach hundreds of milliseconds to several seconds. NTP synchronization reduces drift but does not eliminate it. A last-write-wins resolution will produce the wrong answer whenever two changes arrive within the skew window -- and the failure is silent.

The mitigation: add a tolerance window. Conflicts where the timestamp gap is smaller than the expected clock skew get routed to a dead-letter queue for review rather than being resolved automatically.

from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import Optional

CLOCK_SKEW_TOLERANCE_MS = 500

@dataclass
class SyncRecord:
    id: str
    data: dict
    updated_at: datetime
    source_system: str

def resolve_last_write_wins(
    record_a: SyncRecord,
    record_b: SyncRecord,
    tolerance_ms: int = CLOCK_SKEW_TOLERANCE_MS,
) -> Optional[SyncRecord]:
    # Returns None if conflict is ambiguous (within clock skew tolerance)
    delta_ms = abs((record_a.updated_at - record_b.updated_at).total_seconds() * 1000)
    if delta_ms <= tolerance_ms:
        return None  # route to DLQ
    return record_a if record_a.updated_at > record_b.updated_at else record_b

Test this across the full range of timestamp gaps:

def test_within_tolerance_routes_to_dlq():
    now = datetime.utcnow()
    rec_a = SyncRecord("1", {}, now, "system_a")
    rec_b = SyncRecord("1", {}, now + timedelta(milliseconds=200), "system_b")
    assert resolve_last_write_wins(rec_a, rec_b) is None

def test_outside_tolerance_resolves_correctly():
    now = datetime.utcnow()
    rec_a = SyncRecord("1", {}, now, "system_a")
    rec_b = SyncRecord("1", {}, now + timedelta(seconds=2), "system_b")
    winner = resolve_last_write_wins(rec_a, rec_b)
    assert winner.source_system == "system_b"

Strategy 2: Field-Authority Mapping

Field authority designates each field as owned by a specific system. Only the owning system's value is authoritative for that field. This eliminates conflicts entirely for fields with a clear owner.

from typing import Literal, Dict

FieldAuthority = Literal["system_a", "system_b", "shared"]

FIELD_AUTHORITY_MAP: Dict[str, FieldAuthority] = {
    "name": "system_a",       # CRM owns contact info
    "email": "system_a",
    "payment_status": "system_b",  # Billing owns payment data
    "billing_address": "system_b",
    "notes": "shared",        # Both systems can update
    "last_activity_at": "shared",
}

def merge_with_field_authority(
    local_record: dict,
    incoming_record: dict,
    source_system: str,
) -> dict:
    # Apply fields from incoming where source_system is authoritative.
    # Fields not in the map are logged and skipped.
    import logging
    result = dict(local_record)
    for field, value in incoming_record.items():
        authority = FIELD_AUTHORITY_MAP.get(field)
        if authority is None:
            logging.warning(f"Field '{field}' not in authority map -- skipping")
            continue
        if authority == source_system or authority == "shared":
            result[field] = value
    return result

The key operational concern: the FIELD_AUTHORITY_MAP must be updated whenever either system adds a new field. Store it in a configuration file rather than hard-coding it, so updates do not require a code deploy.

Strategy 3: Hybrid with Dead-Letter Queue

Most production bi-directional syncs combine last-write-wins for shared fields, field authority for owned fields, and a DLQ for conflicts that cannot be resolved deterministically.

import json
import redis
from datetime import datetime

dlq_client = redis.Redis()

def route_to_dlq(conflict_event: dict, reason: str, record_id: str):
    entry = {
        "id": record_id,
        "conflict_data": conflict_event,
        "reason": reason,
        "queued_at": datetime.utcnow().isoformat(),
        "status": "pending_review",
    }
    dlq_client.rpush("sync:dlq", json.dumps(entry))

def process_conflict(record_a: SyncRecord, record_b: SyncRecord) -> Optional[SyncRecord]:
    # Try last-write-wins first
    winner = resolve_last_write_wins(record_a, record_b)
    if winner is not None:
        return winner
    # Cannot resolve -- route to DLQ
    route_to_dlq(
        {"record_a": record_a.data, "record_b": record_b.data},
        reason="Within clock skew tolerance -- manual review required",
        record_id=record_a.id,
    )
    return None

Monitoring Conflict Rates in Production

A healthy bi-directional sync has a low but non-zero conflict rate. Spikes in conflict rate indicate both systems are being written to simultaneously more than expected -- usually a sign of a bug in upstream application logic or a runaway batch process writing to both systems at the same time.

Track conflict rate (conflicts per 1,000 sync events) and DLQ depth as primary operational metrics. Alert when conflict rate increases by more than 2x week-over-week, and alert immediately when DLQ depth exceeds a threshold. Without these alerts, conflicts accumulate silently and surface only when a user reports that their update was overwritten.

For the full architecture guide including CDC setup, sync loop prevention, and operational monitoring, see How to Build a Bi-Directional Data Sync Between Business Applications. The data integration work at 137Foundry covers these patterns in production across multiple client integrations.

DLQ Review Process in Practice

A dead-letter queue that nobody reviews is worse than no DLQ at all -- it creates a false sense of safety while conflicts accumulate. The DLQ needs a human review process, and that process should be defined before the first DLQ entry arrives.

The minimum viable DLQ review process:

Alert on DLQ growth. When DLQ depth exceeds a threshold (e.g., 10 entries), alert the on-call team. Do not wait for a daily review; real conflicts should be resolved within hours.
Review interface. A simple web interface or CLI tool that shows each DLQ entry with the record ID, both conflicting versions, the conflict reason, and the timestamp. Buttons or commands to apply record A, apply record B, or dismiss (if the conflict is no longer relevant).
Audit trail. Every DLQ resolution action is logged with the timestamp, the reviewer, and the chosen resolution. This audit trail is essential for debugging if the resolution was wrong.
Pattern detection. Periodically review the DLQ for patterns: the same record appearing multiple times, the same fields always conflicting, the same source system always losing. Patterns in DLQ entries indicate structural issues in the conflict resolution rules that should be fixed, not just resolved case-by-case.

The SQLAlchemy ORM works well for the audit trail tables. Redis sorted sets are convenient for the DLQ itself -- scores by timestamp allow querying the oldest unresolved entries first.

For the full architecture guide covering conflict resolution, CDC, and operational monitoring, see How to Build a Bi-Directional Data Sync Between Business Applications. The data integration work at 137Foundry covers these patterns across multiple production integration projects.

Why This Matters for Production Reliability

How to Prevent the Flash of Wrong Theme When Implementing Dark Mode

137Foundry — Sun, 31 May 2026 10:39:08 +0000

The flash of wrong theme (FOWT) is the brief moment where a page loads in light mode for a user who prefers dark, or vice versa, before JavaScript applies the correct theme. It happens because the default page color scheme renders before the JavaScript that reads localStorage and applies the user's preference has a chance to run.

For many applications, FOWT is the most visible dark mode bug and the hardest to eliminate without understanding what causes it. This article explains the root cause and walks through the specific fix for client-rendered applications, server-rendered applications, and Next.js specifically.

Why the Flash Happens

JavaScript-first dark mode implementations typically follow this sequence:

Browser receives HTML and begins parsing
Browser begins downloading and evaluating scripts (deferred or at end of <body>)
CSS renders the page using :root default values (light mode)
JavaScript runs, reads localStorage.getItem('theme'), applies data-theme="dark"
Page re-renders in dark mode

The user sees step 3 -- the light mode render -- before step 4 applies the correct theme. On a fast device, this flash is brief but still jarring. On a slow connection or low-end device, it lasts long enough to be obviously wrong.

The Fix: Block Rendering Until the Theme Is Applied

The only reliable fix is to apply the theme before the browser makes its first paint. This means running a small inline script in the <head> of the document -- specifically, before any stylesheets have finished loading.

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <!-- This script runs before any CSS is applied -->
  <script>
    (function() {
      var stored = localStorage.getItem('theme');
      var preferred = stored
        ? stored
        : (window.matchMedia('(prefers-color-scheme: dark)').matches ? 'dark' : 'light');
      document.documentElement.setAttribute('data-theme', preferred);
    })();
  </script>
  <link rel="stylesheet" href="/styles.css">
</head>

This is a blocking script -- it prevents rendering until it executes. Normally, blocking scripts in the <head> are a performance anti-pattern because they delay the page from rendering. Here, that blocking behavior is what we need. The script is small enough (under 200 bytes minified) that the blocking cost is negligible, and the tradeoff is correct: a brief delay on all page loads is better than a visible theme flash on every load for users who have set a preference.

The inline script must be placed before your stylesheet <link> tags. If the stylesheet loads and paints before the script runs, the flash still occurs.

The window.matchMedia API reads the prefers-color-scheme media query value from the OS, which is used as the fallback when no stored preference exists in localStorage.

Preventing Theme Transition on Load

If you have added CSS transitions to smooth theme switching, those same transitions will animate the initial theme application on page load -- which looks like a flash even when the theme is technically applied before first paint. Users see the correct theme animate in from the wrong one.

The fix is to suppress transitions until after the initial load:

/* Start with transitions disabled */
html.no-transition * {
  transition: none !important;
}

(function() {
  var stored = localStorage.getItem('theme');
  var preferred = stored
    ? stored
    : (window.matchMedia('(prefers-color-scheme: dark)').matches ? 'dark' : 'light');
  document.documentElement.setAttribute('data-theme', preferred);
  document.documentElement.classList.add('no-transition');
})();

// After the DOM is ready, remove the no-transition class
document.addEventListener('DOMContentLoaded', function() {
  requestAnimationFrame(function() {
    document.documentElement.classList.remove('no-transition');
  });
});

The requestAnimationFrame ensures that the class removal happens after the first frame has been committed, which guarantees transitions are not active during the initial render.

Server-Rendered Applications: Setting the Theme Before HTML Delivery

In a server-rendered application (PHP, Django, Rails, etc.), you can set the data-theme attribute server-side if the theme preference is stored in a cookie rather than localStorage. The cookie is sent with the request, the server reads it, and the HTML is delivered with data-theme="dark" already present on the <html> element.

Set-Cookie: theme=dark; SameSite=Strict; Path=/; Max-Age=31536000

On the server side, read the cookie and render the HTML accordingly:

<!-- PHP example -->
<?php $theme = $_COOKIE['theme'] ?? 'light'; ?>
<html data-theme="<?= htmlspecialchars($theme) ?>" lang="en">

This eliminates the flash entirely because the HTML arrives from the server already configured with the correct attribute. No client-side JavaScript is required for the initial render.

The tradeoff: the client-side localStorage approach and the server-side cookie approach are mutually exclusive in their simplest implementations. If you are migrating from localStorage to cookie-based persistence, you need a migration path for existing users whose preferences are in localStorage.

Next.js and React Server Components

In Next.js, the standard approach is to use a combination of the <Script> component with strategy="beforeInteractive" and a server-side theme resolution via cookies.

For the client-side fallback, add a beforeInteractive script in your root layout.tsx:

import Script from 'next/script';

export default function RootLayout({ children }) {
  return (
    <html lang="en">
      <head>
        <Script id="theme-init" strategy="beforeInteractive">
          {`
            (function() {
              var stored = localStorage.getItem('theme');
              var preferred = stored
                ? stored
                : (window.matchMedia('(prefers-color-scheme: dark)').matches ? 'dark' : 'light');
              document.documentElement.setAttribute('data-theme', preferred);
            })();
          `}
        </Script>
      </head>
      <body>{children}</body>
    </html>
  );
}

The beforeInteractive strategy renders the script inline in the <head> before any React hydration occurs, which places it in the same position as the manual inline script approach described above.

For server-side resolution with cookies in Next.js App Router, read the cookie in the root layout Server Component and set the attribute on the <html> element before delivery:

import { cookies } from 'next/headers';

export default function RootLayout({ children }) {
  const theme = cookies().get('theme')?.value ?? 'light';
  return (
    <html lang="en" data-theme={theme}>
      <body>{children}</body>
    </html>
  );
}

This delivers HTML with the correct theme already applied, eliminating the flash entirely for server-rendered pages.

Pure CSS: No Flash by Design

If you use only prefers-color-scheme and CSS custom property overrides without any JavaScript, there is no flash. The CSS media query applies before the page paints, and the correct theme tokens are in place from the first render.

The limitation is that you cannot offer a manual toggle -- users who want dark mode on a light-OS device, or light mode on a dark-OS device, have no option. For applications where respecting the system default is sufficient, the pure CSS approach is the simplest path with zero flash risk.

"The flash-of-wrong-theme bug is one of the first things we look for in a dark mode code review. It is almost always solvable without significant refactoring -- the solution is always some variant of 'run the theme script earlier.' Teams are often surprised that the fix is adding a blocking script, because blocking scripts are normally what you remove in performance optimization work. The blocking here is intentional and the performance cost is negligible." -- Dennis Traina, founder of 137Foundry (view services)

Verifying Your Fix Works

After implementing one of these approaches, verify the fix with browser DevTools:

In Chrome DevTools, open the Application tab and clear the site's localStorage.
Set a theme value in localStorage (localStorage.setItem('theme', 'dark')).
Reload the page with the DevTools Network tab open and throttling set to "Slow 3G."
Observe the page load. If a light-mode flash precedes the dark theme, the fix is not in the right position.

A correct implementation shows no visible theme transition on load -- the page arrives in the stored theme without any intermediate state.

The full dark mode implementation context -- CSS custom properties, the localStorage toggle, and the system preference detection -- is covered in How to Add Dark Mode to a Web Application. The FOWT fix described here slots into that broader implementation as the final piece that eliminates the last visible artifact.

For projects where dark mode is part of a larger front-end architecture scope, 137Foundry development services include front-end code review and implementation consulting. FOWT elimination is a consistent item in 137Foundry's front-end review checklist for applications with user-controlled theme switching.

Building a CSS Design Token System That Scales

137Foundry — Sun, 31 May 2026 10:39:07 +0000

A design token system is the layer between your design decisions and your code. It is the difference between a stylesheet where colors, spacing, and border radii are scattered as hard-coded values across hundreds of CSS files, and one where every design decision is named, centralized, and changeable in a single place.

CSS custom properties are the native mechanism for implementing design tokens on the web. This article covers how to structure a token system that scales from a small project to a large application, and how to extend it to support dark mode and other theming scenarios.

The Two Layers of a Token System

A mature token system has two layers: primitive tokens and semantic tokens.

Primitive tokens are raw design values with no opinion about where they are used:

:root {
  /* Color scale */
  --blue-50: #eff6ff;
  --blue-100: #dbeafe;
  --blue-500: #3b82f6;
  --blue-900: #1e3a8a;

  --gray-50: #f9fafb;
  --gray-100: #f3f4f6;
  --gray-500: #6b7280;
  --gray-900: #111827;

  /* Spacing scale */
  --space-1: 4px;
  --space-2: 8px;
  --space-4: 16px;
  --space-6: 24px;
  --space-10: 40px;
}

Semantic tokens map primitive values to named roles in the UI:

:root {
  --color-background: var(--gray-50);
  --color-surface: var(--gray-100);
  --color-text-primary: var(--gray-900);
  --color-text-secondary: var(--gray-500);
  --color-accent: var(--blue-500);
  --color-accent-hover: var(--blue-900);

  --space-component-padding: var(--space-4);
  --space-section-gap: var(--space-10);
}

Components reference semantic tokens exclusively:

.button {
  background: var(--color-accent);
  padding: var(--space-2) var(--space-4);
  color: var(--gray-50);
}

.button:hover {
  background: var(--color-accent-hover);
}

This two-layer structure is what makes the system maintainable at scale. If the brand's primary blue changes from #3b82f6 to #2563eb, you change one primitive token value. Every semantic token that references --blue-500 updates automatically. Every component that uses --color-accent updates as a consequence. No grep-and-replace across the codebase.

Naming Tokens by Role, Not by Appearance

The most important decision in token naming is to name by semantic role rather than visual appearance.

Bad naming:

--color-blue: #3b82f6;
--color-light-gray: #f3f4f6;
--color-dark-gray: #1a1a1a;

Good naming:

--color-accent: #3b82f6;
--color-background-secondary: #f3f4f6;
--color-text-primary: #1a1a1a;

The bad naming fails in two ways. First, the name becomes misleading when the value changes -- --color-blue now holds #2563eb, which is also blue but the name no longer carries specific information. Second, and more critically, appearance names are meaningless in dark mode. You cannot define --color-light-gray: #0f172a in a dark theme without the name becoming actively wrong. Semantic names survive theme changes because they describe what the value does, not what it looks like.

Extending the Token System for Dark Mode

Once the semantic token layer is in place, dark mode becomes a single override block. Define the dark mode semantic tokens under a [data-theme="dark"] attribute selector that JavaScript toggles on the <html> element:

[data-theme="dark"] {
  --color-background: var(--gray-900);
  --color-surface: #1e293b;
  --color-text-primary: var(--gray-50);
  --color-text-secondary: var(--gray-500);
  --color-accent: var(--blue-100);
  --color-accent-hover: var(--blue-50);
}

Every component in the application now responds to the theme change without any component-level changes. The semantic token layer absorbs the entire theming concern.

You can also use the CSS media feature prefers-color-scheme to apply the dark token set automatically when the user's OS is in dark mode, before any JavaScript runs:

@media (prefers-color-scheme: dark) {
  :root {
    --color-background: var(--gray-900);
    --color-surface: #1e293b;
    --color-text-primary: var(--gray-50);
    --color-text-secondary: var(--gray-500);
    --color-accent: var(--blue-100);
  }
}

The [data-theme] attribute selector, when present in the DOM, overrides the media query result. This gives you the correct priority order: user's explicit preference from the UI toggle takes precedence over the OS default.

Organizing a Large Token File

For applications with hundreds of tokens, a flat file becomes hard to navigate. Two organizational strategies work well:

Layered file structure: Split tokens into files by category (tokens/colors.css, tokens/spacing.css, tokens/typography.css) and import them in a single tokens/index.css. Each file is focused and manageable.

Comment-delimited sections in a single file: For smaller projects, sections within one file work well:

:root {
  /* =====================
     COLOR - PRIMITIVES
  ===================== */
  --blue-500: #3b82f6;
  /* ... */

  /* =====================
     COLOR - SEMANTIC
  ===================== */
  --color-accent: var(--blue-500);
  /* ... */

  /* =====================
     SPACING
  ===================== */
  --space-4: 16px;
  /* ... */
}

What matters more than the organizational structure is the convention: every color value used in a component stylesheet must have a corresponding entry in the token file. Any hex value that appears in a component stylesheet without a token reference is a gap in the system.

When to Add a New Token vs. Use an Existing One

The question of when to add new tokens versus reusing existing ones is where design token systems often accumulate clutter. A useful rule: add a new semantic token only when the UI has a distinct need that does not already have a named representation.

If a component needs a color that does not fit any existing semantic token, that is a signal that a new semantic category may be needed -- not just a one-off override. If the component is the only place in the application where this distinction matters, consider whether the distinction belongs in the token layer at all or should remain as a component-level style.

Common token categories that teams consistently underestimate when starting out: interactive state tokens (hover, focus, active, disabled), status tokens (success, warning, error, info), and surface hierarchy tokens (background, surface, overlay, tooltip). Building these categories from the beginning avoids the retrofitting work that comes when the first status message or modal is added and there are no matching tokens in the system.

Tooling That Supports Token System Discipline

Token systems accumulate gaps silently without enforcement. A few tools reduce the maintenance burden:

Stylelint with a custom rule that flags hard-coded hex values in component files catches gaps at commit time rather than in a periodic audit. A value like background: #3b82f6 in a component file is a lint error; background: var(--color-accent) is not. This is the highest-leverage automation addition to a token system.

CSS DevTools panel in Chrome and Firefox shows computed custom property values per element, which makes debugging theme application problems significantly faster than inspecting computed styles manually.

Visual regression snapshots of both light and dark mode after any token change catch unintended cascade effects before they reach production.

Validation: Does Your Token System Cover What It Should?

After implementing a token system, a quick audit catches the most common gaps:

Grep your stylesheets for hex values (#[0-9a-fA-F]{3,6}). Any hard-coded hex outside the token file is a missing token.
Grep for hard-coded pixel values in spacing properties (margin, padding, gap) that do not use var(--space-*). These are gaps in the spacing token layer.
Apply your dark mode toggle and look for elements that remain light. Each one indicates a hard-coded value that bypassed the token system.

"The most common mistake we see in token systems is adding tokens for one-off component needs rather than for genuine semantic distinctions. You end up with fifty color tokens when ten would cover the full design language. Start narrow, expand only when the UI demands it, and keep the token names honest about what they represent." -- Dennis Traina, founder of 137Foundry (view services)

The 137Foundry development team applies this audit at the beginning of any front-end refactor that includes theming work. The grep for hard-coded hex values reliably surfaces a first pass of missing tokens within minutes.

For the practical implementation of dark mode built on top of a CSS custom property token system, How to Add Dark Mode to a Web Application covers the full pattern including localStorage persistence and the flash-of-wrong-theme prevention.

The browser support for CSS custom properties is comprehensive -- caniuse confirms coverage across all modern browsers. This is a technique that is safe to ship without polyfills or build-tool dependencies in any application that does not target Internet Explorer.

How to Add Fuzzy Search to a JavaScript App With Fuse.js

137Foundry — Sat, 30 May 2026 10:27:52 +0000

Fuse.js is a zero-dependency JavaScript library that adds fuzzy search to any dataset that fits in memory. It handles typos, partial matches, and relevance ranking without a server or backend service. For datasets under 10,000 small objects, it is one of the fastest ways to add high-quality search to a web application.

This guide walks through setup, configuration, and the key options that determine how well search results match user intent. For context on the broader search-as-you-type implementation including debounce and request cancellation, the 137Foundry guide on search-as-you-type covers the complete approach. 137Foundry builds custom web applications that incorporate search as a core feature.

Step 1: Install Fuse.js

Install via npm for module-based projects:

npm install fuse.js

For browser-only projects without a build step, Fuse.js is also available via CDN -- check the Fuse.js releases page for the current version and include it as a script tag pointing to the jsdelivr or unpkg CDN.

Step 2: Prepare Your Dataset

Fuse.js works best with an array of objects. The library indexes the fields you specify and searches across them. A simple product catalog might look like:

const products = [
  { id: 1, name: 'Wireless Keyboard', category: 'Electronics', description: 'Compact Bluetooth keyboard' },
  { id: 2, name: 'USB-C Hub', category: 'Electronics', description: 'Multi-port USB hub for laptops' },
  { id: 3, name: 'Desk Lamp', category: 'Home Office', description: 'LED lamp with adjustable brightness' },
  // ... more items
];

The dataset can come from a JSON file, an API call at page load, or be embedded in the HTML. Fuse.js indexes it in memory; updates to the original array require re-indexing.

Step 3: Create the Fuse Instance

import Fuse from 'fuse.js';

const options = {
  keys: ['name', 'category', 'description'],
  threshold: 0.3,
  includeScore: true,
};

const fuse = new Fuse(products, options);

The keys array tells Fuse which fields to search. The threshold controls how fuzzy the matching is. A threshold of 0.0 requires exact matches; 1.0 matches everything. A value of 0.3 is a good starting point: it catches typos and minor misspellings without returning too many irrelevant results.

includeScore: true adds a relevance score to each result (0.0 is a perfect match, higher values are worse matches). This is useful for debugging and for sorting results that pass the threshold. Enable it during development even if scores are not displayed in the UI -- logging scores for real user queries is the fastest way to determine whether the threshold needs adjustment for your specific dataset.

Step 4: Search and Render Results

function search(query) {
  if (query.length < 2) return [];
  return fuse.search(query);
}

// Each result looks like:
// { item: { id: 1, name: 'Wireless Keyboard', ... }, score: 0.12, refIndex: 0 }

fuse.search(query) returns results sorted by score (best match first). Each result wraps the original item with item, score, and refIndex properties. Access the original object through result.item.

Step 5: Wire It to a Search Input With Debounce

Combining Fuse.js with debounce gives you instant, typo-tolerant search without any server requests:

const searchInput = document.querySelector('#search-input');
const resultsContainer = document.querySelector('#results');

function debounce(fn, delay) {
  let timer;
  return (...args) => {
    clearTimeout(timer);
    timer = setTimeout(() => fn(...args), delay);
  };
}

const handleSearch = debounce((query) => {
  const results = search(query.trim());
  renderResults(results);
}, 200);

searchInput.addEventListener('input', (e) => {
  handleSearch(e.target.value);
});

function renderResults(results) {
  if (results.length === 0) {
    resultsContainer.innerHTML = '<p>No results found.</p>';
    return;
  }
  resultsContainer.innerHTML = results
    .map(({ item }) => `<div class="result">${item.name}</div>`)
    .join('');
}

Because Fuse.js searches in memory, there are no race conditions and no AbortController needed. The debounce delay can be shorter (150-200ms) than a server-based search because the search itself is instant.

Configuring Field Weights

Not all fields are equally important. Matching a query in the product name is more relevant than matching it in the description. Fuse.js supports per-key weights:

const options = {
  keys: [
    { name: 'name', weight: 3 },
    { name: 'category', weight: 1.5 },
    { name: 'description', weight: 1 },
  ],
  threshold: 0.3,
};

Higher weight values mean matches in that field boost the result's score more. A product whose name matches the query will rank significantly above one where only the description matches.

Getting weights right usually takes experimentation. Start with name weighted 3x description and adjust based on what kinds of queries your users actually run.

Handling Nested Objects

Fuse.js can search nested fields using dot notation:

const data = [
  { id: 1, title: 'Article A', author: { name: 'Alice', role: 'Editor' } },
  // ...
];

const options = {
  keys: ['title', 'author.name'],
};

Arrays of strings within an object are also supported:

const data = [
  { id: 1, name: 'Product', tags: ['wireless', 'bluetooth', 'compact'] },
  // ...
];

const options = {
  keys: ['name', 'tags'],
};

Fuse.js searches each element of the array and treats a match in any element as a match for that field.

Limiting the Result Set

By default, Fuse.js returns all results above the threshold. For a search UI, this can be too many. Limit results with the limit option:

fuse.search(query, { limit: 10 });

Or slice the results array:

const results = fuse.search(query).slice(0, 10);

Limiting to 10-20 results is appropriate for most search UIs. Showing more than that without pagination creates a list that users do not scroll through anyway.

When Fuse.js Is Not the Right Tool

Fuse.js has limits. It does not support full-text search features like stemming (treating "running" and "runs" as the same term), phrase search, or proximity search. It loads the full dataset into memory, which is fine for small datasets but impractical for large ones. Memory use scales with object count and field size: ten thousand small objects might use 3-4 MB of browser memory, while one hundred thousand records with longer text fields can push past 20 MB and cause noticeable slowdowns on lower-end mobile hardware.

For datasets over 10,000 items, consider Lunr.js (pre-built inverted index), a server-side search API, or a dedicated search service like Algolia or Typesense. For the backend implementation and the server-side patterns that apply when the dataset is too large for client-side search, the 137Foundry guide on search-as-you-type covers the server-side approach including AbortController for managing in-flight requests.

A Complete Minimal Example

import Fuse from 'fuse.js';

const data = [
  { id: 1, name: 'Wireless Keyboard', category: 'Electronics' },
  { id: 2, name: 'USB Hub', category: 'Electronics' },
  { id: 3, name: 'Monitor Stand', category: 'Furniture' },
];

const fuse = new Fuse(data, {
  keys: [{ name: 'name', weight: 2 }, { name: 'category', weight: 1 }],
  threshold: 0.3,
  includeScore: true,
});

const input = document.querySelector('#search');
const results = document.querySelector('#results');

let timer;
input.addEventListener('input', (e) => {
  clearTimeout(timer);
  timer = setTimeout(() => {
    const query = e.target.value.trim();
    if (query.length < 2) {
      results.innerHTML = '';
      return;
    }
    const hits = fuse.search(query, { limit: 5 });
    results.innerHTML = hits
      .map(({ item }) => `<div>${item.name} (${item.category})</div>`)
      .join('');
  }, 200);
});

This is production-ready for small datasets. Add weight tuning, empty state messaging, and ARIA attributes for the complete implementation. 137Foundry handles the full stack when search is a core requirement, from library selection through relevance tuning and ongoing optimization.

7 Free Search Libraries and Tools for JavaScript Web Apps

137Foundry — Sat, 30 May 2026 10:25:32 +0000

Building search into a web application means choosing between client-side and server-side approaches, and between writing everything yourself or using a library that handles the heavy lifting. These seven tools cover the range from simple fuzzy matching to full production search engines.

For the implementation patterns that work regardless of which tool you choose, including debounce, request cancellation, and relevance scoring, the 137Foundry guide on search-as-you-type covers the complete approach. 137Foundry integrates these libraries into production web applications where search is a core feature.

1. Fuse.js

Fuse.js is a lightweight, zero-dependency JavaScript library for fuzzy searching. It works entirely in the browser, which means no server round-trip and no backend configuration. You pass it an array of objects and a set of keys to search, and it returns ranked results with fuzzy matching that handles typos.

Fuse.js is the right choice when your dataset fits in memory (typically under 10,000 small objects), when you want instant results without server requests, and when the full dataset can be safely exposed to the client. It supports field weighting, exact match priority over fuzzy matches, and a configurable fuzzy threshold.

Use it for: site-wide content search, documentation search, product catalog search for small stores, command palette interfaces.

2. Lunr.js

Lunr.js is a full-text search library for JavaScript, designed to create search indexes that behave like Solr or Elasticsearch but run entirely in the browser. Unlike Fuse.js, Lunr pre-builds an inverted index from your dataset, which makes large-dataset search significantly faster at the cost of an upfront indexing step.

Lunr supports boosting (weighting specific fields), stop words, stemming, and wildcard searches. The index can be built server-side and serialized to JSON for faster client-side loading. It is a good step up from Fuse.js when the dataset is larger or when you need proper full-text ranking rather than fuzzy matching.

Use it for: documentation sites, offline-capable web apps, static site search, any use case where the dataset changes infrequently.

3. Typesense

Typesense is an open-source search engine you can self-host or use through Typesense Cloud. It offers sub-10ms search responses, typo tolerance, faceting, and a clean API. Unlike a general-purpose database with full-text search bolted on, Typesense is designed specifically for instant search.

Typesense has official JavaScript and Node.js clients, React and Vue integrations, and a well-documented instant search UI widget library. For teams that want search-engine performance without paying for a SaaS product, self-hosting Typesense on a small instance is a viable option.

Use it for: e-commerce search, SaaS application search, any product where search quality is a competitive feature.

4. Meilisearch

Meilisearch is another open-source, self-hostable search engine with instant search results, typo tolerance, and faceting. Its developer experience is particularly smooth: simple HTTP API, official SDKs for most languages, and a dashboard for exploring your search index. Meilisearch is designed to be easy to set up and run, which makes it a popular choice for development teams that want search-engine quality without extensive DevOps work.

Meilisearch has a cloud hosted offering for teams that prefer not to manage infrastructure.

Use it for: the same use cases as Typesense; the choice between the two usually comes down to API preference and deployment environment.

5. Algolia

Algolia is the industry-standard hosted search-as-a-service platform. It handles infrastructure, relevance tuning, and scale; you send it your data and query it through their API. Algolia's InstantSearch library provides pre-built UI components for search inputs, results, facets, and pagination that work with React, Vue, Angular, and vanilla JavaScript.

Algolia is not free at scale, but it has a generous free tier suitable for small projects. Its primary advantages over self-hosted options are zero infrastructure management, extremely fast global CDN-backed responses, and a mature relevance tuning UI.

Use it for: production applications where search quality and response time matter, teams without dedicated DevOps capacity, global applications that need low-latency search across regions.

6. FlexSearch

FlexSearch is a high-performance full-text search library for JavaScript that claims to be the fastest in-browser search library available. It uses a different indexing approach than Lunr.js, trading some of Lunr's feature richness for raw speed. It supports multiple scoring algorithms, field weighting, and incremental updates to the index.

FlexSearch is a strong option when you have a large client-side dataset and search response time is the primary concern. It supports both browser and Node.js environments.

Use it for: large client-side datasets where Fuse.js starts to feel slow, real-time search in rich JavaScript applications.

7. PostgreSQL Full-Text Search

PostgreSQL's built-in full-text search is not a separate library, but it is frequently overlooked as a serious search option. Using tsvector and tsquery, PostgreSQL supports ranked full-text search with stemming, phrase search, proximity search, and field weighting.

For applications already using PostgreSQL, adding full-text search indexes avoids the operational overhead of a separate search service. It scales to millions of records with proper indexing and handles most search use cases without dedicated search infrastructure.

The limitation is that PostgreSQL full-text search does not support typo tolerance out of the box (fuzzy matching requires the pg_trgm extension) and its ranking algorithm is not as sophisticated as Algolia or Typesense. For many applications, though, it is plenty.

Use it for: applications already on PostgreSQL where search needs are moderate, teams that prefer fewer infrastructure components, internal tools and admin interfaces.

Choosing Between Them

The decision usually breaks down along two dimensions: whether the data can be loaded client-side or needs to stay on the server, and whether the team can manage a separate search infrastructure.

Need	Library
Client-side, small dataset, zero config	Fuse.js
Client-side, larger dataset, full-text ranking	Lunr.js or FlexSearch
Server-side, self-hosted, free	Typesense or Meilisearch
Server-side, hosted service	Algolia
Server-side, existing PostgreSQL	pg full-text + pg_trgm

Questions to Ask Before Choosing

Before committing to a library, three questions narrow the options quickly.

Can the full dataset be safely exposed to the client? Client-side libraries (Fuse.js, Lunr.js, FlexSearch) load the dataset into the browser. If the data contains prices, user records, unpublished content, or anything that should not be visible to all end users, client-side search is not appropriate regardless of performance benefits.

How often does the dataset change? Fuse.js re-indexes on each page load, which works when data changes infrequently. If records are added, updated, or deleted constantly, a server-side index with real-time update hooks handles freshness more reliably than reloading a static JSON file.

What level of relevance quality is required? For internal tools and admin interfaces, basic keyword matching is usually sufficient. For customer-facing product search, features like typo tolerance, field weighting, and faceting make a visible difference in how often users find what they are looking for on the first query.

What is the team's operational capacity? Algolia requires no infrastructure management but costs money at scale. Typesense and Meilisearch require a server but are free and self-hosted. PostgreSQL full-text requires no additional infrastructure if you are already running PostgreSQL.

The front-end implementation patterns (debounce, AbortController, loading states) apply regardless of which backend you use. The guide on building search-as-you-type covers those patterns in detail. 137Foundry has integrated Algolia, Typesense, and PostgreSQL full-text search into client projects across different scale requirements.

Python Try-Except vs Try-Finally: When to Use Each

137Foundry — Fri, 29 May 2026 11:49:30 +0000

Both try-except and try-finally are exception handling constructs in Python, but they serve different purposes. Mixing them up produces code that works accidentally rather than by design. This guide covers when each pattern is appropriate, when to combine them, and the specific cases where the distinction matters most.

What Each Construct Does

try-except catches exceptions. When an exception occurs inside the try block, Python looks for an except clause that matches the exception type and executes it instead of propagating the exception.

try-finally guarantees cleanup. The finally block runs no matter what happens in the try block -- whether the code completes normally, raises an exception, executes a return statement, or even calls sys.exit(). Exceptions are not caught or suppressed; they continue propagating after finally runs.

# try-except: handles the exception
try:
    result = int(user_input)
except ValueError:
    result = 0  # Exception caught; execution continues here

# try-finally: guarantees cleanup regardless of outcome
connection = None
try:
    connection = db.connect()
    result = connection.query(sql)
finally:
    if connection:
        connection.close()  # Always runs, even if query raised

When to Use try-except

Use try-except when you have a specific plan for what to do if an exception occurs. "A specific plan" means one of:

Return a fallback value
Log the error and continue with the next item
Convert the exception to a different type (exception chaining)
Re-raise with additional context

The critical distinction is catching specific exception types, not Exception or bare except:. Broad catches hide unexpected errors and make debugging harder.

# Good: specific exception type, specific handling
try:
    user = db.get_user(user_id)
except UserNotFoundError:
    return None
except DatabaseConnectionError as err:
    raise ServiceUnavailableError("Database offline") from err

# Problematic: catches everything including programming errors
try:
    process_record(record)
except Exception:
    pass  # Silently discards AttributeErrors, NameErrors, etc.

The Python language documentation covers the full built-in exception hierarchy and which errors should be caught versus propagated.

When to Use try-finally

Use try-finally when you need to release a resource or restore state, regardless of whether the operation succeeded. The resource does not have to be a file or database connection -- it could be a lock, a temporary directory, a global flag, or any state that must be cleaned up.

import tempfile
import os

def process_with_temp_file(data):
    tmp_path = tempfile.mktemp()
    try:
        with open(tmp_path, "w") as f:
            f.write(data)
        return transform(tmp_path)
    finally:
        if os.path.exists(tmp_path):
            os.unlink(tmp_path)

The finally block here runs whether transform succeeds, raises, or even if the function is interrupted. The temp file is always cleaned up.

For resources that implement the context manager protocol -- files, database connections from most ORMs, locks -- a with statement is cleaner and equivalent:

with open(path) as f:
    content = f.read()
# File is closed here even if read() raised

The explicit try-finally pattern remains appropriate when you need conditional cleanup, when the resource object is not a context manager, or when multiple resources need to be managed with different cleanup logic.

Combining try-except-else-finally

Python allows all four clauses together. The else clause adds a fourth behavior: code that runs only when no exception was raised.

conn = None
try:
    conn = db.connect()
    rows = conn.query(sql)
except DatabaseConnectionError as err:
    log.exception("DB connection failed")
    raise
except QueryError as err:
    log.exception("Query failed: %s", sql)
    raise
else:
    # Only runs if try completed without raising
    process_rows(rows)
    log.info("Query returned %d rows", len(rows))
finally:
    # Always runs
    if conn:
        conn.close()

The else clause is the most underused part of this pattern. Without it, success-path code goes inside try, which means a QueryError raised by process_rows would be caught by the except QueryError handler -- almost certainly not the intended behavior. Putting success-path logic in else limits the except clauses to errors that actually originate in the try block.

The Common Mistake: try-except as a Substitute for finally

A common mistake is using try-except to handle cleanup, under the assumption that catching exceptions covers all cases:

# Wrong: cleanup only happens if exception occurs
try:
    connection = db.connect()
    result = connection.query(sql)
except Exception:
    connection.close()  # This is cleanup, not exception handling
    raise
# connection is never closed if query succeeds

This closes the connection if an exception occurs but not if the query succeeds. The correct structure uses finally for cleanup, with except only for cases where the caller has a specific response planned:

connection = db.connect()
try:
    result = connection.query(sql)
except QueryError as err:
    log.exception("Query failed: %s", sql)
    return None
finally:
    connection.close()  # Always runs

When to Separate Concerns Into Separate try Blocks

When you need different handling for different operations, separate them into different try blocks rather than catching multiple exception types in one broad handler.

# Single try block: hard to tell which operation failed
try:
    user = db.get_user(user_id)
    profile = api.fetch_profile(user.email)
    send_notification(user, profile)
except Exception as err:
    log.error("Something failed: %s", err)

# Separated: each operation has its own handler
try:
    user = db.get_user(user_id)
except UserNotFoundError:
    return {"error": "not_found"}

try:
    profile = api.fetch_profile(user.email)
except APIError as err:
    log.warning("Profile fetch failed for %s: %s", user.email, err)
    profile = None

send_notification(user, profile)

The separated version makes the intent explicit at each point and avoids the ambiguity of which operation the exception came from.

Testing Both Paths

The pytest documentation at docs.pytest.org covers pytest.raises for asserting that exceptions are raised and mocker.patch for simulating failures:

def test_cleanup_runs_on_failure(mocker):
    mock_conn = mocker.MagicMock()
    mocker.patch("db.connect", return_value=mock_conn)
    mocker.patch.object(mock_conn, "query", side_effect=QueryError)

    with pytest.raises(QueryError):
        run_query(sql)

    mock_conn.close.assert_called_once()  # Verify finally ran

Testing that cleanup runs on failure confirms the finally block is doing its job. Without this test, a refactor that accidentally moves the close() call inside else would not be caught until production.

Context Managers as a Formalized try-finally

The with statement is Python's built-in way to encapsulate try-finally cleanup into a reusable object. When you open a file with with open(path) as f:, the file object's __enter__ method runs at entry and __exit__ method runs at exit, regardless of whether an exception occurred. It is precisely equivalent to wrapping the block in try-finally.

You can define your own context managers using contextlib.contextmanager:

from contextlib import contextmanager

@contextmanager
def managed_connection():
    conn = db.connect()
    try:
        yield conn
    finally:
        conn.close()

with managed_connection() as conn:
    result = conn.query(sql)

The yield splits the function into the setup (before yield) and teardown (after yield) phases. The finally in the context manager function guarantees the teardown runs. This is a cleaner way to express reusable resource management than repeating try-finally blocks throughout the codebase.

The Python documentation at docs.python.org covers the context manager protocol (__enter__ and __exit__) and the contextlib module in depth. The contextlib.suppress function is also part of contextlib and provides a concise alternative to try-except: pass for cases where an exception is genuinely expected and should be silenced cleanly. Python Enhancement Proposal 343 on peps.python.org introduced the with statement and explains the design intent behind the context manager protocol -- the __enter__ / __exit__ approach was chosen specifically to guarantee cleanup even in the presence of exceptions, return statements, and generator-based control flow.

Reference

The full collection of Python exception handling patterns -- including exception chaining, contextlib.suppress, logging with log.exception(), and custom exception hierarchies -- is in the Python Error Handling article on the 137Foundry blog. The try-except-else-finally pattern is covered there with additional context on how it fits into service boundary design.

Python Custom Exception Classes: When and How to Define Your Own

137Foundry — Fri, 29 May 2026 11:49:29 +0000

The built-in Python exception hierarchy covers system errors, input/output failures, and programming mistakes. It does not cover the domain-specific failure modes of your application. A UserNotFoundError is not in the standard library. Neither is RetryableError, InsufficientFundsError, or DataCorruptionError. Those are yours to define, and defining them correctly is the difference between exception handling that communicates intent and exception handling that just prevents crashes.

This post covers the practical side: when to create custom exceptions, how to structure them, what to put in them, and the common mistakes worth avoiding.

When Built-In Exceptions Are Not Enough

Built-in exceptions work well for errors that are universal and well-understood: ValueError for bad input, KeyError for missing dictionary keys, FileNotFoundError for missing paths. They fail as a communication mechanism when:

The caller needs to distinguish between failure modes. If your service can fail with a temporary network problem or a permanent data validation error, the caller needs to handle these differently. Raising Exception("temporary") or Exception("permanent") makes that distinction fragile -- it depends on string matching, not type matching.
The failure carries structured data. A DataProcessingError that includes the record ID, the source file, and the failed field name is far more actionable than a plain string message. Custom exception classes let you attach that data as attributes.
You want to catch all failures from a subsystem without catching everything. A background job processor can catch ServiceError to handle any failure from the service layer while still letting MemoryError or KeyboardInterrupt propagate.

The Minimal Custom Exception

The minimal correct custom exception is three lines:

class UserNotFoundError(Exception):
    """Raised when a user lookup returns no result."""
    pass

That is all you need to make UserNotFoundError a distinct type that callers can catch specifically. The docstring is the class documentation, not an error message.

You do not need to define __init__ unless you need custom attributes. The default Exception.__init__ accepts a message string and stores it as self.args[0], which is what prints when the exception is displayed.

Adding Structured Data to Custom Exceptions

When the exception needs to carry context beyond a message, define __init__:

class DataProcessingError(Exception):
    def __init__(self, message, record_id=None, field=None):
        super().__init__(message)
        self.record_id = record_id
        self.field = field

Call super().__init__(message) to preserve standard exception behavior -- the message will appear in tracebacks and str(err) will return it. Custom attributes go after.

Usage:

raise DataProcessingError(
    "Invalid date format in record",
    record_id=record.id,
    field="created_at"
)

The caller can now access err.record_id and err.field to build a specific error response or retry payload, rather than parsing the message string.

Building a Hierarchy

A flat list of custom exceptions works for small codebases. At scale, a hierarchy rooted at a custom base class lets callers choose how broadly to catch:

class ServiceError(Exception):
    """Base for all service-layer failures."""
    pass

class RetryableError(ServiceError):
    """The operation failed but may succeed on retry."""
    def __init__(self, message, retry_after=None):
        super().__init__(message)
        self.retry_after = retry_after

class PermanentError(ServiceError):
    """The operation will never succeed on retry."""
    pass

class NotFoundError(PermanentError):
    """The requested resource does not exist."""
    def __init__(self, resource_type, resource_id):
        super().__init__(f"{resource_type} {resource_id!r} not found")
        self.resource_type = resource_type
        self.resource_id = resource_id

A background job processor can now catch RetryableError to schedule retry, catch PermanentError to escalate and discard, and let unexpected exceptions propagate to the outer error handler:

try:
    result = service.process(job)
except RetryableError as err:
    queue.schedule_retry(job.id, delay=err.retry_after)
except PermanentError as err:
    alerts.send(f"Permanent failure on {job.id}: {err}")
    job.mark_failed()

Exception Chaining for Cause Preservation

When you catch a low-level exception and raise a domain-level one, use raise X from Y to preserve the original cause:

try:
    raw = db.query(sql)
except DatabaseConnectionError as err:
    raise RetryableError("Database temporarily unavailable") from err

The from err clause sets __cause__ on the new exception. When the traceback prints, Python shows both the original DatabaseConnectionError and the new RetryableError. The root cause is not replaced -- it is chained. The Python Enhancement Proposals on peps.python.org -- specifically PEP 3134 -- explain the design intent behind __cause__ and __context__. The Python language documentation covers the full exception hierarchy and the chaining semantics.

Without chaining, the database error disappears from the traceback. The operator sees a RetryableError with no indication of what the underlying problem was, which makes root cause analysis much harder.

Common Mistakes

Inheriting from BaseException instead of Exception. BaseException is the root of all exceptions including SystemExit and KeyboardInterrupt. Your custom exception should inherit from Exception (or a subclass), not from BaseException. Inheriting from BaseException means your exception bypasses broad except Exception handlers that are catching for cleanup purposes.

Catching your own base class too broadly in the service. A service that catches ServiceError everywhere to suppress failures is using the exception hierarchy for suppression rather than discrimination. Catch at the appropriate level: the outer error boundary handles ServiceError broadly; internal logic handles RetryableError and PermanentError specifically.

Putting too much in the message, not enough in attributes. An error message that reads "Record 4823 failed on field 'amount' with value '$14.00'" carries useful data in an inaccessible form. The record ID, field name, and value should be separate attributes so callers can route, log, or persist them without parsing.

Forgetting to call super().init(). If you define __init__ without calling super().__init__(message), the exception message will not appear in the traceback or in str(err). This is a common source of confusing traceback output.

Testing Custom Exceptions

The pytest documentation at docs.pytest.org covers pytest.raises with the match parameter for asserting on exception messages, and accessing exc_info.value for asserting on custom attributes:

def test_not_found_error_carries_resource_info():
    with pytest.raises(NotFoundError) as exc_info:
        service.get_user(nonexistent_id)
    assert exc_info.value.resource_type == "User"
    assert exc_info.value.resource_id == nonexistent_id

def test_retryable_error_on_timeout(mocker):
    mocker.patch("service.db.query", side_effect=TimeoutError)
    with pytest.raises(RetryableError):
        service.fetch_record(record_id)

Testing that the retry logic triggers on RetryableError is only meaningful if the test also verifies that the right exception type is raised on timeout. Custom exception hierarchies make this testing pattern natural.

Controlling How Custom Exceptions Display

By default, an exception displays its class name and the message passed to __init__. Sometimes you want a different string representation -- for example, when the exception stores multiple attributes and you want the display to include all of them.

Override __str__ to customize the display:

class ValidationError(Exception):
    def __init__(self, field, value, reason):
        self.field = field
        self.value = value
        self.reason = reason
        super().__init__(f"{field}: {reason} (got {value!r})")

err = ValidationError("amount", "-5", "must be positive")
str(err)  # "amount: must be positive (got '-5')"

Calling super().__init__(message) with the formatted message ensures the display is correct in tracebacks, in str(err), and in logging. This is important: if you define __init__ without calling super().__init__(), the exception will display as ValidationError() with no message, which is confusing in tracebacks.

The !r formatting in f-strings applies repr() to the value, which adds quotes around strings and shows the type clearly. This is useful in exception messages where the value might be an empty string, None, or a type that looks similar to another at a glance.

Where to Go From Here

The full reference for Python exception handling patterns -- including try-except-else-finally, contextlib.suppress, logging with log.exception(), and re-raising with bare raise -- is collected in the Python Error Handling article on the 137Foundry blog. Custom exception classes are most effective when the rest of the exception handling stack is also well-structured: a hierarchy without consistent catch/raise patterns at service boundaries will not deliver its full diagnostic value.

Python Libraries for Building Message Queue Consumers

137Foundry — Thu, 28 May 2026 10:27:59 +0000

Building an event-driven data pipeline in Python requires choosing a library that fits your broker, your processing model, and your operational requirements. Here are the primary options, with a plain-language breakdown of what each is best for.

The right choice usually comes down to three factors: which broker you're already running (or can support operationally), how much task management abstraction you need, and whether your processing model is synchronous or async. Getting this wrong early means retrofitting later -- either migrating the library mid-project or working around limitations in retry, dead-letter queue, or monitoring support.

1. Celery

https://pypi.org/project/celery/

Celery is the most widely used Python distributed task queue. It runs on top of Redis or RabbitMQ and handles the full worker lifecycle: task distribution, retry logic, rate limiting, result storage, scheduling, and monitoring.

What it does well: Celery abstracts most of the message broker complexity. You define tasks as Python functions, decorate them with @app.task, and Celery handles distributing them to workers. Retry configuration, countdown timers, and rate limits are built-in. The task model is clean and familiar.

Operational overhead: Celery runs worker processes that need to be managed separately from your application. The full feature set (Celery Beat for scheduling, Flower for monitoring) adds operational components that need to be deployed and maintained.

Best for: Python applications that need reliable distributed task processing with built-in retry, rate limiting, and monitoring. The standard choice for production data pipelines that need more than a raw queue client.

2. RQ (Redis Queue)

https://pypi.org/project/rq/

RQ is a simpler Python task queue that runs exclusively on Redis. It trades Celery's feature richness for a significantly simpler setup and mental model.

What it does well: RQ is easy to understand and operate. Enqueue a function with q.enqueue(process_event, event_data). Run workers with rq worker. The dashboard (rq-dashboard) is lightweight and informative. For teams that find Celery's configuration overwhelming, RQ is often the right starting point.

Operational overhead: Lower than Celery. Fewer moving parts. The reduced feature set means less to configure and less to break.

Best for: Teams that want a simple, Redis-based task queue without the full Celery feature set. Good for smaller pipelines or as a learning entry point before scaling to Celery.

3. Dramatiq

https://pypi.org/project/dramatiq/

Dramatiq is a newer Python task processing library that runs on Redis or RabbitMQ. It's designed to be more predictable than Celery, with simpler configuration and better default retry behavior.

What it does well: Dramatiq's retry behavior is more opinionated and consistent than Celery's out of the box. The actor model is clean: define a function with @dramatiq.actor and send it messages. Error handling and middleware are composable.

Operational overhead: Similar to Celery but with simpler defaults. Fewer gotchas around serialization and worker configuration.

Best for: Teams that have been burned by Celery's configuration complexity but need more features than RQ provides. A reasonable middle ground.

4. redis-py Streams API

https://pypi.org/project/redis/

The redis-py library, Redis's official Python client, includes a full Streams API for working with Redis Streams natively. Redis Streams are a persistent, log-based message primitive that supports consumer groups, message acknowledgment, and pending entry lists.

What it does well: Direct access to Redis Streams without a task framework layer. Full control over consumer group setup, message acknowledgment, and pending entry management. Lower latency than Celery for simple consumer patterns because there's no framework overhead.

Operational overhead: You write the consumer loop yourself. Retry logic, dead-letter queue behavior, and worker management all need to be implemented in application code. This is lower-level than Celery or RQ.

Best for: Pipelines that need precise control over message delivery semantics and don't need Celery's task management features. Also useful for teams that want to understand message queue mechanics before adopting a higher-level framework.

5. aio-pika

https://pypi.org/project/aio-pika/

aio-pika is an asyncio-native Python client for RabbitMQ. Unlike Celery (which uses synchronous workers by default) or redis-py's synchronous queue commands, aio-pika is designed for async consumer code.

What it does well: Full AMQP support with asyncio integration. Suitable for pipelines where the consumer logic includes async I/O (database calls, API requests, file operations) and you want to take advantage of Python's asyncio concurrency rather than spawning multiple synchronous workers.

Operational overhead: Requires asyncio proficiency. Not a drop-in replacement for Celery or RQ -- it's a different model.

Best for: High-throughput pipelines where the processing involves I/O-bound async operations and you're building on an asyncio-based Python application. Not the default choice for CPU-bound processing.

6. pika

https://pypi.org/project/pika/

Pika is RabbitMQ's official Python client library. It provides low-level AMQP protocol access without the framework abstractions of Celery or Dramatiq.

What it does well: Direct control over AMQP exchanges, queues, bindings, and consumer acknowledgment. Useful for complex routing scenarios that require custom exchange configuration.

Operational overhead: Highest of the options listed. You're implementing consumer logic, retry handling, and DLQ routing in application code.

Best for: Teams that need fine-grained AMQP control and have the Python expertise to build consumer infrastructure from primitives.

How to Choose

For most Python event-driven pipelines, the decision is:

New pipeline, need reliability and monitoring: Celery with Redis or RabbitMQ
Simpler requirements, want minimal configuration: RQ with Redis
Celery complexity has been a problem, need middle ground: Dramatiq
Building on RabbitMQ, want asyncio: aio-pika
Direct Redis Streams control: redis-py Streams API

A few clarifying questions that narrow the choice faster:

Is Redis already in your stack? If Redis is already running for caching, sessions, or rate limiting, using it as a message broker adds a queue use case to an existing service without introducing a new infrastructure dependency. Celery with Redis or RQ with Redis are both strong choices in this case.

Does your processing involve async I/O? If the consumer code makes async database calls or API requests, aio-pika with asyncio handles more concurrent tasks per worker than a synchronous Celery pool. For CPU-bound work, synchronous Celery workers with a process pool are the standard.

How much configuration complexity can the team absorb? Celery's feature set is extensive and its configuration surface is large -- some settings interact in non-obvious ways. RQ is simpler but more limited. Dramatiq sits between them. If the team is less experienced with distributed task queues, starting with RQ and migrating to Celery when you hit a specific limitation is a practical path.

Before You Commit: What to Validate

Before finalizing your library choice, run these tests in a staging environment:

Kill a worker mid-task. Send a message, wait for the worker to start processing, kill the process, and confirm the message is redelivered (not dropped) when the worker restarts. Celery with the default task_acks_late=False acknowledges on receipt; with task_acks_late=True, it acknowledges after execution. Know which behavior you have before it matters in production.

Send a message that will fail every retry. Confirm it ends up in a dead-letter destination (a DLQ list, a failed queue, or a visible error state) rather than disappearing silently. The DLQ behavior for exhausted-retry messages differs significantly between libraries and broker configurations.

Send a burst of 1,000 messages. Watch queue depth and consumer lag. A consumer that processes 10 messages in a test may behave differently at 1,000 -- especially if you hit connection pool limits, rate limits on a downstream service, or memory limits in the broker.

Check what monitoring the library exposes. Flower is the standard for Celery. rq-dashboard works with RQ. aio-pika and pika require custom instrumentation or broker-level metrics via the RabbitMQ management API. Confirm you can see queue depth and failure rates in your existing monitoring stack before committing to a library that makes this hard.

These tests take a few hours. The alternative is discovering library-specific behaviors in production under load.

The context for how these libraries fit into a full event-driven pipeline architecture -- from producer setup through consumer acknowledgment to failure handling -- is covered in 137Foundry's guide "How to Build an Event-Driven Data Pipeline With Python and Message Queues". The code patterns in that guide use redis-py and Celery but the architectural decisions apply to any of the libraries listed here.

How to Set Up a Dead-Letter Queue in Celery With Redis

137Foundry — Thu, 28 May 2026 10:27:57 +0000

When a Celery task fails after all retries, where does it go? By default, Celery with a Redis broker just marks the task as failed and moves on. The failed task is logged, but it's not stored anywhere accessible -- and unless you're watching logs or have monitoring set up, you might not know it happened until something downstream breaks.

Without a dead-letter queue, you have two choices when a task fails permanently: lose the data silently, or stop processing everything until the issue is manually fixed. Neither is acceptable for a production pipeline where missing even a small percentage of events can cause data integrity problems downstream.

A dead-letter queue (DLQ) changes this: failed tasks get routed to a dedicated queue where you can inspect them, requeue them manually when the underlying issue is fixed, or log them for audit. Setting one up in Celery with Redis requires a small amount of configuration but dramatically improves your ability to operate data pipelines reliably.

Understanding Celery's Retry and Failure Flow

Before configuring a DLQ, it helps to understand Celery's default behavior:

A task is called. It executes and either succeeds (returns a value) or fails (raises an exception).
If it fails and max_retries is set, Celery schedules a retry after the configured countdown.
If it fails after max_retries retries, Celery raises MaxRetriesExceededError. The task enters FAILURE state.
The failure is logged, the result is stored in the result backend (if configured), and nothing else happens.

There is no built-in routing of failed tasks to a separate queue. You have to build that.

Option 1: Route to a DLQ Using task_failure Signal

The cleanest approach for Redis-backed Celery is to use the task_failure signal to route failed tasks to a dedicated Redis list:

from celery.signals import task_failure
import redis
import json
from datetime import datetime

r = redis.Redis(host='localhost', port=6379, db=0)
DLQ_KEY = "celery_dlq"

@task_failure.connect
def handle_task_failure(sender=None, task_id=None, exception=None,
                         args=None, kwargs=None, traceback=None, einfo=None, **kw):
    dlq_entry = {
        "task_id": task_id,
        "task_name": sender.name if sender else "unknown",
        "exception": str(exception),
        "args": str(args),
        "kwargs": str(kwargs),
        "failed_at": datetime.utcnow().isoformat()
    }
    r.rpush(DLQ_KEY, json.dumps(dlq_entry))

Connect this signal in your Celery app initialization. Every task that exhausts its retries will push an entry to the celery_dlq list in Redis.

You can then inspect DLQ contents from a management script or a monitoring dashboard:

def inspect_dlq(limit=100):
    entries = r.lrange(DLQ_KEY, 0, limit - 1)
    for entry in entries:
        print(json.loads(entry))

Option 2: Dedicated DLQ Task Queue

An alternative is to route failed tasks to a dedicated Celery queue rather than a raw Redis list. This allows DLQ items to be requeued as Celery tasks when the underlying issue is resolved:

from celery import Celery
from celery.signals import task_failure

app = Celery('myapp', broker='redis://localhost:6379/0')

@app.task(name='dlq.failed_task')
def dead_letter_task(original_task_name: str, task_id: str,
                     exception: str, args: list, kwargs: dict):
    # Store in Redis or log -- this task is for inspection, not re-processing
    print(f"DLQ: {original_task_name} [{task_id}] failed with: {exception}")

@task_failure.connect
def on_task_failure(sender=None, task_id=None, exception=None,
                    args=None, kwargs=None, **kw):
    dead_letter_task.apply_async(
        args=[sender.name if sender else "unknown", task_id,
              str(exception), list(args or []), dict(kwargs or {})],
        queue='dlq'
    )

Running a dedicated celery worker -Q dlq processes only DLQ tasks. This separation keeps DLQ handling isolated from normal task workers and makes it visible as a queue in any Celery monitoring tool.

Option 3: Custom Task Base Class With Built-In DLQ

For a clean approach that doesn't require signal handlers, create a custom base task:

from celery import Task
import redis
import json
from datetime import datetime

r = redis.Redis(host='localhost', port=6379, db=0)

class DLQTask(Task):
    abstract = True
    max_retries = 3

    def on_failure(self, exc, task_id, args, kwargs, einfo):
        entry = {
            "task_id": task_id,
            "task_name": self.name,
            "exception": str(exc),
            "traceback": str(einfo),
            "failed_at": datetime.utcnow().isoformat()
        }
        r.rpush("celery_dlq", json.dumps(entry))
        super().on_failure(exc, task_id, args, kwargs, einfo)

@app.task(base=DLQTask, bind=True)
def process_event(self, event_data: dict):
    # Processing logic
    pass

Any task that inherits from DLQTask automatically routes failures to the DLQ.

Adding Alerts on DLQ Depth

A DLQ without monitoring is only marginally better than no DLQ. Add a simple depth check that runs on a schedule:

def check_dlq_depth(threshold: int = 10):
    depth = r.llen("celery_dlq")
    if depth > threshold:
        # Send alert -- Slack webhook, PagerDuty, email, etc.
        send_alert(f"DLQ depth is {depth} (threshold: {threshold})")

Run this every 5 minutes with a lightweight scheduler or as part of your existing monitoring. Datadog, Prometheus, and most APM platforms can scrape Redis metrics directly and alert on list length.

Re-queuing DLQ Messages

Once the underlying issue (bad data, downstream service outage, schema mismatch) is resolved, re-queuing DLQ messages means inspecting the failure entries and re-dispatching them:

def requeue_dlq_entries(limit: int = 100):
    entries = r.lrange("celery_dlq", 0, limit - 1)
    for raw in entries:
        entry = json.loads(raw)
        task_name = entry.get("task_name")
        original_kwargs = entry.get("kwargs", {})
        # Re-dispatch -- requires task be importable by name
        app.send_task(task_name, kwargs=original_kwargs)
    # Clear re-queued entries
    r.ltrim("celery_dlq", len(entries), -1)

This script reads up to 100 DLQ entries, re-dispatches them as Celery tasks, and removes them from the DLQ list.

What to Include in Each DLQ Entry

The DLQ entry is your primary debugging artifact. What you store determines how quickly you can diagnose and fix the underlying issue when you finally look at it.

At minimum, every DLQ entry should include:

The task name. Which Celery task failed. This lets you group failures by task type: if all failures are from one task name, the issue is likely in that task's code or configuration. If failures span many task types, the issue is likely in the data or a shared dependency.

The task ID. The Celery task ID lets you cross-reference DLQ entries with your application logs. If you have structured logging in your tasks that includes task_id at start and completion, you can trace a failed task from the DLQ entry back to its full execution log.

The exception string. str(exc) is usually sufficient. The exception type tells you whether the failure is transient (network error, connection timeout) or permanent (schema validation error, missing required field).

The traceback. For permanent failures, the traceback is the fastest path to the specific line of code that failed. The einfo parameter in on_failure provides a formatted traceback via str(einfo).

The task arguments. str(args) and str(kwargs) give you the inputs to the failed task. This is essential for debugging data-specific failures where only certain inputs trigger the error.

The failure timestamp. datetime.utcnow().isoformat() at failure time. When you're reviewing DLQ contents hours after the failure occurred, knowing when each entry failed helps determine whether failures are ongoing or historical.

All three options above include the core fields. If you're building a custom DLQ handler, start with {task_name, task_id, exception, failed_at} as the minimum viable entry, and add traceback and args when you've encountered a debugging session where you wished you had them.

Summary

Setting up a DLQ for Celery with Redis requires three things:

A mechanism to capture failed tasks (signal handler, base class override, or custom queue)
Storage for failed task details (Redis list or dedicated Celery queue)
Monitoring for DLQ depth with alerting

The signal handler approach (Option 1) is the simplest to add to an existing pipeline. The custom base class approach (Option 3) is cleanest for new pipelines where every task should have DLQ behavior by default.

A fourth piece worth adding: a requeue mechanism paired with a clear DLQ review process. The requeue script above reads DLQ entries and re-dispatches them as tasks after the underlying issue is fixed. Without that step, a DLQ is a collection bin for lost work rather than a recoverable buffer. Test the full loop -- failed tasks land in DLQ, alert fires, root cause is fixed, messages are requeued -- in staging before relying on it in production.

The architectural context for why DLQs are necessary in event-driven pipelines -- and how they fit into the broader producer-consumer pattern -- is covered in the full guide from 137Foundry: "How to Build an Event-Driven Data Pipeline With Python and Message Queues". That piece covers producer setup, consumer acknowledgment, and the failure handling design decisions that this implementation guide builds on.