DEV Community

Cover image for The Bug the Happy Path Hides
Kelechi Uba
Kelechi Uba

Posted on

The Bug the Happy Path Hides

Two backend tasks from my internship, and the failure modes that only appear after the demo is over.

Most backends I build pass their own demo.

You click through the intended flow, the happy path turns green, and the feature feels finished. But the recurring lesson of this internship was that "passes the demo" and "holds up under failure" are separated by a specific class of bug: the one that only appears under a condition the demo never creates.

A crash at exactly the wrong byte. Two requests arriving inside the same critical section. A background task running longer than the timeout its author guessed. A transport failing after the database has already committed.

This is about two of those tasks. One was an individual project: a write-ahead log built from first principles. The other was a team feature: a streaming persona-refinement endpoint for an AI product. Different stacks, different failure modes, same lesson.

Task 1: A write-ahead log without a database

What it was

The task was to build an append-only event store without a database or third-party storage system.

I built it with Python 3.11 and FastAPI. Each event is serialized as compact, newline-delimited JSON and appended to one flat file, events.log. An in-memory dictionary maps every event_id to a (byte_offset, byte_length) pair. Reads use that index to jump directly to the record:

log_file.seek(offset)
raw = log_file.read(length)
Enter fullscreen mode Exit fullscreen mode

On startup, the service replays the log and rebuilds the index before accepting requests.

The core implementation is small. That was part of the challenge. Once I removed the database, I also removed the layer that normally handles durability, indexing, transaction boundaries, and recovery. Every assumption about bytes and concurrency became my responsibility.

The problem it solved

The event store needed four properties:

  1. Append-only writes.
  2. Byte-accurate random-access reads.
  3. Startup recovery by replaying the log.
  4. Consistent state when concurrent requests race.

The "no database" constraint forced me to implement those guarantees explicitly instead of delegating them to Postgres.

How I approached it

The file format is deliberately simple: one UTF-8 JSON object per write, followed by a newline.

serialized = json.dumps(event, ensure_ascii=False, separators=(",", ":"))
encoded = serialized.encode("utf-8")
log_file.write(encoded + b"\n")
log_file.flush()
Enter fullscreen mode Exit fullscreen mode

The index stores the offset of the JSON object's first byte and its encoded length, excluding the newline. That lets reads return exactly the record bytes without scanning or trimming.

A single asyncio.Lock guards the full append transaction:

  1. Read the current file size.
  2. Append the encoded event.
  3. Flush the file buffer.
  4. Add the matching index entry.

Recovery runs inside FastAPI's lifespan hook, so the service rebuilds its index before serving traffic.

That design worked. Then the edge cases started testing what "worked" actually meant.

What broke, and how I fixed it

The Unicode byte trap

Two rows comparing the same string measured in characters versus UTF-8 bytes; a seek that is correct by character lands inside a multibyte character
Characters are counted in code points; the file is addressed in bytes. A seek that is correct by character can land mid-byte.

Python's len(s) counts Unicode code points. Files are addressed in bytes.

For ASCII, those values happen to match. That is what makes the bug dangerous: it passes tests written with payloads such as "test".

  • "Γ©" is one character but two UTF-8 bytes.
  • "πŸŽ‰" is one Unicode code point but four UTF-8 bytes.

If the index stores len(serialized), then seek() moves to a byte offset but read(length) uses a character count. The read can stop inside a multibyte character, leaving invalid UTF-8 or truncated JSON.

The implementation stores len(encoded), and the test makes the reason explicit:

assert store.index[event["id"]] == (0, len(json_bytes))
assert len(json_bytes) != len(json_text)
Enter fullscreen mode Exit fullscreen mode

The payload is "ChidΓ© πŸŽ‰" on purpose. A future change back to character length cannot pass that test by accident.

Stats could observe half a write

My first /stats endpoint read the event count from memory and the byte count from disk:

bytes_written = event_store.log_path.stat().st_size
return {"total": len(event_store.index), "bytes": bytes_written}
Enter fullscreen mode Exit fullscreen mode

Both reads happened outside the append lock.

That created a small but real race. If a POST had written and flushed the bytes but had not yet updated the index, a simultaneous GET /stats could report the new file size with the old event count. Each number was individually true, but the pair described no consistent state the system had ever held.

I moved the snapshot into the store and protected it with the same lock as append:

async def stats(self) -> dict[str, int]:
    async with self._lock:
        bytes_written = self.log_path.stat().st_size if self.log_path.exists() else 0
        return {"total": len(self.index), "bytes": bytes_written}
Enter fullscreen mode Exit fullscreen mode

Now both values come from one transaction boundary.

Recovery failed on a torn tail

An append-only log of intact records ending in a torn final fragment; lookahead skips an interrupted final append but fails loudly on interior corruption
Lookahead lets recovery treat a torn final record as an interrupted append, while interior corruption still fails loudly.

The most serious bug was in startup recovery.

The first recovery loop parsed every line as JSON. It could handle a valid final record without a trailing newline, but not a process crash in the middle of a write. If the file ended with bytes such as:

{"x":3
Enter fullscreen mode Exit fullscreen mode

json.loads() raised. That exception escaped the lifespan hook, so the service could not start until someone manually repaired the log.

A crash during a write had made the write-ahead log prevent recovery.

The fix needed more than a broad try/except. A malformed final fragment can be evidence of an interrupted append. A malformed interior record means previously committed data was damaged. Silently ignoring both would hide real corruption.

Recovery therefore uses one-line lookahead:

raw_line = log_file.readline()
while raw_line:
    next_line = log_file.readline()
    is_last_line = not next_line

    try:
        event = json.loads(content.decode("utf-8"))
    except (json.JSONDecodeError, UnicodeDecodeError):
        if is_last_line:
            print("Skipped 1 incomplete trailing record in events.log",
                  file=sys.stderr, flush=True)
            break
        raise

    raw_line = next_line
Enter fullscreen mode Exit fullscreen mode

The distinction is deliberate:

  • Valid final JSON without a newline is recovered.
  • Invalid final bytes are treated as a torn append and skipped.
  • Invalid interior data fails loudly.

The test injects the exact crash shape instead of approximating it:

with log_path.open("ab") as log_file:
    log_file.write(b'{"x":3')

recovered = EventStore(log_path).recover()
assert recovered == 2
assert captured.err.strip() == (
    "Skipped 1 incomplete trailing record in events.log"
)
Enter fullscreen mode Exit fullscreen mode

The commit history made the lesson measurable. I built append, indexed reads, stats, and startup recovery in under two hours. Roughly fourteen and a half hours later, harder tests exposed the stats race and torn-tail failure.

What I took away

The write path and recovery path are one protocol implemented in two places. They must agree on byte lengths, record boundaries, and what counts as committed. If either side interprets the file differently, the system eventually reads garbage or refuses to boot.

The stats bug was the same lesson at a smaller scale: related state needs one observation boundary.

I also learned to state durability honestly. flush() moves Python's buffered bytes to the operating system, which protects against an application-process crash. It is not fsync(), so it does not promise survival after a machine crash or power loss. Calling the prototype "fully durable" would have hidden an important boundary.

Why I picked it

This task gave me both kinds of useful bug.

The Unicode trap was one I anticipated and encoded into a regression test. The stats race and torn-tail failure were bugs I shipped in the first version and found only after testing the moments between the happy-path steps.

With no database to absorb my mistakes, the failure boundaries were impossible to ignore.

Task 2: Streaming an AI refinement without duplicating the turn

What it was

Anvila is an AI persona-generation platform. I worked on its FastAPI backend, where Celery workers perform long-running generation and Redis carries live events.

My largest feature was the persona-refinement endpoint. It lets a user continue a conversation with an existing persona and watch the response arrive over Server-Sent Events. The model may return a normal streaming answer or regenerate the persona's files and skills.

The change was substantial: roughly 1,300 inserted lines in its first commit, including a new Celery task, the endpoint and SSE relay, a shared persona-application module, and 637 lines of endpoint and task tests.

The architecture sounds straightforward:

Browser <- SSE <- FastAPI relay <- Redis pub/sub <- Celery worker <- model
                                      |
                                      +-----------------------> database
Enter fullscreen mode Exit fullscreen mode

Left-to-right streaming pipeline: browser, FastAPI relay, Redis pub/sub, Celery worker, model, with a durable database branch off the worker
Browser, relay, Redis, worker, and database each have their own lifetime and failure mode.

Every arrow has an independent lifetime and a different failure mode.

The problem it solved

Generation is too slow and stateful to run inside a normal request-response cycle. The work belongs in a background worker.

But users still expect immediate feedback. They should see tokens as the model produces them, not stare at a loading spinner until the full turn is complete.

That required the API process and worker process to agree on:

  • where events are published,
  • when the listener is ready,
  • how long silence is tolerated,
  • which state is durable,
  • and whether a delivery failure is allowed to retry the task.

The hard part was not streaming a token. It was preventing timing and transport failures from losing the beginning of a stream or duplicating the durable result.

How I approached it

I extended the model adapter with an asynchronous .stream() generator. The endpoint creates a Redis pub/sub subscription, enqueues the Celery task, and relays channel messages as SSE frames. The worker processes the model response, persists conversation or persona changes, and publishes status, token, file, skill, and completion events.

One ordering rule mattered immediately: subscribe before enqueue.

await pubsub.subscribe(channel)
refine_persona.delay(...)
Enter fullscreen mode Exit fullscreen mode

If those lines are reversed, a fast worker can publish its first events before the relay is listening. Redis pub/sub does not retain them for a late subscriber.

I added a test that records both operations and asserts:

assert order[:2] == ["subscribe", "delay"]
Enter fullscreen mode Exit fullscreen mode

That protects the race directly instead of hoping the worker is always slower than the request.

What broke, and how I fixed it

The feature worked in manual testing. Review still found two production-shaped failures.

The timeout measured age, not idleness

Two timelines: a fixed wall-clock cutoff severs a healthy but slow stream, while an idle timeout resets on each event and still delivers the terminal event
A wall-clock deadline cuts a healthy-but-slow stream; an idle timeout resets on every delivered event.

My first relay used one wall-clock deadline. Once the stream reached that age, the relay stopped.

That confuses two very different states:

  • no events are arriving because the worker is dead;
  • events are arriving, but the task is legitimately slow.

A long refinement could still be delivering tokens and get disconnected simply because it crossed the original cutoff.

The fix turned the deadline into an idle timeout. Every delivered event resets it:

deadline = loop.time() + REFINE_RELAY_IDLE_TIMEOUT_SECONDS

# after each delivered event
deadline = loop.time() + REFINE_RELAY_IDLE_TIMEOUT_SECONDS
Enter fullscreen mode Exit fullscreen mode

The regression test advances a fake clock beyond the original deadline between two events. The relay must still deliver the terminal event because the first event reset its idle window.

The rule became precise: time out silence, not duration.

A Redis failure could retry durable work

The more dangerous bug crossed a consistency boundary.

The Celery task writes durable state: assistant messages, token usage, persona files, and skills. It also publishes ephemeral Redis events for the live stream.

If publish() raises and that exception escapes the task, Celery may retry the entire refinement. That is acceptable before durable work is committed. It is dangerous after commit, because the retry can duplicate an assistant turn, usage accounting, or file updates.

The two systems cannot form one atomic transaction. Redis cannot roll back the database, and the database cannot guarantee SSE delivery.

So I chose the failure hierarchy explicitly:

  1. Durable state is authoritative.
  2. Redis delivery is best-effort.
  3. A publish failure is logged but never controls task retry.
  4. Terminal events are attempted only after the matching durable state is committed.
try:
    await redis_client.publish(channel, payload)
except Exception:
    logger.exception("failed to publish refine event")
Enter fullscreen mode Exit fullscreen mode

Live token events still occur while the model is streaming, before the final turn commit. The important guarantee is not that every publish happens after a commit. It is that a transport failure cannot turn into a second execution of durable work.

That trades a degraded live stream for consistent stored state. The user may miss an event and need to reload the committed result. The system must not create the same turn twice.

Shared behavior was starting to fork

My first implementation also duplicated the logic that applies model output to a persona: validating files, matching skills, and rebuilding derived content.

That was not an immediate runtime failure, but it was a future correctness bug. Generation and refinement would eventually disagree about what a valid persona update meant.

I extracted the behavior into a shared persona_apply module used by both paths. The rule for applying model output now has one implementation and one place to harden.

What I took away

Distributed correctness often lives in ordering rather than in a large algorithm.

  • Subscribe, then enqueue.
  • Commit durable state before announcing completion.
  • Reset a timeout on activity.
  • Do not let an ephemeral transport decide whether committed work runs again.

Each fix was small. Finding the need for it required treating the API connection, Redis channel, Celery task, and database transaction as separate systems instead of one feature-shaped box.

The tests had to force those boundaries: a worker that publishes immediately, a clock that passes the original deadline while events still arrive, and a Redis client that raises at the worst possible moment.

Why I picked it

It is the same lesson as the write-ahead log at a larger scale.

In the file-backed store, correctness depended on what happened between writing bytes and updating the index, or between a complete record and a torn tail. In Anvila, it depended on what happened between subscribing and enqueueing, or between committing the database transaction and publishing the completion event.

The happy path compresses each pair into one mental step. Production separates them again.

The throughline

The demo is the weakest test I have.

It exercises the sequence I designed under the timing I expected, with every dependency available. It does not crash between two instructions. It does not schedule the unlucky request. It does not slow a healthy worker down or fail Redis after Postgres commits.

The work that made both tasks defensible was the same:

  1. Find the boundary the happy path hides.
  2. Decide which side of that boundary is authoritative.
  3. Define the failure behavior before writing the fix.
  4. Write a test that forces the failure on purpose.

The bugs that taught me the most were not in the line I was looking at. They were in the space between two lines I had treated as one operation.

Top comments (0)