I thought this async Rust service would be simple
I wanted to build a small async service in Rust.
Accept events, process them, retry on failure. Nothing fancy.
It looked like a weekend project.
It turned into a lesson in how quickly “simple” systems stop being simple once you care about correctness.
The full project is available here: https://github.com/yourname/eventful
The naive version
The initial design looked something like this:
HTTP → queue → worker pool
- Handler receives an event
- Push it into a channel
- Workers pull from the channel and process
That works fine — until you actually try to make it correct.
As soon as you introduce retries, idempotency, and failure handling, things start to break in ways that aren’t obvious at first.
Problem 1: Idempotency isn’t just “don’t insert twice”
I wanted ingestion to be idempotent by event_id.
At first, that just meant:
- If the ID exists, return the existing record
- Otherwise insert it
But that leaves a hole.
What if the same ID comes in with a different payload?
That’s not a duplicate — that’s a conflict.
The fix was to store a hash of the payload and reject mismatches:
- Same ID + same payload → OK (deduped)
- Same ID + different payload → 409 conflict
Small change, but it forced me to treat idempotency as a real constraint instead of a convenience.
Problem 2: You can lose work even if you “queued” it
Originally, I assumed:
If I push an event into the queue, it will eventually be processed.
That’s not actually true.
Two things break this:
- The queue is full (
try_sendfails) - The queue is broken (receiver dropped)
In both cases, the event exists in the system, but it never reaches a worker.
The fix was separating “exists” from “scheduled.”
Each record tracks:
-
status(Received, Processing, etc.) -
queued(whether we think it’s scheduled)
If enqueue fails, the record still exists, but it isn’t reliably scheduled anymore.
Which leads to the next problem.
Problem 3: You need a sweeper (even if it feels wrong)
I didn’t initially want a background task scanning state. It felt like a workaround.
But without it, there are too many ways for events to get stuck:
- enqueue fails
- worker crashes mid-processing
- retry timing gets missed
So I added a sweeper.
It runs periodically and looks for:
- events ready to retry
- events marked queued but not processed for too long
Then it re-enqueues them.
It’s not elegant, but it’s robust. It gives you eventual correctness without requiring every code path to be perfect.
Problem 4: “Queue depth” is not one number
At first I tracked queue depth as a single value.
That turned out to be misleading.
There are at least three different things happening:
- Channel depth — how many items are currently in the queue
-
Backlog — how many events are marked
queued == true - Inflight — how many workers are actively processing
These are not the same.
For example:
- Channel depth can be 0 while backlog is high
- Inflight can be maxed out while the queue stays empty
So I split them into separate metrics:
-
queue_channel_depth -
backlog_queued -
processing_inflight
Once I did that, the system became much easier to reason about.
Problem 5: Concurrency needs to be bounded explicitly
The simplest approach is to spawn a task per event.
That works until it doesn’t.
I ended up using a Semaphore to limit concurrency:
- Each task acquires a permit
- The permit is held for the duration of processing
- Max concurrency is fixed
Instead of a fixed worker pool, this lets me:
- keep the code simple
- avoid idle workers
- still enforce limits
It also makes shutdown behavior much easier to control.
Problem 6: Graceful shutdown is where things get messy
Stopping a system like this is harder than starting it.
You need to:
- Stop accepting new work
- Stop dispatching new tasks
- Let in-flight work finish (within reason)
- Not hang forever
What I ended up with:
- a
watchchannel for shutdown signaling - a dispatch loop that exits on signal
- a
JoinSettracking worker tasks - a timeout for draining
- forced abort after timeout
So shutdown looks like:
- Signal shutdown
- Stop pulling from the queue
- Wait up to N milliseconds for workers
- Abort anything still running
It’s not perfect, but it’s predictable.
Problem 7: Metrics will lie to you if you’re not careful
I added metrics early, but they were wrong at first.
The issue was trying to track counts by incrementing and decrementing in multiple places.
That’s easy to get wrong in a concurrent system.
What ended up working:
- Counters → only ever increment
- State counts → only update on real state transitions
For example, queued_count only changes when:
-
queuedflips false → true -
queuedflips true → false
Anything else introduces drift.
The resulting model
The final system looks like this:
HTTP → Ingest → Store → Channel → Dispatcher → Workers
↑
Sweeper
With a state machine:
Received → Processing → Completed
↘ FailedRetry → Failed
And metrics that reflect:
- ingress
- deduplication
- processing success/failure
- backlog
- queue state
- concurrency
- latency
What I took away from this
A few things that stood out:
- “Simple async system” is usually not simple once you care about correctness
- State machines make concurrency problems easier to reason about
- Backpressure is multi-dimensional, not a single number
- A sweeper is often the simplest way to guarantee eventual progress
- Shutdown needs to be designed, not added later
- Observability changes how you design the system
What I didn’t do (on purpose)
This is an in-memory system.
I didn’t add:
- persistence
- distributed processing
- external queues
Those would be the next steps, but the goal here was to get the core behavior right first.
Closing
This ended up being more about edge cases than features.
Most of the code is just making sure the system behaves correctly when things don’t go as planned — which is most of the time in real systems.
That was the interesting part.
And honestly, the part I didn’t expect going in.
Code
If you want to see the full implementation:
Top comments (0)