Why I treat API timeouts as "unknown", not failures

#rust #distributedsystems #sre #backend

Every payment gateway I've ever worked on had the same hidden bug.

A provider API times out. The code says "failure". So you retry. But the original request actually succeeded – the provider just took too long to respond. Now you've double‑charged the customer.

I built Azums, an open‑source payment gateway in Rust, specifically to stop this pattern.

__The fix: make ambiguity explicit.

Instead of pending → success/fail, I designed a state machine with five states:

pending (request sent, waiting)
succeeded (confirmed success)
failed (confirmed failure)
retryable (temporary error, safe to retry)
unknown (timeout or ambiguous response – needs investigation)

When a timeout happens, the system doesn't guess. It marks the transaction as unknown and stops. No blind retries. No double charges.

Why this matters beyond payments

This same pattern applies anywhere you talk to unreliable external systems:

Blockchain RPCs that timeout after the transaction was submitted
AI agent API calls that hang but may have executed
Messaging queues that lose acknowledgements

Treating ambiguity as a real state is the difference between a system that guesses and a system that you can trust.

My full implementation is on GitHub: BlockForge-Dev/Azums

What's your worst "timeout caused a disaster" story? Let me know in the comments.

DEV Community

Why I treat API timeouts as "unknown", not failures

Top comments (0)