Introduction
Most job schedulers do retries… but they don’t explain retries.
ReTraced is a transparent and extensible distributed job scheduler built to make retry behavior, failure handling, and job lifecycle transitions explicit and observable.
Unlike many schedulers that hide retries behind config flags and internal engines, ReTraced treats retries as first-class data — visible, auditable, and configurable per job.
ReTraced is not designed to hide complexity.
It’s designed to expose it clearly.
Why ReTraced Exists
Modern schedulers are powerful, but they often:
- Hide retry decisions inside internal engines
- Expose retry counts without retry intent
- Make failure analysis opaque and indirect
ReTraced was built to answer questions like:
- Why did this job retry at this moment?
- Was the retry automatic or manually triggered?
- Is this failure temporary or permanent?
- When and why did retries stop?
These questions matter when building reliable, debuggable distributed systems.
Core Philosophy
✅ Explicit Over Implicit
In ReTraced:
- retry attempts are stored as structured, queryable data
- failures are classified (temporary vs permanent)
- DLQ is first-class (not an afterthought)
This makes execution behavior predictable, inspectable, and explainable.
✅ Practical Before Perfect
ReTraced favors clarity and control over hidden guarantees:
- at-least-once delivery semantics
- Redis-backed state for speed + simplicity
- minimal coordination logic
Performance Snapshot (Local)
ReTraced prioritizes correctness + visibility while still being fast.
Benchmark (local, Redis-backed):
- 10,000 jobs in ~2.4s with 1 worker
- 10,000 jobs in ~2.1s with 5 workers
This shows low scheduler overhead and good worker scalability.
Benchmarks are indicative, not a production SLA.
What Makes ReTraced Different
🔁 Retry as Data
Every job keeps a retry history:
- timestamp + error
- trigger:
AUTOorMANUAL - retry result and final outcome
This enables real audit trails, DLQ forensics, and safer replays.
🧠 Per-Job Retry Strategies
Each job can define its own retry behavior:
- fixed delay
- linear backoff
- exponential backoff (with/without jitter)
- multi-phase retries → DLQ
🧾 First-Class DLQ
When a job goes dead, it’s not “lost”.
ReTraced preserves the full execution story:
- retry history
- failure context
- poison-job identification
- manual retries tracked clearly
A Bug ReTraced Helped Me Catch (Real Example)
While stress testing, I found a backoff timing bug:
I expected exponential delays to grow like:
5s → 10s → 20s → 40s (total ~75s)
But actual retry timestamps showed delays plateauing near ~6s (total ~28s).
This bug was only visible because ReTraced stores retry attempts as real data — not hidden scheduler state.
That’s the whole point: make failures debuggable by design.
Current Status + What’s Next
ReTraced just hit v1.0.0 — meaning the core retry-as-data model,
DLQ handling, and per-job strategies are stable and usable.
ReTraced is usable today for experimentation and internal tools, and I’m actively improving it toward a production-ready self-hostable system.
It’s not trying to replace mature schedulers — it complements them by making retry intent and failure behavior visible.
Personal Note
I’m a 2nd year student, deeply interested in distributed systems, and I’m building ReTraced to learn real reliability engineering.
My goal is to make this production-level so that devs can actually use it.
If you have experience with schedulers, retries, DLQ design, or Redis-based coordination — I’d love your feedback, suggestions, and PRs 🙌
Thanks for reading 🚀
Top comments (0)