Anshika Jain

Posted on Jan 24

ReTraced: A Job Scheduler Where Retries Are Data (Not Magic)

#distributedsystems #typescript #systemdesign #redis

Introduction

Most job schedulers do retries… but they don’t explain retries.

ReTraced is a transparent and extensible distributed job scheduler built to make retry behavior, failure handling, and job lifecycle transitions explicit and observable.

Unlike many schedulers that hide retries behind config flags and internal engines, ReTraced treats retries as first-class data — visible, auditable, and configurable per job.

ReTraced is not designed to hide complexity.

It’s designed to expose it clearly.

Why ReTraced Exists

Modern schedulers are powerful, but they often:

Hide retry decisions inside internal engines
Expose retry counts without retry intent
Make failure analysis opaque and indirect

ReTraced was built to answer questions like:

Why did this job retry at this moment?
Was the retry automatic or manually triggered?
Is this failure temporary or permanent?
When and why did retries stop?

These questions matter when building reliable, debuggable distributed systems.

Core Philosophy

✅ Explicit Over Implicit

In ReTraced:

retry attempts are stored as structured, queryable data
failures are classified (temporary vs permanent)
DLQ is first-class (not an afterthought)

This makes execution behavior predictable, inspectable, and explainable.

✅ Practical Before Perfect

ReTraced favors clarity and control over hidden guarantees:

at-least-once delivery semantics
Redis-backed state for speed + simplicity
minimal coordination logic

Performance Snapshot (Local)

ReTraced prioritizes correctness + visibility while still being fast.

Benchmark (local, Redis-backed):

10,000 jobs in ~2.4s with 1 worker
10,000 jobs in ~2.1s with 5 workers

This shows low scheduler overhead and good worker scalability.

Benchmarks are indicative, not a production SLA.

What Makes ReTraced Different

🔁 Retry as Data

Every job keeps a retry history:

timestamp + error
trigger: AUTO or MANUAL
retry result and final outcome

This enables real audit trails, DLQ forensics, and safer replays.

🧠 Per-Job Retry Strategies

Each job can define its own retry behavior:

fixed delay
linear backoff
exponential backoff (with/without jitter)
multi-phase retries → DLQ

🧾 First-Class DLQ

When a job goes dead, it’s not “lost”.

ReTraced preserves the full execution story:

retry history
failure context
poison-job identification
manual retries tracked clearly

A Bug ReTraced Helped Me Catch (Real Example)

While stress testing, I found a backoff timing bug:

I expected exponential delays to grow like:

5s → 10s → 20s → 40s (total ~75s)

But actual retry timestamps showed delays plateauing near ~6s (total ~28s).

This bug was only visible because ReTraced stores retry attempts as real data — not hidden scheduler state.

That’s the whole point: make failures debuggable by design.

Current Status + What’s Next

ReTraced just hit v1.0.0 — meaning the core retry-as-data model,
DLQ handling, and per-job strategies are stable and usable.

ReTraced is usable today for experimentation and internal tools, and I’m actively improving it toward a production-ready self-hostable system.

It’s not trying to replace mature schedulers — it complements them by making retry intent and failure behavior visible.

Personal Note

I’m a 2nd year student, deeply interested in distributed systems, and I’m building ReTraced to learn real reliability engineering.

My goal is to make this production-level so that devs can actually use it.

If you have experience with schedulers, retries, DLQ design, or Redis-based coordination — I’d love your feedback, suggestions, and PRs 🙌

Thanks for reading 🚀

DEV Community