DEV Community

Cover image for 56/60 Days System Design Questions
Joud Awad
Joud Awad

Posted on

56/60 Days System Design Questions

Your background job ran for 4 minutes and nobody knows if it finished.

That's not a job queue problem. That's a missing design problem.

Long-running jobs break every assumption you built for synchronous APIs. Your load balancer times out after 30s. Your mobile client doesn't know whether to retry. Your retry logic re-runs a job that already half-completed.

Here's the real scenario:

You're processing a video upload. The job takes 2–8 minutes. Millions of users.

What do you expose to the client?

A) Polling endpoint — client hits /jobs/:id/status every 5s until done
B) Webhook — job fires a POST to client's callback URL on completion
C) SSE / WebSocket — server pushes progress updates in real time
D) Synchronous wait — keep the HTTP connection open until the job finishes

One scales to millions without coupling your infrastructure to client uptime.

The others have hard production failure modes most teams don't discover until 3 AM.

The deeper problem isn't transport — it's these 4 things nobody gets right the first time:

→ Idempotency. Every job must be safe to re-run. If your retry logic can double-charge, double-send, or double-process — you don't have retries, you have bugs waiting.

→ Progress granularity. "0% → 100%" is useless for a 6-minute job. You need intermediate states: queued, processing, transcoding, uploading, complete. Clients need something to show users.

→ Timeout vs failure. A job that stops responding isn't the same as a job that failed. Dead workers, OOM kills, spot instance evictions — your queue needs a heartbeat or a visibility timeout, not just a try/catch.

→ Deduplication. The client will retry. Your queue will redeliver. You need a dedup key scoped to the original request — not the job run.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.

30DaysOfSystemDesign #SystemDesign #BackendEngineering #DistributedSystems

Top comments (4)

Collapse
 
thejoud1997 profile image
Joud Awad

A) Polling — Correct for most cases
Client-controlled, stateless, scales independently. The server doesn't care if the client disconnects, retries, or crashes — the job runs and the status endpoint just answers queries. 5s polling intervals on a job that takes 2–8 minutes is trivially cheap. The key rules: make job IDs stable and idempotent, set a TTL on job records so you don't accumulate state forever, and use exponential backoff not fixed intervals. LinkedIn, YouTube, and S3 multipart uploads all use polling for async job status.

Collapse
 
thejoud1997 profile image
Joud Awad

B) Webhook — Right idea, wrong default
Webhooks are great for server-to-server flows where the receiver has a stable HTTPS endpoint. They fall apart for mobile clients (no public URL), in environments with NAT/firewall, and when the receiver is down at delivery time. You'd need a retry queue, delivery guarantees, and signature verification just to make it reliable. Webhooks work well as a supplementary delivery mechanism for platform integrations — not as the primary client notification path for end users.

Collapse
 
thejoud1997 profile image
Joud Awad

C) SSE / WebSocket — Expensive for this use case
Real-time push is fantastic for chat, live dashboards, and collaborative editing. For a job that takes 2–8 minutes and is triggered once? You're holding an open connection, burning a file descriptor, and adding connection-management complexity for maybe 3 meaningful state changes. SSE makes sense when you genuinely need sub-second updates or continuous streaming. For async job progress, polling is cheaper and simpler.

Collapse
 
thejoud1997 profile image
Joud Awad

D) Synchronous wait — Production antipattern
Keeping the HTTP connection open for 2–8 minutes kills your load balancer (most timeout at 30–60s), ties up a server thread or process, and gives the client no recovery path if the connection drops mid-job. This is how systems get "stuck" — jobs complete but the client never hears about it because the connection dropped at minute 3. Never block on long-running work. Always return immediately with a job ID.