Gabriel Anhaia

Posted on May 23

Design a Video Upload Pipeline: Chunked, Resumable, Fan-Out Transcode

#systemdesign #interview #video #distributedsystems

Book: System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You said "upload to S3, encode in the background." The interviewer asks what happens when a user's phone drops connection at 87%. You said "they retry." They didn't hire you.

Most "design YouTube" prep collapses into the read path: CDN, HLS, recommendation. The write path is the harder problem and it's where interview signal lives. A real video upload pipeline is five subsystems pretending to be one endpoint. Get any of them wrong and you ship a feature that works for the first 1 GB clip on Wi-Fi, then silently corrupts the rest.

This post walks the write path end-to-end. Multipart resumable upload. Per-chunk validation. Transcode fanout with a priority queue. Thumbnail extraction. CDN warm-up. Then the five failure modes interviewers actually ask about, with mitigations you can defend.

Why "upload to S3" is the wrong answer

PUT /upload returns 200 when the bytes arrive. That's fine for a 2 MB profile picture. It is not fine for a 4 GB 4K clip from a phone on flaky LTE in a basement. The single-PUT design fails on three axes:

Connection loss. TCP drops at 87%, you restart from 0%.
Memory. Your API gateway buffers the whole body before forwarding. 4 GB times 200 concurrent uploads is a node OOM.
Validation latency. You can't checksum what you haven't received. By the time you reject a corrupt upload, the client has already burned 30 minutes of data.

The correct mental model: an upload is a stateful session that produces N independent immutable chunks, plus a finalization step that assembles them. The HTTP request that "starts" the upload is just session metadata. The bytes go straight from the client to object storage. Your application never sees them.

S3 supports this natively through multipart upload. So does GCS and R2. You're not reinventing this. You're orchestrating it.

Subsystem 1: Multipart resumable upload

The session has four states: INITIATED, IN_PROGRESS, COMPLETED, ABORTED. Each chunk has its own state: PENDING, UPLOADING, UPLOADED, VERIFIED. The client drives both. Your API just records what happened.

Here's the initiation flow in Python with boto3:

import boto3
from uuid import uuid4

s3 = boto3.client("s3")

def initiate_upload(filename: str, size_bytes: int, user_id: str):
    # 50 MB chunks. Small enough for mobile, big enough that
    # the per-chunk overhead doesn't dominate. S3 minimum is 5 MB
    # (except the last part). Max is 5 GB per part, 10000 parts total.
    chunk_size = 50 * 1024 * 1024
    chunk_count = (size_bytes + chunk_size - 1) // chunk_size

    if chunk_count > 10000:
        raise ValueError("file too large; bump chunk_size")

    key = f"raw/{user_id}/{uuid4()}/{filename}"
    response = s3.create_multipart_upload(
        Bucket="video-ingest",
        Key=key,
        ContentType="video/mp4",
        # server-side encryption: required, never optional
        ServerSideEncryption="AES256",
    )

    session = {
        "session_id": str(uuid4()),
        "s3_upload_id": response["UploadId"],
        "s3_key": key,
        "chunk_size": chunk_size,
        "chunk_count": chunk_count,
        "user_id": user_id,
        "state": "INITIATED",
        "chunks": [
            {"index": i, "state": "PENDING", "checksum": None}
            for i in range(chunk_count)
        ],
    }
    redis.setex(
        f"upload:{session['session_id']}",
        86400,  # 24h TTL. Abandoned uploads die naturally.
        json.dumps(session),
    )
    return session

Session state lives in Redis, not Postgres. The chunk-state mutations are hot (every chunk PUT triggers one), and you don't need durability. If Redis loses the session, the client treats it as expired and starts over. The S3-side multipart upload also has its own TTL via lifecycle rules; set it to 7 days so orphaned uploads don't bill you forever.

For each chunk, the client asks your API for a presigned URL and uploads directly to S3:

def get_chunk_url(session_id: str, chunk_index: int):
    session = load_session(session_id)
    if session["chunks"][chunk_index]["state"] == "UPLOADED":
        return {"already_done": True}

    url = s3.generate_presigned_url(
        "upload_part",
        Params={
            "Bucket": "video-ingest",
            "Key": session["s3_key"],
            "UploadId": session["s3_upload_id"],
            "PartNumber": chunk_index + 1,  # S3 is 1-indexed, yes really
        },
        ExpiresIn=3600,
    )
    return {"url": url, "headers": {"x-amz-content-sha256": "UNSIGNED-PAYLOAD"}}

The client POSTs each chunk's ETag back after S3 acknowledges it. You record the ETag in the session and mark the chunk UPLOADED. When all chunks are uploaded, the client calls finalize:

def finalize(session_id: str):
    session = load_session(session_id)
    if any(c["state"] != "VERIFIED" for c in session["chunks"]):
        raise ValueError("not all chunks verified")

    parts = [
        {"PartNumber": c["index"] + 1, "ETag": c["etag"]}
        for c in session["chunks"]
    ]
    s3.complete_multipart_upload(
        Bucket="video-ingest",
        Key=session["s3_key"],
        UploadId=session["s3_upload_id"],
        MultipartUpload={"Parts": parts},
    )
    session["state"] = "COMPLETED"
    save_session(session)
    publish_to_transcode_queue(session["s3_key"], session["user_id"])

The resumability win: if the client drops at chunk 47 of 80, on reconnect it asks the API for the session, sees chunks 0-46 are UPLOADED, and resumes from 47. No bytes get re-uploaded. The S3 multipart upload sits open server-side waiting for the rest.

Subsystem 2: Chunk validation and assembly

This is where the "design YouTube" candidates lose the room. You don't trust client-reported checksums. The client could be on a router that's silently corrupting frames, or it could be a malicious upload trying to inject content past the validator.

For each chunk, two checks run:

Integrity checksum. Does the byte stream the client sent match what arrived?
Content scan. Is the chunk a plausible piece of the file it claims to be?

For the checksum, the trade is xxhash vs sha256:

xxhash (XXH3) hits ~30 GB/s on a single core. Non-cryptographic. Catches random bit-flips, TCP corruption, disk errors. Will not catch a deliberate adversarial collision.
sha256 hits ~500 MB/s with hardware acceleration, ~150 MB/s without. Cryptographic. Catches everything xxhash catches plus deliberate tampering.

Use xxhash for per-chunk integrity, computed by the client and verified by a small Lambda that reads the chunk on s3:ObjectCreated:Part events. Use sha256 for the final assembled file, computed once and stored as the canonical content hash for dedupe and tamper detection. You don't need cryptographic strength on every chunk because the final hash covers the whole object. You do need fast per-chunk so the validator doesn't backlog under burst.

import xxhash

def verify_chunk(session_id: str, chunk_index: int):
    session = load_session(session_id)
    chunk = session["chunks"][chunk_index]
    client_xxh = chunk["client_xxhash"]

    s3_obj = s3.get_object(
        Bucket="video-ingest",
        Key=session["s3_key"],
        PartNumber=chunk_index + 1,
    )

    hasher = xxhash.xxh3_64()
    for piece in s3_obj["Body"].iter_chunks(8 * 1024 * 1024):
        hasher.update(piece)

    if hasher.hexdigest() != client_xxh:
        # don't retry blindly. Most "checksum mismatch" cases
        # are real client-side bugs, not transient noise.
        mark_chunk_failed(session_id, chunk_index, "checksum_mismatch")
        return

    chunk["state"] = "VERIFIED"
    save_session(session)

Content scanning is the second pass. Run ClamAV or a managed equivalent against the assembled object before it hits the transcode queue. Some shops also run a frame-level sanity check (does the first chunk start with a valid container header?) to reject obvious "I uploaded a .exe renamed to .mp4" attempts before transcoding burns CPU on them.

Subsystem 3: Transcode fanout with priority queue

A single source video becomes 12-18 outputs: HLS at 144p/240p/360p/480p/720p/1080p/4K, DASH variants, audio-only, low-bandwidth previews, plus codec splits (H.264 for legacy, HEVC for newer Apple, AV1 for bandwidth wins). Each output is an independent transcode job. They're embarrassingly parallel.

The interview-tier insight: you don't transcode them in arbitrary order. The user watches the first 30 seconds at the lowest viewable resolution within a few seconds of finishing the upload. Everything else can wait minutes. So the priority queue looks like this:

PRIORITY_LOW_RES_FIRST_30S = 0    # ship-blocking
PRIORITY_LOW_RES_FULL = 1         # whole video at 360p
PRIORITY_AUDIO_ONLY = 1           # cheap, parallel
PRIORITY_MID_RES_FULL = 2         # 720p
PRIORITY_HIGH_RES_FULL = 3        # 1080p+
PRIORITY_PREMIUM = 4              # 4K, HEVC, AV1

def fan_out_transcode(s3_key: str, user_id: str, duration_s: int):
    publish_priority(
        queue="transcode-low",
        priority=PRIORITY_LOW_RES_FIRST_30S,
        payload={
            "src": s3_key,
            "rendition": "360p_h264",
            "segment": (0, min(30, duration_s)),
        },
    )
    publish_priority(
        queue="transcode-low",
        priority=PRIORITY_LOW_RES_FULL,
        payload={"src": s3_key, "rendition": "360p_h264"},
    )
    for rendition in ["720p_h264", "1080p_h264"]:
        publish_priority(
            queue="transcode-mid",
            priority=PRIORITY_MID_RES_FULL,
            payload={"src": s3_key, "rendition": rendition},
        )
    for rendition in ["1080p_hevc", "2160p_h264", "1080p_av1"]:
        publish_priority(
            queue="transcode-high",
            priority=PRIORITY_HIGH_RES_FULL,
            payload={"src": s3_key, "rendition": rendition},
        )

Three separate queues, not one. Why: a flood of 4K jobs from one creator should never starve the 360p first-30s queue that an entire region of viewers is waiting on. Per-queue worker pools size differently. Low-res is CPU-light and you can run hundreds. AV1 is brutal, so you cap at a handful of GPU nodes.

The FFmpeg invocation for a single rendition is unsurprising:

ffmpeg -i "$SRC" \
  -c:v libx264 -preset veryfast -crf 23 \
  -c:a aac -b:a 128k \
  -hls_time 6 -hls_playlist_type vod \
  -hls_segment_filename "out/seg_%04d.ts" \
  "out/index.m3u8"

What's not obvious: each worker pulls the source from S3 to local NVMe, transcodes, uploads outputs, then deletes the local copy. If the worker crashes mid-transcode, the message goes back to the queue with a visibility-timeout retry. The output bucket is keyed by content hash plus rendition, so retries are idempotent. A partial upload from a crashed worker gets overwritten cleanly.

Subsystem 4: Thumbnail and preview generation

Common interview mistake: putting thumbnails after transcode. They're independent. Run them in parallel, on a separate worker pool, against the same source.

Three outputs:

Hero thumbnail: single frame extracted around the 10% timecode (avoids opening black frames and end-card frames).
Sprite sheet: a grid of frames at 1 fps for scrubber preview, packed into a single JPEG with a WebVTT cue file.
Animated preview: 3-second WebP loop, used in feed cards on hover.

def generate_thumbnails(s3_key: str, duration_s: int):
    hero_ts = int(duration_s * 0.1)
    subprocess.run([
        "ffmpeg", "-ss", str(hero_ts), "-i", local_src,
        "-frames:v", "1", "-q:v", "2", "out/hero.jpg",
    ], check=True)

    subprocess.run([
        "ffmpeg", "-i", local_src,
        "-vf", "fps=1,scale=160:90,tile=10x6",
        "-frames:v", "1", "out/sprite.jpg",
    ], check=True)
    write_vtt_cues(duration_s, "out/sprite.vtt")

    preview_start = max(0, hero_ts - 1)
    subprocess.run([
        "ffmpeg", "-ss", str(preview_start), "-t", "3",
        "-i", local_src, "-vcodec", "libwebp",
        "-loop", "0", "-q:v", "60",
        "out/preview.webp",
    ], check=True)

Why this matters for the interview: if thumbnails block on transcode completing, a 20-minute 4K upload can't render a feed card for 20 minutes. Decoupling thumbnail extraction means the card is ready in seconds. Same source, different pool, different SLA.

Subsystem 5: CDN warm-up

The naive read path assumes "upload to S3, CloudFront caches on first request." That's true. It's also why your first viewer in São Paulo waits 4 seconds for the first segment to fetch from us-east-1.

CDN warm-up is a write-path subsystem. As soon as the low-res-first-30s rendition lands in the output bucket, a warm-up worker pushes the manifest and first segments to edge POPs:

Uploader's geography first. They're the most likely first viewer (they'll watch their own upload).
Top-N popular POPs second. Your global top edges by historical request volume.
Lazy elsewhere. Don't warm POPs nobody will hit; you're paying egress for cache fills that never serve a request.

def warm_cdn(s3_key: str, manifest_url: str, uploader_geo: str):
    pop_list = [
        nearest_pop_for(uploader_geo),
        *TOP_POPS_BY_TRAFFIC[:5],
    ]
    for pop in pop_list:
        # CloudFront doesn't have a true "push". You fire prefetch
        # requests from a worker in the target region. The result
        # populates the regional edge cache.
        requests.get(manifest_url, headers={"X-Force-Cache": "1"},
                     proxies={"https": f"https://prefetch.{pop}.internal"})
        for segment in first_segments(manifest_url, count=5):
            requests.get(segment)

This is its own subsystem with its own SLA. "Low-res ready" doesn't mean "ready to watch." It means "ready to start warming." The viewer-facing SLA is "first segment cached in user's POP," which is 5-30 seconds after low-res finishes.

The 5 failure modes interviewers ask about

Failure 1: Client drops mid-upload

Symptom: TCP connection dies at chunk 47 of 80. Client reconnects 4 minutes later from a different IP.

Mitigation: Session state in Redis with a 24h TTL keyed by session_id, not by IP or auth token. On reconnect, the client re-fetches session state, sees which chunks are UPLOADED, resumes from the first non-uploaded chunk. S3 multipart upload has its own server-side TTL via bucket lifecycle (7 days) so the in-flight multipart isn't garbage collected too aggressively.

The gotcha: Don't bind the session to IP. Mobile clients change IP constantly. Bind it to a session token the client stores. Re-auth on resume so a leaked token doesn't let an attacker resume someone else's upload.

Failure 2: Chunk corrupt on arrival

Symptom: client_xxhash doesn't match the chunk you read back from S3.

Mitigation: Mark the specific chunk FAILED with reason checksum_mismatch. The client re-uploads that chunk (presigned URL is the same). Don't auto-retry from the server side. Most checksum failures are real client bugs (Wi-Fi extender, dying RAM in the phone), and infinite retry just rotates the same broken bytes through your validator.

The gotcha: Three consecutive checksum failures on the same chunk index from the same session should ABORT the session and surface a user-visible error. Otherwise you'll see uploads that grind forever and never finish.

Failure 3: Transcode worker crashes mid-job

Symptom: FFmpeg OOMs or the node terminates. Output is partial.

Mitigation: Each transcode job is a message in SQS/RabbitMQ with a visibility timeout matched to worst-case transcode duration (10 min for low-res, 60 min for 4K AV1). On timeout, the message goes back to the queue. Output keys are deterministic (output/<content_hash>/<rendition>/) so the retry overwrites the partial upload atomically. Workers stream output back to S3 in multipart chunks (never as a single PUT) so a partial transcode doesn't fail the upload step too.

The gotcha: If you don't isolate per-codec queues, a flood of AV1 jobs from one creator can starve the 360p-first-30s queue. Separate queues, separate worker pools, separate SLAs.

Failure 4: S3/storage throttle

Symptom: PUT requests start returning 503 SlowDown. Or S3 starts throttling reads from a hot key during transcode fanout.

Mitigation: Two halves. Write-side: spread keys across a high-cardinality prefix (the uuid4() in the key already does this). S3 partitions on the first few bytes of the key, so don't prefix everything with raw/2026-05-23/... or you'll hammer one partition. Read-side: when fanning out transcode jobs against the same source object, stagger worker startup so 18 workers don't simultaneously hit the same S3 prefix. A 100ms jitter is enough.

The gotcha: Bucket-wide 503s are rare but real during multi-tenant spikes. Have a retry policy with exponential backoff and a dead-letter queue. Don't let transcode workers retry-storm S3. They'll just deepen the throttle.

Failure 5: CDN warm-up lag

Symptom: Low-res rendition is ready, the playback URL works, but the first viewer in Sydney waits 6 seconds for the first segment because the closest edge POP hasn't cached it yet.

Mitigation: Separate the "ready" signal. transcode_complete means the file exists in origin. playback_ready means the closest 5 POPs have warmed manifests and first segments. The viewer-facing "your video is ready" notification fires on playback_ready, not transcode_complete. For premium-tier creators, warm more POPs upfront. For free tier, warm only uploader-geo + top-3 and let the cold cache fill on demand.

The gotcha: Some CDNs (Cloudflare Workers, Fastly Compute) support real push-warm via prefetch APIs. CloudFront doesn't have a first-class one. You fire HTTP GETs from a worker running in the target region and rely on the regional edge cache populating from those. Build the abstraction so swapping the CDN doesn't break the warm-up subsystem.

The 90-second answer

When the interviewer says "design a video upload pipeline," this is the shape:

Upload is a stateful session, not a PUT. Session state in Redis, bytes in S3 multipart. Per-chunk xxhash for fast integrity, sha256 on the final assembled object for dedupe. Transcode fans out into 12-18 renditions across three priority queues. Low-res-first-30s gets shipped before anything else; premium codecs go to a small GPU pool. Thumbnails run in parallel on their own workers, not after transcode. CDN warm-up is its own subsystem with its own SLA, pushing manifests to uploader-geo first and top global POPs second. Five failure modes: client drop (Redis session resume), chunk corrupt (per-chunk checksum reject), transcode crash (visibility timeout + idempotent output keys), storage throttle (key cardinality + jittered fanout), warm-up lag (separate playback_ready signal from transcode_complete). Failure boundaries match component boundaries; each subsystem fails on its own without taking the others down.

That's the answer. Notice what it doesn't talk about: the read path, recommendation, the player. Those are different interviews.

The write path is harder than the read path. The signal an interviewer is looking for is whether you can decompose a "single upload endpoint" into five independent subsystems with five independent failure modes, and reason about the trade in each. If you can do that for video, you can do it for any large-object ingest pipeline: medical imaging, satellite telemetry, build artifacts. The shape generalizes.

What's the failure mode you've actually hit in production on a write path like this? Drop it in the comments. I'd take a real "we got bit by S3 prefix throttling" war story over another whiteboard sketch any day.

If this was useful

The System Design Pocket Guide: Interviews walks 15 real interview designs (video pipelines, ride-sharing, news feeds, real-time multiplayer) at the same decomposition depth. Each design names the failure modes interviewers probe for and the mitigations that survive follow-up questions. If decomposing a write path into five subsystems made sense to you, the chapters on ingest pipelines and fanout systems will too.