At a surface level, this sounds trivial, as it’s just storing userId, videoId, and timestamp.
But not when:
- Millions of users press play at the same time
- People switch from TV to phone in seconds
- Writes happen every few seconds
- Resume must feel instant
1. Clarifying the Problem
We are designing a Playback Resume System that allows users to resume watching from where they left off across devices.
We are not designing:
- The video streaming pipeline itself
- Real-time co-watch (two users watching in sync)
- Multi-region replication, global failover, or cross-region consistency trade-offs
- Perfect real-time synchronisation across devices (1–2 second eventual consistency is acceptable)
This service would live within an existing micro-services architecture, so I won’t deep-dive into service discovery, deployment, etc., and will focus purely on the playback state.
2. Functional Requirements (User Centric)
- User should be able to resume a video from the last watched position.
- User should be able to switch devices and continue seamlessly.
- User should have an independent watch history per profile.
- The system should update the playback position periodically while watching.
- Latest progress should win if multiple devices update.
3. Non-Functional Requirements
- Resume reads <150ms
- Writes <500ms
- High availability
- Scalable to millions of concurrent users
- Eventual consistency across devices is acceptable (1–2 sec lag)
CAP Theorem Consideration
During network partitions, we prefer Availability + Partition tolerance over Strong Consistency + Partition tolerance.
Why?
Because if one replica is slightly behind, the user resuming 1 second earlier is acceptable. But we cannot afford downtime.
So:
- High availability > Strong consistency
- Eventual consistency + last write wins is good enough
For playback, you won't need perfection. Responsiveness is.
4. Data Model
Instead of user_id, we use:
(account_id, profile_id, video_id)
Because in one household:
Account 123
├── Profile A → V1 → 1200s
└── Profile B → V1 → 300s
Each profile tracks progress independently.
We also store:
position
updated_at
device_id
updated_at enables conflict resolution.
Note: Using updated_at for last write wins assumes reasonably synchronised clocks. In production, this is typically handled using server-generated timestamps or monotonic counters. I’m keeping the conflict resolution logic simple here to focus on system behaviour rather than clock management.
5. Scale Estimation
Assume:
- 10M daily users
- 3M actively watching
- Update every 10 seconds
- 30 min session → ~180 updates
That’s:
~540M writes/day
~6K writes/sec
This is not a small system. Logical reads are similar in magnitude, but database reads are significantly reduced via caching.
Smarter Write Strategy
In reality, we don’t unthinkingly update every 10 seconds.
We optimise by writing only when:
- Position delta > 15–30 seconds
- OR user pauses
- OR the app goes to the background
- OR periodic checkpoint (e.g., every 60 seconds)
This reduces:
- Write amplification
- Cache churn
- Queue pressure
- Database cost
That 540M/day number can realistically drop 3–5x with smarter checkpointing.
6. API Design
Update Playback
POST /playback/update
Body:
- account_id
- profile_id
- video_id
- position
- device_id
Resume Playback
GET /playback/resume
7. Start Simple: DB-Only
We could store everything in DynamoDB/Cassandra.
Primary key:
(account_id#profile_id, video_id)
Pros:
- Simple
- Durable
- Easy to scale horizontally
Cons:
- Every resume hits DB
- Higher latency at scale
- Costly under heavy read traffic
Good for MVP. But not ideal for massive scale.
8. Hybrid Architecture
Because a resume is latency sensitive and read-heavy, we introduce caching.
High Level Design
How It Works
Write Flow
- Client sends update.
- Service performs a conditional write to the DB (if updated_at is newer).
- Redis cache is updated.
- Event is optionally published for analytics.
Conditional Writes (Idempotency)
To avoid stale overwrites, we use conditional writes:
Update only if incoming.updated_at > existing.updated_at
This ensures:
- Last write wins
- Safe retries
- No duplicate corruption
Redis Crash Safety
Instead of writing only to Redis first:
We persist to DB first (durable), then update Redis.
In a worst-case scenario, if Redis crashes, the DB remains the source of truth. We prefer durability over extreme write latency savings.
Read Flow
- Check Redis.
- If hit → instant resume.
- If miss → fetch from DB → repopulate cache.
Most reads should never touch the database.
Brief stale reads may occur due to replication lag, acceptable under our 1–2 second tolerance.
Failure Handling & Retries
In production, both Redis and the database may occasionally timeout or throttle under load.
To protect latency SLOs:
- Reads fall back to DB if Redis times out.
- Writes use bounded retries with exponential backoff.
- Timeouts are enforced at the service layer to avoid request pile-ups.
If a write ultimately fails, we prefer dropping that checkpoint rather than blocking playback. The next update will reconcile the state due to our last write-win logic.
9. Multi-Device Conflict Handling
If TV and phone both send updates:
- Compare updated_at
- Latest wins
We accept slight inconsistencies because availability matters more.
That’s our CAP trade-off in action.
10. Other Production Considerations
Storage Lifecycle (TTL)
Playback entries shouldn’t live forever. We can expire inactive entries after X days (e.g., 180 days) using TTL policies.
This prevents:
- Unbounded storage growth
- Cold data occupying hot partitions Hot Partition Prevention If we partitioned incorrectly (e.g., by video_id), A trending show at 8 pm could create hot shards. Using:
(account_id#profile_id, video_id)
Ensures even distribution and avoids the hot partition problem.
Proper database capacity planning or auto scaling is required to handle peak write bursts and avoid write throttling under load.
11. UX Guardrails & Data Freshness
A resume should feel intuitive and not surprising.
To prevent confusing jumps in playback:
- If a resume position differs by only a few seconds, the client may ignore minor regressions.
- We may cap backward jumps beyond a safety threshold (e.g., don’t resume 5 minutes earlier unless requested).
- Clients can display “Resume from 11:11?” to give users control when conflicts occur.
This keeps the system technically simple (last-write-wins) while protecting the user experience from edge-case inconsistencies.
Final Thoughts
This problem looks like a key-value store.
It’s not.
It touches:
- Distributed systems
- Caching strategy
- Conflict resolution
- UX latency expectations
- CAP trade-offs
- Data modelling for real households



Top comments (0)