kaustubh yerkade

Posted on Nov 28, 2025

Kubernetes Architecture : let's design for a video streaming app at 1 million users

#systemdesign #devops #docker #kubernetes

Big awesome problem......

We’ll try to create a compact, practical Kubernetes architecture for a video streaming app at one million users, plus clear design choices, components, scaling patterns, and rough capacity guidelines.

We’ll also call out assumptions up front to pick the path that matches the traffic pattern.

Assumptions

We must assume something because “million users” could mean many things:

Option A - 1M total registered users, low concurrency: ~10k concurrent peak.
Option B - 1M active users, moderate concurrency: ~100k concurrent peak.
Option C - 1M concurrent users (extreme): design is close to CDN-first, multiple large clusters, heavy multi-region infra.

So lets focus on Option B (100k concurrent) as a realistic high-scale target (and also provide notes for A and C.)

High-level architecture (summary)

Clients (web, mobile, smart TV)
Global CDN (edge caching for video segments) - primary delivery
Edge/Ingress: Global LB → Regional edge (CloudLB / Global Accelerator) → Ingress (Envoy/NGINX/ALB)
Kubernetes Clusters (regional, multi-AZ):

Gateway / API (stateless REST/gRPC microservices)
Auth / Session / User Service
Video Metadata / Catalog (read-optimized DB + caching)
Transcoding / Packaging (batch/stream processing; GPU for live)
Streaming Origin / Manifest generator (creates HLS/DASH manifests)
Storage access layer (signed URLs to object store)

Object Storage (S3/GCS) - stores master files, HLS/DASH segments, thumbnails
Cache Layer (Redis / Memcached) - session tokens, manifest cache, rate limiting counters
Message Bus (Kafka / Pulsar) - eventing, workflow for transcoding / analytics
Datastore - OLTP (CloudSQL / Cockroach / Postgres) for metadata, and a NoSQL (Cassandra) for high-write global data if required
Monitoring / Observability - Prometheus + Grafana + Loki + Jaeger
DevOps Pipeline - CI/CD / GitOps (ArgoCD / Flux + Helm)
Security - WAF / Rate limiting / DDoS protection

Delivery flow (requests for playback)

Client requests playback → CDN edge checks cache.
If cached segment exists → served from CDN (fastest path).
If cache miss → CDN requests origin (Ingress → origin service in K8s).
Origin returns manifest / segment or a signed URL into S3 (prefer signed URL for direct S3 fetch).
Player downloads segments from CDN / S3.
Analytics events (play, buffer, quality) are sent asynchronously to message bus.

Key Kubernetes components & patterns

1. Ingress:

Use an L7 Ingress controller (Envoy/Contour/NGINX/ALB) with global LB in front. Use TLS (ACME) and HTTP/2 for APIs.

2. Autoscaling:

HPA for stateless services (CPU/RPS/custom metrics, e.g., requests/sec, queue length).
VPA for workload right-sizing (optional for non-critical or preproduction).
Cluster Autoscaler for nodes (scale based on pending pods).

3. Transcoding:

Batch jobs (K8s Jobs/CronJobs) for VOD transcoding.
GPU node pools and Kubernetes Device Plugins for live transcoding/real-time.
Use KEDA to scale transcoder workloads based on queue length (Kafka/SQS).

4. Persistent Storage:

Object store (S3/GCS) for video assets; use PV/PVC for metadata DB if needed.
Avoid keeping heavy video blobs on local PVs.

5. Streaming origin:

Stateless manifest generators and origin servers that return signed URLs or stream segments.
Edge + CDN: Always front video with CDN - it’s mandatory at this scale.

6. Service Mesh (optional):

-Istio/Linkerd for observability, traffic shaping, and resilience if team can manage complexity.

7. Multi-region:

Maintain replicas of origin/metadata and use geo-DNS or global LB for locality.

Rough capacity & cluster sizing (Option B ≈ 100k concurrent)

These are just rough numbers, tune with load tests.

CDN: design for tens of Gbps - offload 90–99% to CDN.
Some major CDN providers include Cloudflare, Akamai, Amazon CloudFront, Google Cloud CDN, Microsoft Azure CDN and Fastly.

K8s origin & API tier:

ballpark numbers: 100–300 pods across regions. Pod size ~0.5–1 vCPU, 1–2 GiB memory.

Ingress / API layer: use dedicated nodepool; 10–30 nodes (n1-standard-8 or similar).

Transcoding:

Transcoding is required for converting the content into multiple renditions to support adaptive bitrate streaming (HLS/DASH).

For VOD(Video on Demand): ephemeral workers - spawn jobs as needed; use large node pools for burst.

For live: GPU nodes; count depends on encoding profile (1 GPU can serve X concurrent streams). Example: 8–32 GPU nodes.

Cache / Redis: Clustered Redis (HA) with enough memory to hold hot manifests and sessions.

DB:

Metadata DB (Postgres/Cockroach) in HA with read replicas.
Analytics DB separated (Clickhouse / BigQuery).

Nodes:

Per region: multiple node pools - small for frontend, medium for workers, large/GPU for transcoding.

Example region sizing: 50–150 nodes depending on workload mix.

Network & egress: plan for high outgoing bandwidth; use cloud egress plans and CDN for cost control.

Resilience & availability

Multi-AZ per region, multi-region failover for critical services.
Cross-region replication for metadata (or active-passive with fast failover).
Use etcd backup snapshots and automated restore tests for control plane.
Use health probes, circuit breakers, retries on network calls.

Security & costs

Offload public traffic to CDN to reduce egress and node count.
Use signed URLs so the origin is not overwhelmed.
Add WAF and DDoS protection (Cloud provider + CDN).
Use fine-grained RBAC, network policies, and secrets manager (K8s Secrets + KMS).

Observability & SLOs

1.Instrument player and backend: latency, buffering ratio, startup time, bitrate changes.

2.Collect metrics: CDN hit/miss rate, origin QPS, transcoder queue length, error rates.

3.Define SLOs: e.g., 99.5% play success, <2s startup for 95th percentile.

CI/CD & Operational Playbooks

1.GitOps (ArgoCD) for manifests; Helm for templating.

2.Canary/blue-green rollouts for topology changes.

3.Chaos tests for resilience (chaos monkey, pod/node kills).

4.Run load tests (Gatling, Locust) with realistic CDN bypass to measure origin load.

Quick variants by assumption

1.Option A (10k concurrent) - Single region, 10–30 nodes, small transcode fleet, CDN still mandatory.

2.Option C (1M concurrent) - Multi-region, several large clusters, massive CDN footprint, significant investment in edge infra, likely multi-cloud or many POPs.