Hi everyone,
I'm currently designing the architecture for a live streaming and private meeting platform and would appreciate some guidance on scaling recording infrastructure using LiveKit Egress.
I haven't implemented autoscaling yet and want to design the system correctly before moving forward.
Current Stack
Frontend: Angular
Backend: .NET
Media Server: Self-hosted LiveKit
Infrastructure: AWS (EC2 + containerized services)
Coordination: Redis
OS: Ubuntu EC2 instance
Recording Use Cases
The platform supports two types of sessions:
- Private Meetings
Recording uses RoomComposite Egress to capture the entire meeting.
- Livestream Classes
Recording uses Participant Egress to record only the instructor stream.
Recording is optional and triggered by the instructor, so demand can vary significantly. For example, several instructors could start recording sessions simultaneously.
Problem I'm Trying to Solve
Since egress workers process recording jobs, I'm trying to design a system that can handle bursts of recording requests without failures.
My concern is handling situations where many recordings start at the same time. Without proper scaling, this could lead to:
recording requests failing
egress workers becoming overloaded
timeouts during recording initialization
What I'm Trying to Achieve
Ideally the system should:
Automatically scale egress workers when recording demand increases
Scale down when idle to reduce infrastructure cost
Handle bursts where many recordings start simultaneously
Support both RoomComposite and Participant egress jobs efficiently
Questions
For developers running LiveKit in production:
What is the recommended strategy to scale LiveKit egress workers?
Should autoscaling be based on:
CPU / memory usage
number of active recordings
number of pending egress jobs
pipelines per worker
Has anyone implemented autoscaling for egress workers successfully on AWS (ECS / EC2 / Kubernetes)?
When LiveKit server load increases (many rooms), how do you typically scale LiveKit media servers alongside egress workers?
Context
I'm still in the architecture design stage, so any suggestions, reference architectures, or lessons learned from production deployments would be extremely helpful.
Thanks in advance!
Top comments (0)