Every time we mention rendering videos in the browser, we get raised eyebrows and concerns about performance. This is for a good reason, mainly because, for a long time, browsers were not capable of doing it and had to use the CPU for decoding and encoding frames.
This bottleneck made companies willing to implement a video editing solution to create two separate systems: a playback mechanism that would run in the browser, allowing users to trim/split videos, apply effects, add text clips, etc., visually, and a backend mechanism that would take the same structure created on the web and recreate it on a server to render it at high speeds.
Such an approach, although it has a faster rendering time, has disadvantages, such as:
- All the clips should be uploaded to a server for the backend to be able to create the composition.
- All the filters/effects/transitions should also be created through a custom rendering solution.
- The backend should download the clips to create the composition.
- Rendering several videos simultaneously can be very computationally intensive, requiring autoscaling solutions.
- The generated video should be stored on the server for the interface to download the rendered output.
- Users typically upload videos larger than 10MB, so you have to consider network connectivity.
- Cannot create a product that works well on slow networks or offline.
- You have to ensure the composition created on the backend matches the one the user created.
- Working with text is challenging and requires consideration of font families and styles.
- Server providers charge for traffic as well, so you have to keep this in consideration.
Browser Rendering Steps
Generally, rendering the video in the browser involves several steps:
- Capture the frame.
- Encode the frame into a video stream.
- Repeat until all the frames are captured.
- Mux the video and audio streams into a container.
The most time-consuming steps in this flow are capturing and encoding the frames.
A simple implementation of capturing video frames could be seeking an HTMLVideoElement, rendering the texture into the canvas, and taking a screenshot of the whole composition. The seeking in this case will always try to locate the I-Frame to paint the result, which is a headache for the browser, even if you increase the seeking time by 1/30 (the FPS) seconds. It's also worth mentioning that switching the tabs while rendering happens will pause the video player and cause the final video to have missing sections because browsers have underlying optimizations to ensure good performance.
Regarding encoding, the common solution would be to use the WASM build of FFmpeg. However, it is not possible to start an encoding process and just append the new frames into the stream because the WASM build does not support it at the moment. You would have to temporarily store all the frames in memory (not a good idea) or store a chunk of them, transform them into an encoded video, and repeat the process.
There is a great video from Christopher Chedeau explaining more about the pipeline that goes into encoding a video and the limitations of the browser:
"Video Editing in the Browser" by Christoher Chedeau at #RemixConf 2023 💿
WebCodecs API
The good news is that browsers have started adding support for hardware-accelerated video/audio/image encoders and decoders that solve the issues mentioned above - this set of APIs is known as the WebCodecs API.
For the capturing part, the seeking can now be replaced with decoding the frame using the VideoDecoder API. This is much faster than seeking because the browser doesn’t have to look for I-Frames, B-Frames, and P-Frames to reconstruct the image. Instead, it can just retrieve what has changed and recreate the texture from that.
Encoding can be done through the VideoEncoder API, which enables appending the video frames directly into the stream without having to store the frames temporarily.
WebCodecs Considerations
There are things to be aware of when deciding to implement a rendering mechanism with WebCodecs. Here are some hiccups we faced during the process:
- There are no built-in APIs for muxing and demuxing. You have to either use FFmpeg or an alternative library.
- You need additional tooling for identifying the codec string when initializing the decoder. It should follow the pattern described here.
- Sometimes, the decoder tells you if it supports a specific configuration only after decoding a few packets, so you have to handle it gracefully.
- WebCodecs is not supported in all browsers yet (e.g., Firefox).
- The logs are not always explicit and require you to have additional tooling for debugging.
- When encoding videos, you have to make sure that the devices you are using support the specific resolution. For instance, a 7-year-old Android phone might not be able to encode a video larger than Full HD, and you have to account for that. Also, the encoder doesn’t inform you of this.
- Canvas works with RGB/RGBA color format, but in many cases, the video frames coming from VideoDecoder will be in YUV, requiring you to handle the conversion.
- You can’t get the media/stream info from the video and should use ffmpeg/mp4box.
- You have to identify and handle rotated videos yourself.
- There is a limited number of codecs supported.
Conclusion
It is possible to render videos in the browser and do it in a performant manner that, in many cases, replaces the need for a server. But the process of developing such a system is overly complicated and involves a lot of hacking and getting around browser's limitations.
Over the last year, our team has been working on creating a Video Editing SDK meant to take care of all the issues and limitations mentioned above, while giving you full access to build the flow and interface you want. Check it out at rendley.com
Top comments (0)