Further Scaling the Choirless Render Pipeline

#python #serverless #ibmcloud #mqtt

I have previously done a bit of investigation into scaling the Choirless rendering backend:

Stress Testing Choirless Render Pipeline

Matt Hamilton ・ Jul 25 '20

#python #serverless

Choirless is a side project I've been working on for Call for Code, that allows people to remotely sing together as a virtual choir.

The big question, was how well could the serverless architecture I'd build on IBM Cloud Functions (Apache Openwhisk) would scale. I mean, in theory it should be infinite, right? But the reality is a function has a maximum allowed runtime of 10 minutes. We have set ourselves a limit of allowing maximum 10-minute-long songs to be recorded. So that means we need to be able do each part of the processing in faster-than-realtime.

Most of the heavy work is done using ffmpeg to do the actual video and audio mixing. But how many audio and video streams can we mix in total?

As it turns out once we started to go above about 100 videos in parallel then ffmpeg started to have trouble. I'm not sure if it was ffmpeg or the underlying limits imposed by Linux or the Openwhisk environment, but either way it started to randomly just hang forever -- until killed after 10 minutes.

So I re-architected it to break the videos up into parts for the composition. The main expensive part is compositing all the video streams together. But similarly we hit the same thread limits with the audio.

We are using a javascript library written by my colleague, Sean, called boxjam to arrange the videos into a grid. Boxjam always arranges the rows in line, so you can always split a video horizontally cleanly.

So, I have broken the composition process up into parallel running tasks. Each task is responsible for compositing together one row of video or audio. So if there are 5 rows, then we spawn off a total of 10 openwhisk actions, one per row for each of audio and video. We then have a final composition stage in which we stack all the video rows on top of each other, mix the audio together and add it back to the final video.

The intermediate row videos are streamed up to IBM Cloud Object Storage (COS) for access by the final compositor. That in turn uploads to final complete rendering back to COS.

In order to track the progress of each of the stages and visualise it I needed a way to report out the progress of each stage. I decided to use MQTT as a protocol to do that, as it is lightweight, available in both Python and Javascript and we can use either publicly available brokers, or host our own. I'll detail the actual code for publishing and subscribing to the MQTT messages in another post soon, but just to give you an example of how easy it is to publish a message to MQTT:

msg = {'choir_id': choir_id,
       'song_id': song_id,
       'stage': stage,
       'status_id': str(uuid.uuid4())
}
publish.single(
      f'choirless/{choir_id}/{song_id}/renderer/{stage}',
      json.dumps(msg),
      hostname='mqtt.eclipse.org',
      port=1883
)

I then wrote a Python notebook in IBM Watson Studio to subscribe to these message during a run and create a chart with Altair. I'd not come across altair before, but it allowed me to generate slightly nicer looking charts that plain matplotlib.

So how does it do? Below are some plots and associated screenshots of the final output.

Yellow Submarine

This was our 'alpha' performance for Choirless just before we submitted our entry to call for code. It consists of 16 parts of random people who joined in. It is 1 minute 52 seconds long.

Looking at the chart you can see the initial renderer-compositor-main action at the start. That is responsible for working out how many child actions to fire off and to allocate them each a row to work on. You can see there are then four audio processing children, and four video processing children. The audio is much faster to mix together for each row than the video. This is expected as compositing video requires processing far more data. Once the children are all finished, you see the renderer-final start and runs for about 42 seconds. The entire process is complete in about 91 seconds.

Dreams

This is my colleague, Glynn, playing "Dreams" by Fleetwood Mac. He is paying all six parts himself. This was the longest video in our test, at just over four minutes.

You can see two children each for video and audio as there are two rows in the video. The entire thing takes about 3.5 minutes to render. And the longest single function (renderer-compositor-child-video-103) takes just over two minutes. So again we are running about twice the speed of realtime. Well within out desired requirements.

Load test bagpipes

This was the big one. I'd previously testing rendering around 200 'clones' of colleague Sean playing the bagpipes. That completed in just shy of ten minutes on the previous test. And was only rendering to 720p. Each of these videos are rendered to 1080p, so 2.5x the number of pixels.

In this test, I create 308 clones of Sean to perform. The piece is only 1 minute 51 seconds long. But this was a test to prove I'd solved the problem of ffmpeg choking on too many threads.

You can see that the render-compositor-main has spawned 15 child processes for each of audio and video. Again, one per row of video. Each row has many more videos though. In this case up to 21 videos per row. What is interesting is the variation on video processing times. Each of the children apart from the last row (renderer-compositor-child-video-1000) has the same number of videos to process. Yet the processing time varies from 73 to 131 seconds. I guess this could be luck of the draw on which physical host the Cloud Function gets dispatched to. Some many have more load on them than others. If we wanted faster processing time at the expense of cost, we could fire off multiple children for each row and the final render would start whenever the quickest ones return.

The longest single action in this render is 2 minutes 11 seconds, so we are just below realtime here, with the actual video being 1 minute 51 seconds. Although many of the children finished before then, so I think we should be OK in general, even up to 10 minute videos.

Conclusion

The scalability tests and refactoring of the pipeline have worked really well. I'm very pleased with the overall performance. Once aspect that I spent quite a bit time looking at is the trade-off between video/audio codec encode/decode speed and file size. For example, the fastest codec is huffyuf, but produced 8GB files. So although the codec was faster, we spent more time loading from Cloud Object Storage.

I experimented with using mpeg2 with only keyframes, and also h264. In the end I settled on using mpeg2 for the intermediate storage as that seemed to provide the best trade off in speed/quality/size.

Two other things we could look at in the future if need be:

Further breaking the rows down.

We could split each row in half and process each half before combining to a full row.

Live streaming the output from the children to the final compositor.

At the moment we wait for all children to be complete before starting the final composition, but in theory we could start before they have finished. Ffmpeg supports streaming to/from TCP connections, so we could have the final process be the action the fires off the children and hands them each the address of a TCP connection to stream to. This would mean the final render would be complete by the time each child is done. However this would mean that each child would have to stay running until the slowest has finished. Which would increase the overall costs as cloud functions are billed by running time.

The final output video of 308 Seans Bagpiping: