Nilay Jayswal for 100ms Inc.

Posted on Jan 5, 2022 • Edited on Jan 24, 2022 • Originally published at 100ms.live

Virtual Background in Android with WebRTC

Virtual backgrounds are becoming necessary nowadays in the video conferencing world. It allows us to replace our natural background with an image or a video. We can also upload our custom images in the background.

In this blog, we are going to implement Virtual Background in Android with WebRTC using mlkit selfie segmentation.

This content was originally published - HERE

This feature works best with uniform lightning condition in background and requires a high-performance mobile android device for a smooth user experience.

By end of this blog, you can expect the virtual background feature to look like this.

Dependencies

Add the dependencies for the ML Kit Android libraries to the module's app-level gradle file, which is usually app/build.gradle:


dependencies {  implementation 'com.google.mlkit:segmentation-selfie:16.0.0-beta3'}

Add the dependencies for the libyuv.

dependencies {  implementation 'io.github.zncmn.libyuv:core:0.0.7'}

libyuv is an open-source project that includes YUV scaling and conversion functionality.

Common WebRTC terms you should know

VideoFrame: It contains the buffer of the frame captured by the camera device in I420 format.
VideoSink : It is used to send the frame back to WebRTC native source.
VideoSource : It reads the camera device, produces VideoFrames, and deliver them to VideoSinks.
VideoProcessor : It is an interface provided by WebRTC to update videoFrames produced by videoSource .
MediaStream : It is an API related to WebRTC which provides support for streaming audio and video data. It consists of zero or more MediaStreamTrack objects, representing various audio or video tracks

Approaches we thought of

Updating the WebRTC MediaStream by passing it to the mlkit selfie segmentation model and getting the updated stream. But sadly we don’t have a method in android to replaceTrack in WebRTC.
Updating the stream coming from the source camera and then passing it to WebRTC. Got some success on it, but then issues were faced in using the updated stream in the WebRTC.
Creating another virtual video source from the camera source and using that as an input to mlkit API . But sending the updated stream back to WebRTC gave us issues.
Using Android CameraX Apis to read frames but again WebRTC doesn't support it.

After trying all these approaches and not getting suitable results, we figured out that we need to do processing on VideoFrame for our use case.

Getting the VideoFrame from WebRTC

Most challenging part was getting the VideoFrame out for every frame from WebRTC for processing. After a lot of research we found out that we can use setVideoProcessor API available with VideoSource. It has few callbacks

//It gives us VideoFrame going into WebRTC for every frame
fun onFrameCaptured(inputVideoFrame: VideoFrame?)

//It gives us sink which we will use to send updated videoFrame back to //WebRTC
fun setSink(sink: VideoSink?)

This is how we can setVideoProcessor to VideoSource(source in below code snippet is VideoSource)

 source.setVideoProcessor(object : VideoProcessor {
            override fun onCapturerStarted(p0: Boolean) {

            }
            override fun onCapturerStopped() {

            }
            override fun onFrameCaptured(inputVideoFrame: VideoFrame?) {
              //Do processing with inputVideoFrame here
            }
            override fun setSink(sink: VideoSink?) {
              //set sink here to send updated videoFrame back to WebRTC 
            }
        })

If we are setting VideoProcessor to the VideoSource we need to call onFrame callback on every frame from VideoSink otherwise, we will get a black screen on our device.

//Here frame is the updated VideoFrame we are getting after ML processing //on input videoFrame
sink.onFrame(frame)

Converting VideoFrame to supported ML model Input Type

To perform segmentation on an image, mlkit needs an InputImage object which can be created from either a bitmap, bytebuffer, media.Image, byte array, or a file on the device.

Here, we have converted inputVideoFrame into a bitmap using libyuv library

YuvFrame: It copies the Y, V and U planes from videoFrame buffer into a byte array which we are converting to ARGB_8888 Bitmap

yuvFrame = YuvFrame(
                inputVideoFrame,
                YuvFrame.PROCESSING_NONE,
                inputVideoFrame.timestampNs
            )
 inputFrameBitmap = yuvFrame.bitmap

Now we have created InputImage using inputFrameBitmap

val mlImage = InputImage.fromBitmap(inputFrameBitmap, 0)

Initialise mlkit model

We have created an instance of Segmenter using this.

Process the mlImage

segmenter.process( mlImage )
            .addOnSuccessListener { segmentationMask ->
                val mask = segmentationMask.buffer
                val maskWidth = segmentationMask.width
                val maskHeight = segmentationMask.height
                mask.rewind()
                val arr: IntArray = maskColorsFromByteBuffer(mask, maskWidth, maskHeight)
                val segmentedBitmap = Bitmap.createBitmap(
                    arr, maskWidth, maskHeight, Bitmap.Config.ARGB_8888
                )
            //segmentedBitmap is the person segmented from background
            }
            .addOnFailureListener { exception ->
                HMSLogger.e( "App" , "${exception.message}" )
            }
            .addOnCompleteListener {
            }

Draw the segmented background on the canvas

We have used Porter.Duff mode to draw segmented output with the background image given by user on the Canvas(using canvas APIs)
After this we will get outputBitmap from canvas which we are using to create an updated VideoFrame

Create new VideoFrame from outputBitmap

surfTextureHelper?.handler?.post() {
                GLES20.glTexParameteri(
                    GLES20.GL_TEXTURE_2D,
                    GLES20.GL_TEXTURE_MIN_FILTER,
                    GLES20.GL_NEAREST
                )
                GLES20.glTexParameteri(
                    GLES20.GL_TEXTURE_2D,
                    GLES20.GL_TEXTURE_MAG_FILTER,
                    GLES20.GL_NEAREST
                )
                GLUtils.texImage2D(GLES20.GL_TEXTURE_2D, 0, outputBitmap, 0)

                val i420Buf = yuvConverter.convert(inputBuffer)
                val outputVideoFrame = VideoFrame(i420Buf, 180, frameTs)                //180 is the frame rotation degree which we are using
            }

Send VideoFrame back to WebRTC

This will replace the input video feed with the background supplied on both local and remote

sink.onFrame(outputVideoFrame)

Time taken

The whole pipeline takes on an average 40-50ms on 360p resolution as measured on OnePlus6.

Optimizations

Most of the processing time is taken by input VideoFrame to YuvFrame conversion. Since the real-time view doesn't change much on every frame, there is no point in doing this conversion on every frame. The previous converted YuvFrame can be easily used for processing. It helps in enhancing the performance and user experience.

Oldest comments (4)

Yogesh Singh 100ms Inc. • Jan 5 '22

awesome! 😍

neilyoung • Feb 17 '22

This looks nice and logical. However, it doesn't work, if you are capturing from a SurfaceView (screen sharing). Whatever I do, the YUV plane contains just 16 in Y and -128 in U and V. Sounds strange, but is what it is.

Ashish Kumar verma • Mar 7 '22

We have not tried capturing from surfaceView, will definitely check it @neilyoung .
Thanks for the input

pairat • Oct 14 '23

I tried using it for about 10 minutes and ran into memory problems causing the app to crash. I tried deleting the code until only convert yuv frame was left, but I ran into a problem. When I first looked in profiler, the native memory kept increasing. Until at one point it decreased and then increased to other instead. Then the app froze until it shut down.