Yuichiro Tachibana (Tsuchiya)

Posted on Feb 12, 2021

Python WebRTC basics with aiortc

Background

Python is a very popular programming language, and one of the reasons is its Machine Learning, Image Processing, and Computer Vision ecosystems.
The developers and researchers in these fields often implement some models with Python and launch real-time demos with OpenCV, consuming a video via cv2.VideoCapture() and displaying a preview via cv2.imshow().

However, there is a limitation with this method - the demo is only accessible from localhost because cv2.VideoCapture(0) accesses a locally connected device and cv2.imshow() shows a video on a local screen.
(Technically, using these methods remotely is possible, for example, with X forwarding or transmitting a local video as a RTSP stream, but these methods are not simple and easy.)

So, we want to make web-based apps for that purpose so that users can easily try the CV models with real-time video input from their local webcams or smartphones, remotely.

To do this, we can utilize WebRTC (Web Real-Time Communication).
WebRTC enables web servers and clients, including web browsers, to send and receive video, audio, and arbitrary data streams over the network with low latency.
It is now supported by major browsers like Chrome, Firefox, and Safari, and its specs are open and standardized. Browser-based real-time video chat apps like Google Meet are common examples of WebRTC usage.

In this article, I will explain the basics of WebRTC, how to use it with Python, and how to write code sending and receiving video streams between web browsers and Python servers.

This article also supplements this article, which exlains how to build a Python library for the Streamlit framework enabling to transmit video streams between clients and servers via WebRTC.

WebRTC basics and `aiortc`

We want to learn about WebRTC and implement a WebRTC system with Python (and JavaScript for the frontend).

Fortunately, we can use aiortc, a great open-source WebRTC library for Python. We will start by running the sample code and then learn about WebRTC based on it.

To start, clone aiortc repository to your environment:



$ git clone https://github.com/aiortc/aiortc.git
$ cd aiortc/

In this article, we also check out the 1.0.0 tag for consistency. This may not be necessary for your use with future releases.



$ git checkout 1.0.0

Then, run the sample in examples/server directory.



$ cd examples/server/
$ python server.py
======== Running on http://0.0.0.0:8080 ========
(Press CTRL+C to quit)

Access http://localhost:8080/ in your web browser as instructed. (This example is using Google Chrome.) The page will look like this:

Next, check the "Use video" box and click the "Start" button. If you are asked for permission to access the devices, permit it. Then, the screen will show a video stream sourced from your webcam.

Here, the video stream is captured from a webcam in the frontend JS process, sent to the server-side Python process, and sent back to the frontend to show the preview. To verify it after reloading the page, set the second select box in the "Use video" checkbox row from "No transform" to another option, and click "Start" again. For example, with "Edge detection", the result will be this:

This image transformation is implemented as server-side Python code. For example, here’s the edge detection code.



# perform edge detection
img = frame.to_ndarray(format="bgr24")
img = cv2.cvtColor(cv2.Canny(img, 100, 200), cv2.COLOR_GRAY2BGR)

# rebuild a VideoFrame, preserving timing information
new_frame = VideoFrame.from_ndarray(img, format="bgr24")
new_frame.pts = frame.pts
new_frame.time_base = frame.time_base
return new_frame

Since the video stream is sent to and processed on a server-side Python process, you can implement and inject arbitrary image processing/transforming code into it in the Python world. (You can try it if interested by customizing VideoTransformTrack.recv()).

With this example, we can see the huge potential to achieve what we want - sending video streams from a web browser, processing the streams in a Python process, and sending them back to the frontend again to preview.

Explanation of the example and lessons about WebRTC basics

To understand what happens under the hood, let's investigate the code and learn about the basics of WebRTC. We will see the example code step by step in this section.

Start by setting up a connection

With the example above, when you access http://localhost:8080/, the web browser loads client.js and index.html. This is the server-side code that returns these static files. The main part of this example is written in client.js. Let's look at it (and index.html when necessary as a reference).

When the "Start" button is clicked, the start()function is called, which is defined here.

In start(), after disabling the "Start" button, it calls createPeerConnection(). This function creates a new RTCPeerConnection object and attaches some event listeners to it.

RTCPeerConnection represents a WebRTC connection from peer to peer. It's one of the core objects in this WebRTC code. The 'track' event listener is important here. It’s called when a new "track" is established where a video or audio stream is flowing from the server-side (another peer of the connection) to the frontend (your end of the connection). This will be explained later.

After setting up an RTCPeerConnection object, pc, in createPeerConnection(), return to start() and jump to this part of it.

Connecting a video stream from the local webcam

The constraints object is set up to be passed to navigator.mediaDevices.getUserMedia(), which requests users to allow access to the local media devices like webcams or microphones. Here, we assume constraints.video is set to true, as "Use video" is checked. Then, navigator.mediaDevices.getUserMedia(constraints) requests your permission to access the webcam and returns a promise which is fulfilled with a stream object if successful.

When the stream object is obtained, the "tracks” of the stream are added to the connection, pc, with pc.addTrack() like this. This means connecting a video stream ("track") from the local webcam to the WebRTC connection.

In the procedure above, you’re setting up the connection object associated with a local video stream. Next, we call the negotiate() function. This function is responsible for signaling, one of the most important parts of WebRTC.

What is signaling and what does it do?

In WebRTC, media and data streams are transmitted via a peer-to-peer connection. To establish the WebRTC connection, the peers have to complete a signaling process first.

Signaling is the exchange of the metadata of each peer, called session description. It includes information such as available media codecs, the IP address of the peer, available ports, etc. The peers will establish the connection between them based on the data obtained from signaling.

During this exchange, this metadata is expressed as text in the SDP (Session Description Protocol) format.

In the signaling phase, one client first generates a message containing the metadata, which is called the offer, and sends it to the other peer. As the other peer receives the offer and reads it, WebRTC generates an answer containing its metadata, and sends it back to the offerer.

Here, the discovery method of message destinations and the medium to send the messages depend on the application.

In chat tools like Google Meet, the central server mediates the message transportation. In streamlit-webrtc, the JS frontend sends an offer to the Python server and vice versa. The application developer, not the WebRTC standard, determines the transportation method of this exchange.

What the negotiate() function does

Let's take a look into the negotiate() function in the sample. This is a JavaScript frontend and is the "offerer" side in the explanation above. At first in the function, pc.createOffer() is called in order to generate the offer. You can generate the offer simply with this method. It's provided as a part of the browser's WebRTC API.

Then the generated offer is set to the pc object with pc.setLocalDescription() in the next line. If you are interested in the contents of offer, try to see it by putting console.log(offer) into the code 😉.

Note: pc.createOffer() is an async method that does not return the generated offer object directly, but returns a Promise object to be fulfilled with the offer object. The sample code is written in so-called Promise-style. If you are not familiar with it, I recommend learning about it so you can follow along with the code.

The offer, which pc.createOffer() created here, only contains metadata about media - such as available codecs - but not network information.

The SDP offer includes information about any MediaStreamTracks already attached to the WebRTC session, codec, and options supported by the browser, and any candidates already gathered by the ICE agent, ...

MDN Web Docs, Mozilla.org

Gathering network connectivity information

Network connectivity information is called ICE candidate. ICE candidate contains the information about the methods available for the peer to use to make a connection, like the IP address of the peer and available ports. (This is for a simple case where the peers connect directly. In some cases, for example, they use an intermediate server, TURN server, and the ICE candidate content will vary.)

Gathering ICE candidates runs asynchronously and the code here looks a bit complicated as it contains a mixture of a Promise and callbacks.

Now, all information necessary for the offer is gathered and stored in pc.localDescription (see here). It's time to send the offer to the other peer, which is in this case a server-side Python process. In this example we want to establish a connection from the frontend JS process to the server-side Python process.

This part of the code is for sending the offer, and it's a simple fetch() function call. As I explained above, the sending method is not standardized by WebRTC itself. In this case, a simple HTTP request to the Python server is chosen as a design decision. The offer is transmitted as a JSON payload of the HTTP request.

Note: You can now ignore this part of the code, between the reference to pc.localDescription and the fetch().

What’s happening server-side

Let's see the server-side code invoked by this HTTP request, offer() coroutine in server.py. From here, we’ll review this coroutine to understand the process running on the other peer in the signaling phase, from receiving the "offer" to responding with the "answer".

First, the received JSON payload is parsed to obtain the offer sent from the frontend here.

The following process in this coroutine looks very similar to the frontend one, like a mirror, since this offer and answer mechanism is kind of symmetric. First the RTCPeerConnection object is created here. This is equivalent to this code in the frontend. Then, attach the necessary event listeners. (They’ll be explained later.)

In the next line, the received "offer" is passed to the pc object with pc.setRemoteDescription(). Note that offer here contains a session description generated on the frontend, which is a remote peer seen from this side. (Equivalent code to it has not appeared in the frontend, but will do so later 🙂.)

After that, pc.createAnswer() is called to generate an "answer". The generated answer is set to the pc object with pc.setLocalDescription in the next line. pc.createAnswer() here is equivalent to pc.createOffer() in the frontend code. An important point here is the server-side pc object now owns session descriptions generated by both peers. (Remember that both pc.setRemoteDescription and pc.setLocalDescription have been called.)

Finally, the answer is sent back to the frontend as a JSON payload of an HTTP response, same as the request with the offer.

Returning to the frontend to see the rest of `negotiate()`

After the HTTP response returns, its JSON payload is parsed and the answer is passed to the frontend pc object with pc.setRemoteDescription. It's the same as the server-side process we've seen above. At this point, the frontend pc object also owns the session descriptions of both peers.

Now signaling has finished! As a result, both peers have both session descriptions. That is the goal of signaling.

NOTE: This section gave you a brief introduction of WebRTC signaling and connection. If interested, you can read more details in this article from Mozilla's MDN Web Docs.

What happens after signaling is achieved

After signaling is complete, the browser's WebRTC API and server-side aiortc library automatically start trying to establish a connection. They try to find a workable network path based on the gathered ICE candidates included in the session descriptions.

This tutorial will skip the details of this step. If you’re interested, I recommend referencing documents related to keywords such as "NAT Traversal", "STUN", "TURN", "ICE", and "(UDP) hole punching", including the following.

Introduction to WebRTC protocols (Mozilla)
WebRTC NAT Traversal Methods: A Case for Embedded TURN (frozen mountain)

Video stream transmission

Video transmission starts after a WebRTC connection is established.

At first, since a video stream (a track) is already attached to the connection on the frontend (remember pc.addTrack() above), the video stream from the frontend to the Python server starts to transmit.

The server-side responds to the newly-added track by triggering a track event. The event listener is called with the added track as an argument [1]. In this listener function, there are two blocks handling audio and video respectively, and we will focus on the video one.

Here a VideoTransformTrack object is created with the added track object as a constructor argument, and it is added to the server-side pc object with pc.addTrack(). Just like the frontend equivalent, this track is transmitted from peer to peer – in this case, from server to frontend.

[1] Strictly speaking, this track event is fired before the connection is established, but for now I explain it this way for simplicity.

VideoTransformTrack is defined in the top part of server.py. As its name implies, this class takes a track as an input, applies some image transformation to the video from the track, and outputs the result frames as another track.

This is the place where computer vision logic is implemented 😎. Note that its base class, MediaStreamTrack is not a part of standard WebRTC API, but a utility class aiortc provides.

A closer look at the `recv()` method

Let's take a glance at its recv() method, where OpenCV filters transform each frame. It's easy to understand what happens there if you are familiar with computer vision code in Python and OpenCV.

One unique point of recv() is that it gets an input frame from the input track via its recv() method(note that self.track is passed as a constructor argument and represents a media stream from the frontend) and returns a new frame which is an instance of the av.VideoFrame class (see here). A regular numpy array is wrapped by the av.VideoFrame class.

Let's return to the process following pc.addTrack() in the track event listener.

Making the video stream visible on the web page

After pc.addTrack() finishes, a video stream from the server to the frontend is established, and in turn, a track event is triggered on the frontend to call its event listener. Remember that we already attached the event listener above (once again, here is the link😀).

Within this event listener, the added track, which represents a media stream from the server to the frontend, is attached to an HTML <video> element as its source so that the <video> element displays the video stream! This is why you can see the video stream on the web page.

Conclusion

aiortc is a WebRTC library for Python.
WebRTC has a preparation phase called "Signaling", during which the peers exchange data called "offers" and "answers" in order to gather necessary information to establish the connection.
Developers choose an arbitrary method for Signaling, such as the HTTP req/res mechanism.
After Signaling, the WebRTC connection is established and starts transmitting media streams.

Top comments (17)

Zeeshaan • May 11 '21 • Edited

It is also not starting to capture the video in your example program.
I'm also attaching the photo of your deployed example model not working.

Image: dev-to-uploads.s3.amazonaws.com/up...
Your link: share.streamlit.io/whitphx/streaml...

Yuichiro Tachibana (Tsuchiya) • May 13 '21 • Edited

Since my example does not work, there seems something wrong in your environment. For example, network problems.
Are you using firewall software or being inside a office network which may drop WebRTC packets?

Apart from that, your app hosted on Sharing does not work in my env, where my app works on the other hand. It implies there may also be something wrong in your app code, but I can't find an exact reason.
However I found one thing; your app executes mask_image() in every execution, even when the WebCam mode is selected. What if it is modified not to be run when not necessary?

Zeeshaan • May 13 '21

I'm not using any firewall, And my app code is executing successfully on the local server, But It is not working on the Sharing Streamlit.

Does your app works on Sharing ? Because it in not working on any of my env.

Yuichiro Tachibana (Tsuchiya) • May 13 '21

Does your app works on Sharing ?

Yes. That's why I think there are something wrong in your environment.

Can you share the logs displayed in the browser's developer tool? Are there some errors?

Zeeshaan • May 13 '21

Sharing the log file please check.
Log File: dev-to-uploads.s3.amazonaws.com/up...

I'm asking you to please look into my code.
Github Code: github.com/zeeshaan28/Face-Mask-De...

I'm also sharing with you the log file of your example deployed on Sharing
File: dev-to-uploads.s3.amazonaws.com/up...

Zeeshaan • May 11 '21 • Edited

I have created a face mask detector app that uses images or a webcam for detecting faces. The app works fine on a local server but when I deployed it on online streamlit sharing, the webcam is loading but it is not capturing. I tried deploying on Heroku, but the same problem persists.
I'm sharing the picture and my web app link. Please review and help me with it.

Image: dev-to-uploads.s3.amazonaws.com/up...
WEB APP Link: share.streamlit.io/zeeshaan28/face...

bmox • Jul 27 '21

When I run it on Local URL: localhost:8501 I have the option of selecting my device. But why am I trying it on Network url 192.168.1.127:8501 I'm unable to pick any device. It informs me a text "unavailable". I'm not going to attempt both URLat the same time. Could you please clarify what the issue is?

Yuichiro Tachibana (Tsuchiya) • Jul 27 '21

You need HTTPS when using media devices on hosts other than localhost. See dev.to/whitphx/build-a-web-based-r...

Momo • Oct 15 '22 • Edited

Can I ask a question sir, I just want to know if it is possible to have multiple cameras and each camera will pass frames to the python sever for image processing.
is this possible with aiortc? I hoping for your reply. sorry for my bad english.

Yuichiro Tachibana (Tsuchiya) • Oct 24 '22

I think it's possible.
You can just do the same thing for each camera input.

Zulfiquar Ali • Sep 22 '22

the model works just fine but then I get this error when I move from the camera and its intermittent.

2022-09-22 21:13:16.903 Error occurred in the WebRTC thread:
2022-09-22 21:13:16.904 Traceback (most recent call last):
2022-09-22 21:13:16.904 File "/home/user/.local/lib/python3.8/site-packages/streamlit_webrtc/process.py", line 108, in _run_worker_thread
2022-09-22 21:13:16.904 self._worker_thread()
2022-09-22 21:13:16.904 File "/home/user/.local/lib/python3.8/site-packages/streamlit_webrtc/process.py", line 196, in _worker_thread
2022-09-22 21:13:16.904 new_frames = finished.result()
2022-09-22 21:13:16.904 File "/home/user/.local/lib/python3.8/site-packages/streamlit_webrtc/models.py", line 115, in recv_queued
2022-09-22 21:13:16.904 return [self.recv(frames[-1])]
2022-09-22 21:13:16.904 File "/home/user/.local/lib/python3.8/site-packages/streamlit_webrtc/models.py", line 107, in recv
2022-09-22 21:13:16.904 return av.VideoFrame.from_ndarray(new_image, format="bgr24")
2022-09-22 21:13:16.904 File "av/video/frame.pyx", line 358, in av.video.frame.VideoFrame.from_ndarray
2022-09-22 21:13:16.904 File "av/utils.pyx", line 69, in av.utils.check_ndarray
2022-09-22 21:13:16.904 AttributeError: 'NoneType' object has no attribute 'dtype'

Please help me on this...
Thanks in advance...