DEV Community

loading...
Cover image for Python WebRTC basics with aiortc

Python WebRTC basics with aiortc

whitphx profile image Yuichiro Tachibana (Tsuchiya) ・12 min read

Background

Python is a very popular programming language, and one of the reasons is its Machine Learning, Image Processing, and Computer Vision ecosystems.
The developers and researchers in these fields often implement some models with Python and launch real-time demos with OpenCV, consuming a video via cv2.VideoCapture() and displaying a preview via cv2.imshow().

However, there is a limitation with this method - the demo is only accessible from localhost because cv2.VideoCapture(0) accesses a locally connected device and cv2.imshow() shows a video on a local screen.
(Technically, using these methods remotely is possible, for example, with X forwarding or transmitting a local video as a RTSP stream, but these methods are not simple and easy.)

So, we want to make web-based apps for that purpose so that users can easily try the CV models with real-time video input from their local webcams or smartphones, remotely.

To do this, we can utilize WebRTC (Web Real-Time Communication).
WebRTC enables web servers and clients, including web browsers, to send and receive video, audio, and arbitrary data streams over the network with low latency.
It is now supported by major browsers like Chrome, Firefox, and Safari, and its specs are open and standardized. Browser-based real-time video chat apps like Google Meet are common examples of WebRTC usage.

In this article, I will explain the basics of WebRTC, how to use it with Python, and how to write code sending and receiving video streams between web browsers and Python servers.

This article also supplements this article, which exlains how to build a Python library for the Streamlit framework enabling to transmit video streams between clients and servers via WebRTC.

WebRTC basics and aiortc

We want to learn about WebRTC and implement a WebRTC system with Python (and JavaScript for the frontend).

Fortunately, we can use aiortc, a great open-source WebRTC library for Python. We will start by running the sample code and then learn about WebRTC based on it.

To start, clone aiortc repository to your environment:

$ git clone https://github.com/aiortc/aiortc.git
$ cd aiortc/
Enter fullscreen mode Exit fullscreen mode

In this article, we also check out the 1.0.0 tag for consistency. This may not be necessary for your use with future releases.

$ git checkout 1.0.0
Enter fullscreen mode Exit fullscreen mode

Then, run the sample in examples/server directory.

$ cd examples/server/
$ python server.py
======== Running on http://0.0.0.0:8080 ========
(Press CTRL+C to quit)
Enter fullscreen mode Exit fullscreen mode

Access http://localhost:8080/ in your web browser as instructed. (This example is using Google Chrome.) The page will look like this:

Next, check the "Use video" box and click the "Start" button. If you are asked for permission to access the devices, permit it. Then, the screen will show a video stream sourced from your webcam.

aiortc video example

Here, the video stream is captured from a webcam in the frontend JS process, sent to the server-side Python process, and sent back to the frontend to show the preview. To verify it after reloading the page, set the second select box in the "Use video" checkbox row from "No transform" to another option, and click "Start" again. For example, with "Edge detection", the result will be this:

aiortc edge extraction example

This image transformation is implemented as server-side Python code. For example, here’s the edge detection code.

# perform edge detection
img = frame.to_ndarray(format="bgr24")
img = cv2.cvtColor(cv2.Canny(img, 100, 200), cv2.COLOR_GRAY2BGR)

# rebuild a VideoFrame, preserving timing information
new_frame = VideoFrame.from_ndarray(img, format="bgr24")
new_frame.pts = frame.pts
new_frame.time_base = frame.time_base
return new_frame
Enter fullscreen mode Exit fullscreen mode

Since the video stream is sent to and processed on a server-side Python process, you can implement and inject arbitrary image processing/transforming code into it in the Python world. (You can try it if interested by customizing VideoTransformTrack.recv()).

With this example, we can see the huge potential to achieve what we want - sending video streams from a web browser, processing the streams in a Python process, and sending them back to the frontend again to preview.

Explanation of the example and lessons about WebRTC basics

To understand what happens under the hood, let's investigate the code and learn about the basics of WebRTC. We will see the example code step by step in this section.

Start by setting up a connection

With the example above, when you access http://localhost:8080/, the web browser loads client.js and index.html. This is the server-side code that returns these static files. The main part of this example is written in client.js. Let's look at it (and index.html when necessary as a reference).

When the "Start" button is clicked, the start()function is called, which is defined here.

In start(), after disabling the "Start" button, it calls createPeerConnection(). This function creates a new RTCPeerConnection object and attaches some event listeners to it.

RTCPeerConnection represents a WebRTC connection from peer to peer. It's one of the core objects in this WebRTC code. The 'track' event listener is important here. It’s called when a new "track" is established where a video or audio stream is flowing from the server-side (another peer of the connection) to the frontend (your end of the connection). This will be explained later.

After setting up an RTCPeerConnection object, pc, in createPeerConnection(), return to start() and jump to this part of it.

Connecting a video stream from the local webcam

The constraints object is set up to be passed to navigator.mediaDevices.getUserMedia(), which requests users to allow access to the local media devices like webcams or microphones. Here, we assume constraints.video is set to true, as "Use video" is checked. Then, navigator.mediaDevices.getUserMedia(constraints) requests your permission to access the webcam and returns a promise which is fulfilled with a stream object if successful.

When the stream object is obtained, the "tracks” of the stream are added to the connection, pc, with pc.addTrack() like this. This means connecting a video stream ("track") from the local webcam to the WebRTC connection.

In the procedure above, you’re setting up the connection object associated with a local video stream. Next, we call the negotiate() function. This function is responsible for signaling, one of the most important parts of WebRTC.

What is signaling and what does it do?

In WebRTC, media and data streams are transmitted via a peer-to-peer connection. To establish the WebRTC connection, the peers have to complete a signaling process first.

Signaling is the exchange of the metadata of each peer, called session description. It includes information such as available media codecs, the IP address of the peer, available ports, etc. The peers will establish the connection between them based on the data obtained from signaling.

During this exchange, this metadata is expressed as text in the SDP (Session Description Protocol) format.

Signaling

In the signaling phase, one client first generates a message containing the metadata, which is called the offer, and sends it to the other peer. As the other peer receives the offer and reads it, WebRTC generates an answer containing its metadata, and sends it back to the offerer.

Here, the discovery method of message destinations and the medium to send the messages depend on the application.

In chat tools like Google Meet, the central server mediates the message transportation. In streamlit-webrtc, the JS frontend sends an offer to the Python server and vice versa. The application developer, not the WebRTC standard, determines the transportation method of this exchange.

What the negotiate() function does

Let's take a look into the negotiate() function in the sample. This is a JavaScript frontend and is the "offerer" side in the explanation above. At first in the function, pc.createOffer() is called in order to generate the offer. You can generate the offer simply with this method. It's provided as a part of the browser's WebRTC API.

Then the generated offer is set to the pc object with pc.setLocalDescription() in the next line. If you are interested in the contents of offer, try to see it by putting console.log(offer) into the code 😉.

Note: pc.createOffer() is an async method that does not return the generated offer object directly, but returns a Promise object to be fulfilled with the offer object. The sample code is written in so-called Promise-style. If you are not familiar with it, I recommend learning about it so you can follow along with the code.

The offer, which pc.createOffer() created here, only contains metadata about media - such as available codecs - but not network information.

The SDP offer includes information about any MediaStreamTracks already attached to the WebRTC session, codec, and options supported by the browser, and any candidates already gathered by the ICE agent, ...

MDN Web Docs, Mozilla.org

Gathering network connectivity information

Network connectivity information is called ICE candidate. ICE candidate contains the information about the methods available for the peer to use to make a connection, like the IP address of the peer and available ports. (This is for a simple case where the peers connect directly. In some cases, for example, they use an intermediate server, TURN server, and the ICE candidate content will vary.)

Gathering ICE candidates runs asynchronously and the code here looks a bit complicated as it contains a mixture of a Promise and callbacks.

Now, all information necessary for the offer is gathered and stored in pc.localDescription (see here). It's time to send the offer to the other peer, which is in this case a server-side Python process. In this example we want to establish a connection from the frontend JS process to the server-side Python process.

This part of the code is for sending the offer, and it's a simple fetch() function call. As I explained above, the sending method is not standardized by WebRTC itself. In this case, a simple HTTP request to the Python server is chosen as a design decision. The offer is transmitted as a JSON payload of the HTTP request.

Note: You can now ignore this part of the code, between the reference to pc.localDescription and the fetch().

What’s happening server-side

Let's see the server-side code invoked by this HTTP request, offer() coroutine in server.py. From here, we’ll review this coroutine to understand the process running on the other peer in the signaling phase, from receiving the "offer" to responding with the "answer".

First, the received JSON payload is parsed to obtain the offer sent from the frontend here.

The following process in this coroutine looks very similar to the frontend one, like a mirror, since this offer and answer mechanism is kind of symmetric. First the RTCPeerConnection object is created here. This is equivalent to this code in the frontend. Then, attach the necessary event listeners. (They’ll be explained later.)

In the next line, the received "offer" is passed to the pc object with pc.setRemoteDescription(). Note that offer here contains a session description generated on the frontend, which is a remote peer seen from this side. (Equivalent code to it has not appeared in the frontend, but will do so later 🙂.)

After that, pc.createAnswer() is called to generate an "answer". The generated answer is set to the pc object with pc.setLocalDescription in the next line. pc.createAnswer() here is equivalent to pc.createOffer() in the frontend code. An important point here is the server-side pc object now owns session descriptions generated by both peers. (Remember that both pc.setRemoteDescription and pc.setLocalDescription have been called.)

Finally, the answer is sent back to the frontend as a JSON payload of an HTTP response, same as the request with the offer.

Returning to the frontend to see the rest of negotiate()

After the HTTP response returns, its JSON payload is parsed and the answer is passed to the frontend pc object with pc.setRemoteDescription. It's the same as the server-side process we've seen above. At this point, the frontend pc object also owns the session descriptions of both peers.

Now signaling has finished! As a result, both peers have both session descriptions. That is the goal of signaling.

NOTE: This section gave you a brief introduction of WebRTC signaling and connection. If interested, you can read more details in this article from Mozilla's MDN Web Docs.

What happens after signaling is achieved

After signaling is complete, the browser's WebRTC API and server-side aiortc library automatically start trying to establish a connection. They try to find a workable network path based on the gathered ICE candidates included in the session descriptions.

Signaling and connection

This tutorial will skip the details of this step. If you’re interested, I recommend referencing documents related to keywords such as "NAT Traversal", "STUN", "TURN", "ICE", and "(UDP) hole punching", including the following.

Video stream transmission

Video transmission starts after a WebRTC connection is established.

At first, since a video stream (a track) is already attached to the connection on the frontend (remember pc.addTrack() above), the video stream from the frontend to the Python server starts to transmit.

The server-side responds to the newly-added track by triggering a track event. The event listener is called with the added track as an argument [1]. In this listener function, there are two blocks handling audio and video respectively, and we will focus on the video one.

Here a VideoTransformTrack object is created with the added track object as a constructor argument, and it is added to the server-side pc object with pc.addTrack(). Just like the frontend equivalent, this track is transmitted from peer to peer – in this case, from server to frontend.

[1] Strictly speaking, this track event is fired before the connection is established, but for now I explain it this way for simplicity.

VideoTransformTrack is defined in the top part of server.py. As its name implies, this class takes a track as an input, applies some image transformation to the video from the track, and outputs the result frames as another track.

This is the place where computer vision logic is implemented 😎. Note that its base class, MediaStreamTrack is not a part of standard WebRTC API, but a utility class aiortc provides.

A closer look at the recv() method

Let's take a glance at its recv() method, where OpenCV filters transform each frame. It's easy to understand what happens there if you are familiar with computer vision code in Python and OpenCV.

One unique point of recv() is that it gets an input frame from the input track via its recv() method(note that self.track is passed as a constructor argument and represents a media stream from the frontend) and returns a new frame which is an instance of the av.VideoFrame class (see here). A regular numpy array is wrapped by the av.VideoFrame class.

Let's return to the process following pc.addTrack() in the track event listener.

Making the video stream visible on the web page

After pc.addTrack() finishes, a video stream from the server to the frontend is established, and in turn, a track event is triggered on the frontend to call its event listener. Remember that we already attached the event listener above (once again, here is the link😀).

Within this event listener, the added track, which represents a media stream from the server to the frontend, is attached to an HTML <video> element as its source so that the <video> element displays the video stream! This is why you can see the video stream on the web page.

Conclusion

  • aiortc is a WebRTC library for Python.
  • WebRTC has a preparation phase called "Signaling", during which the peers exchange data called "offers" and "answers" in order to gather necessary information to establish the connection.
  • Developers choose an arbitrary method for Signaling, such as the HTTP req/res mechanism.
  • After Signaling, the WebRTC connection is established and starts transmitting media streams.

Discussion (0)

pic
Editor guide