Python is a very popular programming language, and one of the reasons is its Machine Learning, Image Processing, and Computer Vision ecosystems.
The developers and researchers in these fields often implement some models with Python and launch real-time demos with OpenCV, consuming a video via
cv2.VideoCapture() and displaying a preview via
However, there is a limitation with this method - the demo is only accessible from localhost because
cv2.VideoCapture(0) accesses a locally connected device and
cv2.imshow() shows a video on a local screen.
(Technically, using these methods remotely is possible, for example, with X forwarding or transmitting a local video as a RTSP stream, but these methods are not simple and easy.)
So, we want to make web-based apps for that purpose so that users can easily try the CV models with real-time video input from their local webcams or smartphones, remotely.
To do this, we can utilize WebRTC (Web Real-Time Communication).
WebRTC enables web servers and clients, including web browsers, to send and receive video, audio, and arbitrary data streams over the network with low latency.
It is now supported by major browsers like Chrome, Firefox, and Safari, and its specs are open and standardized. Browser-based real-time video chat apps like Google Meet are common examples of WebRTC usage.
In this article, I will explain the basics of WebRTC, how to use it with Python, and how to write code sending and receiving video streams between web browsers and Python servers.
This article also supplements this article, which exlains how to build a Python library for the Streamlit framework enabling to transmit video streams between clients and servers via WebRTC.
Fortunately, we can use
aiortc, a great open-source WebRTC library for Python. We will start by running the sample code and then learn about WebRTC based on it.
To start, clone
aiortc repository to your environment:
$ git clone https://github.com/aiortc/aiortc.git $ cd aiortc/
In this article, we also check out the
1.0.0 tag for consistency. This may not be necessary for your use with future releases.
$ git checkout 1.0.0
Then, run the sample in
$ cd examples/server/ $ python server.py ======== Running on http://0.0.0.0:8080 ======== (Press CTRL+C to quit)
Access http://localhost:8080/ in your web browser as instructed. (This example is using Google Chrome.) The page will look like this:
Next, check the "Use video" box and click the "Start" button. If you are asked for permission to access the devices, permit it. Then, the screen will show a video stream sourced from your webcam.
Here, the video stream is captured from a webcam in the frontend JS process, sent to the server-side Python process, and sent back to the frontend to show the preview. To verify it after reloading the page, set the second select box in the "Use video" checkbox row from "No transform" to another option, and click "Start" again. For example, with "Edge detection", the result will be this:
This image transformation is implemented as server-side Python code. For example, here’s the edge detection code.
# perform edge detection img = frame.to_ndarray(format="bgr24") img = cv2.cvtColor(cv2.Canny(img, 100, 200), cv2.COLOR_GRAY2BGR) # rebuild a VideoFrame, preserving timing information new_frame = VideoFrame.from_ndarray(img, format="bgr24") new_frame.pts = frame.pts new_frame.time_base = frame.time_base return new_frame
Since the video stream is sent to and processed on a server-side Python process, you can implement and inject arbitrary image processing/transforming code into it in the Python world. (You can try it if interested by customizing
With this example, we can see the huge potential to achieve what we want - sending video streams from a web browser, processing the streams in a Python process, and sending them back to the frontend again to preview.
To understand what happens under the hood, let's investigate the code and learn about the basics of WebRTC. We will see the example code step by step in this section.
With the example above, when you access http://localhost:8080/, the web browser loads
index.html. This is the server-side code that returns these static files. The main part of this example is written in
client.js. Let's look at it (and
index.html when necessary as a reference).
RTCPeerConnection represents a WebRTC connection from peer to peer. It's one of the core objects in this WebRTC code. The 'track' event listener is important here. It’s called when a new "track" is established where a video or audio stream is flowing from the server-side (another peer of the connection) to the frontend (your end of the connection). This will be explained later.
After setting up an
createPeerConnection(), return to
start() and jump to this part of it.
The constraints object is set up to be passed to
navigator.mediaDevices.getUserMedia(), which requests users to allow access to the local media devices like webcams or microphones. Here, we assume
constraints.video is set to true, as "Use video" is checked. Then,
navigator.mediaDevices.getUserMedia(constraints) requests your permission to access the webcam and returns a promise which is fulfilled with a stream object if successful.
When the stream object is obtained, the "tracks” of the stream are added to the connection,
pc.addTrack() like this. This means connecting a video stream ("track") from the local webcam to the WebRTC connection.
In the procedure above, you’re setting up the connection object associated with a local video stream. Next, we call the
negotiate() function. This function is responsible for signaling, one of the most important parts of WebRTC.
In WebRTC, media and data streams are transmitted via a peer-to-peer connection. To establish the WebRTC connection, the peers have to complete a signaling process first.
Signaling is the exchange of the metadata of each peer, called session description. It includes information such as available media codecs, the IP address of the peer, available ports, etc. The peers will establish the connection between them based on the data obtained from signaling.
During this exchange, this metadata is expressed as text in the SDP (Session Description Protocol) format.
In the signaling phase, one client first generates a message containing the metadata, which is called the offer, and sends it to the other peer. As the other peer receives the offer and reads it, WebRTC generates an answer containing its metadata, and sends it back to the offerer.
Here, the discovery method of message destinations and the medium to send the messages depend on the application.
In chat tools like Google Meet, the central server mediates the message transportation. In
streamlit-webrtc, the JS frontend sends an offer to the Python server and vice versa. The application developer, not the WebRTC standard, determines the transportation method of this exchange.
Let's take a look into the
pc.createOffer() is called in order to generate the offer. You can generate the offer simply with this method. It's provided as a part of the browser's WebRTC API.
Then the generated offer is set to the
pc object with
pc.setLocalDescription() in the next line. If you are interested in the contents of
offer, try to see it by putting
console.log(offer) into the code 😉.
pc.createOffer() is an async method that does not return the generated offer object directly, but returns a Promise object to be fulfilled with the offer object. The sample code is written in so-called Promise-style. If you are not familiar with it, I recommend learning about it so you can follow along with the code.
The offer, which
pc.createOffer() created here, only contains metadata about media - such as available codecs - but not network information.
The SDP offer includes information about any MediaStreamTracks already attached to the WebRTC session, codec, and options supported by the browser, and any candidates already gathered by the ICE agent, ...
Network connectivity information is called ICE candidate. ICE candidate contains the information about the methods available for the peer to use to make a connection, like the IP address of the peer and available ports. (This is for a simple case where the peers connect directly. In some cases, for example, they use an intermediate server, TURN server, and the ICE candidate content will vary.)
Gathering ICE candidates runs asynchronously and the code here looks a bit complicated as it contains a mixture of a Promise and callbacks.
Now, all information necessary for the offer is gathered and stored in
pc.localDescription (see here). It's time to send the offer to the other peer, which is in this case a server-side Python process. In this example we want to establish a connection from the frontend JS process to the server-side Python process.
This part of the code is for sending the offer, and it's a simple
fetch() function call. As I explained above, the sending method is not standardized by WebRTC itself. In this case, a simple HTTP request to the Python server is chosen as a design decision. The offer is transmitted as a JSON payload of the HTTP request.
Note: You can now ignore this part of the code, between the reference to
pc.localDescription and the
Let's see the server-side code invoked by this HTTP request,
offer() coroutine in
server.py. From here, we’ll review this coroutine to understand the process running on the other peer in the signaling phase, from receiving the "offer" to responding with the "answer".
First, the received JSON payload is parsed to obtain the offer sent from the frontend here.
The following process in this coroutine looks very similar to the frontend one, like a mirror, since this offer and answer mechanism is kind of symmetric. First the
RTCPeerConnection object is created here. This is equivalent to this code in the frontend. Then, attach the necessary event listeners. (They’ll be explained later.)
In the next line, the received "offer" is passed to the
pc object with
pc.setRemoteDescription(). Note that offer here contains a session description generated on the frontend, which is a remote peer seen from this side. (Equivalent code to it has not appeared in the frontend, but will do so later 🙂.)
pc.createAnswer() is called to generate an "answer". The generated answer is set to the
pc object with
pc.setLocalDescription in the next line.
pc.createAnswer() here is equivalent to
pc.createOffer() in the frontend code. An important point here is the server-side
pc object now owns session descriptions generated by both peers. (Remember that both
pc.setLocalDescription have been called.)
Finally, the answer is sent back to the frontend as a JSON payload of an HTTP response, same as the request with the offer.
After the HTTP response returns, its JSON payload is parsed and the answer is passed to the frontend
pc object with
pc.setRemoteDescription. It's the same as the server-side process we've seen above. At this point, the frontend
pc object also owns the session descriptions of both peers.
Now signaling has finished! As a result, both peers have both session descriptions. That is the goal of signaling.
NOTE: This section gave you a brief introduction of WebRTC signaling and connection. If interested, you can read more details in this article from Mozilla's MDN Web Docs.
After signaling is complete, the browser's WebRTC API and server-side
aiortc library automatically start trying to establish a connection. They try to find a workable network path based on the gathered ICE candidates included in the session descriptions.
This tutorial will skip the details of this step. If you’re interested, I recommend referencing documents related to keywords such as "NAT Traversal", "STUN", "TURN", "ICE", and "(UDP) hole punching", including the following.
- Introduction to WebRTC protocols (Mozilla)
- WebRTC NAT Traversal Methods: A Case for Embedded TURN (frozen mountain)
Video transmission starts after a WebRTC connection is established.
At first, since a video stream (a track) is already attached to the connection on the frontend (remember
pc.addTrack() above), the video stream from the frontend to the Python server starts to transmit.
The server-side responds to the newly-added track by triggering a track event. The event listener is called with the added track as an argument . In this listener function, there are two blocks handling audio and video respectively, and we will focus on the video one.
VideoTransformTrack object is created with the added track object as a constructor argument, and it is added to the server-side
pc object with
pc.addTrack(). Just like the frontend equivalent, this track is transmitted from peer to peer – in this case, from server to frontend.
 Strictly speaking, this
track event is fired before the connection is established, but for now I explain it this way for simplicity.
VideoTransformTrack is defined in the top part of
server.py. As its name implies, this class takes a track as an input, applies some image transformation to the video from the track, and outputs the result frames as another track.
This is the place where computer vision logic is implemented 😎. Note that its base class,
MediaStreamTrack is not a part of standard WebRTC API, but a utility class
Let's take a glance at its
recv() method, where OpenCV filters transform each frame. It's easy to understand what happens there if you are familiar with computer vision code in Python and OpenCV.
One unique point of
recv() is that it gets an input frame from the input track via its
recv() method(note that
self.track is passed as a constructor argument and represents a media stream from the frontend) and returns a new frame which is an instance of the
av.VideoFrame class (see here). A regular numpy array is wrapped by the
Let's return to the process following
pc.addTrack() in the track event listener.
pc.addTrack() finishes, a video stream from the server to the frontend is established, and in turn, a track event is triggered on the frontend to call its event listener. Remember that we already attached the event listener above (once again, here is the link😀).
Within this event listener, the added track, which represents a media stream from the server to the frontend, is attached to an HTML
<video> element as its source so that the
<video> element displays the video stream! This is why you can see the video stream on the web page.
aiortcis a WebRTC library for Python.
- WebRTC has a preparation phase called "Signaling", during which the peers exchange data called "offers" and "answers" in order to gather necessary information to establish the connection.
- Developers choose an arbitrary method for Signaling, such as the HTTP req/res mechanism.
- After Signaling, the WebRTC connection is established and starts transmitting media streams.