DEV Community 👩‍💻👨‍💻

Cover image for Diary of youtube-dl internals, part 4
Ali Sherief
Ali Sherief

Posted on

Diary of youtube-dl internals, part 4

The series continues! This week I elect to go through the omnipotent YoutubeIE extractor class again, and I will explain more about the inner workings of some of its methods.

The player cache

This seems to be a dictionary of video IDs and functions that return their signatures, given an encrypted base64 signature s. Encrypted signatures come from "manifests" in Youtube videos, also known as metadata for each "piece" of the video (modern streaming sites use DASH to split the video into chunks and serve one chunk at a time). Sometimes, that manifest file has an encrypted signature.

The player cache is stored in the member self._player_cache and it's initialized to an empty dictionary. It has keys and values of the form _player_cache[videoID] = func_decrypt_sig.

To see how this works in action, we need an example Youtube video with DASH chunks that have encrypted signatures. Fortunately, I found one: Where's The Truck - Dave Dudley - Charles Douglas (music video). It seems that encrypted signatures are mostly used in music videos like the one above.

_decrypt_signature()

This method, which is called as self._decrypt_signature(self, s, video_id, player_url, age_gate=False) is responsible for constructing the _player_cache dictionary. Let's open a PyCharm debugger simulating (via the -s command line option) the download of the above video.

method variables of decrypt signature

There are a lot of interesting variables dumped in the screenshot I want to go over.

First we have player_url, and while at first glance you might think this is the URL of video, it's actually not. It is the url of the Javascript file that renders the Youtube player. When passed to _decrypt_signature(), it is originally /s/player/408be03a/player_ias.vflset/en_US/base.js. Then, https://youtube.com is prepended to it.

I am not sure what "408be03a" in the URL symbolizes; I thought this was the version of the script deployed, but so far, the other numbers I tried are giving 404 Not Found errors.

Here is the full value of the encrypted signature variable in the debug view.

'mAq0QJ8wRQIhALrcE92kDAxVsGOHWcbXWx8igWxWeV_atxxRc2Y4te0WAiBiXqFDS8l65jrCIhs7aSnW60u7Azakk14h6QUZw_nHlQ===='
Enter fullscreen mode Exit fullscreen mode

This is a python base64 str, which looks like this when decoded:

$ echo mAq0QJ8wRQIhALrcE92kDAxVsGOHWcbXWx8igWxWeV_atxxRc2Y4te0WAiBiXqFDS8l65jrCIhs7aSnW60u7Azakk14h6QUZw_nHlQ==== | base64 -i -d >/tmp/file.txt
base64: invalid input  # Ignore this
$ od -x </tmp/file.txt
0000000 0a98 40b4 309f 0245 0021 dcba dd13 0ca4
0000020 550c 63b0 5987 d7c6 1f5b 8122 566c 5679
0000040 c7ad 5c14 8ed9 7b2d 8045 1888 a897 d250
0000060 5ef2 8eb9 88b0 ce86 4ada ba75 eed2 cdc0
0000100 24a9 88d7 417a 7046 799c 0050
0000113
Enter fullscreen mode Exit fullscreen mode

The decrypting function corresponding to the video generates another base64 string, likely the raw signature data. The value of func is gleaned off of self._extract_signature_function().

>>> print(s)
mAq0QJ8wRQIhALrcE92kDAxVsGOHWcbXWx8igWxWeV_atxxRc2Y4te0WAiBiXqFDS8l65jrCIhs7aSnW60u7Azakk14h6QUZw_nHlQ====
>>> print(func)
# <generic object output that translates to `lambda s: ''.join(s[i] for i in cache_spec)`>
>>> print(func(s))  # func() is the extracted signature function
AOq0QJ8wRQIhALrcE92kDAxVsGmHWcbXWx8igWxWeV_atxxRc2Y4te0WAiBiXqFDS8l65jrCIhs7aSnW60u7Azakk14h6QUZw_nHlQ==
Enter fullscreen mode Exit fullscreen mode

And this is its decoded hex value:

$ echo AOq0QJ8wRQIhALrcE92kDAxVsGmHWcbXWx8igWxWeV_atxxRc2Y4te0WAiBiXqFDS8l65jrCIhs7aSnW60u7Azakk14h6QUZw_nHlQ== | base64 -i -d >/tmp/file.txt
base64: invalid input
$ od -x </tmp/file.txt
0000000 ea00 40b4 309f 0245 0021 dcba dd13 0ca4
0000020 550c 69b0 5987 d7c6 1f5b 8122 566c 5679
0000040 c7ad 5c14 8ed9 7b2d 8045 1888 a897 d250
0000060 5ef2 8eb9 88b0 ce86 4ada ba75 eed2 cdc0
0000100 24a9 88d7 417a 7046 799c 0050
0000113
Enter fullscreen mode Exit fullscreen mode

These outputs are not identical. See below for why not.


Now, when I ran the program again with the same video, the signatures changed right in front of me. This gives me the impression that these signatures handle DRM-related things. I'm not too sure yet, and I'd have to explore deeper to see exactly what it does. But here is a screenshot of the variables created in self._extract_signature_function() anyway, with the new signatures.

method variables of extract signature function

The first thing we can see is that it constructs a function ID from the player_type, player_id (which is not the video ID), and something called a signature cache ID. Here in this example, it made a function ID of 'js_408be03a_106' by putting together:

player type - js ==================+
                                   |
ID of player_ias page - 408be03a ==+==> function ID - js_408be03a_106
                                   |
signature cache ID - 106 ==========+
Enter fullscreen mode Exit fullscreen mode

player_type="js" represents the HTML5 player. There is also player_type="swf" for the phased-out Flash player.

Alright, so then what happens? From the function ID, it fetches a scrambled list of indices from its cache, in the variable cache_spec. cache_spec will be used to scramble the bytes in the signature according to the position of the indices. In our example, this is what our list of indices looks like:

>>> print(cache_spec)
[1, 26, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 0, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103]
Enter fullscreen mode Exit fullscreen mode

It can be seen that the signature is 104 bytes long, and that positions 0, 1, and 26 (1st, 2nd and 27th bytes) are the only indices scrambled. So it swaps the 2nd, 27th and 1st bytes towards the left i.e. 2nd byte is now 27th byte, 27th is now 1st, 1st is now 2nd. All other bytes are unchanged.

At this point there are many unanswered questions here. Where does the signature cache ID come from, and how is it fetched from the cache? What does the ID of the player_ias page really represent? And if this function was supposed to decrypt signatures, then how come it only scrambled a few bytes? Shouldn't it have performed full AES256 decryption instead?

I don't have answers for these questions yet so all I can say is, I have to keep digging. For now though, you can get your mind off of _player_cache and see some of the simpler YoutubeIE methods below.

Printing methods

YoutubeIE has a fairly large of methods that print diagnostics to the screen. Do you recognize any of them? 🤔

    def report_video_info_webpage_download(self, video_id):
        """Report attempt to download video info webpage."""
        self.to_screen('%s: Downloading video info webpage' % video_id)

    def report_information_extraction(self, video_id):
        """Report attempt to extract video information."""
        self.to_screen('%s: Extracting video information' % video_id)

    def report_unavailable_format(self, video_id, format):
        """Report extracted video URL."""
        self.to_screen('%s: Format %s not available' % (video_id, format))

    def report_rtmp_download(self):
        """Indicate the download will use the RTMP protocol."""
        self.to_screen('RTMP download detected')
Enter fullscreen mode Exit fullscreen mode

That's it for now folks. Thanks for reading, even if I kind of left you at a cliffhanger regarding the playlist cache. Hopefully I can get more info about that.

Of course, if you see any errors in this post, you know what to do - drop a note down in the comments so I can correct them.

Top comments (0)

Take Your Github Repository To The Next Level

Take Your Github Repository To The Next Level 🚀️: A step-by-step guide on creating the perfect Github repository.