The series continues! This week I elect to go through the omnipotent
YoutubeIE extractor class again, and I will explain more about the inner workings of some of its methods.
This seems to be a dictionary of video IDs and functions that return their signatures, given an encrypted base64 signature
s. Encrypted signatures come from "manifests" in Youtube videos, also known as metadata for each "piece" of the video (modern streaming sites use DASH to split the video into chunks and serve one chunk at a time). Sometimes, that manifest file has an encrypted signature.
The player cache is stored in the member
self._player_cache and it's initialized to an empty dictionary. It has keys and values of the form
_player_cache[videoID] = func_decrypt_sig.
To see how this works in action, we need an example Youtube video with DASH chunks that have encrypted signatures. Fortunately, I found one: Where's The Truck - Dave Dudley - Charles Douglas (music video). It seems that encrypted signatures are mostly used in music videos like the one above.
This method, which is called as
self._decrypt_signature(self, s, video_id, player_url, age_gate=False) is responsible for constructing the
_player_cache dictionary. Let's open a PyCharm debugger simulating (via the
-s command line option) the download of the above video.
There are a lot of interesting variables dumped in the screenshot I want to go over.
First we have
_decrypt_signature(), it is originally
https://youtube.com is prepended to it.
I am not sure what "408be03a" in the URL symbolizes; I thought this was the version of the script deployed, but so far, the other numbers I tried are giving 404 Not Found errors.
Here is the full value of the encrypted signature variable in the debug view.
This is a python base64 str, which looks like this when decoded:
$ echo mAq0QJ8wRQIhALrcE92kDAxVsGOHWcbXWx8igWxWeV_atxxRc2Y4te0WAiBiXqFDS8l65jrCIhs7aSnW60u7Azakk14h6QUZw_nHlQ==== | base64 -i -d >/tmp/file.txt base64: invalid input # Ignore this $ od -x </tmp/file.txt 0000000 0a98 40b4 309f 0245 0021 dcba dd13 0ca4 0000020 550c 63b0 5987 d7c6 1f5b 8122 566c 5679 0000040 c7ad 5c14 8ed9 7b2d 8045 1888 a897 d250 0000060 5ef2 8eb9 88b0 ce86 4ada ba75 eed2 cdc0 0000100 24a9 88d7 417a 7046 799c 0050 0000113
The decrypting function corresponding to the video generates another base64 string, likely the raw signature data. The value of
func is gleaned off of self._extract_signature_function().
>>> print(s) mAq0QJ8wRQIhALrcE92kDAxVsGOHWcbXWx8igWxWeV_atxxRc2Y4te0WAiBiXqFDS8l65jrCIhs7aSnW60u7Azakk14h6QUZw_nHlQ==== >>> print(func) # <generic object output that translates to `lambda s: ''.join(s[i] for i in cache_spec)`> >>> print(func(s)) # func() is the extracted signature function AOq0QJ8wRQIhALrcE92kDAxVsGmHWcbXWx8igWxWeV_atxxRc2Y4te0WAiBiXqFDS8l65jrCIhs7aSnW60u7Azakk14h6QUZw_nHlQ==
And this is its decoded hex value:
$ echo AOq0QJ8wRQIhALrcE92kDAxVsGmHWcbXWx8igWxWeV_atxxRc2Y4te0WAiBiXqFDS8l65jrCIhs7aSnW60u7Azakk14h6QUZw_nHlQ== | base64 -i -d >/tmp/file.txt base64: invalid input $ od -x </tmp/file.txt 0000000 ea00 40b4 309f 0245 0021 dcba dd13 0ca4 0000020 550c 69b0 5987 d7c6 1f5b 8122 566c 5679 0000040 c7ad 5c14 8ed9 7b2d 8045 1888 a897 d250 0000060 5ef2 8eb9 88b0 ce86 4ada ba75 eed2 cdc0 0000100 24a9 88d7 417a 7046 799c 0050 0000113
These outputs are not identical. See below for why not.
Now, when I ran the program again with the same video, the signatures changed right in front of me. This gives me the impression that these signatures handle DRM-related things. I'm not too sure yet, and I'd have to explore deeper to see exactly what it does. But here is a screenshot of the variables created in
self._extract_signature_function() anyway, with the new signatures.
The first thing we can see is that it constructs a function ID from the
player_id (which is not the video ID), and something called a signature cache ID. Here in this example, it made a function ID of 'js_408be03a_106' by putting together:
player type - js ==================+ | ID of player_ias page - 408be03a ==+==> function ID - js_408be03a_106 | signature cache ID - 106 ==========+
player_type="js" represents the HTML5 player. There is also
player_type="swf" for the phased-out Flash player.
Alright, so then what happens? From the function ID, it fetches a scrambled list of indices from its cache, in the variable
cache_spec will be used to scramble the bytes in the signature according to the position of the indices. In our example, this is what our list of indices looks like:
>>> print(cache_spec) [1, 26, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 0, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103]
It can be seen that the signature is 104 bytes long, and that positions 0, 1, and 26 (1st, 2nd and 27th bytes) are the only indices scrambled. So it swaps the 2nd, 27th and 1st bytes towards the left i.e. 2nd byte is now 27th byte, 27th is now 1st, 1st is now 2nd. All other bytes are unchanged.
At this point there are many unanswered questions here. Where does the signature cache ID come from, and how is it fetched from the cache? What does the ID of the player_ias page really represent? And if this function was supposed to decrypt signatures, then how come it only scrambled a few bytes? Shouldn't it have performed full AES256 decryption instead?
I don't have answers for these questions yet so all I can say is, I have to keep digging. For now though, you can get your mind off of
_player_cache and see some of the simpler YoutubeIE methods below.
YoutubeIE has a fairly large of methods that print diagnostics to the screen. Do you recognize any of them? 🤔
def report_video_info_webpage_download(self, video_id): """Report attempt to download video info webpage.""" self.to_screen('%s: Downloading video info webpage' % video_id) def report_information_extraction(self, video_id): """Report attempt to extract video information.""" self.to_screen('%s: Extracting video information' % video_id) def report_unavailable_format(self, video_id, format): """Report extracted video URL.""" self.to_screen('%s: Format %s not available' % (video_id, format)) def report_rtmp_download(self): """Indicate the download will use the RTMP protocol.""" self.to_screen('RTMP download detected')
That's it for now folks. Thanks for reading, even if I kind of left you at a cliffhanger regarding the playlist cache. Hopefully I can get more info about that.
Of course, if you see any errors in this post, you know what to do - drop a note down in the comments so I can correct them.