Disclaimer
This project is for exercice purpose only. The whole source code will not be shared to avoid abuses.
Introduction
For the sake of the experimentation, I'm willing to build a Netflix clone, based on automated torrent downloading.
Requirements
To build a such system, I will need a web platform able to stream videos accross multiple devices as Netflix does. Hopefully, the Plex platform already does the job in a pretty awesome way. Therefore, I will just have to build a software able to search crawl in torrent websites in order to find and download the films and series I want to watch.
The idea is to have a web interface, asking me for the film I want to watch. If I already have it on my hard drive, it will open the Plex platform. If not, it will trigger an automated torrent download and move the film/serie into my plex media folder.
Technical stack
For this projet, I will use the Python language. As I haven't really worked with it yet, it will be a great introduction to this technology.
Steps
Step 1 : Getting the web page
After a research on the website, a specific URL is built. For instance, when searching for the avengers film, the URL looks like this : https://xxxxxxx/search/avengers/1/99/200
To process web scrapping, I am using the BeautifulSoup4
python module coupled with the requests
one.
class XXXParser(Parser):
def __init__(self):
super().__init__()
self.base_url = "https://xxx"
def __build_url(self, film_name) -> str:
url = self.base_url
url += "/search/" + film_name + "/1/99/200"
return url
def __get_page_content(self, film_name: str) -> BeautifulSoup:
html_content = requests.get(self.__build_url(film_name)).text
return BeautifulSoup(html_content, 'html.parser')
The website I am scrapping is structured this way :
<tr>
<td class="vertTh">
<center>
<a href="https://.../browse/200" title="More from this category">Video</a><br>
(<a href="https://...y/browse/207" title="More from this category">HD - Movies</a>)
</center>
</td>
<td>
<div class="detName">
<a href="https://.../torrent/34281763/Avengers.Endgame.2019.1080p.BRRip.x264-MP4"
class="detLink">Avengers.Endgame.2019.1080p.BRRip.x264-MP4</a>
</div>
<a href="magnet:?...">
<img src="https://.../static/img/icon-magnet.gif" alt="Magnet link" width="12"height="12">
</a>
<a href="https://.../user/..."><img src="https://.../static/img/trusted.png" alt="Trusted" title="Trusted" style="width:11px;" width="11" height="11" border="0"></a>
</td>
<td align="right">1803</td>
<td align="right">383</td>
</tr>
After receiving the web page as a BeautifulSoup result object, I can start filtering the HTML tags to retrieve the information I am looking for :
- Title
- Download URL
- Number of seeders
- Trusted uploader
- Video quality
page_content = self.__get_page_content(film_name)
rows = page_content.select("tr")
for row in rows:
if row.select_one(".vertTh") is None:
# This is not a table row containing a film
continue
film_name = row.select_one(".detLink").text
film_url = row.select_one(".detLink").attrs.get('href')
seeders = int(row.contents[len(row.contents) - 2].text)
leechers = int(row.contents[len(row.contents) - 1].text)
trusted = row.select_one("img[alt=Trusted]")
quality = re.search("\\d{4}p", film_name)
Step 2 : filtering results
One of the problem with torrents name, is their unintelligable names. Many of them basically looks like this Avengers.Infinity.War.2018.1080p.10bit.BluRay.8CH.x265.HEVC-PSA
which makes the work harder when it goes to filtering data.
So, I need to identify which text to remove to clear the titles.
- Replace dots by spaces
- Remove the quality using a regex (
\d{3,4}p
) - Remove the tags "DVDrip", "HDrip" etc... using a regex (
\w{2,3}rip
) - Remove repeted keywords among all titles : blueray, bluray, HEVC, AAC, ACC, PSA, MP4....
- Remove encoding tags with regex (
(x|h)\d+
) - Remove useless "The" at the beggining of titles
I now have more natural results :
avengers
avengers endgame
avengers endgame (2019)
avengers infinity war
avengers infinity war 2018 english
avengers age of ultron (2015)
Step 3 : Sorting results
I don't want to spend time filtering the results myself to find the best one, I want it to be automated. That's why I need to give a score to each result based on several criteria.
By default, every result has a score of 0.
Levenshtein distance
This one is the most important of all scoring methods.
The levenshtein distance calculates the number of changes needed to go from a string A to a string B. In my case, I want the levenshtein distance to be the lower as possible between my query and the titles. Thanks to the previous title clearing done above, film titles already looks pretty natural.
Seeders
As I want my film to be downloaded as fast as possible, I'm looking for the ones with the most seeders. To avoid increasing too much the score based on the number of seeders, I am using the mathematical square root function, where the Y values increases slower as the X values increases.
Language
As a french speaker, I prefer watching french movies. If the movie title contains "french" keyword, then its score is increased by one. However, if it only contains a "fr" keyword, its score is increased by 0.5 because I am less sure it is a french language related tag.
Quality
The quality is also an important criteria. If the title contains a quality greater or equal than 1080p, the film's score increases of 1 points. If the quality is lower, it increases proportionnally to the quality (720p => 0.5, 480p => 0.25...)
Trusted uploader
The website I am scrapping has the ability to reward users with a tag "Trusted". This tag insures me a good quality and an accurate content. A film uploaded by a trusted uploader automatically increases its score by 1.
Step 4 : automate download
To be continued...
Thanks for reading, keep in mind to stay awesome !
Top comments (0)