We can get the top 50 web traffic sties with these two site traffic monitor services:
- Alexa: https://www.alexa.com/
- SimilarWeb: https://www.similarweb.com/
and the top 50 web traffic sites are:
- Alexa: https://www.alexa.com/topsites
- SimilarWeb: https://www.similarweb.com/top-websites/
With python requests and BeautifulSoup modules, we can automate list the top 50 web sites from these two monitor services.
First we create a dict that store these two monitor service urls and selectors (for BeautifulSoup select):
webRankSites = {
"Alexa": {
"url": "https://www.alexa.com/topsites/",
"selector": "div.DescriptionCell"
},
"SimilarWeb":{
"url": "https://www.similarweb.com/top-websites/",
"selector": "td.topRankingGrid-cell.topWebsitesGrid-cellWebsite.showInMobile"
}
}
How to define the selector? We need to check these two services url content with the site list:
Alexa:
As the developer tools show, the web site is in the element div with class DescriptionCell, the selector is "div.DescriptionCell".SimilarWeb:
The web site is in the element td with 3 classes topRankingGrid-cell, topWebsitesGrid-cellWebsite, showInMobile. The selector is "td.topRankingGrid-cell.topWebsitesGrid-cellWebsite.showInMobile".
Second we start to get the url content with requests.get and with BeautifulSoup selector patterns to get the web site list (myheaders is used for similarWeb service, since no user-agent will result response status code 403):
myheaders = {"user-agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0(HTTP_USER_AGENT)"}
for site in webRankSites:
print("site: " + site)
resp = requests.get(webRankSites[site]["url"], headers = myheaders)
soup = BeautifulSoup(resp.text, 'html.parser')
items = soup.select(webRankSites[site]["selector"])
i = 1
for item in items:
print(str(i) + ". " + item.text.strip())
i+=1
Then we can get the result:
site: Alexa
1. Google.com
2. Youtube.com
3. Tmall.com
...
site: SimilarWeb
1. google.com
2. youtube.com
3. facebook.com
...
Wow, the result can be automate to get and it looks great. Wanna try? Check this demo:
https://repl.it/@timhuangt/GlobalTopSite
And enjoy it! Happy coding!!
Top comments (2)
Since the similarweb response the request of these code without the ranking list, the list of similarweb will not appear. I've check the resp.text:
And I have no idea to solve this interruption. Any idea?
Someone provide a method that add a header with
"authority": "similarweb.com"
and help to get result from similarweb.com. (already update the code repl.it/@timhuangt/GlobalTopSite )