The web has many different types of content: images, video, text, audio and more. You can use Python to download data from the web.
The program below downloads image from search engines Google and Baidu.
Why these? Because they are large image archives.
It can be called like this with the keyword:
Code to scrape images:
#!/usr/bin/python3 #-*- coding:utf-8 -*- import re import requests from bs4 import BeautifulSoup from urllib.parse import urlparse import os def download_baidu(keyword): url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word='+word+'&ct=201326592&v=flip' result = requests.get(url) html = result.text pic_url = re.findall('"objURL":"(.*?)",',html,re.S) i = 0 for each in pic_url: print(pic_url) try: pic= requests.get(each, timeout=10) except requests.exceptions.ConnectionError: print ('exception') continue string = 'pictures'+keyword+'_'+str(i) + '.jpg' fp = open(string,'wb') fp.write(pic.content) fp.close() i += 1 def download_google(word): url = 'https://www.google.com/search?q=' + word + '&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982' page = requests.get(url).text soup = BeautifulSoup(page, 'html.parser') for raw_img in soup.find_all('img'): link = raw_img.get('src') os.system("wget " + link) if __name__ == '__main__': word = input("Input key word: ") download_baidu(word) #download_google(word)
This downloads the images into the same directory. The downside of this implementation is that it does not use threading. Threading speeds up the download process.
If you want to download lots of images, you should use threading. Speed with threading, ok maybe not this fast.
Either-way it's interesting to compare search results for Google and Baidu.
Top comments (0)