Image scraping with Python

#python #image

The web has many different types of content: images, video, text, audio and more. You can use Python to download data from the web.

The program below downloads image from search engines Google and Baidu.

Why these? Because they are large image archives.

It can be called like this with the keyword:

download_baidu(word)
download_google(word)

Code to scrape images:

#!/usr/bin/python3
#-*- coding:utf-8 -*-

import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
import os

def download_baidu(keyword): 
    url = 'https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word='+word+'&ct=201326592&v=flip'
    result = requests.get(url)
    html = result.text
    pic_url = re.findall('"objURL":"(.*?)",',html,re.S)
    i = 0

    for each in pic_url:
        print(pic_url)
        try:
            pic= requests.get(each, timeout=10)
        except requests.exceptions.ConnectionError:
            print ('exception')
            continue

        string = 'pictures'+keyword+'_'+str(i) + '.jpg'
        fp = open(string,'wb')
        fp.write(pic.content)
        fp.close()
        i += 1

def download_google(word):
    url = 'https://www.google.com/search?q=' + word + '&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982'
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')

       for raw_img in soup.find_all('img'):
           link = raw_img.get('src')
           os.system("wget " + link)

if __name__ == '__main__':
    word = input("Input key word: ")
    download_baidu(word)
    #download_google(word)

This downloads the images into the same directory. The downside of this implementation is that it does not use threading. Threading speeds up the download process.

If you want to download lots of images, you should use threading. Speed with threading, ok maybe not this fast.