DEV Community

Cover image for Scrape Google Books in Python
Dmitriy Zub ☀️
Dmitriy Zub ☀️

Posted on • Edited on • Originally published at serpapi.com

Scrape Google Books in Python

What will be scraped

image

Prerequisites

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective.

Separate virtual environment

In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus preventing libraries or Python version conflicts.

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

📌Note: this is not a strict requirement for this blog post.

Install libraries:

pip install requests parsel
Enter fullscreen mode Exit fullscreen mode

Reduce the chance of being blocked

There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.


Full Code

from parsel import Selector
import requests, json, re

params = {
    "q": "richard branson",
    "tbm": "bks",
    "gl": "us",
    "hl": "en"
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
selector = Selector(text=html.text)

books_results = []

# https://regex101.com/r/mapBs4/1
book_thumbnails = re.findall(r"s=\\'data:image/jpg;base64,(.*?)\\'", str(selector.css("script").getall()), re.DOTALL)

for book_thumbnail, book_result in zip(book_thumbnails, selector.css(".Yr5TG")):
    title = book_result.css(".DKV0Md::text").get()
    link = book_result.css(".bHexk a::attr(href)").get()
    displayed_link = book_result.css(".tjvcx::text").get()
    snippet = book_result.css(".cmlJmd span::text").get()
    author = book_result.css(".fl span::text").get()
    author_link = f'https://www.google.com/search{book_result.css(".N96wpd .fl::attr(href)").get()}'
    date_published = book_result.css(".fl+ span::text").get()
    preview_link = book_result.css(".R1n8Q a.yKioRe:nth-child(1)::attr(href)").get()
    more_editions_link = book_result.css(".R1n8Q a.yKioRe:nth-child(2)::attr(href)").get()

    books_results.append({
        "title": title,
        "link": link,
        "displayed_link": displayed_link,
        "snippet": snippet,
        "author": author,
        "author_link": author_link,
        "date_published": date_published,
        "preview_link": preview_link,
        "more_editions_link": f"https://www.google.com{more_editions_link}" if more_editions_link is not None else None,
        "thumbnail": bytes(bytes(book_thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape")
    })


print(json.dumps(books_results, indent=2))
Enter fullscreen mode Exit fullscreen mode

Import libraries:

from parsel import Selector
import requests, json
Enter fullscreen mode Exit fullscreen mode
  • parsel is a library to extract and remove data from HTML and XML using XPath and CSS selectors. It's similar to beautifulsoup4 except it supports full XPath and has its own CSS pseudo-elements support, for example ::text or ::attr(<attribute_name>).

Create search query parameters and request headers:

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "q": "richard branson",  # search query
    "tbm": "bks",            # book results
    "gl": "us",              # country to search from
    "hl": "en"               # language
}

# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
Enter fullscreen mode Exit fullscreen mode
  • user-agent is used to act as a "real" user visit so website think it's a user, not the bot/script that sends a request. It's the most basic form of avoiding being blocked by a website.

Pass query params, request headers to the request and create a Selector object:

html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
selector = Selector(text=html.text)
Enter fullscreen mode Exit fullscreen mode

Create a temporary list to store the data:

books_results = []
Enter fullscreen mode Exit fullscreen mode

Match thumbnails data using regular expression:

# https://regex101.com/r/mapBs4/1
book_thumbnails = re.findall(r"s=\\'data:image/jpg;base64,(.*?)\\'", str(selector.css("script").getall()), re.DOTALL)
Enter fullscreen mode Exit fullscreen mode

The reason why we need to parse the data from <script> tags is because if you parse book thumbnail from <img> ["src"] attribute you'll get a 1x1 placeholder instead of a thumbnail.

  • re.findall() return a list of all matches.
  • selector.css("script") return a list of all found <script> tags and getall() will get the data value from translated XPath returned by <class 'SelectorList'> or <class 'Selector'> instance.
  • re.DOTALL will match everything including new line. Note that you have to have . switch, otherwise it will match every charter except a new line.

Iterate over matched thumbnails and CSS container with all the needed data and extract it:

for book_thumbnail, book_result in zip(book_thumbnails, selector.css(".Yr5TG")):
    title = book_result.css(".DKV0Md::text").get()
    link = book_result.css(".bHexk a::attr(href)").get()
    displayed_link = book_result.css(".tjvcx::text").get()
    snippet = book_result.css(".cmlJmd span::text").get()
    author = book_result.css(".fl span::text").get()
    author_link = f'https://www.google.com/search{book_result.css(".N96wpd .fl::attr(href)").get()}'
    date_published = book_result.css(".fl+ span::text").get()
    preview_link = book_result.css(".R1n8Q a.yKioRe:nth-child(1)::attr(href)").get()
    more_editions_link = book_result.css(".R1n8Q a.yKioRe:nth-child(2)::attr(href)").get()
Enter fullscreen mode Exit fullscreen mode
  • zip() aggregates multiple iterables in parallel and returns a tuple with an item from each one.
  • css(".Yr5TG") is like calling soup.select(".Yr5TG") with bs4, which will return a list of matches.
  • css(".DKV0Md::text") where CSS3 pseudo-element ::text will get text, and get() will get the textual data value from translated XPath. If using without get() you'll get a translated XPath <class 'SelectorList'> or <class 'Selector'> instance from CSS selector.
  • ::attr(href) is also a pseudo-element to grab an attribute.

Append the data to temporary list as a dict:

books_results.append({
    "title": title,
    "link": link,
    "displayed_link": displayed_link,
    "snippet": snippet,
    "author": author,
    "author_link": author_link,
    "date_published": date_published,
    "preview_link": preview_link,
    # if URL is present, add "https://www.google.com" to the URL, instead to None: "Nonehttps://www.google.com"
    "more_editions_link": f"https://www.google.com{more_editions_link}" if more_editions_link is not None else None, 
    "thumbnail": bytes(bytes(book_thumbnail, "ascii").decode("unicode-escape"), "ascii").decode("unicode-escape")
})
Enter fullscreen mode Exit fullscreen mode
  • bytes().decode() will decode unicode escape characters. We have to do it twice, because after first decoding some unicode characters are still present for some reason.

Print the data:

print(json.dumps(books_results, indent=2))
Enter fullscreen mode Exit fullscreen mode

Part of the JSON output:

[
  {
    "title": "The Virgin Way: How to Listen, Learn, Laugh and Lead",
    "link": "https://books.google.com/books?id=Jkp1AgAAQBAJ&printsec=frontcover&dq=richard+branson&hl=en&newbks=1&newbks_redir=1&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQ6AF6BAgIEAI",
    "displayed_link": "books.google.com",
    "snippet": "This is not a conventional book on leadership. There are no rules \u2013 but rather the secrets of leadership that he has learned along the way from his days at Virgin Records, to his recent work with The Elders.",
    "author": "Sir Richard Branson",
    "author_link": "https://www.google.com/search/search?gl=us&hl=en&tbm=bks&tbm=bks&q=inauthor:%22Sir+Richard+Branson%22&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQ9Ah6BAgIEAU",
    "date_published": "2014",
    "preview_link": "https://books.google.com/books?id=Jkp1AgAAQBAJ&printsec=frontcover&dq=richard+branson&hl=en&newbks=1&newbks_redir=1&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQuwV6BAgIEAc",
    "more_editions_link": "https://www.google.com/books/edition/The_Virgin_Way/Jkp1AgAAQBAJ?hl=en&gl=us&kptab=editions&sa=X&ved=2ahUKEwin3IrX-_n1AhXclmoFHbMHDfIQmBZ6BAgIEAg",
    "thumbnail": ""
  }, ... other results
]
Enter fullscreen mode Exit fullscreen mode

Outro

If you have anything to share, any questions, suggestions, or something that isn't working correctly, reach out via Twitter at @dimitryzub, or @serp_api.

Yours,
Dmitriy, and the rest of SerpApi Team.


Join us on Reddit | Twitter | YouTube

Add a Feature Request💫 or a Bug🐞

Top comments (0)