DEV Community

Cover image for How to Scrape Google Search Results With Python - Best Google Search Scraper?
Scraper0024
Scraper0024

Posted on

How to Scrape Google Search Results With Python - Best Google Search Scraper?

What is Google SERP?

Whenever web scraping Google search results is discussed, you will most likely come across the abbreviation "SERP". SERP stands for Search Engine Results Page. It's the page you get after entering a query in the search bar.

In the past, Google returned a list of links for your query. Today, it looks completely different - SERPs include a variety of features and elements that make your search experience fast and convenient.
Typically, the page consists of:

  • Organic Search Results
  • Paid Search Results
  • Featured Snippets
  • Knowledge Graph
  • Other Elements: Such as maps, images, or news stories that appear based on the query.

Is It Legal to Scrape Google Search Results?

Before scraping Google search results, it’s essential to understand the legal implications. Google’s Terms of Service prohibit scraping their search results, as stated in their policies:

"You shall not scrape, crawl, or use any automated means to access the Services for any purpose."

Violating these terms could result in IP bans or even legal action from Google. However, the legality of scraping depends on the jurisdiction, the data you're scraping, and how you're using it.

Alternatives to Scraping Google:

  • Google Custom Search API: Google offers an official API to retrieve search results, providing a legal and structured way to access data without violating their policies.
  • Other Search APIs: If you're not set on using Google, there are other search engines and services that provide APIs for accessing search results, such as Bing, and Scrapeless.

4 Main Difficulties to Scrape Google SERP

Scraping Google SERPs presents a number of challenges, which is why it's considered difficult. These include:

  1. Bot Detection: Google employs several techniques to detect and block bots, including:
    • CAPTCHA
    • IP Blocking
    • Rate Limiting
  2. Dynamic Content: Google search results are often dynamically generated using JavaScript, which can complicate scraping. Content may load after the initial page load, requiring tools like Selenium to render the page fully.
  3. HTML Structure Changes: Google frequently changes the layout and structure of its search results, meaning that scrapers need to adapt quickly to avoid breaking the code.
  4. Complex Data: the SERP includes a variety of complex elements like ads, images, videos, and rich snippets, making it challenging to extract meaningful data consistently.

Despite these challenges, scraping Google search results is still possible with the right techniques and tools.

Let’s break down the process into the following steps for scraping Google search results with Python:

How to Scrape Google Search Results Using Python?

Step 1: Send Requests to Google

Before you begin scraping, you'll need to send a request to Google's search page. Since Google blocks most requests from bots, it’s essential to simulate a real user by setting a proper User-Agent header.

import requests
from fake_useragent import UserAgent

# Generate a random user-agent
ua = UserAgent()
headers = {'User-Agent': ua.random}

# Google search query
query = "How to scrape Google search results with Python"
url = f"https://www.google.com/search?q={query}"

# Send the GET request
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    print(response.text)
else:
    print("Failed to retrieve the page")
Enter fullscreen mode Exit fullscreen mode

Step 2: Parse the HTML Content

Once you have the HTML content of the Google SERP, you can use BeautifulSoup to extract the data you need.

from bs4 import BeautifulSoup

# Parse the page content
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the search result containers
search_results = soup.find_all('div', class_='BVG0Nb')

for result in search_results:
    title = result.text
    link = result.find('a')['href']
    print(f"Title: {title}")
    print(f"Link: {link}\n")
Enter fullscreen mode Exit fullscreen mode

Step 3: Handling JavaScript (Using Selenium)

Selenium is a great tool for handling pages that rely on JavaScript to render content. It automates a browser and simulates user interaction, making it ideal for scraping dynamically generated content.

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

# Set up Selenium WebDriver
driver = webdriver.Chrome(ChromeDriverManager().install())

# Open Google and perform the search
driver.get("https://www.google.com/")
search_box = driver.find_element(By.NAME, 'q')
search_box.send_keys("How to scrape Google search results with Python")
search_box.submit()

# Wait for results to load and extract the links
driver.implicitly_wait(5)

# Get search results
search_results = driver.find_elements(By.CLASS_NAME, 'BVG0Nb')

for result in search_results:
    title = result.text
    link = result.find_element(By.TAG_NAME, 'a').get_attribute('href')
    print(f"Title: {title}")
    print(f"Link: {link}\n")

driver.quit()
Enter fullscreen mode Exit fullscreen mode

Step 4: Avoid Detection

To minimize the chances of being detected and blocked by Google, you should:

  • Rotate User Agents: Use different user agents to simulate requests from various browsers.
  • Add Delays: Introduce random delays between requests to mimic human-like browsing behavior.
  • Use Proxies: Rotate IP addresses to distribute your requests and avoid detection.
  • Respect Robots.txt: Always check Google’s robots.txt file and follow ethical scraping practices.

The Best Google Search Scraping API - Scrapeless

While scraping Google directly is possible, it can be tedious and error-prone, and it often results in being blocked. This is where Scrapeless comes in. With the powerful CAPTCHA solver, IP rotation, intelligent proxy, and web unlocker, Scrapeless is a powerful API designed specifically to help users scrape search results without getting blocked.

Why Choose Scrapeless?

  • Legality: Scrapeless provides a legal and compliant way to access search results.
  • Reliability: The API uses sophisticated techniques to avoid detection, ensuring uninterrupted data collection.
  • Ease of Use: Scrapeless offers a simple API that integrates easily with Python, making it ideal for developers who need quick access to search result data.
  • Customizable: You can tailor the results to your needs, such as specifying the type of content (e.g., organic listings, ads, etc.).

Scrapeless Google Search scraper API - using steps:

In order to make the data targeted and specific, we crawl Google trends in this article as a demonstration.

Frustrated on web blocking and Google Search scraping?
Join Our Community and get the effective solution with Free Trial!

Step 1. Log into the Scrapeless Dashboard and go to "Google Search API".

Google Search API

Step 2. Configure the keywords, region, language, proxy and other information you need on the left. After making sure everything is OK, click "Start Scraping".

  • q: Parameter defines the query you want to search for.
  • gl: Parameter defines the country to use for Google search.
  • hl: Parameter defines the language to use for Google search.

Start Scraping

Step 3. Get the crawling results and export them.

Get the crawling results

Just need sample code to integrate into your project? We’ve got you covered! Or you can visit our API documentation for any language you need.

  • Python:
import http.client
import json

conn = http.client.HTTPSConnection("api.scrapeless.com")
payload = json.dumps({
   "actor": "scraper.google.search",
   "input": {
      "q": "coffee",
      "hl": "en",
      "gl": "us"
   }
})
headers = {
   'Content-Type': 'application/json'
}
conn.request("POST", "/api/v1/scraper/request", payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
Enter fullscreen mode Exit fullscreen mode
  • Golang
package main

import (
   "fmt"
   "strings"
   "net/http"
   "io/ioutil"
)

func main() {

   url := "https://api.scrapeless.com/api/v1/scraper/request"
   method := "POST"

   payload := strings.NewReader(`{
    "actor": "scraper.google.search",
    "input": {
        "q": "coffee",
        "hl": "en",
        "gl": "us"
    }
}`)

   client := &http.Client {
   }
   req, err := http.NewRequest(method, url, payload)

   if err != nil {
      fmt.Println(err)
      return
   }
   req.Header.Add("Content-Type", "application/json")

   res, err := client.Do(req)
   if err != nil {
      fmt.Println(err)
      return
   }
   defer res.Body.Close()

   body, err := ioutil.ReadAll(res.Body)
   if err != nil {
      fmt.Println(err)
      return
   }
   fmt.Println(string(body))
}
Enter fullscreen mode Exit fullscreen mode

Ending Words

Scraping Google search results can be tricky, but with the right tools and techniques, it's definitely achievable! Just remember: it's not all about writing the code—it’s about knowing how to avoid detection, respect legal boundaries, and find alternatives when necessary.

Scrapeless Scraping API could just be your best friend in the world of scraping Google search results!

Sentry blog image

How to reduce TTFB

In the past few years in the web dev world, we’ve seen a significant push towards rendering our websites on the server. Doing so is better for SEO and performs better on low-powered devices, but one thing we had to sacrifice is TTFB.

In this article, we’ll see how we can identify what makes our TTFB high so we can fix it.

Read more

Top comments (0)