How to Scrape Google Search Console Backlinks

#python #programming #gsc

If you’re a Webmaster or SEO specialist, most probably you need to do backlink audits regularly. There are situations when you are forced to find toxic backlinks and disavow them. However, it’s very hard to export and correlate manually all Backlinks data from Google Search Console.

If the websites you’re working with are substantially large, there would be a lot of clicking and exporting involved to get this data out of GSC, it’s simply not doable.

Here is where Python, Beautiful Soup, Pandas come in — they will allow you to scrape GSC and pull the data you need automatically.

First things first:

Install the following packages using pip: bs4, requests, re, pandas, csv

Emulate a user session

In order to scrape Google Search Console for backlink information we need to emulate a normal user. We do this by simply going into your browser of choice, opening the Links section in GSC and selecting the Top linking sites section. Once here, we need to inspect the source code of the page by right clicking and hitting Inspect.

In the dev tools, we go to the network tab and select the first URL that appears and is a document type, it should be a request on a URL of the following type: https://search.google.com/search-console/links?resource_id=sc-domain%3A{YourDomainName}

Click on the URL and look at the Headers section for the Request Headers section as per the image below:

In order to emulate a normal session, we will need to add to our Python requests the request information that we see in the request header.

A few notes on this process:

You will notice that your request header will also contain cookie information. Python-wise, for the requests library, this information will be stored into a dictionary called cookies. The rest of the information will be stored into a dictionary named headers.

In effect, what we are doing is to take the information from the header and transform it into two dictionaries as per the code from below.

*** replace [your-info] with your actual data**

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import csv

headers = {
    "authority": "search.google.com",
    "method":"GET",
    "path":'"[your-info]"',
    "scheme":"https",  
    "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-encoding":"gzip, deflate, br",
    "accept-language":"en-GB,en-US;q=0.9,en;q=0.8,ro;q=0.7",
    "cache-control":"no-cache",
    "pragma":"no-cache",
    "sec-ch-ua":"navigate",
    "sec-fetch-site":"same-origin",
    "sec-fetch-user":"?1",
    "upgrade-insecure-requests":"1",
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
    "x-client-data":"[your-info]",
    "sec-ch-ua":'" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    "sec-ch-ua-mobile": "?0",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate"
}
cookies = {
    "__Secure-1PAPISID":"[your-info]",
    "__Secure-1PSID":"[your-info]",
    "__Secure-3PAPISID":"[your-info]",
    "__Secure-3PSID":"[your-info]",
    "__Secure-3PSIDCC":"[your-info]",
    "1P_JAR":"[your-info]",
    "NID":"[your-info]",
    "APISID":"[your-info]",
    "CONSENT":"[your-info]",
    "HSID":"[your-info]",
    "SAPISID":"[your-info]",
    "SID":"[your-info]",
    "SIDCC":"[your-info]",
    "SSID":"[your-info]",
    "_ga":"[your-info]",
    "OTZ":"[your-info]",
    "OGPC":"[your-info]"
}

The information displayed in your request header might be different in your case, don’t worry about the differences as long as you can create the two dictionaries.

Once this is done, execute the cell with the header and cookies information as it’s time to start working on the first part of the actual script — collecting a list of referring domains that link back to your website.

*** replace [your-domain] with your actual domain**

genericURL = "https://search.google.com/search-console/links/drilldown?resource_id=[your-domain]&type=EXTERNAL&target=&domain="
req = requests.get(genericURL, headers=headers, cookies=cookies)
soup = BeautifulSoup(req.content, 'html.parser')

The above URL is in effect the URL in the Top linking sites section, so, please ensure that you update it accordingly.

You can test that you are bypassing the login by running the following code:

g_data = soup.find_all("div", {"class": "OOHai"})
for example in g_data:
    print(example)
    break

The output of the above code should be a div with a class called “00Hai”. If you see anything of the sort, you can continue.

Create a List of Referring Domains

The next step in this process will be to leverage Python and Pandas to return a list with all of the referring domains that point at your domain.

g_data = soup.find_all("div", {"class": "OOHai"})

dfdomains = pd.DataFrame()
finalList = []
for externalDomain in g_data:
    myList = []
    out = re.search(r'<div class="OOHai">(.*?(?=<))', str(externalDomain))
    if out:
        myList.append(out.group(1))
    finalList.append(myList) 
dfdomains = dfdomains.append(pd.DataFrame(finalList, columns=["External Domains"]))

domainsList = dfdomains["External Domains"].to_list()

The above code initialises an empty Pandas dataFrame, which will be populated with the external domains. The domains are identified by running through the entire HTML and identifying all of the divs that are in the “OOHai” class. If any such information is present, the dfdomains dataFrame will be populated with the name of the external domains.

Extract Backlink information for each Domain**

Next we will extract the backlink information for all domains, Top sites linking to this page and also Top linking pages (practically the 3rd level from GSC, only the first value).

def extractBacklinks():
    for domain in domainsList[:]:
        url = f"https://search.google.com/search-console/links/drilldown?resource_id=[your-domain]&type=EXTERNAL&target={domain}&domain="

        request = requests.get(url, headers=headers, cookies=cookies)
        soup = BeautifulSoup(request.content, 'html.parser')

        for row in soup.find_all("div", {"class": "OOHai"}):          
            output = row.text
            stripped_output = output.replace("", "")

            domain_stripped = str(domain.split('https://')[1].split('/')[0])

            print ("---------")
            print ("Domain: " + domain)
            print ("---------")

            url_secondary = f"https://search.google.com/search-console/links/drilldown?resource_id=[your-domain]&type=EXTERNAL&target={domain}&domain={stripped_output}"
            request_secondary = requests.get(url_secondary, headers=headers, cookies=cookies)
            soup_secondary = BeautifulSoup(request_secondary.content, 'html.parser')
            for row in soup_secondary.find_all("div", {"class": "OOHai"}):
                output_last = row.text
                stripped_output_last = output_last.replace("", "")
                break

            with open(f"{domain_stripped}.csv", 'a') as file:
                writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
                writer = writer.writerow([domain, stripped_output, stripped_output_last])
            file.close()

extractBacklinks()