Mikhail Zub for SerpApi

Posted on Sep 19, 2022

Web scraping Google Trends Realtime search with Nodejs

#webscraping #node #google

Intro

Currently, we don't have an API that supports extracting data from Google Trends Realtime Search page.

This blog post is to show you way how you can do it yourself with provided DIY solution below while we're working on releasing our proper API.

The solution can be used for personal use as it doesn't include the Legal US Shield that we offer for our paid production and above plans and has its limitations such as the need to bypass blocks, for example, CAPTCHA.

You can check our public roadmap to track the progress for this API:

🗺️ [New API] Google Trends Realtime Search Trends

What will be scraped

Full code

If you don't need an explanation, have a look at the full code example in the online IDE

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const baseURL = `https://trends.google.com`;
const countryCode = "US";
const category = "all";
/* allows next categories: 
b - business,
e - entertainment,
m - health,
t - sci/tech,
s - sports,
h - top stories
*/
async function fillTrendsDataFromPage(page) {
  while (true) {
    const isNextPage = await page.$(".feed-load-more-button");
    if (!isNextPage) break;
    await page.click(".feed-load-more-button");
    await page.waitForTimeout(2000);
  }
  const dataFromPage = await page.evaluate((baseURL) => {
    return Array.from(document.querySelectorAll(".feed-item")).map((el) => ({
      index: el.querySelector(".index")?.textContent.trim(),
      title: Array.from(el.querySelectorAll(".title a"))
        .map((el) => el.getAttribute("title"))
        .join(" • "),
      titleLinks: Array.from(el.querySelectorAll(".title a")).map((el) => ({
        [el.getAttribute("title")]: `${baseURL}${el.getAttribute("href")}`,
      })),
      subtitle: el.querySelector(".summary-text a")?.textContent.trim(),
      subtitleLink: el.querySelector(".summary-text a")?.getAttribute("href"),
      source: el.querySelector(".source-and-time span:first-child")?.textContent.trim(),
      published: el.querySelector(".source-and-time span:last-child")?.textContent.trim(),
      thumbnail: `https:${el.querySelector(".feed-item-image-wrapper img")?.getAttribute("src")}`,
    }));
  }, baseURL);
  return dataFromPage;
}

async function getGoogleTrendsRealtimeResults() {
  const browser = await puppeteer.launch({
    headless: false,
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();
  page.setViewport({ width: 1200, height: 700 });

  const URL = `${baseURL}/trends/trendingsearches/realtime?geo=${countryCode}&category=${category}&hl=en`;

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);

  await page.waitForSelector(".feed-item");

  const realtimeResults = await fillTrendsDataFromPage(page);

  await browser.close();

  return realtimeResults;
}

getGoogleTrendsRealtimeResults().then((result) => console.dir(result, { depth: null }));

Preparation

First, we need to create a Node.js* project and add npm packages puppeteer, puppeteer-extra and puppeteer-extra-plugin-stealth to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.

To do this, in the directory with our project, open the command line and enter npm init -y, and then npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth.

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

📌Note: also, you can use puppeteer without any extensions, but I strongly recommended use it with puppeteer-extra with puppeteer-extra-plugin-stealth to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.

Process

SelectorGadget Chrome extension was used to grab CSS selectors by clicking on the desired element in the browser. If you have any struggles understanding this, we have a dedicated Web Scraping with CSS Selectors blog post at SerpApi.

The Gif below illustrates the approach of selecting different parts of the results.

Code explanation

Declare puppeteer to control Chromium browser from puppeteer-extra library and StealthPlugin to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth library:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

Next, we "say" to puppeteer use StealthPlugin, write Google Trends URL, country code (check the full list of supported Google Trends Locations) and category:

puppeteer.use(StealthPlugin());

const baseURL = `https://trends.google.com`;
const countryCode = "US";
const category = "all";

All awailable categories:

b - business,
e - entertainment,
m - health,
t - sci/tech,
s - sports,
h - top stories.

Next, write a function to load all data and get information from the page:

async function fillTrendsDataFromPage() {
  ...
}

In this function, first, we need to load more data until it is available. To do this we use while loop in which we check if "Load More" button is present on the page (page.$() method), click on this button, wait 2 seconds (using waitForTimeout method) and repeat again until the button is absent from the page:

while (true) {
  const isNextPage = await page.$(".feed-load-more-button");
  if (!isNextPage) break;
  await page.click(".feed-load-more-button");
  await page.waitForTimeout(2000);
}

Next, we get information from the page context (using evaluate() method) and save it in the returned array. First, we need to get all the trends results available on the page (querySelectorAll() method) and make the new array from got NodeList (Array.from()):

return Array.from(document.querySelectorAll(".feed-item")).map((el) => ({

Next, we assign the necessary data to each object's key. We can do this with textContent and trim() methods, which get the raw text and removes white space from both sides of the string. If we need to get links, we use getAttribute() method to get "href" and "src" HTML element attributes. To make title string looks like on the page, we need to get an array with title links and using join() method unite array elements into a string with the • separator:

    index: el.querySelector(".index")?.textContent.trim(),
    title: Array.from(el.querySelectorAll(".title a"))
        .map((el) => el.getAttribute("title"))
        .join(" • "),
    titleLinks: Array.from(el.querySelectorAll(".title a")).map((el) => ({
        [el.getAttribute("title")]: `${baseURL}${el.getAttribute("href")}`,
        })),
    subtitle: el.querySelector(".summary-text a")?.textContent.trim(),
    subtitleLink: el.querySelector(".summary-text a")?.getAttribute("href"),
    source: el.querySelector(".source-and-time span:first-child")?.textContent.trim(),
    published: el.querySelector(".source-and-time span:last-child")?.textContent.trim(),
    thumbnail: `https:${el.querySelector(".feed-item-image-wrapper img")?.getAttribute("src")}`,

Next, write a function to control the browser, and get information:

async function getGoogleTrendsDailyResults() {
  ...
}

In this function first we need to define browser using puppeteer.launch({options}) method with current options, such as headless: false and args: ["--no-sandbox", "--disable-setuid-sandbox"].

These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page:

const browser = await puppeteer.launch({
  headless: false,
  args: ["--no-sandbox", "--disable-setuid-sandbox"],
});

const page = await browser.newPage();

Next, we define the full request URL, change default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout() method, go to URL with .goto() method and use .waitForSelector() method to wait until the selector is load:

const URL = `${baseURL}/trends/trendingsearches/realtime?geo=${countryCode}&category=${category}&hl=en`;

await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);

await page.waitForSelector(".feed-item");

And finally, we save trends data from the page in the realtimeResults constant, close the browser and return the received data:

const realtimeResults = await fillTrendsDataFromPage(page);

await browser.close();

return realtimeResults;

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file

Output

[
   {
      "index":"1",
      "title":"Explore Financial Conduct Authority • Explore Finance • Explore Robo-advisor • Explore Financial services • Explore Debt management plan • Explore Consumer • Explore Investment • Explore Debtor • Explore Financial adviser",
      "titleLinks":[
         {
            "Explore Financial Conduct Authority":"https://trends.google.com/trends/explore?q=/m/0cc7rp_&date=now+7-d&geo=US"
         },
         {
            "Explore Finance":"https://trends.google.com/trends/explore?q=/m/02_7t&date=now+7-d&geo=US"
         },
         {
            "Explore Robo-advisor":"https://trends.google.com/trends/explore?q=/m/010vqqqk&date=now+7-d&geo=US"
         },
         {
            "Explore Financial services":"https://trends.google.com/trends/explore?q=/m/02h400t&date=now+7-d&geo=US"
         },
         {
            "Explore Debt management plan":"https://trends.google.com/trends/explore?q=/m/0crs3y&date=now+7-d&geo=US"
         },
         {
            "Explore Consumer":"https://trends.google.com/trends/explore?q=/m/025_b&date=now+7-d&geo=US"
         },
         {
            "Explore Investment":"https://trends.google.com/trends/explore?q=/m/0g_fl&date=now+7-d&geo=US"
         },
         {
            "Explore Debtor":"https://trends.google.com/trends/explore?q=/m/03rd6r&date=now+7-d&geo=US"
         },
         {
            "Explore Financial adviser":"https://trends.google.com/trends/explore?q=/m/08p4gp&date=now+7-d&geo=US"
         }
      ],
      "subtitle":"Robo advice shines for borrowers: study",
      "subtitleLink":"https://www.investmentexecutive.com/news/research-and-markets/robo-advice-shines-for-borrowers/",
      "source":"Investment Executive",
      "published":"13 hours ago",
      "thumbnail":"https://t0.gstatic.com/images?q=tbn:ANd9GcSmxvumRvWQdIKhvir_gth7zv6N3zSIsoG1WsbnCB84b2rWqidrbyIVY04xbem0jYwTQ5Yd4osF1ns"
   },
   ... and other results
]