Mikhail Zub for SerpApi

Posted on Oct 24, 2022

Web scraping Google Play Children (Kids) with Nodejs

#webscraping #node #serpapi

What will be scraped

Full code

If you don't need an explanation, have a look at the full code example in the online IDE

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

puppeteer.use(StealthPlugin());

const searchParams = {
  hl: "en", // Parameter defines the language to use for the Google search
  gl: "us", // parameter defines the country to use for the Google search
  device: "phone", // parameter defines the search device. Options: phone, tablet, tv, chromebook
  age: null, // parameter defines age subcategory. Options: null (0-12 years), AGE_RANGE1 (0-5 years), AGE_RANGE2 (6-8 years), AGE_RANGE3 (9-12 years)
};

const URL = searchParams.age
  ? `https://play.google.com/store/apps/category/FAMILY?age=${searchParams.age}&hl=${searchParams.hl}&gl=${searchParams.gl}&device=${searchParams.device}`
  : `https://play.google.com/store/apps/category/FAMILY?hl=${searchParams.hl}&gl=${searchParams.gl}&device=${searchParams.device}`;

async function scrollPage(page, scrollContainer) {
  let lastHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
  while (true) {
    await page.evaluate(`window.scrollTo(0, document.querySelector("${scrollContainer}").scrollHeight)`);
    await page.waitForTimeout(4000);
    let newHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
    if (newHeight === lastHeight) {
      break;
    }
    lastHeight = newHeight;
  }
}

async function getKidsAppsFromPage(page) {
  const apps = await page.evaluate(() => {
    const mainPageInfo = Array.from(document.querySelectorAll("section .oVnAB")).reduce((result, block) => {
      const categoryTitle = block.textContent.trim();
      const apps = Array.from(block.parentElement.querySelectorAll(".ULeU3b")).map((app) => {
        const link = `https://play.google.com${app.querySelector(".Si6A0c")?.getAttribute("href")}`;
        const appId = link.slice(link.indexOf("?id=") + 4);
        if (app.querySelector(".sT93pb.DdYX5.OnEJge")) {
          return {
            title: app.querySelector(".sT93pb.DdYX5.OnEJge")?.textContent.trim(),
            appCategory: app.querySelector(".sT93pb.w2kbF:not(.ePXqnb)")?.textContent.trim(),
            link,
            rating: parseFloat(app.querySelector(".ubGTjb:last-child > div")?.getAttribute("aria-label")?.slice(6, 9)) || "No rating",
            iconThumbnail: app.querySelector(".j2FCNc img")?.getAttribute("srcset").slice(0, -3),
            appThumbnail: app.querySelector(".Vc0mnc img")?.getAttribute("src") || app.querySelector(".Shbxxd img")?.getAttribute("src"),
            video: app.querySelector(".aCy7Gf button")?.getAttribute("data-video-url") || "No video preview",
            appId,
          };
        } else {
          return {
            title: app.querySelector(".Epkrse")?.textContent.trim(),
            link,
            rating: parseFloat(app.querySelector(".vlGucd > div:first-child")?.getAttribute("aria-label")?.slice(6, 9)) || "No rating",
            thumbnail: app.querySelector(".TjRVLb img")?.getAttribute("srcset"),
            appId,
          };
        }
      });
      return {
        ...result,
        [categoryTitle]: apps,
      };
    }, {});

    return mainPageInfo;
  });
  return apps;
}

async function getMainPageInfo() {
  const browser = await puppeteer.launch({
    headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
    args: ["--no-sandbox", "--disable-setuid-sandbox"],
  });

  const page = await browser.newPage();

  await page.setDefaultNavigationTimeout(60000);
  await page.goto(URL);

  await page.waitForSelector(".oVnAB");

  await scrollPage(page, ".T4LgNb");

  const apps = await getKidsAppsFromPage(page);

  await browser.close();

  return apps;
}

getMainPageInfo().then((result) => console.dir(result, { depth: null }));

Preparation

First, we need to create a Node.js* project and add npm packages puppeteer, puppeteer-extra and puppeteer-extra-plugin-stealth to control Chromium (or Chrome, or Firefox, but now we work only with Chromium which is used by default) over the DevTools Protocol in headless or non-headless mode.

To do this, in the directory with our project, open the command line and enter npm init -y, and then npm i puppeteer puppeteer-extra puppeteer-extra-plugin-stealth.

*If you don't have Node.js installed, you can download it from nodejs.org and follow the installation documentation.

📌Note: also, you can use puppeteer without any extensions, but I strongly recommended use it with puppeteer-extra with puppeteer-extra-plugin-stealth to prevent website detection that you are using headless Chromium or that you are using web driver. You can check it on Chrome headless tests website. The screenshot below shows you a difference.

Process

First of all, we need to scroll through all apps listings until there are no more listings loading which is the difficult part described below.

The next step is to extract data from HTML elements after scrolling is finished. The process of getting the right CSS selectors is fairly easy via SelectorGadget Chrome extension which able us to grab CSS selectors by clicking on the desired element in the browser. However, it is not always working perfectly, especially when the website is heavily used by JavaScript.

We have a dedicated Web Scraping with CSS Selectors blog post at SerpApi if you want to know a little bit more about them.

The Gif below illustrates the approach of selecting different parts of the results using SelectorGadget.

Code explanation

Declare puppeteer to control Chromium browser from puppeteer-extra library and StealthPlugin to prevent website detection that you are using web driver from puppeteer-extra-plugin-stealth library:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

Next, we "say" to puppeteer use StealthPlugin, write the necessary request parameters and search URL using ternary operator (the URL may differ depending on whether the age is specified ):

puppeteer.use(StealthPlugin());

const searchParams = {
  hl: "en", // Parameter defines the language to use for the Google search
  gl: "us", // parameter defines the country to use for the Google search
  device: "phone", // parameter defines the search device. Options: phone, tablet, tv, chromebook
  age: null, // parameter defines age subcategory. Options: null (0-12 years), AGE_RANGE1 (0-5 years), AGE_RANGE2 (6-8 years), AGE_RANGE3 (9-12 years)
};

const URL = searchParams.age
  ? `https://play.google.com/store/apps/category/FAMILY?age=${searchParams.age}&hl=${searchParams.hl}&gl=${searchParams.gl}&device=${searchParams.device}`
  : `https://play.google.com/store/apps/category/FAMILY?hl=${searchParams.hl}&gl=${searchParams.gl}&device=${searchParams.device}`;

If the age parameter is set to null, that means we use the default age subcategory (0-12 years) and the URL will look like this:

"https://play.google.com/store/apps/category/FAMILY?hl=en&gl=US";

Otherwise, the URL will look like this:

"https://play.google.com/store/apps/category/FAMILY?age=AGE_RANGE1&hl=en&gl=US";

The GIF below illustrates how the URL changes:

Next, we write a function to scroll the page to load all the articles:

async function scrollPage(page, scrollContainer) {
  ...
}

In this function, first, we need to get scrollContainer height (using evaluate() method). Then we use while loop in which we scroll down scrollContainer, wait 2 seconds (using waitForTimeout method), and get a new scrollContainer height.

Next, we check if newHeight is equal to lastHeight we stop the loop. Otherwise, we define newHeight value to lastHeight variable and repeat again until the page was not scrolled down to the end:

let lastHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
while (true) {
  await page.evaluate(`window.scrollTo(0, document.querySelector("${scrollContainer}").scrollHeight)`);
  await page.waitForTimeout(4000);
  let newHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
  if (newHeight === lastHeight) {
    break;
  }
  lastHeight = newHeight;
}

Next, we write a function to get books data from the page:

async function getKidsAppsFromPage(page) {
  ...
}

In this function, we get information from the page context and save it in the returned object. Next, we need to get all HTML elements with "section .oVnAB" selector (querySelectorAll() method).

Then we use reduce() method (it's allow to make the object with results) to iterate an array that built with Array.from() method:

const apps = await page.evaluate(() => {
  const mainPageInfo = Array.from(document.querySelectorAll("section .oVnAB")).reduce((result, block) => {
      ...
    }, {});

    return mainPageInfo;
});
return apps;

And finally, we need to get all the data using the following methods:

On each itaration step we return previous step result (using spread syntax) and add the new category with name from categoryTitle constant:

We need to use two different result templates because there are two different app layouts on the page:

const categoryTitle = block.textContent.trim();
const apps = Array.from(block.parentElement.querySelectorAll(".ULeU3b")).map((app) => {
  const link = `https://play.google.com${app.querySelector(".Si6A0c")?.getAttribute("href")}`;
  const appId = link.slice(link.indexOf("?id=") + 4);
  // if one layout appears
  if (app.querySelector(".sT93pb.DdYX5.OnEJge")) {
    return {
      title: app.querySelector(".sT93pb.DdYX5.OnEJge")?.textContent.trim(),
      appCategory: app.querySelector(".sT93pb.w2kbF:not(.ePXqnb)")?.textContent.trim(),
      link,
      rating: parseFloat(app.querySelector(".ubGTjb:last-child > div")?.getAttribute("aria-label")?.slice(6, 9)) || "No rating",
      iconThumbnail: app.querySelector(".j2FCNc img")?.getAttribute("srcset").slice(0, -3),
      appThumbnail: app.querySelector(".Vc0mnc img")?.getAttribute("src") || app.querySelector(".Shbxxd img")?.getAttribute("src"),
      video: app.querySelector(".aCy7Gf button")?.getAttribute("data-video-url") || "No video preview",
      appId,
    };
  // else extracting second layout
  } else {
    return {
      title: app.querySelector(".Epkrse")?.textContent.trim(),
      link,
      rating: parseFloat(app.querySelector(".vlGucd > div:first-child")?.getAttribute("aria-label")?.slice(6, 9)) || "No rating",
      thumbnail: app.querySelector(".TjRVLb img")?.getAttribute("srcset"),
      appId,
    };
  }
});
return {
  ...result,
  [categoryTitle]: apps,

Next, write a function to control the browser, and get information:

async function getMainPageInfo() {
  ...
}

In this function first we need to define browser using puppeteer.launch({options}) method with current options, such as headless: true and args: ["--no-sandbox", "--disable-setuid-sandbox"].

These options mean that we use headless mode and array with arguments which we use to allow the launch of the browser process in the online IDE. And then we open a new page:

const browser = await puppeteer.launch({
  headless: true, // if you want to see what the browser is doing, you need to change this option to "false"
  args: ["--no-sandbox", "--disable-setuid-sandbox"],
});

const page = await browser.newPage();

Next, we change default (30 sec) time for waiting for selectors to 60000 ms (1 min) for slow internet connection with .setDefaultNavigationTimeout() method, go to URL with .goto() method and use .waitForSelector() method to wait until the selector is load:

await page.setDefaultNavigationTimeout(60000);
await page.goto(URL);
await page.waitForSelector(".oVnAB");

And finally, we wait until the page was scrolled, save apps data from the page in the apps constant, close the browser, and return the received data:

await scrollPage(page, ".T4LgNb");

const apps = await getKidsAppsFromPage(page);

await browser.close();

return apps;

Now we can launch our parser:

$ node YOUR_FILE_NAME # YOUR_FILE_NAME is the name of your .js file

Output

{
   "New & updated":[
      {
         "title":"PBS KIDS Video",
         "appCategory":"Education",
         "link":"https://play.google.com/store/apps/details?id=org.pbskids.video",
         "rating":4.4,
         "iconThumbnail":"https://play-lh.googleusercontent.com/Fel1apzw2D5Qy1xZ9HYQ3LPEJqZB5OxdhkorYLrQ7fTUIdGU8uIY_qiN9ZvaRs9eItQ=s128-rw",
         "appThumbnail":"https://play-lh.googleusercontent.com/9MSE2M5sGVy73d75bBemSfZQicBp1cOkjjG-c3tvW5vOVrpOaXdAyjmnbVcBCMWSaLk=w416-h235-rw",
         "video":"No video preview",
         "appId":"org.pbskids.video"
      },
      ... and other results
   ],
   "Encourage kindness":[
      {
         "title":"Breathe, Think, Do with Sesame",
         "link":"https://play.google.com/store/apps/details?id=air.com.sesameworkshop.ResilienceThinkBreathDo",
         "rating":4,
         "thumbnail":"https://play-lh.googleusercontent.com/-UbCkW4xbM661t4mndTi7owhXY0GYBCRQn4Pxl7_1tXgCCvqKsJwUKE-O61NO0CuJA=s512-rw 2x",
         "appId":"air.com.sesameworkshop.ResilienceThinkBreathDo"
      },
      ... and other results
   ],
   ... and other categories
}

Using Google Play Apps Store API from SerpApi

This section is to show the comparison between the DIY solution and our solution.

The biggest difference is that you don't need to create the parser from scratch and maintain it.

There's also a chance that the request might be blocked at some point from Google, we handle it on our backend so there's no need to figure out how to do it yourself or figure out which CAPTCHA, proxy provider to use.

First, we need to install google-search-results-nodejs:

npm i google-search-results-nodejs

Here's the full code example, if you don't need an explanation:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(process.env.API_KEY); //your API key from serpapi.com

const params = {
  engine: "google_play", // search engine
  gl: "us", // parameter defines the country to use for the Google search
  hl: "en", // parameter defines the language to use for the Google search
  store: "apps", // parameter defines the type of Google Play store
  store_device: "phone", // parameter defines the search device. Options: phone, tablet, tv, chromebook, watch, car
  apps_category: "FAMILY", // parameter defines the apps and games store category. In this case we use "FAMILY" to scrape Google Play Children apps
};

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};

const getResults = async () => {
  const json = await getJson();
  const appsResults = json.organic_results.reduce((result, category) => {
    const { title: categoryTitle, items } = category;
    const apps = items.map((app) => {
      const { title, link, rating, category, video = "No video preview", thumbnail, product_id } = app;
      if (category) {
        return {
          title,
          link,
          rating,
          category,
          video,
          thumbnail,
          appId: product_id,
        };
      } else {
        return {
          title,
          link,
          rating,
          thumbnail,
          appId: product_id,
        };
      }
    });
    return {
      ...result,
      [categoryTitle]: apps,
    };
  }, {});
  return appsResults;
};

getResults().then((result) => console.dir(result, { depth: null }));

Code explanation

First, we need to declare SerpApi from google-search-results-nodejs library and define new search instance with your API key from SerpApi:

const SerpApi = require("google-search-results-nodejs");
const search = new SerpApi.GoogleSearch(API_KEY);

Next, we write the necessary parameters for making a request:

const params = {
  engine: "google_play", // search engine
  gl: "us", // parameter defines the country to use for the Google search
  hl: "en", // parameter defines the language to use for the Google search
  store: "apps", // parameter defines the type of Google Play store
  store_device: "phone", // parameter defines the search device. Options: phone, tablet, tv, chromebook, watch, car
  apps_category: "FAMILY", // parameter defines the apps and games store category. In this case we use "FAMILY" to scrape Google Play Children apps
};

Next, we wrap the search method from the SerpApi library in a promise to further work with the search results:

const getJson = () => {
  return new Promise((resolve) => {
    search.json(params, resolve);
  });
};

And finally, we declare the function getResult that gets data from the page and return it:

const getResults = async () => {
  ...
};

In this function first, we get json with results, then we need to iterate organic_results array in the received json. To do this we use reduce() method (it's allow to make the object with results). On each itaration step we return previous step result (using spread syntax) and add the new category with name from categoryTitle constant:

  const json = await getJson();
  const appsResults = json.organic_results.reduce((result, category) => {
    ...
    return {
      ...result,
      [categoryTitle]: apps,
    };
  }, {});
  return appsResults;

Next, we destructure category element, redefine title to categoryTitle constant, and itarate the items array to get all books from this category. To do this we need to destructure the book element, set default value "No rating" for rating and return this constants:

We need to use two different result templates because there are two different app layouts on the page:

const apps = items.map((app) => {
  const { title, link, rating, category, video = "No video preview", thumbnail, product_id } = app;
  // if one layout appears
  if (category) {
    return {
      title,
      link,
      rating,
      category,
      video,
      thumbnail,
      appId: product_id,
    };
    // else extracting second layout
  } else {
    return {
      title,
      link,
      rating,
      thumbnail,
      appId: product_id,
    };
  }
});

After, we run the getResults function and print all the received information in the console with the console.dir method, which allows you to use an object with the necessary parameters to change default output options:

getResults().then((result) => console.dir(result, { depth: null }));

Output

{
   "New & updated":[
      {
         "title":"PBS KIDS Video",
         "link":"https://play.google.com/store/apps/details?id=org.pbskids.video",
         "rating":4.4,
         "category":"Education",
         "video":"No video preview",
         "thumbnail":"https://play-lh.googleusercontent.com/Fel1apzw2D5Qy1xZ9HYQ3LPEJqZB5OxdhkorYLrQ7fTUIdGU8uIY_qiN9ZvaRs9eItQ=s64-rw",
         "appId":"org.pbskids.video"
      },
      ... and other results
   ],
   "Enriching games":[
      {
         "title":"Violet - My Little Pet",
         "link":"https://play.google.com/store/apps/details?id=ro.Funbrite.VioletMyLittlePet",
         "rating":4.7,
         "thumbnail":"https://play-lh.googleusercontent.com/lnv-uzrGlkY3Ke_UofPyq77k4RDjatyIOrCnTGoBSWtIF6sluX-eys3MH8Z43kZZ6g=s256-rw",
         "appId":"ro.Funbrite.VioletMyLittlePet"
      },
      ... and other results
   ],
   ... and other categories
}

Links

If you want to see some projects made with SerpApi, write me a message.

Join us on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞

DEV Community

Web scraping Google Play Children (Kids) with Nodejs

What will be scraped

Full code

Preparation

Process

Code explanation

Output

Using Google Play Apps Store API from SerpApi

Code explanation

Output

Links

Top comments (0)

Read next

Node.js + ioredis + elasticache

Integrating Google Calendar API in Node.JS: A Guide to Event Creation and Meeting Scheduling

Express 5 is here, what’s new?

Cómo empezar con TypeScript usando Vite: Configuración y compilación simplificadas 🎯