DEV Community

Cover image for Using Puppeteer to scrape answers in Stackoverflow
Zygimantas Sniurevicius for Product Hackers

Posted on

Using Puppeteer to scrape answers in Stackoverflow

What is Puppeteer

Puppeteer is a node library that lets us control a chrome browser via commands, its one of the most used tools for web scraping because it grants us the ability to automate actions easily.


What are we doing

Today we'll learn how to setup Puppeteer to scrape google top results when searching for a problem in stackoverflow, let's see how it will work:

  • First we run the script with the question
node index "how to exit vim"
Enter fullscreen mode Exit fullscreen mode
  • Now we google the top results from stackoverflow
    vim question

  • Collect all the links that match half or more words of our question.

[
  {
    keywordMatch: 4,
    url: 'https://stackoverflow.com/questions/31595411/how-to-clear-the-screen-after-exit-vim/51330580'
  }
]
Enter fullscreen mode Exit fullscreen mode
  • Create a folder for the question asked.

  • Visit each URL and look for the answer.

  • Make a screenshot of the answer if there is one.
    stackoverflow answer

  • Save it in our folder previously created.
    folder


Repository

Im not going to cover all the code details in this blog post, things like how to create folders with node.js, how to loop through the array of urls and how to allow arguments in the script are all in my github repository.

You can find the full code here


Explaining the code

After seeing the steps we need to do in the previous section its time to build it ourselves.

Let's begin by initializing puppeteer inside an async function.

A headless browser is a web browser without a user interface.

Its recommended to use a try catch block because its difficult to control errors that happen while the browser is running.


(async () => {
  try {
    const browser = await puppeteer.launch({
      headless: false,
    });

    const page = await browser.newPage();

  } catch (error) {
    console.log("Error " + error.toString());
  }
})();

Enter fullscreen mode Exit fullscreen mode

To get all the result's from a specific website we need to construct the URL with +site:stackoverflow.com.

page.goto accepts two parameters a string for the url and an object for the options, in our case we specify to wait to be completly loaded before moving on.

const googleUrl = `https://www.google.com/search?q=how%20to%20exit%20vim+site%3Astackoverflow.com`;

await page.goto(googleUrl, ["load", "domcontentloaded", "networkidle0"]);

Enter fullscreen mode Exit fullscreen mode

Getting the url's

After navigating to the google search page, its time to collect all the href links that belong to the section https://stackoverflow.com/questions.

Inside the page.evaluate method we are allowed to access the DOM with the document object, this means we can use selectors to find the information we need easily using document.querySelector or document.querySelectorAll

remember that document.querySelectorAll doesn't return an Array, instead, its a NodeList, that's why we transform it to Array before filtering.

Then, we map throught all the elements and return the url's.


const queryUrl = "how%20to%20exit%20vim"

const validUrls = await page.evaluate((queryUrl) => {
 const hrefElementsList = Array.from(
      document.querySelectorAll(
          `div[data-async-context='query:${queryUrl}%20site%3Astackoverflow.com'] a[href]`
        )
      );

      const filterElementsList = hrefElementsList.filter((elem) =>
        elem
          .getAttribute("href")
          .startsWith("https://stackoverflow.com/questions")
      );

      const stackOverflowLinks = filterElementsList.map((elem) =>
        elem.getAttribute("href")
      );

      return stackOverflowLinks;
    }, queryUrl);
Enter fullscreen mode Exit fullscreen mode

Matching the url

With our verified urls in a variable called validUrls its time to check if some of them roughtly match what are we looking for.

we split the question into an Array and loop each word, if the word its inside the stackoverflow url we add it to our variable wordCounter, after we are done with this process we check if half of the words match the url.


const queryWordArray = [ 'how', 'to', 'exit', 'vim' ]
const keywordLikeability = [];

validUrls.forEach((url) => {
  let wordCounter = 0;

  queryWordArray.forEach((word) => {
     if (url.indexOf(word) > -1) {
       wordCounter = wordCounter + 1;
     }
  });

  if (queryWordArray.length / 2 < wordCounter) {
    keywordLikeability.push({
      keywordMatch: wordCounter,
      url: url,
    });
  }
});

Enter fullscreen mode Exit fullscreen mode

Capturing the answer

Finally, we need a function that visits the stackoverflow website and checks if there is an answer, in case there is proceed to make a screenshot of the element and save it.

we start by going to the stackoverflow url, and closing the popup because otherwise its gonna appear in our screenshot and we dont want that.

To find the popup close button we use a xpath selector, its like a weird cousin of our beloved CSS selector but for xml/html.

popup

With the pop up gone it's time to see if we even have an answer, if we do, we make a screenshot and save it.

await acceptedAnswer.screenshot({
 path: `.howtoexitvim.png`,
 clip: { x: 0, y: 0, width: 1024, height: 800 },
});

Enter fullscreen mode Exit fullscreen mode

take care when using the screenshot method because its not consistent, to make it a smoother experience try to get the DOM element's size and location as shown in the picture above.


const getAnswerFromQuestion = async (website, page) => {
  console.log("Website", website);
  await page.goto(website,["load","domcontentloaded","networkidle0"]);
  const popUp = (await page.$x("//button[@title='Dismiss']"))[0];
  if (popUp) await popUp.click();

  const acceptedAnswer = await page.$(".accepted-answer");

  if (!acceptedAnswer) return;

  await acceptedAnswer.screenshot({
    path: `./howtoexitvim.png`,
  });
};


Enter fullscreen mode Exit fullscreen mode

Call the function created in the previous section with the parameters and we are done!


await getAnswerFromQuestion(keywordLikeability[0].url, page);

Enter fullscreen mode Exit fullscreen mode

Here is the final result, we can finally exit VIM!
stackoverflow answer


Final remarks

I hope you learned something today and please check up the repository i set up it has all the code, thanks for reading me and stay awesome ❤️

Top comments (1)

Collapse
 
aadityasiva profile image
Aadityasiva

Wow so nice