Practical Puppeteer: How to evaluate XPath expression

#puppeteer #javascript #xpath #webscraping

Today I will share about how to evaluate XPath expression in Puppeteer using $x API and in addition we will also use waitForXPath API.

Before I learn Puppeteer, I mostly use XPath on PHP through their DOMXPath class and I found it very useful for doing element selector things. I feel comfortable and easy when using XPath expression rather than using CSS selector, it's just my personal opinion, sorry :)

For those who don't know XPath, here is according to Wikipedia

XPath (XML Path Language) is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C).

In Puppeteer there are two API that related to XPath. One is waitForXPath that same like waitForSelector. The purpose is the same, it wait for element to appear based on our XPath expression. The second is $x method that useful for evaluating XPath expression. The $x will return array of ElementHandle and I will show you the sample later.

Stop the boring things. Let's start with a scenario. I have a website it's called Lamudi in Indonesia https://www.lamudi.co.id/newdevelopments/ and I want to get/scrape the value based on selector show below.

Our target is this selector. I want to get the 160 value.



<span class="CountTitle-number">160</span>

Usually we can use CSS selector like document.querySelector('span[class="CountTitle-number"]') but alternatively now we are using XPath expression like this //span[@class="CountTitle-number"].

On Developer tools console we can get this selector easily. Try type this on Developer tools on your browser.



$x('//span[@class="CountTitle-number"]');

The image result is like below.

OK nice, now we already get the ElementHandle from that XPath expression. OK now let's create the script on that use Puppeteer to get this selector text content.

Preparation



npm i puppeteer

The code

The code is self explanatory and I hope you can adjust, expand or improvise for your specific needs later.

File puppeteer_xpath.js



const puppeteer = require('puppeteer');

(async () => {
    // set some options (set headless to false so we can see 
    // this automated browsing experience)
    let launchOptions = { headless: false, args: ['--start-maximized'] };

    const browser = await puppeteer.launch(launchOptions);
    const page = await browser.newPage();

    // set viewport and user agent (just in case for nice viewing)
    await page.setViewport({width: 1366, height: 768});
    await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36');

    // go to the target web
    await page.goto('https://www.lamudi.co.id/newdevelopments/');

    // wait for element defined by XPath appear in page
    await page.waitForXPath("(//span[@class='CountTitle-number'])[1]");

    // evaluate XPath expression of the target selector (it return array of ElementHandle)
    let elHandle = await page.$x("(//span[@class='CountTitle-number'])[1]");

    // prepare to get the textContent of the selector above (use page.evaluate)
    let lamudiNewPropertyCount = await page.evaluate(el => el.textContent, elHandle[0]);

    console.log('Total Property Number is:', lamudiNewPropertyCount);

    // close the browser
    await browser.close();
})();

Run it



node puppeteer_xpath.js

If everything OK it will display the result like below.



Total Property Number is: 160

Conclusion

I think Puppeteer support for XPath will be very useful for data scraping, since sometimes it's hard to write CSS selector for specific use case.

Thank you and I hope you enjoy it. See you again on next Practical Puppeteer series.

Source code of this sample is available on GitHub https://github.com/sonyarianto/xpath-on-puppeteer.git

Reference

https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pagexexpression
https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pagewaitforxpathxpath-options
https://pptr.dev
https://www.lamudi.co.id/newdevelopments/
https://en.wikipedia.org/wiki/XPath
Cover Photo by eberhard grossgasteiger from Pexels https://www.pexels.com/photo/countryside-daylight-grass-hd-wallpaper-568236/

Top comments (8)

Tommy • Feb 26 '23

You actually don't need this line:

let elHandle = await page.$x("(//span[@class='CountTitle-number'])[1]");

More concise:

const element = await page.waitForXPath("(//span[@class='CountTitle-number'])[1]");
const lamudiNewPropertyCount = await page.evaluate(el => el.textContent, element);

Sony AK • Apr 8 '23

Thanks @tohodo nice, noted

Raphael Schweikert • Jul 26 '21 • Edited

Thanks for this. I love XPath for these kinds of use-cases.
Yes, CSS selectors can be simpler and well-understood but they are also restricted on purpose to have good run-time characteristics to not bog down the browser for dynamic updates.
So there’s lots of things you can do with XPath that’s simply not possible with selectors (like finding text nodes or using axes to select up the tree instead of down.