Today I will share about how to evaluate XPath expression in Puppeteer using $x
API and in addition we will also use waitForXPath
API.
Before I learn Puppeteer, I mostly use XPath on PHP through their DOMXPath class and I found it very useful for doing element selector things. I feel comfortable and easy when using XPath expression rather than using CSS selector, it's just my personal opinion, sorry :)
For those who don't know XPath, here is according to Wikipedia
XPath (XML Path Language) is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. XPath was defined by the World Wide Web Consortium (W3C).
In Puppeteer there are two API that related to XPath. One is waitForXPath
that same like waitForSelector
. The purpose is the same, it wait for element to appear based on our XPath expression. The second is $x
method that useful for evaluating XPath expression. The $x
will return array of ElementHandle and I will show you the sample later.
Stop the boring things. Let's start with a scenario. I have a website it's called Lamudi in Indonesia https://www.lamudi.co.id/newdevelopments/ and I want to get/scrape the value based on selector show below.
Our target is this selector. I want to get the 160
value.
<span class="CountTitle-number">160</span>
Usually we can use CSS selector like document.querySelector('span[class="CountTitle-number"]')
but alternatively now we are using XPath expression like this //span[@class="CountTitle-number"]
.
On Developer tools console we can get this selector easily. Try type this on Developer tools on your browser.
$x('//span[@class="CountTitle-number"]');
The image result is like below.
OK nice, now we already get the ElementHandle from that XPath expression. OK now let's create the script on that use Puppeteer to get this selector text content.
Preparation
npm i puppeteer
The code
The code is self explanatory and I hope you can adjust, expand or improvise for your specific needs later.
File puppeteer_xpath.js
const puppeteer = require('puppeteer');
(async () => {
// set some options (set headless to false so we can see
// this automated browsing experience)
let launchOptions = { headless: false, args: ['--start-maximized'] };
const browser = await puppeteer.launch(launchOptions);
const page = await browser.newPage();
// set viewport and user agent (just in case for nice viewing)
await page.setViewport({width: 1366, height: 768});
await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36');
// go to the target web
await page.goto('https://www.lamudi.co.id/newdevelopments/');
// wait for element defined by XPath appear in page
await page.waitForXPath("(//span[@class='CountTitle-number'])[1]");
// evaluate XPath expression of the target selector (it return array of ElementHandle)
let elHandle = await page.$x("(//span[@class='CountTitle-number'])[1]");
// prepare to get the textContent of the selector above (use page.evaluate)
let lamudiNewPropertyCount = await page.evaluate(el => el.textContent, elHandle[0]);
console.log('Total Property Number is:', lamudiNewPropertyCount);
// close the browser
await browser.close();
})();
Run it
node puppeteer_xpath.js
If everything OK it will display the result like below.
Total Property Number is: 160
Conclusion
I think Puppeteer support for XPath will be very useful for data scraping, since sometimes it's hard to write CSS selector for specific use case.
Thank you and I hope you enjoy it. See you again on next Practical Puppeteer series.
Source code of this sample is available on GitHub https://github.com/sonyarianto/xpath-on-puppeteer.git
Reference
- https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pagexexpression
- https://github.com/puppeteer/puppeteer/blob/master/docs/api.md#pagewaitforxpathxpath-options
- https://pptr.dev
- https://www.lamudi.co.id/newdevelopments/
- https://en.wikipedia.org/wiki/XPath
- Cover Photo by eberhard grossgasteiger from Pexels https://www.pexels.com/photo/countryside-daylight-grass-hd-wallpaper-568236/
Top comments (8)
You actually don't need this line:
More concise:
Thanks @tohodo nice, noted
Thanks for this. I love XPath for these kinds of use-cases.
Yes, CSS selectors can be simpler and well-understood but they are also restricted on purpose to have good run-time characteristics to not bog down the browser for dynamic updates.
So there’s lots of things you can do with XPath that’s simply not possible with selectors (like finding text nodes or using axes to select up the tree instead of down.
totally agree with this, XPath to the rescue and full flexibility :)
thanks, this was helpful
you are welcome :)
Thank you so much, good sir!
Struggled to find so well arranged and simply put infromation for days
Thank you sir.
But if the XPATH does not exist, is it possible to fix this? To tract that... Can you help me?
Like
if doenst exists do this
If exista do that
Thank you.