DEV Community ๐Ÿ‘ฉโ€๐Ÿ’ป๐Ÿ‘จโ€๐Ÿ’ป

Dendi Handian
Dendi Handian

Posted on • Updated on

Scraping a Web Page in Browser using XPath and Javascript

As a programmer we should think to automate anything related to our daily task every single time if possible. For instance when you gathering amount of data on a web page, rather than copying the text one-by-one you could do a simple web scraping.

The Case

I will demonstrate how to scrap the youtube playlist of PyCon ID 2020 Talks in this youtube page https://www.youtube.com/playlist?list=PLIv0V1YCmEi3A6H6mdsoxh4RDpzvnJpMq. As the result, I will have a list of video titles.

pycon id 2020 playlist

The XPath

XPath is the query languange to get the nodes/elements on the XML or HTML, you could learn it more on other resources like W3school https://www.w3schools.com/xml/xpath_intro.asp. The simple query example for getting nodes containing the video titles is this:

//a[@class="yt-simple-endpoint style-scope ytd-playlist-video-renderer"]
Enter fullscreen mode Exit fullscreen mode

The above xpath syntax may not work if the web page structure is changed in the future.

You could also try this yourself in the Chrome/Edge Browser developer tools, on the Elements tab and Ctrl + F to start using Xpath. The result indicates that it has 39 items and it seems to be right.

xpath for the playlist

The XPath Utility Function in Javascript

After found the right xpath for the element, now open Console tab in the browser developer tools to begin typing some javascript. Javascript has a built-in XPath utility function that has syntax like this $x(). We could pass the xpath string to the function and check the length:

$x('//a[@class="yt-simple-endpoint style-scope ytd-playlist-video-renderer"]').length
Enter fullscreen mode Exit fullscreen mode

xpath result count

If the output length matches the numbers of items we want to scrap, then the function will works. Now we just need to get the list of titles and return it to the console screen:

$x('//a[@class="yt-simple-endpoint style-scope ytd-playlist-video-renderer"]').map(function(el){return el.text.trim()}).join("\n")
Enter fullscreen mode Exit fullscreen mode

xpath result list

The output in the console may look weird because of the \n. But when you copy the string contents and paste it on the editor like Visual Studio Code, you will get a clean result:

result strings

Hope this will be useful for you.

Top comments (4)

Collapse
 
grahamthedev profile image
GrahamTheDev • Edited on

You should check out DOM parser - developer.mozilla.org/en-US/docs/W...

It will allow you to then use normal CSS selectors and querySelectorAll etc. to grab info which 99% of the time will be far easier and more robust as the xpath is far more likely to change on a document not using IDs.

Collapse
 
dendihandian profile image
Dendi Handian

Nice Info, I will try.

I usually working with Scrapy to scrap a web, so XPath is just the habit.

Collapse
 
grahamthedev profile image
GrahamTheDev

Never used Scrapy but I can certainly understand the usage of xpath as I used for ages. I find DOM parser much easier but probably because I am used to CSS.

I enjoyed the article!

Collapse
 
prabhukadode profile image
Prabhu

Leant something new now

Here is a post you might want to check out:

Regex for lazy developers

Sorry for the callout ๐Ÿ˜†