loading...

Web Scraping in PHP using Goutte

sayopaul profile image Sayo Paul ・3 min read

Web Scraping in PHP using Goutte

Today I would be talking about something very common, Web Scraping . Depending on your needs or a client's needs, situations may arise when you may need to extract data from a webpage .

What is Web Scraping ?

According to WebHarvy, Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites . In its simplest form, web scraping is getting the contents of a webpage via a script . Alright, let's move on to web scraping in PHP . Recently, I needed to scrape a site for a client in PHP so I looked for articles that talked about web scraping in PHP and I found out that there were few and most of them were pretty outdated .

However, in my research, I came across Goutte ; a (wonderful) screen scraping and web crawling library for PHP . At its core, Goutte is a wrapper around three of Symfony's components ( God bless Fabien πŸ™Œ) ; BrowserKit, CssSelector and DomCrawler . It is important for us to understand what each of these components does as it helps us to understand just how powerful Goutte is .

BrowserKit ;
Simply put, the BrowserKit component simulates the behaviour of a real browser . It is the foundational element of Goutte.

DomCrawler;
β€ŠThe DomCrawler component eases the navigation of the DOM ( Document Object Model ) . The DomCrawler allows us to navigate the dom like this:

    $crawler = $crawler->filter('body > p');

We can also traverse through nodes on the DOM using some of the methods that it provides . For example, if we want to get the first paragraph in the body of the page we could do this:

    $crawler->filter('body > p')->eq(0);

The eq() method is zero indexed and it takes a number specifying the position of the element we want to access .
There are other methods such as siblings(), first() [an alias of eq(0), underground it just calls eq(0) ], last() etc .

CssSelector;
The CssSelector is a wonderful component that allows us to select elements via their CSS selectors . It does this by converting the CSS selectors to their XPath equivalents . So for example say we wanted to select an element with a class called "fire" we could do this:

    $crawler->filter('.fire');

The CssSelector component is so amazing that it even supports CSS such as ;

    $crawler->filter('div[style*="max-height:175px; overflow: hidden;"]');

The above means that we are looking for a div element with an inline style attribute of "style=max-height:175px; overflow: hidden;"

For more information, please do well to read the docs of DomCrawler, CssSelector and Goutte .

Alright now that we have a bit of an idea about the three major components, it is time for us to bring everything together and actually scrape something . As you may have realised by now,when it comes to scraping, there is no laid down way to do it . You are free to explore and try out so many ways to get your data . The only limit you have is your creativity . There are times where I have had to combine the CssSelector and DomCrawler in order to get what I want [ actually, a lot of times ] .

In the next post we are going to put everything that we have learnt so far in to play by scraping the website of the Punch .

Discussion

pic
Editor guide
Collapse
robencom profile image
robencom

I also had to use Goutte for web scraping lately, it is doing the job just fine.

Though the documentation needs to be more elaborate..

Collapse
sayopaul profile image
Sayo Paul Author

I agree with you πŸ’― . The documentation isn't very detailed and can be better . I had to test and browse multiple sources to discover some of its features

Collapse
robencom profile image
robencom

It's great to find another developer who is using Goutte!

I managed to do my task with Goutte so far, but in case I face any difficulties, I'd like to ask for your help if that's ok with you.

Happy New year!

Thread Thread
sayopaul profile image
Sayo Paul Author

Definitely it is okay .

Happy new year to you too .

Thank you 😊

Collapse
tuanhaviet22 profile image
Tuan Ha

Very useful , thanks

Collapse
rakeshkumar125 profile image
Rakesh Kumar

Your posts looks good & nicely explained about Goutte. But I have personally use "PHP Simple HTML DOM Parser", traditional file_get_contents with regex. I have also explained how can you create scripting. You can check on below link.
postnidea.com/php-data-scraping-te...

Collapse
sayopaul profile image
Sayo Paul Author

Would be much easier, but it's limited to just extracting data and saving in a format. No automation etc😁. Majority of the time my scraping needs require scripting.