DEV Community

Sayo Paul
Sayo Paul

Posted on

Web Scraping in PHP using Goutte

Web Scraping in PHP using Goutte

Today I would be talking about something very common, Web Scraping . Depending on your needs or a client's needs, situations may arise when you may need to extract data from a webpage .

What is Web Scraping ?

According to WebHarvy, Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites . In its simplest form, web scraping is getting the contents of a webpage via a script . Alright, let's move on to web scraping in PHP . Recently, I needed to scrape a site for a client in PHP so I looked for articles that talked about web scraping in PHP and I found out that there were few and most of them were pretty outdated .

However, in my research, I came across Goutte ; a (wonderful) screen scraping and web crawling library for PHP . At its core, Goutte is a wrapper around three of Symfony's components ( God bless Fabien ๐Ÿ™Œ) ; BrowserKit, CssSelector and DomCrawler . It is important for us to understand what each of these components does as it helps us to understand just how powerful Goutte is .

BrowserKit ;
Simply put, the BrowserKit component simulates the behaviour of a real browser . It is the foundational element of Goutte.

DomCrawler;
โ€ŠThe DomCrawler component eases the navigation of the DOM ( Document Object Model ) . The DomCrawler allows us to navigate the dom like this:

    $crawler = $crawler->filter('body > p');

We can also traverse through nodes on the DOM using some of the methods that it provides . For example, if we want to get the first paragraph in the body of the page we could do this:

    $crawler->filter('body > p')->eq(0);

The eq() method is zero indexed and it takes a number specifying the position of the element we want to access .
There are other methods such as siblings(), first() [an alias of eq(0), underground it just calls eq(0) ], last() etc .

CssSelector;
The CssSelector is a wonderful component that allows us to select elements via their CSS selectors . It does this by converting the CSS selectors to their XPath equivalents . So for example say we wanted to select an element with a class called "fire" we could do this:

    $crawler->filter('.fire');

The CssSelector component is so amazing that it even supports CSS such as ;

    $crawler->filter('div[style*="max-height:175px; overflow: hidden;"]');

The above means that we are looking for a div element with an inline style attribute of "style=max-height:175px; overflow: hidden;"

For more information, please do well to read the docs of DomCrawler, CssSelector and Goutte .

Alright now that we have a bit of an idea about the three major components, it is time for us to bring everything together and actually scrape something . As you may have realised by now,when it comes to scraping, there is no laid down way to do it . You are free to explore and try out so many ways to get your data . The only limit you have is your creativity . There are times where I have had to combine the CssSelector and DomCrawler in order to get what I want [ actually, a lot of times ] .

In the next post we are going to put everything that we have learnt so far in to play by scraping the website of the Punch .

Top comments (7)

Collapse
 
robencom profile image
robencom

I also had to use Goutte for web scraping lately, it is doing the job just fine.

Though the documentation needs to be more elaborate..

Collapse
 
sayopaul profile image
Sayo Paul

I agree with you ๐Ÿ’ฏ . The documentation isn't very detailed and can be better . I had to test and browse multiple sources to discover some of its features

Collapse
 
robencom profile image
robencom

It's great to find another developer who is using Goutte!

I managed to do my task with Goutte so far, but in case I face any difficulties, I'd like to ask for your help if that's ok with you.

Happy New year!

Thread Thread
 
sayopaul profile image
Sayo Paul

Definitely it is okay .

Happy new year to you too .

Thank you ๐Ÿ˜Š

Collapse
 
tuanhaviet22 profile image
Tuan Ha

Very useful , thanks

Collapse
 
rakeshkumar125 profile image
Rakesh Kumar

Your posts looks good & nicely explained about Goutte. But I have personally use "PHP Simple HTML DOM Parser", traditional file_get_contents with regex. I have also explained how can you create scripting. You can check on below link.
postnidea.com/php-data-scraping-te...

Collapse
 
sayopaul profile image
Sayo Paul

Would be much easier, but it's limited to just extracting data and saving in a format. No automation etc๐Ÿ˜. Majority of the time my scraping needs require scripting.