DEV Community: Sayo Paul

Web Scraping in PHP using Goutte - part 2

Sayo Paul — Fri, 28 Dec 2018 18:05:17 +0000

Web Scraping in PHP using Goutte II

In the last article, we got introduced to web scraping and we looked into Goutte, a wonderful PHP web scraping library . In this article, we would be putting our knowledge to practice by scraping the website of the Punch . To be more specific, we would be scraping the punch to get the lastest news https://punchng.com/topics/news headlines 😎 .

Let's get right into it 💪 !

NB : This is for testing purposes only, I do not in any way intend to reproduce the material gotten from the Punch and I do not advice you to do so as that would be copyright infringement .

First things first, we set up Composer autoloading, import the Goutte namespace and we instantiate a new Goutte Client:

    require "vendor/autoload.php";
    use Goutte\Client;
    $client = new Client();

The next step is to send a request via the $client object . The $client object returns a crawler instance . It is this instance that we use to apply our filters .

     $crawler = $client->request('GET',"https://punchng.com/topics/news");

On the front page of the Punch news page are article boxes . Each article has its own box and a heading ( The headline ) with the class ".seg-title" . We want to select all the headlines (.seg-title) on the page and then take each of them one by one . We do it with this:

     $crawler->filter('.seg-title')->each(function ($node){


     });

Notice the method each() ? The each() method allows us to iterate over the current selection(node list) when it contains more than one node . As we mentioned above, we are selecting each of the headlines (.seg-title) hence we have more than one node and we want to iterate through them . Underground, the each() method accepts an instance of an anonymous function, loops through the current node list and then passes a node on each iteration to the closure thus allowing us to access the current node ( $node ) in the closure .

     public function each(\Closure $closure)
     {
          $data = array();
          foreach ($this->nodes as $i => $node) {
              $data[] = $closure($this->createSubCrawler($node), $i);
          }

          return $data;
      }

Alright, the next thing we want to do is extract the text from the current node .

     $crawler->filter('.seg-title')->each(function ($node){
         $headline = $node->text();
         echo $headline;
     });

We get the textual content of the node by calling the method text() . The next thing we do is print out the headline and there we have it ! We would always get the latest 10 news headlines on the punch printed out to us whenever we run this script . Like I said in the previous article, when it comes to scraping, almost anything is possible ( even logging in and filling forms ) . The limit is your mind 😊 . I honestly wish we could go deeper but sadly that's all for now 😅 .

For more information, please do well to read the docs of DomCrawler, CssSelector and Goutte .

Do you have any web scaping needs ? You can hire me to help you out here

Web Scraping in PHP using Goutte

Sayo Paul — Fri, 28 Dec 2018 18:04:32 +0000

Web Scraping in PHP using Goutte

Today I would be talking about something very common, Web Scraping . Depending on your needs or a client's needs, situations may arise when you may need to extract data from a webpage .

What is Web Scraping ?

According to WebHarvy, Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites . In its simplest form, web scraping is getting the contents of a webpage via a script . Alright, let's move on to web scraping in PHP . Recently, I needed to scrape a site for a client in PHP so I looked for articles that talked about web scraping in PHP and I found out that there were few and most of them were pretty outdated .

However, in my research, I came across Goutte ; a (wonderful) screen scraping and web crawling library for PHP . At its core, Goutte is a wrapper around three of Symfony's components ( God bless Fabien 🙌) ; BrowserKit, CssSelector and DomCrawler . It is important for us to understand what each of these components does as it helps us to understand just how powerful Goutte is .

BrowserKit ;
Simply put, the BrowserKit component simulates the behaviour of a real browser . It is the foundational element of Goutte.

DomCrawler;
The DomCrawler component eases the navigation of the DOM ( Document Object Model ) . The DomCrawler allows us to navigate the dom like this:

    $crawler = $crawler->filter('body > p');

We can also traverse through nodes on the DOM using some of the methods that it provides . For example, if we want to get the first paragraph in the body of the page we could do this:

    $crawler->filter('body > p')->eq(0);

The eq() method is zero indexed and it takes a number specifying the position of the element we want to access .
There are other methods such as siblings(), first() [an alias of eq(0), underground it just calls eq(0) ], last() etc .

CssSelector;
The CssSelector is a wonderful component that allows us to select elements via their CSS selectors . It does this by converting the CSS selectors to their XPath equivalents . So for example say we wanted to select an element with a class called "fire" we could do this:

    $crawler->filter('.fire');

The CssSelector component is so amazing that it even supports CSS such as ;

    $crawler->filter('div[style*="max-height:175px; overflow: hidden;"]');

The above means that we are looking for a div element with an inline style attribute of "style=max-height:175px; overflow: hidden;"

For more information, please do well to read the docs of DomCrawler, CssSelector and Goutte .

Alright now that we have a bit of an idea about the three major components, it is time for us to bring everything together and actually scrape something . As you may have realised by now,when it comes to scraping, there is no laid down way to do it . You are free to explore and try out so many ways to get your data . The only limit you have is your creativity . There are times where I have had to combine the CssSelector and DomCrawler in order to get what I want [ actually, a lot of times ] .

In the next post we are going to put everything that we have learnt so far in to play by scraping the website of the Punch .

Hi, I'm Sayo Paul

Sayo Paul — Mon, 13 Mar 2017 01:10:03 +0000

I have been coding for 1 year +.

You can find me on GitHub as sayopaul

I live in PortHarcourt, Nigeria.

I am a student at FUNAAB

I mostly program in these languages: PHP.

I am currently learning more about Laravel.

Nice to meet you.