<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sayo Paul</title>
    <description>The latest articles on DEV Community by Sayo Paul (@sayopaul).</description>
    <link>https://dev.to/sayopaul</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F10562%2F21235901.jpeg</url>
      <title>DEV Community: Sayo Paul</title>
      <link>https://dev.to/sayopaul</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sayopaul"/>
    <language>en</language>
    <item>
      <title>Web Scraping in PHP using Goutte - part 2</title>
      <dc:creator>Sayo Paul</dc:creator>
      <pubDate>Fri, 28 Dec 2018 18:05:17 +0000</pubDate>
      <link>https://dev.to/sayopaul/web-scraping-in-php-using-goutte---part-2-5e66</link>
      <guid>https://dev.to/sayopaul/web-scraping-in-php-using-goutte---part-2-5e66</guid>
      <description>

&lt;h1&gt;
  
  
  Web Scraping in PHP using Goutte II
&lt;/h1&gt;

&lt;p&gt;In the &lt;a href="https://dev.to/sayopaul/web-scraping-in-php-using-goutte--4p6h"&gt;last article&lt;/a&gt;, we got introduced to web scraping and we looked into &lt;a href="https://github.com/FriendsOfPHP/Goutte"&gt;Goutte&lt;/a&gt;, a wonderful PHP web scraping library . In this article, we would be putting our knowledge to practice by scraping the website of the &lt;a href="https://punchng.com"&gt;Punch&lt;/a&gt; . To be more specific, we would be scraping the punch to get the lastest news &lt;a href="https://punchng.com/topics/news"&gt;https://punchng.com/topics/news&lt;/a&gt; headlines  😎 .&lt;/p&gt;

&lt;p&gt;Let's get right into it 💪 !&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NB : This is for testing purposes only, I do not in any way intend to reproduce the material gotten from the Punch and I do not advice you to do so as that would be copyright infringement&lt;/strong&gt; .&lt;/p&gt;

&lt;p&gt;First things first, we set up Composer autoloading, import the Goutte namespace and we instantiate a new Goutte Client:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight php"&gt;&lt;code&gt;    &lt;span class="k"&gt;require&lt;/span&gt; &lt;span class="s2"&gt;"vendor/autoload.php"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nx"&gt;Goutte\Client&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nv"&gt;$client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The next step is to send a request via the $client object . The $client object returns a crawler instance . It is this instance that we use to apply our filters .&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight php"&gt;&lt;code&gt;     &lt;span class="nv"&gt;$crawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;$client&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'GET'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s2"&gt;"https://punchng.com/topics/news"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;On the front page of the Punch news page are article boxes . Each article has its own box and a heading ( The headline ) with the class ".seg-title" . We want to select all the headlines (.seg-title) on the page and then take each of them one by one . We do it with this:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight php"&gt;&lt;code&gt;     &lt;span class="nv"&gt;$crawler&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'.seg-title'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;each&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$node&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;


     &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Notice the method &lt;code&gt;each()&lt;/code&gt; ? The &lt;code&gt;each()&lt;/code&gt; method allows us to iterate over the current selection(node list) when it contains more than one node . As we mentioned above, we are selecting each of the headlines (.seg-title) hence we have more than one node and we want to iterate through them . Underground, the &lt;code&gt;each()&lt;/code&gt; method accepts an instance of an anonymous function, loops through the current node list and then passes a node on each iteration to the closure thus allowing us to access the current node ( &lt;code&gt;$node&lt;/code&gt; ) in the closure .&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight php"&gt;&lt;code&gt;     &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;each&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;\Closure&lt;/span&gt; &lt;span class="nv"&gt;$closure&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="nv"&gt;$data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;array&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
          &lt;span class="k"&gt;foreach&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;nodes&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nv"&gt;$i&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;$node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
              &lt;span class="nv"&gt;$data&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;$closure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$this&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;createSubCrawler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$node&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nv"&gt;$i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
          &lt;span class="p"&gt;}&lt;/span&gt;

          &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nv"&gt;$data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Alright, the next thing we want to do is extract the text from the current node .&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight php"&gt;&lt;code&gt;     &lt;span class="nv"&gt;$crawler&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'.seg-title'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;each&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$node&lt;/span&gt;&lt;span class="p"&gt;){&lt;/span&gt;
         &lt;span class="nv"&gt;$headline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;$node&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
         &lt;span class="k"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$headline&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
     &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We get the textual content of the node by calling the method &lt;code&gt;text()&lt;/code&gt; . The next thing we do is print out the headline and there we have it ! We would always get the latest 10 news headlines on the punch printed out to us whenever we run this script . Like I said in the previous article, when it comes to scraping, almost anything is possible ( even logging in and filling forms ) . The limit is your mind 😊 . I honestly wish we could go deeper but sadly that's all for now 😅 . &lt;/p&gt;

&lt;p&gt;For more information, please do well to read the docs of &lt;a href="https://symfony.com/doc/current/components/dom_crawler.html"&gt;DomCrawler&lt;/a&gt;, &lt;a href="https://symfony.com/doc/current/components/css_selector.html"&gt;CssSelector&lt;/a&gt; and &lt;a href="https://github.com/FriendsOfPHP/Goutte"&gt;Goutte&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;Do you have any web scaping needs ? You can hire me to help you out &lt;a href="https://www.fiverr.com/sayopaul/scrape-webpage-for-you-and-save-it-in-any-format-in-php"&gt;here&lt;/a&gt;&lt;/p&gt;


</description>
      <category>webscraping</category>
      <category>goutte</category>
      <category>phpscraping</category>
    </item>
    <item>
      <title>Web Scraping in PHP using Goutte </title>
      <dc:creator>Sayo Paul</dc:creator>
      <pubDate>Fri, 28 Dec 2018 18:04:32 +0000</pubDate>
      <link>https://dev.to/sayopaul/web-scraping-in-php-using-goutte--4p6h</link>
      <guid>https://dev.to/sayopaul/web-scraping-in-php-using-goutte--4p6h</guid>
      <description>

&lt;h1&gt;
  
  
  Web Scraping in PHP using Goutte
&lt;/h1&gt;

&lt;p&gt;Today I would be talking about something very common, Web Scraping . Depending on your needs or a client's needs, situations may arise when you may need to extract data from a webpage .&lt;/p&gt;

&lt;p&gt;What is Web Scraping ?&lt;/p&gt;

&lt;p&gt;According to &lt;a href="https://www.webharvy.com"&gt;WebHarvy&lt;/a&gt;, Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites . In its simplest form, web scraping is getting the contents of a webpage via a script . Alright, let's move on to web scraping in PHP . Recently, I needed to scrape a site for a client in PHP so I looked for articles that talked about web scraping in PHP and I found out that there were few and most of them were pretty outdated . &lt;/p&gt;

&lt;p&gt;However, in my research, I came across &lt;a href="https://github.com/FriendsOfPHP/Goutte"&gt;Goutte&lt;/a&gt; ; a (wonderful) screen scraping and web crawling library for PHP . At its core, Goutte is a wrapper around three of Symfony's components ( God bless Fabien 🙌) ; BrowserKit, CssSelector and DomCrawler . It is important for us to understand what each of these components does as it helps us to understand just how powerful Goutte is .&lt;/p&gt;

&lt;p&gt;BrowserKit ;&lt;br&gt;
 Simply put, the BrowserKit component simulates the behaviour of a real browser . It is the foundational element of Goutte.&lt;/p&gt;

&lt;p&gt;DomCrawler;&lt;br&gt;
 The DomCrawler component eases the navigation of the DOM ( Document Object Model ) . The DomCrawler allows us to navigate the dom like this:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight php"&gt;&lt;code&gt;    &lt;span class="nv"&gt;$crawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;$crawler&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'body &amp;gt; p'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We can also traverse through nodes on the DOM using some of the methods that it provides . For example, if we want to get the first paragraph in the body of the page we could do this:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight php"&gt;&lt;code&gt;    &lt;span class="nv"&gt;$crawler&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'body &amp;gt; p'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;eq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The eq() method is zero indexed and it takes a number specifying the position of the element we want to access .&lt;br&gt;
There are other methods such as &lt;code&gt;siblings()&lt;/code&gt;, &lt;code&gt;first()&lt;/code&gt; [an alias of &lt;code&gt;eq(0)&lt;/code&gt;, underground it just calls &lt;code&gt;eq(0)&lt;/code&gt; ], &lt;code&gt;last()&lt;/code&gt; etc . &lt;/p&gt;

&lt;p&gt;CssSelector;&lt;br&gt;
The CssSelector is a wonderful component that allows us to select elements via their CSS selectors . It does this by converting the CSS selectors to their XPath equivalents . So for example say we wanted to select an element with a class called "fire" we could do this:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight php"&gt;&lt;code&gt;    &lt;span class="nv"&gt;$crawler&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'.fire'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The CssSelector component is so amazing that it even supports CSS such as ;&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight php"&gt;&lt;code&gt;    &lt;span class="nv"&gt;$crawler&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="na"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'div[style*="max-height:175px; overflow: hidden;"]'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The above means that we are looking for a div element with an inline style attribute of &lt;code&gt;"style=max-height:175px; overflow: hidden;"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;For more information, please do well to read the docs of &lt;a href="https://symfony.com/doc/current/components/dom_crawler.html"&gt;DomCrawler&lt;/a&gt;, &lt;a href="https://symfony.com/doc/current/components/css_selector.html"&gt;CssSelector&lt;/a&gt; and &lt;a href="https://github.com/FriendsOfPHP/Goutte"&gt;Goutte&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;Alright now that we have a bit of an idea about the three major components, it is time for us to bring everything together and actually scrape something . As you may have realised by now,when it comes to scraping,  there is no laid down way to do it . You are free to explore and try out so many ways to get your data . The only limit you have is your creativity . There are times where I have had to combine the CssSelector and DomCrawler in order to get what I want [ actually, a lot of times ] . &lt;/p&gt;

&lt;p&gt;In the &lt;a href="https://dev.to/sayopaul/web-scraping-in-php-using-goutte---part-2-5e66"&gt;next post&lt;/a&gt; we are going to put everything that we have learnt so far in to play by scraping the website of the &lt;a href="https://punchng.com"&gt;Punch&lt;/a&gt; . &lt;/p&gt;


</description>
      <category>webscraping</category>
      <category>goutte</category>
      <category>phpscraping</category>
    </item>
    <item>
      <title>Hi, I'm Sayo Paul</title>
      <dc:creator>Sayo Paul</dc:creator>
      <pubDate>Mon, 13 Mar 2017 01:10:03 +0000</pubDate>
      <link>https://dev.to/sayopaul/hi-im-sayo-paul</link>
      <guid>https://dev.to/sayopaul/hi-im-sayo-paul</guid>
      <description>&lt;p&gt;I have been coding for 1 year +.&lt;/p&gt;

&lt;p&gt;You can find me on GitHub as &lt;a href="https://github.com/sayopaul" rel="noopener noreferrer"&gt;sayopaul&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I live in PortHarcourt, Nigeria.&lt;/p&gt;

&lt;p&gt;I am a student at FUNAAB&lt;/p&gt;

&lt;p&gt;I mostly program in these languages: PHP.&lt;/p&gt;

&lt;p&gt;I am currently learning more about Laravel.&lt;/p&gt;

&lt;p&gt;Nice to meet you.&lt;/p&gt;

</description>
      <category>introduction</category>
    </item>
  </channel>
</rss>
