PHP Web Scraping (2 Part Series)
Today I would be talking about something very common, Web Scraping . Depending on your needs or a client's needs, situations may arise when you may need to extract data from a webpage .
What is Web Scraping ?
According to WebHarvy, Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites . In its simplest form, web scraping is getting the contents of a webpage via a script . Alright, let's move on to web scraping in PHP . Recently, I needed to scrape a site for a client in PHP so I looked for articles that talked about web scraping in PHP and I found out that there were few and most of them were pretty outdated .
However, in my research, I came across Goutte ; a (wonderful) screen scraping and web crawling library for PHP . At its core, Goutte is a wrapper around three of Symfony's components ( God bless Fabien 🙌) ; BrowserKit, CssSelector and DomCrawler . It is important for us to understand what each of these components does as it helps us to understand just how powerful Goutte is .
Simply put, the BrowserKit component simulates the behaviour of a real browser . It is the foundational element of Goutte.
The DomCrawler component eases the navigation of the DOM ( Document Object Model ) . The DomCrawler allows us to navigate the dom like this:
$crawler = $crawler->filter('body > p');
We can also traverse through nodes on the DOM using some of the methods that it provides . For example, if we want to get the first paragraph in the body of the page we could do this:
$crawler->filter('body > p')->eq(0);
The eq() method is zero indexed and it takes a number specifying the position of the element we want to access .
There are other methods such as
first() [an alias of
eq(0), underground it just calls
last() etc .
The CssSelector is a wonderful component that allows us to select elements via their CSS selectors . It does this by converting the CSS selectors to their XPath equivalents . So for example say we wanted to select an element with a class called "fire" we could do this:
The CssSelector component is so amazing that it even supports CSS such as ;
$crawler->filter('div[style*="max-height:175px; overflow: hidden;"]');
The above means that we are looking for a div element with an inline style attribute of
"style=max-height:175px; overflow: hidden;"
Alright now that we have a bit of an idea about the three major components, it is time for us to bring everything together and actually scrape something . As you may have realised by now,when it comes to scraping, there is no laid down way to do it . You are free to explore and try out so many ways to get your data . The only limit you have is your creativity . There are times where I have had to combine the CssSelector and DomCrawler in order to get what I want [ actually, a lot of times ] .