Oxylabs for Oxylabs

Posted on Oct 31, 2022 • Edited on Jun 8, 2023

Web Scraping With PHP | Ultimate Tutorial

#webscraping #php #tutorial #beginners

You can use various scripting languages to do web scraping, and PHP is certainly one to try! It’s a general-purpose language and one of the most popular options for web development. For example, WordPress, the most common content management system for creating websites, is built using PHP.

PHP offers various building blocks required to build a web scraper, although it can quickly become an increasingly complicated task. Conveniently, many open-source libraries can make web scraping with PHP more accessible.

This post will guide you through the step-by-step process of writing various PHP web scraping routines you can employ to extract public data from static and dynamic web pages.

Let’s get started!

Can PHP be used for web scraping?

In short, yes, it certainly can, and the rest of the article will detail precisely how the web page scraping processes should look. However, asking whether it's a good choice as a language for web scraping is an entirely different question, as numerous programming language alternatives exist.

Note that PHP is old. It has existed since the 90s and reached significant version 8. Yet, this is advantageous as it makes PHP a rather easy language to use and has decades of solved problems/errors under its belt. However, simplicity comes at a cost as well. When it comes to complex, dynamic websites, PHP is outperformed by Python and Javascript, although if your requirements are data scraped from simple pages, then PHP is a good choice.

Installing prerequisites

To begin, make sure that you have both PHP and Composer installed.

If you’re using Windows, visit this link to download PHP. You can also use the Chocolatey package manager.

Using Chocolatey, run the following command from the command line or PowerShell:

choco install php

If you’re using macOS, the chances are that you already have PHP bundled with the operating system. Otherwise, you can use a package manager such as Homebrew to install PHP. Open the terminal and enter the following:

brew install php

Once PHP is installed, verify that the version is 7.1 or newer. Open the terminal and enter the following to verify the version:

php --version

Next, install Composer. Composer is a dependency manager for PHP. It’ll help to install and manage the required packages.

To install Composer, visit this link. Here you’ll find the downloads and instructions.

If you’re using a package manager, the installation is easier. On macOS, run the following command to install Composer:

brew install composer

On Windows, you can use Chocolatey:

choco install composer

To verify the installation, run the following command:

composer --version

The next step is to install the required libraries.

Making an HTTP GET request

The first step of PHP web scraping is to load the page.

In this tutorial, we’ll be using books.toscrape.com. The website is a dummy book store for practicing web scraping.

When viewing a website in a browser, the browser sends an HTTP GET request to the web server as the first step. To send the HTTP GET request using PHP, the built-in function file_get_contents can be used.

This function can take a file path or a URL and return the contents as a string.

Create a new file and save it as native.php. Open this file in a code editor such as Visual Studio Code. Enter the following lines of code to load the HTML page and print the HTML in the terminal:

<?php

$html = file_get_contents('https://books.toscrape.com/');

echo $html;

Execute this code from the terminal as follows:

php native.php

Upon executing this command, the entire HTML of the page will be printed.

As of now, it’s difficult to locate and extract specific information within the HTML.

This is where various open-source, third-party libraries come into play.

Web scraping in PHP with Goutte

A wide selection of libraries is available for web scraping with PHP. In this tutorial, Goutte will be used as it’s accessible, well-documented, and continuously updated. It’s always a good idea to try the most popular solution. Usually, supporting content and preexisting advice are plentiful.

Goutte can handle most static websites. For dynamic sites, let’s use Symfony Panther.

Goutte, pronounced goot, is a wrapper around Symfony’s components, such as BrowserKit, CssSelector, DomCrawler, and HTTPClient.

Symfony is a set of reusable PHP components. The components used by Goutte can be used directly. However, Goutte makes it easier to write the code.

To install Goutte, create a directory where you intend to keep the source code. Navigate to the directory and enter these commands:

composer init --no-interaction --require="php >=7.1"

composer require fabpot/goutte

composer update

The first command will create the composer.json file. The second command will add the entry for Goutte as well as download and install the required files. It’ll also create the composer.lock file.

The composer update command will ensure that all the files of the dependencies are up to date.

Sending HTTP requests with Goutte

The most important class for PHP web scraping using Goutte is the Client that acts like a browser. The first step is to create an object of this class:

$client = new Client();

This object can then be used to send a request. The method to send the request is conveniently called request. It takes two parameters — the HTTP method and the target URL, and returns an instance of the DOM crawler object:

$crawler = $client->request('GET', 'https://books.toscrape.com&#39;);

This will send the GET request to the HTML page. To print the entire HTML of the page, we can call the html() method.

Putting together everything we’ve built so far, this is how the code file looks like:

<?php

require 'vendor/autoload.php';

use Goutte\Client;

$client = new Client();

$crawler = $client->request('GET', 'https://books.toscrape.com');

echo $crawler->html();

Save this new PHP file as books.php and run it from the terminal. This will print the entire HTML:

php books.php

Next, we need a way to locate specific elements from the page.

Locating HTML elements via CSS Selectors

Goutte uses the Symfony component CssSelector. It facilitates the use of CSS Selectors in locating HTML elements.

The CSS Selector can be supplied to the filter method. For example, to print the title of the page, enter the following line to the books.php file that we’re working with:

echo $crawler->filter('title')->text();

Note that title is the CSS Selector that selects the <title> node from the HTML.

Keep in mind that in this particular case, text() returns a text contained in the HTML element. In the earlier example, we’ve used html() to return the entire HTML of the selected element.

If you prefer to work with XPath, use the filterXPath() method instead. The following line of code produces the same output:

echo $crawler->filterXPath('//title')->text();

Now, let’s move on to extracting the book titles and prices.

Extracting the elements

Open https://books.toscrape.com in Chrome, right-click on a book and select Inspect. Before we write the web scraping code, we need to analyze the HTML of our page first.

The books are located in the <article> tags

Upon examining the HTML of the target web page, we can see that each book is contained in an article tag, which has a product_pod class. Here, the CSS Selector would be .product_pod.

In each article tag, the complete book title is located in the thumbnail image as an alt attribute value. The CSS Selector for the book title would be .image_container img.

Finally, the CSS Selector for the book price would be .price_color.

To get all the titles and prices from this page, first, we need to locate the container and then run the each loop.

In this loop, an anonymous function will extract and print the title along with the price as follows:

function scrapePage($url, $client){

$crawler = $client->request('GET', $url);

$crawler->filter('.product_pod')->each(function ($node) {

$title = $node->filter('.image_container img')->attr('alt');

$price = $node->filter('.price_color')->text();

echo $title . "-" . $price . PHP_EOL;

});

}

The functionality of web data extraction was isolated in a function. The same function can be used for extracting data from different websites.

Handling pagination

At this point, your PHP web scraper is performing data extraction from only a single URL. In real-life web scraping scenarios, multiple pages would be involved.

In this particular site, the pagination is controlled by a Next link (button). The CSS Selector for the Next link is .next > a.

In the function scrapePage that we’ve created earlier, add the following lines:

try {

$next_page = $crawler->filter('.next > a')->attr('href');

} catch (InvalidArgumentException) { //Next page not found

return null;

}

return "https://books.toscrape.com/catalogue/" . $next_page;

This code uses the CSS Selector to locate the Next button and to extract the value of the href attribute, returning the relative URL of the subsequent page. On the last page, this line of code will raise the InvalidArgumentException.

If the next page is found, this function will return its URL. Otherwise, it will return null.

From now on, you’ll be initiating each scraping cycle with a different URL. This will make the conversion from a relative URL to an absolute one easier.

Lastly, you can use a while loop to call this function:

$client = new Client();

$nextUrl = "https://books.toscrape.com/catalogue/page-1.html";

while ($nextUrl) {

$nextUrl = scrapePage($nextUrl, $client);

}

scrapePage($url, $client);

The web scraping code is almost complete.

Writing data to a CSV file

The final step of the PHP web scraping process is to export the data to a storage. PHP’s built-in fputcsv function can be used to export the data to a CSV file.

First, open the CSV file in write or append mode and store the file handle in a variable.

Next, send the variable to the scrapePage function. Then, call the fputcsv function for each book to write the title and price in one row.

Lastly, after the while loop, close the file by calling fclose.

The final code file will be as follows:

function scrapePage($url, $client, $file)

{

$crawler = $client->request('GET', $url);

$crawler->filter('.product_pod')->each(function ($node) use ($file) {

$title = $node->filter('.image_container img')->attr('alt');

$price = $node->filter('.price_color')->text();

fputcsv($file, [$title, $price]);

});

try {

$next_page = $crawler->filter('.next > a')->attr('href');

} catch (InvalidArgumentException) { //Next page not found

return null;

}

return "https://books.toscrape.com/catalogue/" . $next_page;

}

$client = new Client();

$file = fopen("books.csv", "a");

$nextUrl = "https://books.toscrape.com/catalogue/page-1.html";

while ($nextUrl) {

echo "<h2>" . $nextUrl . "</h2>" . PHP_EOL;

$nextUrl = scrapePage($nextUrl, $client, $file);

}

fclose($file);

Run this file from the terminal:

php books.php

This will create a books.csv file with 1,000 rows of data.

Web scraping with Guzzle, XML, and XPath

Guzzle is a PHP library that sends HTTP requests to web pages in order to get a response. In other words, Guzzle is a PHP HTTP client that you can use to scrape data. Note that before working with a web page, you’d need to understand two more concepts--XML and XPath.

XML stands for eXtensible Markup Language. It’ll be used to create files for storing structured data. These files can then be transmitted and the data constructed.

There is the issue of reading XML files and this is where XPath comes into the picture.

XPath stands for XML Path and is used for navigation and selecting XML nodes.

HTML files are very similar to XML files. In some cases, you might need a parser to make adjustments to the minor differences and make HTML at least somewhat compliant with XML file standards. There are some parsers that can read even poorly formatted XML.

In any case, the parsers will then make necessary HTML modifications so that you can work with XPath to query and navigate the HTML.

Setting up a Guzzle Project

To install Guzzle, create a directory where you intend to keep the source code. Navigate to the directory and enter these commands:

composer init --no-interaction --require="php >=7.1"

composer require guzzlehttp/guzzle

In addition to Guzzle, let’s also use a library for parsing HTML code. There are many PHP libraries available such as simple HTML dom parser and Symphony DOMCrawler. In this tutorial, Symphony DOMCrawler is chosen. Its syntax is very similar to Goutte, and you’ll be able to apply what you already know in this section. Another point in favor of DomCrawler over the simple HTML dom parser is that it supports working with invalid HTML code very well. So, let’s get going.

Install DOMCrawler using the following command:

composer require symfony/dom-crawler

These commands will download all the necessary files. The next step is to create a new file and save it as scraper.php.

Sending HTTP requests with Guzzle

Similar to Goutte, the most important class of Guzzle is Client. Begin by creating a new file scraper.php and enter the following lines of PHP code:

<?php

require 'vendor/autoload.php';

use GuzzleHttp\Client;

use Symfony\Component\DomCrawler\Crawler;

Now we’re ready to create an object of the Client class:

$client = new Client();

You can then use the client object to send a request. The method to send the request is conveniently called request. It takes two parameters — the HTTP method and the target URL, and returns a response:

$response = $client->request('GET', 'https://books.toscrape.com&#39;);

From this response, we can extract the web page's HTML as follows:

$html = $response->getBody()->getContents();

echo $html

Note that in this example, the response contains HTML code. If you’re working with a web page that returns JSON, you can save the JSON to a file and stop the script. The next section will be applicable only if the response contains HTML or XML data.

Continuing, the DomCrawler will be used to extract specific elements from this web page.

Locating HTML elements via XPath

Import the Crawler class and create an instance of the Crawler class as shown in the following PHP code snippet:

use Symfony\Component\DomCrawler\Crawler;

We can create an instance of the crawler class as follows:

$crawler = new Crawler($html);

Now we can use the filterXPath method to extract any XML node. For example, the following line prints only the title of the page:

echo $crawler->filterXPath('//title')->text();

A quick note about XML Nodes: In XML, everything is a node-- an element is a node, an attribute is a node, and text is also a node. The filterXPath method returns a node. So, to extract the text from an element, even if you use the text() function in XPath, you still have to call the text() method to extract text as a string.

In other words, both the following lines of code will return the same value:

echo $crawler->filterXPath('//title')->text();

echo $crawler->filterXPath('//title/text()')->text();

Now, let's move on to extracting the book titles and prices.

Extracting the elements

Before writing web scraping code, let’s start with analyzing the HTML of our page.

Open the web page https://books.toscrape.com in Chrome, right-click on a book and select Inspect.

The books are located in <article> elements with the class attribute set to product_pod. The XPath to select these nodes will be as follows:

//[@class="product_pod"]

In each article tag, the complete book title is located in the thumbnail image as an alt attribute value. The XPath for book title and book price would be as follows:

//[@class="image_container"]/a/img/@alt

//[@class="price_color"]/text()

To get all of the titles and prices from this page, you first need to locate the container and then use a loop to get to each of the elements containing the data you need.

In this loop, an anonymous function will extract and print the title along with the price, as shown in the following PHP code snippet:

$crawler->filterXpath('//[@class="product_pod"]')->each(function ($node) {

$title = $node->filterXpath('.//[@class="image_container"]/a/img/@alt')->text();

$price = $node->filterXPath('.//[@class="price_color"]/text()')->text();

echo $title . "-" . $price . PHP_EOL;

});

This was a simple demonstration of how you can scrape data from any page using Guzzle or DOMCrawler parsers. Note that this method won’t work with a dynamic website. These websites use JavaScript code that cannot be handled by DOMCrawler. In cases like this, you’ll need to use Symphony Panther.

The next step after extracting data is to save it.

Saving extracted data to a file

To store the extracted data, you can change the script to use the built-in PHP and create a CSV file.

Write the following PHP code snippet as this:

$file = fopen("books.csv", "a");

$crawler->filterXpath('//[@class="product_pod"]')->each(function ($node) use ($file) {

$title = $node->filterXpath('.//[@class="image_container"]/a/img/@alt')->text();

$price = $node->filterXPath('.//*[@class="price_color"]/text()')->text();

fputcsv($file, [$title, $price]);

});

fclose($file);

This code snippet, when run, will save all the data to the books.csv file.

Web scraping with Symfony Panther

Dynamic websites use JavaScript to render the contents. For such websites, Goutte wouldn’t be a suitable option.

For these websites, the solution is to employ a browser to render the page. It can be done using another component from Symfony – Panther. Panther is a standalone PHP library for web scraping using real browsers.

In this section, let’s scrape quotes and authors from quotes.toscrape.com. It’s a dummy website for learning the basics of scraping dynamic web pages.

Installing Panther and its dependencies

To install Panther, open the terminal, navigate to the directory where you’ll be storing your source code, and run the following commands:

composer init --no-interaction --require="php >=7.1"

composer require symfony/panther

composer update

These commands will create a new composer.json file and install Symfony/Panther.

The other two dependencies are a browser and a driver. The common browser choices are Chrome and Firefox. The chances are that you already have one of these browsers installed.

The driver for your browser can be downloaded using any of the package managers.

On Windows, run:

choco install chromedriver

On macOS, run:

brew install chromedriver

Sending HTTP requests with Panther

Panther uses the Client class to expose the get() method. This method can be used to load URLs, or in other words, to send HTTP requests.

The first step is to create the Chrome Client. Create a new PHP file and enter the following lines of code:

<?php

require 'vendor/autoload.php';

use \Symfony\Component\Panther\Client;

$client = Client::createChromeClient();

The $client object can then be used to load the web page:

$client->get('https://quotes.toscrape.com/js/&#39;);

This line will load the page in a headless Chrome browser.

Locating HTML elements via CSS Selectors

To locate the elements, first, you need to get a reference for the crawler object. The best way to get an object is to wait for a specific element on a page using the waitFor() method. It takes the CSS Selector as a parameter:

$crawler = $client->waitFor('.quote');

The code line waits for the element with this selector to become available and then returns an instance of the crawler.

The rest of the code is similar to Goutte’s as both use the same CssSelector component of Symfony.

The container HTML element of a quote

First, the filter method is supplied by the CSS Selector to get all of the quote elements. Then, the anonymous function is supplied to each quote to extract the author and the text:

$crawler->filter('.quote')->each(function ($node) {

$author = $node->filter('.author')->text();

$quote = $node->filter('.text')->text();

echo $autor." - ".$quote

});

Handling pagination

To scrape data from all of the subsequent pages of this website, you can simply click the Next button. For clicking the links, the clickLink() method can be used. This method works directly with the link text.

On the last page, the link won’t be present, and calling this method will throw an exception. This can be handled by using a try-catch block:

while (true) {

$crawler = $client->waitFor('.quote');

…

try {

$client->clickLink('Next');

} catch (Exception) {

break;

}

Writing data to a CSV file

Writing the data to CSV is straightforward when using PHP’s fputcsv() function. Open the CSV file before the while loop, write every row using the fputcsv() function, and close the file after the loop.

Here’s the final code:

$file = fopen("quotes.csv", "a");

while (true) {

$crawler = $client->waitFor('.quote');

$crawler->filter('.quote')->each(function ($node) use ($file) {

$author = $node->filter('.author')->text();

$quote = $node->filter('.text')->text();

fputcsv($file, [$author, $quote]);

});

try {

$client->clickLink('Next');

} catch (Exception) {

break;

}

fclose($file);

Once you execute the web scraper contained in this PHP script, you’ll have a quotes.csv file with all the quotes and authors ready for further analysis.

Click here and check out a repository on GitHub to find the complete code used in this article.

Conclusion

You shouldn’t run into major hiccups when using Goutte for most static web pages, as this popular library offers sufficient functionality and extensive documentation. However, if the typical HTML extraction methods aren’t up to the task when dynamic elements come into play, then Symfony Panther is the right way to deal with more complicated loads.

If you’re working with a site developed using Laravel, Code Igniter, or just plain PHP, writing the web scraping part directly in PHP can be very useful, for example, when creating your own WordPress plugin. As PHP is also a scripting language, you can write web scraping code even when it’s not meant to be deployed to a website.