Probably among many articles on Web Scraping With PHP, you may find something useful here along the lines.
We’ll be using spatie/crawler package which will provide us with a great features for writing crawlers without going absolutely crazy!
Please keep in mind that there is no general “the best way” — each approach has its use-case depending on what you need, how you like to do things, and what you want to achieve.
Note: Before you scrape a website, do read their Terms of Service to make sure they are OK with being scraped.
Use Case
Build our own crawler project to fetch content of any websites.
Set Up & Installation
Install package via Composer:
composer require guzzlehttp/psr7 ^1.8.3
composer require spatie/crawler
Notice that guzzlehttp/psr7 is installed because spatie package uses Guzzle promises under the hood to crawl multiple urls concurrently.
Coding Time!
Before we start, we need to create a class which extends the \Spatie\Crawler\CrawlObservers\CrawlObserver
abstract class. We can hook into crawling steps and manipulate http responses in our CustomCrawlerObserver
class.
<?php | |
namespace App\Observers; | |
use DOMDocument; | |
use Spatie\Crawler\CrawlObservers\CrawlObserver; | |
use Psr\Http\Message\UriInterface; | |
use Psr\Http\Message\ResponseInterface; | |
use GuzzleHttp\Exception\RequestException; | |
use Illuminate\Support\Facades\Log; | |
class CustomCrawlerObserver extends CrawlObserver { | |
private $content; | |
public function __construct() { | |
$this->content = NULL; | |
} | |
/** | |
* Called when the crawler will crawl the url. | |
* | |
* @param \Psr\Http\Message\UriInterface $url | |
*/ | |
public function willCrawl(UriInterface $url): void | |
{ | |
Log::info('willCrawl',['url'=>$url]); | |
} | |
/** | |
* Called when the crawler has crawled the given url successfully. | |
* | |
* @param \Psr\Http\Message\UriInterface $url | |
* @param \Psr\Http\Message\ResponseInterface $response | |
* @param \Psr\Http\Message\UriInterface|null $foundOnUrl | |
*/ | |
public function crawled( | |
UriInterface $url, | |
ResponseInterface $response, | |
?UriInterface $foundOnUrl = null | |
) | |
{ | |
$doc = new DOMDocument(); | |
@$doc->loadHTML($response->getBody()); | |
//# save HTML | |
$content = $doc->saveHTML(); | |
//# convert encoding | |
$content1 = mb_convert_encoding($content,'UTF-8',mb_detect_encoding($content,'UTF-8, ISO-8859-1',true)); | |
//# strip all javascript | |
$content2 = preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', '', $content1); | |
//# strip all style | |
$content3 = preg_replace('/<style\b[^>]*>(.*?)<\/style>/is', '', $content2); | |
//# strip tags | |
$content4 = str_replace('<',' <',$content3); | |
$content5 = strip_tags($content4); | |
$content6 = str_replace( ' ', ' ', $content5 ); | |
//# strip white spaces and line breaks | |
$content7 = preg_replace('/\s+/S', " ", $content6); | |
//# html entity decode - ö was shown as ö | |
$html = html_entity_decode($content7); | |
//# append | |
$this->content .= $html; | |
} | |
/** | |
* Called when the crawler had a problem crawling the given url. | |
* | |
* @param \Psr\Http\Message\UriInterface $url | |
* @param \GuzzleHttp\Exception\RequestException $requestException | |
* @param \Psr\Http\Message\UriInterface|null $foundOnUrl | |
*/ | |
public function crawlFailed( | |
UriInterface $url, | |
RequestException $requestException, | |
?UriInterface $foundOnUrl = null | |
) | |
{ | |
Log::error('crawlFailed',['url'=>$url,'error'=>$requestException->getMessage()]); | |
} | |
/** | |
* Called when the crawl has ended. | |
*/ | |
public function finishedCrawling() | |
{ | |
Log::info("finishedCrawling"); | |
//# store $this->content in DB | |
//# Add logic here | |
} | |
} |
Then, we can now prepare the crawler itself using syntax below and start it:
<?php | |
namespace App\Http\Controllers; | |
use Illuminate\Http\Request; | |
use App\Observers\CustomCrawlerObserver; | |
use Spatie\Crawler\CrawlProfiles\CrawlInternalUrls; | |
use Spatie\Crawler\Crawler; | |
use App\Http\Controllers\Controller; | |
use GuzzleHttp\RequestOptions; | |
class CustomCrawlerController extends Controller { | |
public function __construct() {} | |
/** | |
* Crawl the website content. | |
* @return true | |
*/ | |
public function fetchContent(){ | |
//# initiate crawler | |
Crawler::create([RequestOptions::ALLOW_REDIRECTS => true, RequestOptions::TIMEOUT => 30]) | |
->acceptNofollowLinks() | |
->ignoreRobots() | |
// ->setParseableMimeTypes(['text/html', 'text/plain']) | |
->setCrawlObserver(new CustomCrawlerObserver()) | |
->setCrawlProfile(new CrawlInternalUrls('https://www.lipsum.com')) | |
->setMaximumResponseSize(1024 * 1024 * 2) // 2 MB maximum | |
->setTotalCrawlLimit(100) // limit defines the maximal count of URLs to crawl | |
// ->setConcurrency(1) // all urls will be crawled one by one | |
->setDelayBetweenRequests(100) | |
->startCrawling('https://www.lipsum.com'); | |
return true; | |
} | |
} |
Note on line 26, we set the CustomCrawlerObserver
class created earlier. And on line 32, we set the website to crawl.
Also on line 27, we’re defining that we only want to follow internal links by using setCrawlProfile
-function. Check out other options here.
Do you see setDelayBetweenRequests(100)
? It makes the crawler pause 100 milliseconds between every request.
That’s it
Quick and easy! spatie/crawler provides options/features to set maximum crawl depth, response size, adding a delay between requests, limit the crawl links, limit content-types to parse etc. that simplifies the process of web scraping. While this was an introductory article, you may choose to build on this knowledge and create complex web scrapers that can crawl thousands of pages. So make sure you access this package for more documentation.
Source Code!
Here’s the full source code with what we’ve done so far.
The post Web Scraping with Laravel and spatie/crawler first appeared on Anlisha Maharjan.
Top comments (0)