DEV Community

Cover image for Web Scraping with Laravel and spatie/crawler
Anlisha Maharjan
Anlisha Maharjan

Posted on • Originally published at anlisha.com.np on

1

Web Scraping with Laravel and spatie/crawler

Probably among many articles on Web Scraping With PHP, you may find something useful here along the lines.

We’ll be using spatie/crawler package which will provide us with a great features for writing crawlers without going absolutely crazy!

Please keep in mind that there is no general “the best way” — each approach has its use-case depending on what you need, how you like to do things, and what you want to achieve.

Note: Before you scrape a website, do read their Terms of Service to make sure they are OK with being scraped.

Use Case

Build our own crawler project to fetch content of any websites.

Set Up & Installation

Install package via Composer:

composer require guzzlehttp/psr7 ^1.8.3
composer require spatie/crawler
Enter fullscreen mode Exit fullscreen mode

Notice that guzzlehttp/psr7 is installed because spatie package uses Guzzle promises under the hood to crawl multiple urls concurrently.

Coding Time!

Before we start, we need to create a class which extends the \Spatie\Crawler\CrawlObservers\CrawlObserver abstract class. We can hook into crawling steps and manipulate http responses in our CustomCrawlerObserver class.

<?php
namespace App\Observers;
use DOMDocument;
use Spatie\Crawler\CrawlObservers\CrawlObserver;
use Psr\Http\Message\UriInterface;
use Psr\Http\Message\ResponseInterface;
use GuzzleHttp\Exception\RequestException;
use Illuminate\Support\Facades\Log;
class CustomCrawlerObserver extends CrawlObserver {
private $content;
public function __construct() {
$this->content = NULL;
}
/**
* Called when the crawler will crawl the url.
*
* @param \Psr\Http\Message\UriInterface $url
*/
public function willCrawl(UriInterface $url): void
{
Log::info('willCrawl',['url'=>$url]);
}
/**
* Called when the crawler has crawled the given url successfully.
*
* @param \Psr\Http\Message\UriInterface $url
* @param \Psr\Http\Message\ResponseInterface $response
* @param \Psr\Http\Message\UriInterface|null $foundOnUrl
*/
public function crawled(
UriInterface $url,
ResponseInterface $response,
?UriInterface $foundOnUrl = null
)
{
$doc = new DOMDocument();
@$doc->loadHTML($response->getBody());
//# save HTML
$content = $doc->saveHTML();
//# convert encoding
$content1 = mb_convert_encoding($content,'UTF-8',mb_detect_encoding($content,'UTF-8, ISO-8859-1',true));
//# strip all javascript
$content2 = preg_replace('/<script\b[^>]*>(.*?)<\/script>/is', '', $content1);
//# strip all style
$content3 = preg_replace('/<style\b[^>]*>(.*?)<\/style>/is', '', $content2);
//# strip tags
$content4 = str_replace('<',' <',$content3);
$content5 = strip_tags($content4);
$content6 = str_replace( ' ', ' ', $content5 );
//# strip white spaces and line breaks
$content7 = preg_replace('/\s+/S', " ", $content6);
//# html entity decode - ö was shown as &ouml;
$html = html_entity_decode($content7);
//# append
$this->content .= $html;
}
/**
* Called when the crawler had a problem crawling the given url.
*
* @param \Psr\Http\Message\UriInterface $url
* @param \GuzzleHttp\Exception\RequestException $requestException
* @param \Psr\Http\Message\UriInterface|null $foundOnUrl
*/
public function crawlFailed(
UriInterface $url,
RequestException $requestException,
?UriInterface $foundOnUrl = null
)
{
Log::error('crawlFailed',['url'=>$url,'error'=>$requestException->getMessage()]);
}
/**
* Called when the crawl has ended.
*/
public function finishedCrawling()
{
Log::info("finishedCrawling");
//# store $this->content in DB
//# Add logic here
}
}

Then, we can now prepare the crawler itself using syntax below and start it:

<?php
namespace App\Http\Controllers;
use Illuminate\Http\Request;
use App\Observers\CustomCrawlerObserver;
use Spatie\Crawler\CrawlProfiles\CrawlInternalUrls;
use Spatie\Crawler\Crawler;
use App\Http\Controllers\Controller;
use GuzzleHttp\RequestOptions;
class CustomCrawlerController extends Controller {
public function __construct() {}
/**
* Crawl the website content.
* @return true
*/
public function fetchContent(){
//# initiate crawler
Crawler::create([RequestOptions::ALLOW_REDIRECTS => true, RequestOptions::TIMEOUT => 30])
->acceptNofollowLinks()
->ignoreRobots()
// ->setParseableMimeTypes(['text/html', 'text/plain'])
->setCrawlObserver(new CustomCrawlerObserver())
->setCrawlProfile(new CrawlInternalUrls('https://www.lipsum.com'))
->setMaximumResponseSize(1024 * 1024 * 2) // 2 MB maximum
->setTotalCrawlLimit(100) // limit defines the maximal count of URLs to crawl
// ->setConcurrency(1) // all urls will be crawled one by one
->setDelayBetweenRequests(100)
->startCrawling('https://www.lipsum.com');
return true;
}
}

Note on line 26, we set the CustomCrawlerObserver class created earlier. And on line 32, we set the website to crawl.

Also on line 27, we’re defining that we only want to follow internal links by using setCrawlProfile-function. Check out other options here.

Do you see setDelayBetweenRequests(100)? It makes the crawler pause 100 milliseconds between every request.

That’s it

Quick and easy! spatie/crawler provides options/features to set maximum crawl depth, response size, adding a delay between requests, limit the crawl links, limit content-types to parse etc. that simplifies the process of web scraping. While this was an introductory article, you may choose to build on this knowledge and create complex web scrapers that can crawl thousands of pages. So make sure you access this package for more documentation.

Source Code!

Here’s the full source code with what we’ve done so far.

The post Web Scraping with Laravel and spatie/crawler first appeared on Anlisha Maharjan.

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs