PHP cURL Web Scraping: The Complete Working Guide

#php #webdev #programming #tutorial

Most PHP cURL scraping tutorials show you two functions, a screenshot of some output, and call it a day.

This guide covers what actually happens on real websites: proper request setup, HTML parsing without regex, error handling, and pagination.

Full guide with MySQL storage, cookies, rate limiting, and avoiding blocks: phpspiderblog.com

Making a Proper cURL Request

The bare minimum most tutorials show:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://books.toscrape.com/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);

This works on cooperative websites. On anything with basic bot detection you'll get a 403 or empty response — and the script won't tell you why.

Here's a proper starting point:

function scrape_url($url) {
    $ch = curl_init();

    curl_setopt_array($ch, [
        CURLOPT_URL            => $url,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_MAXREDIRS      => 5,
        CURLOPT_TIMEOUT        => 30,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_ENCODING       => '',
        CURLOPT_HTTPHEADER     => [
            'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
            'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language: en-US,en;q=0.5',
            'Connection: keep-alive',
        ],
    ]);

    $response = curl_exec($ch);
    $error    = curl_error($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

    curl_close($ch);

    if ($error) {
        echo "cURL error: " . $error;
        return false;
    }

    if ($httpCode !== 200) {
        echo "HTTP error: " . $httpCode;
        return false;
    }

    return $response;
}

$html = scrape_url("https://books.toscrape.com/");

if ($html) {
    echo "Page fetched. Length: " . strlen($html) . " bytes";
}

Output:

Page fetched. Length: 51274 bytes

Key options that matter:

CURLOPT_FOLLOWLOCATION — follows redirects. Without this, HTTP to HTTPS redirects return empty.
CURLOPT_TIMEOUT — without this a slow server blocks your script indefinitely.
User-Agent — without this, requests identify as a cURL bot and most sites block them.
CURLINFO_HTTP_CODE — curl_exec() returns false only on network failure, not on 403 or 404. Always check the status code separately.

Parsing HTML With DOMDocument (Not Regex)

Most tutorials use regex on HTML:

preg_match_all('/<h3><a[^>]*title="([^"]*)"/', $response, $matches);

This breaks the moment a site changes one attribute or reformats whitespace. Use DOMDocument instead:

libxml_use_internal_errors(true);

$dom = new DOMDocument();
$dom->loadHTML($html);

libxml_clear_errors();

$xpath = new DOMXPath($dom);

$titles = $xpath->query('//article[contains(@class,"product_pod")]//h3/a');

foreach ($titles as $title) {
    echo $title->getAttribute('title') . PHP_EOL;
}

Output:
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects

Extracting multiple fields at once:

$books = $xpath->query('//article[contains(@class,"product_pod")]');

foreach ($books as $book) {
    $titleNode = $xpath->query('.//h3/a', $book)->item(0);
    $priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);

    $title = $titleNode ? $titleNode->getAttribute('title') : 'N/A';
    $price = $priceNode ? trim($priceNode->textContent) : 'N/A';

    echo $title . " | " . $price . PHP_EOL;
}

Output:
A Light in the Attic | £51.77
Tipping the Velvet | £53.74
Soumission | £50.10

libxml_use_internal_errors(true) suppresses warnings from malformed HTML — real websites have unclosed tags and encoding issues. Without this your output becomes unreadable.

Handling Errors and Retries

A scraper that crashes on the first failed request is useless. Build retry logic from the start:

function scrape_with_retry($url, $maxRetries = 3, $delay = 2) {
    $attempt = 0;

    while ($attempt < $maxRetries) {
        $attempt++;

        $ch = curl_init();
        curl_setopt_array($ch, [
            CURLOPT_URL            => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_TIMEOUT        => 30,
            CURLOPT_HTTPHEADER     => [
                'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
            ],
        ]);

        $response  = curl_exec($ch);
        $curlError = curl_error($ch);
        $httpCode  = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);

        if ($curlError) {
            echo "Attempt $attempt failed: $curlError" . PHP_EOL;
            sleep($delay);
            continue;
        }

        if ($httpCode === 429) {
            echo "Rate limited. Waiting 10 seconds..." . PHP_EOL;
            sleep(10);
            continue;
        }

        if ($httpCode >= 500) {
            echo "Server error $httpCode. Retrying..." . PHP_EOL;
            sleep($delay);
            continue;
        }

        if ($httpCode === 403 || $httpCode === 404) {
            echo "Failed permanently (HTTP $httpCode)" . PHP_EOL;
            return false;
        }

        if ($httpCode === 200) {
            return $response;
        }
    }

    return false;
}

Not every error deserves a retry:

429 — site is telling you to slow down. Wait longer, don't retry immediately.
5xx — server-side, usually temporary. Retry after a pause.
403 — you're blocked. Retrying the same request won't help.
404 — page doesn't exist. Don't retry at all.

Scraping Paginated Content

Following the next button is more reliable than guessing URL patterns:

$url      = "https://books.toscrape.com/";
$allBooks = [];
$page     = 1;

while ($url) {
    $html = scrape_with_retry($url, 3, 2);
    if (!$html) break;

    libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    libxml_clear_errors();

    $xpath = new DOMXPath($dom);
    $books = $xpath->query('//article[contains(@class,"product_pod")]');

    foreach ($books as $book) {
        $titleNode = $xpath->query('.//h3/a', $book)->item(0);
        $priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);

        $allBooks[] = [
            'title' => $titleNode ? $titleNode->getAttribute('title') : 'N/A',
            'price' => $priceNode ? trim($priceNode->textContent) : 'N/A',
        ];
    }

    echo "Page $page — " . $books->length . " books scraped." . PHP_EOL;

    $nextNode = $xpath->query('//li[contains(@class,"next")]/a')->item(0);

    if ($nextNode) {
        $url = "https://books.toscrape.com/catalogue/" . $nextNode->getAttribute('href');
        $page++;
        sleep(1);
    } else {
        echo "Last page reached." . PHP_EOL;
        $url = null;
    }
}

echo "Total: " . count($allBooks) . " books." . PHP_EOL;

Output:
Page 1 — 20 books scraped.
Page 2 — 20 books scraped.
...
Last page reached.
Total: 1000 books.

The full guide covers MySQL storage, session handling with cookies, rate limiting, rotating user agents, and the most common mistakes that silently break scrapers.

👉 Read the complete guide on phpspiderblog.com