Most PHP cURL scraping tutorials show you two functions, a screenshot of some output, and call it a day.
This guide covers what actually happens on real websites: proper request setup, HTML parsing without regex, error handling, and pagination.
Full guide with MySQL storage, cookies, rate limiting, and avoiding blocks: phpspiderblog.com
Making a Proper cURL Request
The bare minimum most tutorials show:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://books.toscrape.com/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$response = curl_exec($ch);
curl_close($ch);
This works on cooperative websites. On anything with basic bot detection you'll get a 403 or empty response — and the script won't tell you why.
Here's a proper starting point:
function scrape_url($url) {
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 5,
CURLOPT_TIMEOUT => 30,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_ENCODING => '',
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Connection: keep-alive',
],
]);
$response = curl_exec($ch);
$error = curl_error($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($error) {
echo "cURL error: " . $error;
return false;
}
if ($httpCode !== 200) {
echo "HTTP error: " . $httpCode;
return false;
}
return $response;
}
$html = scrape_url("https://books.toscrape.com/");
if ($html) {
echo "Page fetched. Length: " . strlen($html) . " bytes";
}
Output:
Page fetched. Length: 51274 bytes
Key options that matter:
-
CURLOPT_FOLLOWLOCATION— follows redirects. Without this, HTTP to HTTPS redirects return empty. -
CURLOPT_TIMEOUT— without this a slow server blocks your script indefinitely. -
User-Agent— without this, requests identify as a cURL bot and most sites block them. -
CURLINFO_HTTP_CODE— curl_exec() returns false only on network failure, not on 403 or 404. Always check the status code separately.
Parsing HTML With DOMDocument (Not Regex)
Most tutorials use regex on HTML:
preg_match_all('/<h3><a[^>]*title="([^"]*)"/', $response, $matches);
This breaks the moment a site changes one attribute or reformats whitespace. Use DOMDocument instead:
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$titles = $xpath->query('//article[contains(@class,"product_pod")]//h3/a');
foreach ($titles as $title) {
echo $title->getAttribute('title') . PHP_EOL;
}
Output:
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Extracting multiple fields at once:
$books = $xpath->query('//article[contains(@class,"product_pod")]');
foreach ($books as $book) {
$titleNode = $xpath->query('.//h3/a', $book)->item(0);
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
$title = $titleNode ? $titleNode->getAttribute('title') : 'N/A';
$price = $priceNode ? trim($priceNode->textContent) : 'N/A';
echo $title . " | " . $price . PHP_EOL;
}
Output:
A Light in the Attic | £51.77
Tipping the Velvet | £53.74
Soumission | £50.10
libxml_use_internal_errors(true) suppresses warnings from malformed HTML — real websites have unclosed tags and encoding issues. Without this your output becomes unreadable.
Handling Errors and Retries
A scraper that crashes on the first failed request is useless. Build retry logic from the start:
function scrape_with_retry($url, $maxRetries = 3, $delay = 2) {
$attempt = 0;
while ($attempt < $maxRetries) {
$attempt++;
$ch = curl_init();
curl_setopt_array($ch, [
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTPHEADER => [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
],
]);
$response = curl_exec($ch);
$curlError = curl_error($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($curlError) {
echo "Attempt $attempt failed: $curlError" . PHP_EOL;
sleep($delay);
continue;
}
if ($httpCode === 429) {
echo "Rate limited. Waiting 10 seconds..." . PHP_EOL;
sleep(10);
continue;
}
if ($httpCode >= 500) {
echo "Server error $httpCode. Retrying..." . PHP_EOL;
sleep($delay);
continue;
}
if ($httpCode === 403 || $httpCode === 404) {
echo "Failed permanently (HTTP $httpCode)" . PHP_EOL;
return false;
}
if ($httpCode === 200) {
return $response;
}
}
return false;
}
Not every error deserves a retry:
- 429 — site is telling you to slow down. Wait longer, don't retry immediately.
- 5xx — server-side, usually temporary. Retry after a pause.
- 403 — you're blocked. Retrying the same request won't help.
- 404 — page doesn't exist. Don't retry at all.
Scraping Paginated Content
Following the next button is more reliable than guessing URL patterns:
$url = "https://books.toscrape.com/";
$allBooks = [];
$page = 1;
while ($url) {
$html = scrape_with_retry($url, 3, 2);
if (!$html) break;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$books = $xpath->query('//article[contains(@class,"product_pod")]');
foreach ($books as $book) {
$titleNode = $xpath->query('.//h3/a', $book)->item(0);
$priceNode = $xpath->query('.//*[contains(@class,"price_color")]', $book)->item(0);
$allBooks[] = [
'title' => $titleNode ? $titleNode->getAttribute('title') : 'N/A',
'price' => $priceNode ? trim($priceNode->textContent) : 'N/A',
];
}
echo "Page $page — " . $books->length . " books scraped." . PHP_EOL;
$nextNode = $xpath->query('//li[contains(@class,"next")]/a')->item(0);
if ($nextNode) {
$url = "https://books.toscrape.com/catalogue/" . $nextNode->getAttribute('href');
$page++;
sleep(1);
} else {
echo "Last page reached." . PHP_EOL;
$url = null;
}
}
echo "Total: " . count($allBooks) . " books." . PHP_EOL;
Output:
Page 1 — 20 books scraped.
Page 2 — 20 books scraped.
...
Last page reached.
Total: 1000 books.
The full guide covers MySQL storage, session handling with cookies, rate limiting, rotating user agents, and the most common mistakes that silently break scrapers.
Top comments (0)