DEV Community

Ganesh Bagaria
Ganesh Bagaria

Posted on • Edited on • Originally published at Medium

2 1 1

How to Scrape a website using PHP?

Hey Guys, Today I will tell you how you can scrape a website using PHP language. To scrape a website using PHP you need to include simple_html_dom.php file in your PHP file. This file contains predefined functions to parse the html website or to search through the tags of that site. Keep in mind Scraping a website without the site’s permission can be considered illegal.

*This post is just for Educational Purpose.

First choose the website and the data on it which you want to Scrape. Here I am taking the example of AndroidHeadlines.com site. From it we are going to scrape the Latest Headlines.

AndroidHeadlines Site Screenshot

Step 1 : First you need to start the PHP tags–

<?php
?>

Step 2 : Second include the simple_html_dom file in your PHP code and place that file into the same folder –

<?php
require_once("simple_html_dom.php");
?>

Step 3 : Now create a variable which will contain a method named as file_get_html (this method will create the Document Object Model for the URL provided by the user inside it’s parenthesis) –

<?php
require_once("simple_html_dom.php");
$html = file_get_html("https://www.androidheadlines.com/");
?>

Step 4 : Now by using the variable $html, we can find the site’s tag. So let’s find the tag which contains all the latest posts. For finding the tag inside the $html variable we will use find() function –

AndroidHeadlines Screenshot with inspect elements

<?php
require_once("simple_html_dom.php");
$html = file_get_html("https://www.androidheadlines.com/");
$headlines = $html -> find("div[class=container]")[0];
?>

Step 5 : As we only want to scrape the title of the headline and there being multiple headlines, we need to create an array to store all these headlines –

<?php
require_once("simple_html_dom.php");
$html = file_get_html("https://www.androidheadlines.com/");
$headlines = $html -> find("div[class=container]")[0];
$titles = array();
?>

Step 6 : Now we are going to find the tag which contains the title of the headline. As you can see the span tag contains the title. So, just scrape it and don’t write any index at the end. Now we can directly save it to our titles array –

AndroidHealines Screenshot

<?php
require_once("simple_html_dom.php");
$html = file_get_html("https://www.androidheadlines.com/");
$headlines = $html -> find("div[class=container]")[0];
$titles = array();
$titles = $headlines -> find("span[class=featured-title]");
?>

Step 7 : Now to print the array titles use foreach or any other loop –

<?php
require_once("simple_html_dom.php");
$html = file_get_html("https://www.androidheadlines.com/");
$headlines = $html -> find("div[class=container]")[0];
$titles = array();
$titles = $headlines -> find("span[class=featured-title]");
foreach($titles as $title){
echo $title."<br>";
}
?>

Step 8 : Finally, You’ll obtain the scraped data as output in the following manner –

Scrapped Array Result

I hope now you know how to actually scrape data from a website. Feel free to ask any question.

Don't forget to visit my blog to get more posts like this ganofins.com

Image of Datadog

The Essential Toolkit for Front-end Developers

Take a user-centric approach to front-end monitoring that evolves alongside increasingly complex frameworks and single-page applications.

Get The Kit

Top comments (2)

Collapse
 
crawlbase profile image
Crawlbase

This blog provides a clear and concise guide on scraping websites using PHP, with a helpful step-by-step approach. It's a great resource for those looking to learn web scraping techniques. If you need further assistance, don't hesitate to ask. Also, check out Crawlbase for more advanced scraping solutions.

Collapse
 
suckup_de profile image
Lars Moelleken

Hi, I am maintaining a modern fork of "simple_html_dom" for PHP, maybe you will like it. :)

-> github.com/voku/simple_html_dom

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more