DEV Community

Nguyen Ngoc Minh Khoi
Nguyen Ngoc Minh Khoi

Posted on

CRAWL COVID-19 DATAS FROM WEBSITE WORLDOMETER.INFO WITH PHP LIBRARY: SIMPLE_HTML_DOM.PHP

1. Introduction of PHP library simple_html_dom.php
“simple_html_dom.php” is A HTML DOM parser library written in PHP, let you manipulate HTML in a very easy way! The library supports invalid HTML; find tags on an HTML page with css selectors just like jQuery, and manipulate DOM like Javascript. Extract contents from HTML in a single line. It ‘s very easy to learn for most of web developers.
That ‘s an excellent tool for crawling website. Without an HTML DOM parser like “simple_html_dom”, the crawling work must be done with method “file_get_contents()”. The method only return the whole HTML page as a string, and it support no way to find tags in that string.
*Simple_html_dom library website*
Simple_html_dom with the way to find and manipulate with HTML DOM similar to Javascript amd jQuery
2. Introduction of website worldometer.info and the Covid-19 datas on it.
Worldometer.info is a reference website that provides counters and real-time statistics for diverse topics. The website is founded in 2014, but it has become well-known since 2020, because of its timely Covid-19 information which is updated day by day.
Woldometer.info display the Covid-19 data stats of a nation
Woldometer.info display the Covid-19 data stats
from over 200 nations and territories
On website woldometer.info, the Covid-19 data stats from over 200 nations and territories is updated daily and displayed in table and chart. All the data on worldometer is PUBLIC, and can be crawled unlimitedly.

3. Use simple_html_dom.php to crawl data from *worldometer.info*
Using simple_html_dom.php, we can crawl the Covid-19 datas cases, deaths, recovered easily. Before starting, you must download the source code of the library here. Just download the library source, and save them in any folder. Then you can invoke the library by command “require_once” .
After that, we look up the source code html of the page. We can see all the data of Covid-19 cases, deaths, recovered number is wrapped by a SPAN tag inside a DIV tag with class “maincounter-number”. With CSS selectors, we can specified them as “.maincounter-number span”.
Now let ‘s take it with simple_html_dom.php. At first, we crawl the whole HTML page as a string.

$html_raw = file_get_html($url); 
// file_get_html() method is assigned by simple_html_dom.php
Enter fullscreen mode Exit fullscreen mode

The simple_html_dom.php can easily take contents of specified classes via CSS Selectors.

 $this->data_ncov->cases = $html_raw
         ->find(".maincounter-number span")[0]-> innertext();
// => The first dom element found contains Covid cases number
 $this->data_ncov->death = $html_raw
             ->find(".maincounter-number span")[1]-> innertext();
// => The second dom element found contains Covid deaths number
 $this->data_ncov->recovered = $html_raw
             ->find(".maincounter-number span")[2]-> innertext();
// => The third dom element found contains Covid recovered number
Enter fullscreen mode Exit fullscreen mode

4. Keep crawling more data stats from Worldometer.info
Now with the simple_html_dom library, we can crawl some basic datas about Covid-19 stats in a particular country. But worldometer have many others public informations that we can crawl. Those are the mobilizing Covid data stats day by day of each country.
Let ‘s see the view-source of the page, we see that each chart drawed on the page, there is a dataset behind. For example, the crawl the mobilizing stats of total cases
Behind each chart drawed on worldometer …
Behind each chart drawed…
there is a dataset of date
there is a dataset of number of stats
….there is a dataset
In the two images above, it ‘s easily realized that the chart was built by the Highchart javascript library, and the dataset of date we need is in the object chart->….->categories (red field), and dataset of number of stats in object chart->….->date (blue field). You can use pure PHP script to crawl and handle these datasets, and return them in a PHP array.

// find the dataset with the specified string of Highchart object
 $index_chart_begin = 
           strpos($html_entities, 
                  "Highcharts.chart('coronavirus-cases-linear'");
 // get the arrays of days (for the horizontal Axis)
 $index_xAxis_begin = strpos($html_entities, 
                                "xAxis", $index_chart_begin);
 $xAxis_index_bracket_open = strpos($html_entities, 
                                         '[', $index_xAxis_begin);
 $xAxis_index_bracket_close = 
            strpos($html_entities,']', $xAxis_index_bracket_open);
 $string_xAxis = substr(
      $html_entities,
      $xAxis_index_bracket_open + 1,
      $xAxis_index_bracket_close - $xAxis_index_bracket_open - 1
 );
 $string_xAxis = str_replace('"', '', $string_xAxis);
 $string_xAxis = str_replace(', ', '/', $string_xAxis);
 $array_xAxis = explode(',', $string_xAxis);
 // get the arrays of number of stats 
 $index_data_begin = strpos($html_entities, "data", 
                                          $index_chart_begin);
 $data_index_bracket_open = strpos($html_entities, '[', 
                                             $index_data_begin);
 $data_index_bracket_close = strpos($html_entities, ']', 
                                        $data_index_bracket_open);
 $string_data = substr(
      $html_entities,
      $data_index_bracket_open + 1,
      $data_index_bracket_close - $data_index_bracket_open - 1
 );
 $array_data = explode(',', $string_data);
 // Then display them in a array in pairs 
                        //(day => number of stats)
 $return_array= array();
 for ($i = 0; $i < count($array_xAxis ); $i++) {
      array_push($return_array, 
                   array($array_xAxis[$i] => $array_data[$i]));
 }
Enter fullscreen mode Exit fullscreen mode


Finally, the array can be converted to JSON String using json_encode function. And you do the same to craw the mobilizing data stats of new cases in day, death cases, new death in day, and recovered cases.
I have built a demo project, you can see the demo video on my youtube, or download full source code for reference. Beside the Covid cases data stats, this project also crawl the vaccination against Covid by country from the web site ourworldindata.com.

Top comments (0)