DEV Community

Cover image for Processing One Billion Rows in PHP!
Florian Engelhardt
Florian Engelhardt

Posted on • Edited on

Processing One Billion Rows in PHP!

You may have heard of the "The One Billion Row Challenge" (1brc) and in case you don't, go checkout Gunnar Morlings's 1brc repo.

I got sucked in because two of my colleagues have entered the competition and are on the leaderboard.

PHP is not known for its speed, but as I am working on the PHP profiler I thought I give it a shot and see how fast it can get.

A first naive approach

I cloned the repository and created the billion row dataset in measurements.txt. After that I started building my first naive implementation of something that could solve the challenge:



<?php

$stations = [];

$fp = fopen('measurements.txt', 'r');

while ($data = fgetcsv($fp, null, ';')) {
    if (!isset($stations[$data[0]])) {
        $stations[$data[0]] = [
            $data[1],
            $data[1],
            $data[1],
            1
        ];
    } else {
        $stations[$data[0]][3] ++;
        $stations[$data[0]][2] += $data[1];
        if ($data[1] < $stations[$data[0]][0]) {
            $stations[$data[0]][0] = $data[1];
        }
        if ($data[1] > $stations[$data[0]][1]) {
            $stations[$data[0]][1] = $data[1];
        }
    }
}

ksort($stations);

echo '{';
foreach($stations as $k=>&$station) {
    $station[2] = $station[2]/$station[3];
    echo $k, '=', $station[0], '/', $station[2], '/', $station[1], ', ';
}
echo '}';


Enter fullscreen mode Exit fullscreen mode

There is nothing wild in here, just open the file, use fgetcsv() to read the data. If the station is not found already, create it, otherwise increment the counter, sum the temperature and see if the current temperature is lower than or higher than min or max and updated accordingly.

Once I have everything together, I use ksort() to bring the $stations array in order and then echo out the list and calculate the average temperature while doing so (sum / count).

Running this simple code on my laptop takes 25 Minutes 🤯

Time to optimise and have a look at the profiler:

Timeline visualisation using the Datadog PHP profiler showing that the code is 100% CPU bound

The timeline visualisation helps me see, that this is clearly CPU bound, file compilation at the beginning of the script is negligible and there are no garbage collection events.

Flame graph view showing we are spending 46% CPU time in the  raw `fgetcsv()` endraw  function.

The flame graph view is helpful as well in showing that I am spending 46% of CPU time in fgetcsv().

fgets() instead of fgetcsv()

The first optimisation was to use fgets() to get a line and split on the ; character manually instead of relying on fgetcsv(). This is because fgetcsv() does a whole lot more than what I need.



// ...
while ($data = fgets($fp, 999)) {
    $pos = strpos($data, ';');
    $city = substr($data, 0, $pos);
    $temp = substr($data, $pos+1, -1);
// ...


Enter fullscreen mode Exit fullscreen mode

Additionally I refactored $data[0] to $city and $data[1] to $temp everywhere.

Running the script again with just this one change already brought the runtime down to 19m 49s. In absolute numbers, this is still a lot but also: it's 21% down!

Flame graph view now showing  raw `fgets()` endraw ,  raw `substr()` endraw  and  raw `strpos()` endraw  combined using nearly the same amount of CPU as  raw `fgetcsv()` endraw  before, but are still faster

The flame graph reflects the change, switching to showing the CPU time by line also reveals what is going on in the root frame:

Flame graph showing the CPU time by line, indicating that we are spending 38% in lines 18 and 23

I am spending ~38% CPU time in lines 18 and 23, which are:



18 | $stations[$city][3] ++;
   | // ...
23 | if ($temp > $stations[$city][1]) {


Enter fullscreen mode Exit fullscreen mode

Line 18 is the first access to the $stations array in the loop, otherwise it is only an increment and line 23 is a comparison, nothing that seems expensive at first glance, but lets do few more optimisations and you'll see what is taking time here.

Using a reference where possible



$station = &$stations[$city];
$station[3] ++;
$station[2] += $temp;
// instead of
$stations[$city][3] ++;
$stations[$city][2] += $data[1];


Enter fullscreen mode Exit fullscreen mode

This should help PHP to not search for the key in the $stations array on every array access, see it like a cache for accessing the "current" station in the array.

And it actually helps, running this only takes 17m 48s, another 10% down!

Only one comparison

While looking at the code, I stumbled over this piece:



if ($temp < $station[0]) {
    $station[0] = $temp;
}
if ($temp > $station[1]) {
    $station[1] = $temp;
}


Enter fullscreen mode Exit fullscreen mode

In case the temperature is lower than min, it cannot be higher than max anymore, so I made this an elseif and maybe spare some CPU cycles.

BTW: I don't know anything about the order of temperatures in measurements.txt, but depending on that order it could make a difference if I first checked the one or the other.

The new versions takes 17m 30s, which is another ~2%. Better than just jitter, but not by a lot.

Adding type casts

PHP is known as a dynamic language and it is something that I valued a lot when I just got started writing software, one less problem to care about. But on the other side, knowing the types helps the engine make better decisions when running your code.



$temp = (float)substr($data, $pos+1, -1);


Enter fullscreen mode Exit fullscreen mode

Guess what? This simple cast makes the script run in just 13m 32s which is a whopping performance increase of 21%!

Flame graph showing the CPU time by line, indicating that we are spending 15% in 23



18 | $station = &$stations[$city];
   | // ...
23 | } elseif ($temp > $station[1]) {


Enter fullscreen mode Exit fullscreen mode

Line 18 still shows up with 11% of CPU time spend, which is the access to the array (finding the key in the hash map, which is the underling data structure used for associative arrays in PHP).

Line 23's CPU time dropped from ~32% to ~15%. This is due to PHP not doing type juggling anymore. Before the type cast, $temp / $station[0] / $station[1] were strings, so PHP had to convert them to float in order to compare them on every single comparison.

What about JIT?

OPCache in PHP is by default disabled in CLI and needs the opcache.enable_cli setting to be set to on. JIT (as part of OPCache) is default enabled, but effectively disabled as the buffer size is set to 0, so I set the opcache.jit-buffer-size to something, I just went with 10M. After these changes have been applied, I re-ran the script with JIT and see it finish in:

7m 19s 🚀

which is 45.9% less time spend!!

What more?

I already brought the runtime down from 25 minutes in the beginning to just ~7 minutes. One thing that I found absolutely astonishing is that fgets() allocates ~56 GiB/m of RAM for reading a 13 GB file. Something seems off, so I checked the implementation of fgets() and it looks like that I can spare a lot of these allocations by just omitting the len argument to fgets():



while ($data = fgets($fp)) {
// instead of
while ($data = fgets($fp, 999)) {


Enter fullscreen mode Exit fullscreen mode

Comparing the profile before and after the change gives this:

Profile comparison view before and after omitting the len argument to  raw `fgets()` endraw  revealing that PHP is only allocating 6BiG per minute instead of 56 GiB per minute.

You might think that this gives a lot of performance improvement, but it is just ~1%. This is because those are small allocations which the ZendMM can handle in bins and those are blazingly fast.

Can we make it even faster?

Yes, we can! So far my approach was single threaded, which is the nature of most PHP software, but PHP does support threads in user land through the parallel extension.

Reading the data in PHP is a bottleneck as the profiler clearly shows. Switching from fgetcsv() to fgets() and manual split helps, but this is still taking a lot of time, so lets use threads to parallelise reading and processing the data and then afterwards combine the intermediate results from the worker threads.



<?php

$file = 'measurements.txt';

$threads_cnt = 16;

/**
 * Get the chunks that each thread needs to process with start and end position.
 * These positions are aligned to \n chars because we use `fgets()` to read
 * which itself reads till a \n character.
 *
 * @return array<int, array{0: int, 1: int}>
 */
function get_file_chunks(string $file, int $cpu_count): array {
    $size = filesize($file);

    if ($cpu_count == 1) {
        $chunk_size = $size;
    } else {
        $chunk_size = (int) ($size / $cpu_count);
    }

    $fp = fopen($file, 'rb');

    $chunks = [];
    $chunk_start = 0;
    while ($chunk_start < $size) {
        $chunk_end = min($size, $chunk_start + $chunk_size);

        if ($chunk_end < $size) {
            fseek($fp, $chunk_end);
            fgets($fp); // moves fp to next \n char
            $chunk_end = ftell($fp);
        }

        $chunks[] = [
            $chunk_start,
            $chunk_end
        ];

        $chunk_start = $chunk_end;
    }

    fclose($fp);
    return $chunks;
}

/**
 * This function will open the file passed in `$file` and read and process the
 * data from `$chunk_start` to `$chunk_end`.
 *
 * The returned array has the name of the city as the key and an array as the
 * value, containing the min temp in key 0, the max temp in key 1, the sum of
 * all temperatures in key 2 and count of temperatures in key 3.
 *
 * @return array<string, array{0: float, 1: float, 2: float, 3: int}>
 */ 
$process_chunk = function (string $file, int $chunk_start, int $chunk_end): array {
    $stations = [];
    $fp = fopen($file, 'rb');
    fseek($fp, $chunk_start);
    while ($data = fgets($fp)) {
        $chunk_start += strlen($data);
        if ($chunk_start > $chunk_end) {
            break;
        }
        $pos2 = strpos($data, ';');
        $city = substr($data, 0, $pos2);
        $temp = (float)substr($data, $pos2+1, -1);
        if (isset($stations[$city])) {
            $station = &$stations[$city];
            $station[3] ++;
            $station[2] += $temp;
            if ($temp < $station[0]) {
                $station[0] = $temp;
            } elseif ($temp > $station[1]) {
                $station[1] = $temp;
            }
        } else {
            $stations[$city] = [
                $temp,
                $temp,
                $temp,
                1
            ];
        }
    }
    return $stations;
};

$chunks = get_file_chunks($file, $threads_cnt);

$futures = [];

for ($i = 0; $i < $threads_cnt; $i++) {
    $runtime = new \parallel\Runtime();
    $futures[$i] = $runtime->run(
        $process_chunk,
        [
            $file,
            $chunks[$i][0],
            $chunks[$i][1]
        ]
    );
}

$results = [];

for ($i = 0; $i < $threads_cnt; $i++) {
    // `value()` blocks until a result is available, so the main thread waits
    // for the thread to finish
    $chunk_result = $futures[$i]->value();
    foreach ($chunk_result as $city => $measurement) {
        if (isset($results[$city])) {
            $result = &$results[$city];
            $result[2] += $measurement[2];
            $result[3] += $measurement[3];
            if ($measurement[0] < $result[0]) {
                $result[0] = $measurement[0];
            }
            if ($measurement[1] < $result[1]) {
                $result[1] = $measurement[1];
            }
        } else {
            $results[$city] = $measurement;
        }
    }
}

ksort($results);

echo '{', PHP_EOL;
foreach($results as $k=>&$station) {
    echo "\t", $k, '=', $station[0], '/', ($station[2]/$station[3]), '/', $station[1], ',', PHP_EOL;
}
echo '}', PHP_EOL;


Enter fullscreen mode Exit fullscreen mode

This code does a few things, first I scan the file and split it into chunks that are \n aligned (for I am using fgets() later). When I have the chunks ready, I start $threads_cnt worker threads that then all open the same file and seek to their assigned chunk start and read and process the data till the chunk end, returning an intermediate result that afterwards gets combined, sorted and printed out in the main thread.

This multithreaded approach finishes in just:

1m 35s 🚀

Is this the end?

Nope, certainly not. There is at least two more things to this solution:

  1. I am running this code on MacOS on Apple Silicon hardware which is crashing when using the JIT in a ZTS build of PHP, so the 1m 35s result is without JIT, it might be even faster if I could use it
  2. I realised that I was running on a PHP version that was compiled using CFLAGS="-g -O0 ..." due my needs in my day to day work 🤦

I should have checked this in the beginning, so I re-compiled PHP 8.3 using CFLAGS="-Os ..." and my final number (with 16 threads) is:

🚀 27.7 seconds 🚀

This number is by no means comparable to what you can see in the leaderboard of the original challenge and this is due to the fact that I ran this code on total different hardware.

This is a timeline view with 10 threads:

Image description

The thread at the bottom is the main thread, waiting for the results from the worker threads. Once those workers have returned their intermediate results the main thread can be seen working on combining and sorting everything. We can also clearly see, that the main thread is by no means the bottleneck. In case you want to try to optimise this even further, concentrate on the worker threads.

What did I learn on the way?

Each abstraction layer simply trades usability/integration for CPU cycles or memory. fgetcsv() is super easy to use and hides a lot of stuff, but this comes at a cost. Even fgets() hides some stuff from us but makes it super convenient to read the data.

Adding types to your code will help the language either optimise execution or will stop type juggling which is something you do not see but still pay for it with CPU cycles.

JIT is awesome, especially when dealing with CPU bound problems!

It is definitely not the nature of most PHP software, but thanks to parallelisation (using ext-parallel) we could push the numbers down significantly.

The END?

I hope you had as much fun reading this article as I had. In case you want to further optimise the code, feel free to grab this and leave a comment how far you got.

[Edit to add the following on March, 18 2024]

There is MOAR!! performance to gain

After this blog post was released, some ideas arose from the comments how to make it faster and there were three suggestions from @xjakub

Remove the isset()

We might not need to check for isset(), we can "just" create the reference to the station, it will be NULL when the station does not exist. This means: one array access spared in case the city does exist (which is the majority of cases).



# before
if (isset($stations[$city])) {
    $station = &$stations[$city];
// ..

# after
$station = &$stations[$city];
if ($station !== NULL) {
// ..


Enter fullscreen mode Exit fullscreen mode

This shaves off around 2.5% of wall time!

Don't check on the fgets() return value

The main loop that reads the file currently looks like this:



while ($data = fgets($fp)) {
    $chunk_start += strlen($data);
    if ($chunk_start > $chunk_end) {
        break;
    }
// ..


Enter fullscreen mode Exit fullscreen mode

The added check for $chunk_start > $chunk_end came with moving to parallelisation, as every thread only works on a part of the file. Now @xjakub mentioned that there is no need to check for the fgets() return value anymore, as it will always return a string as long as we are in between $chunk_start and $chunk_end, meaning we could make this the expression in the loop and just read unchecked.



while ($chunk_start < $chunk_end) {
    $data = fgets($fp);
    $chunk_start += strlen($data);
// ..


Enter fullscreen mode Exit fullscreen mode

This changes removes a branch from the loop and results in another ~2.7% drop in wall time!

fgets() vs. stream_get_line()

The actual reading and storing in $city and $temp looks as followes:



$data = fgets($fp);
$chunk_start += strlen($data);
$pos2 = strpos($data, ';');
$city = substr($data, 0, $pos2);
$temp = (float)substr($data, $pos2+1, -1);


Enter fullscreen mode Exit fullscreen mode

I had never heard of stream_get_line() which behaves nearly identical to fgets() besides that it allows you to specify the end of line delimiter!



$city = stream_get_line($fp, 99, ';');
$temp = stream_get_line($fp, 99, "\n");
$chunk_start += strlen($city) + strlen($temp) + 2;
$temp = (float)$temp;


Enter fullscreen mode Exit fullscreen mode

This change shoved off another ~15% of wall time!

Why is that? The implementations for fgets() and stream_get_line() are really close, both are using the PHP stream layer. The major thing that changed is that we do not need to split a string (from fgets()) into substrings using substr() anymore. The additional strlen() call is negligible as variables in PHP are zval's under the hood which hold the length of the string.

So where are we with the wall time for this PHP script?

@codemunky showed up in the comments and ran the benchmark on the same AX161 from Hetzner that the Java folks used to run their implementations on.

The final result (so far): 12.76 seconds 🎉

The END again?

I don't know, maybe there's still something to optimize here, but after managing to spend ~83% of the wall time in the stream_get_line() function, it looks like we've achieved what the PHP stream layer allows.

PHP Profiler showing 83% wall time spend in  raw `stream_get_line()` endraw  function

Either we find another function that bypasses the PHP stream layer and gives more direct access to the filesystem or we try to optimise the layer itself.

Happy hacking!

Top comments (70)

Collapse
 
mykezero profile image
Mykezero

I love the strategy: write the simple code first, refine performance and then introduce multi-threading. This also emphasizes the point that it's sometimes necessary to know your compiler and the inner working underneath a framework or library, to determine how to gain that performance boost.

A C# example of this would be knowing that if a request is IO bound, and if the function you called uses the CreateIoCompletionPort Win32 function, then it will use an IOCompletePort, where the CPU will release the thread and ask the IOCompletePort to tell the CPU when the work is finished, as opposed to tying up the thread until the work is finished.

There's a lot more going on underneath the hood to do that, but even with just a little bit of the knowledge of the inner working, we can save resources and improve performance.

Collapse
 
gunabalans profile image
Gunabalan.S

I tried a standard PHP script execution on a WAMP environment with 8 GB RAM,
and it safely processed approximately 10 million records.

<?php
/**
 * The test.csv file contains a limited data set to verifying the accuracy of calculations.
 * 
 * karaikal;1
 * karaikal;2
 * karaikal;2
 * karaikal;3
 * Mumbai;1
 * Mumbai;2
 * Mumbai;3
 * Mumbai;4
 * Mumbai;4
 * 
 * then go with actual dataset : weather_stations.csv
 */
$seconds = time(); //time stamp

//load file
//$fileHandle = fopen('./weather_stations.csv', 'r');
$fileHandle = fopen('./test.csv', 'r');

$data = [];

$index = 0;
while (($line = fgets($fileHandle)) !== false) {
    $values = explode(";", $line);
    $key = trim($values[0]);
    $value = trim($values[1]);
    $data[$key]['m1'] = isset($data[$key]['m1']) ? ($data[$key]['m1'] . ',' . $value) : $value;
    $data[$key]['m'][$value] = ($data[$key]['m'][$value] ?? 0) + 1;
    $index++;
}

krsort($data);

foreach ($data as $key => $val) {
    $dataPerCity = explode(',', $val['m1']);
    $count = count($dataPerCity);
    $sum = array_sum($dataPerCity);
    $middle = floor(($count - 1) / 2);

    // Mean (Average)
    $mean = $sum / $count;

    // Median
    $median = ($count % 2 == 0) ? ($dataPerCity[$middle] + $dataPerCity[$middle + 1]) / 2 : $dataPerCity[$middle];

    // Mode
    $mode = array_search(max($val['m']), $val['m']);

    // Output results
    echo "$key / Mean: $mean Median: $median Mode: $mode\n";
}

fclose($fileHandle);

$seconds2 = time();
$elapsedTime = $seconds2 - $seconds;
echo "Elapsed time in seconds: $elapsedTime\n";
echo "Records processed : $index\n";
Enter fullscreen mode Exit fullscreen mode

Image description

Collapse
 
moshedolev profile image
Moshe Dolev

First, really beautiful how you improved the code step by step, and eventually got X60 boost - from half an hour to half a minute. WOW !!
Second, you note that the final best result you got is much different from what is seen in the leaderboard because of a very different hardware. It would be interesting to run this test on a similar hardware. This way we will get an indication if and how much php is slower than java.

Collapse
 
codemunky profile image
codemunky

I got you. I have the exact same AX161 from Hetzner. The final version of his code executes in 15.4 seconds (with JIT & parallel).

Collapse
 
realflowcontrol profile image
Florian Engelhardt

May I ask you if you could re-run the script with the changes that @xjakub suggested? I created a file with those changes at github.com/realFlowControl/1brc
I am very curious about the time you get on that machine

Thread Thread
 
codemunky profile image
codemunky

With pleasure.

Ran it five times as my server is a bit busier today: 13.4/12.8/12.9/12.6/12.5

And re-ran the old version five times, for completeness: 16.3/16.0/15.9/15.9/15.9

Thread Thread
 
codemunky profile image
codemunky

Interestingly it doesn't seem to really scale past 24 threads. Between 16 and 24 it gets faster the more threads I chuck at it. Over 24 and it starts to slow. nice/ionice make no noticable difference

Collapse
 
realflowcontrol profile image
Florian Engelhardt

Whoop whoop, that's awesome

Collapse
 
xjakub profile image
Jakub a. Ritchie • Edited

I tried doing this in PHP too, and the end result was very similar to yours but with some changes which should amount to ~5% better performance (which just shows how hard it is to further optimise it):

  • Instead of isset($stations[$city]), you can check if $station = &$stations[$city]; assigns null! This way you only need to access the hashmap once, and then change $station as necessary!
  • You are checking both fgets and $chunk_start > $chunk_end to decide whether to end the loop, but it is possible to skip the first check. Although I'm not sure if this will bring a noticeable difference.

As for the ZTS+Mac issue, you can try to either run the code in a Docker container (php:zts plus pecl install parallel was able to run your solution), or use NTS+multiprocessing by spawning separate processes with either amphp/parallel or just proc_open! (I used the latter in my case)

Edit: I also see that I skip a char at strpos($line, ';', 1);. I don't think it has any performance impact though 😄

Collapse
 
realflowcontrol profile image
Florian Engelhardt

Removing the isset() shoved off 2.5% of wall time, that is an awesome idea and was easy to check. Making $chunk_start < $chunk_end the expression of the while loop and moving the fgets() into the loop body shove off another 2.7% of wall time on my machine.

I'll update the blog post once I am back from DutchPHP. Thats awesome! Thank you!

Collapse
 
xjakub profile image
Jakub a. Ritchie

Oh, I'm glad both changes helped! Compared to the fread ones they were much simpler to implement, but I just realized we can simply use stream_get_line to fetch the station name, which is faster and simpler than the other approaches!

Thread Thread
 
realflowcontrol profile image
Florian Engelhardt

Replied on twitter already, but for visibility: switching to stream_get_line() shoved off another 15%
This is awesome!

Collapse
 
xjakub profile image
Jakub a. Ritchie

I just tried reading the file with fread instead of fgets and I got a 15% performance improvement! The approach makes the code uglier, but it is noticeably faster at least

Collapse
 
realflowcontrol profile image
Florian Engelhardt

I updated the blog post with your suggestions, thank you!

Collapse
 
codewithcaen profile image
CodeWithCaen

Incredible! So many cool insights learned from this!

Collapse
 
renoirb profile image
Renoir • Edited

Good stuff!

Type casting and a huge change, neat!!

Have you checked if generators and async how they’d be?

Also, the sorting, the end result is much faster indeed. But you haven’t kept sorting. It isn’t exactly the same output, is it?

In any case; thanks for sharing and inciting people to experiment with performance tracing!

Collapse
 
calliko profile image
Токмаков Александр

My God. Thank you for the article. Finally I optimize my code!

Collapse
 
rslhdyt profile image
Risal Hidayat

What a writing, I learn a lot from your post. Huge thanks

Collapse
 
valentiniljaz profile image
Valentin Iljaž

Bravo. I really appreciated the effort. PHP deserves some love and caring; it can pay back 10 fold :-)

Collapse
 
efpage profile image
Eckehard

Is there any test file to download? As far as I understood the source data can be created from a script, but for a first test a simple file would be easier.

Collapse
 
realflowcontrol profile image
Florian Engelhardt

Hey there,
I create a repo with the source code for PHP at github.com/realFlowControl/1brc
This also has a PHP script to generate the measurements.txt:

php createMeasurements.php 10000 # to create a file with 10k entries
Enter fullscreen mode Exit fullscreen mode

Hope this helps!

Collapse
 
efpage profile image
Eckehard

Hy Florian,

works great! I The timing is reportet in ms, but it seems to be seconds.

It took about 160s to write a 132 MB file on my local drive, but the time to read the file will greatly depend on the file system, caching and the reading strategy. Depending on your system the reading will be much faster the second time your open the file.

Notepad++ opens the file without any delay, as they only read the part that is displayed. The standard Editor takes about one minute to do the same, as they read the whole file at once. We can do the same on reading the file in chunks, as file operations are handled in the background by the hard drive, this operations do usually not put any strain on the processor.

So, if we do so we get a nice contest on file operations, but as you have not much control over the details on most high level languages, what does the contest really measure?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.