Gianluca Fabrizi

Posted on Jan 31 • Originally published at gianlucafabrizi.dev

1 Billion Rows Challenge in PHP

#php #1brc #performance

You've probably heard of the 1BRC (1 Billion Rows Challenge).

In a nutshell, it's a challenge (originally aimed at Java programmers) to write code that can solve a data calculation and aggregation problem on a file with 1 billion rows in the fastest way possible.

While we can all agree that it is not indicative of the quality of the language or the typical usage scenarios, it is an activity that leads to a better understanding of the programming language being used and to the discovery of more powerful techniques.

The challenge was very successful and many have tried to transfer the concept to other programming languages.

Today, we will try to tackle the challenge in the language that, at first glance, seems to have the least to say: PHP.

Let's start by reading the task specifications:

The file with the data (measurements.txt) contains one billion rows. Each row represents the temperature recorded in a weather station and is composed of two fields: the station name and the detected temperature, separated by the ; character.

Example:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2

The task is to read all the lines and calculate the minimum, average and maximum temperatures for each weather station, rounded to one decimal place.

Finally, you have to display this data on the screen in alphabetical order in the format <station name>:<minimum temperature>/<average temperature>/<maximum temperature>.

Example:

{Abha=-23.0/18.0/59.2, Abidjan=-16.2/26.0/67.3, Abéché=-10.0/29.4/69.0, Accra=-10.1/26.4/66.4, Addis Ababa=-23.7/16.0/67.0, Adelaide=-27.8/17.3/58.5, ...}

The winning script took 1,535 seconds to finish; the top 5 finished under 2 seconds.

Getting to this point in PHP seems daunting, but let's see how far we can go!

DISCLAIMER: I will not be using advanced debugging and profiling tools to determine where to optimize during testing, that's not the purpose of this article.

Initial Specifications

We need to find a way to measure the performance of a PHP script. This is where the first problems come in:

How can we measure the execution time?
How much data should be passed to the script to ensure reliable results?
Where should we run the script? Should we do it locally or elsewhere (a dedicated server, a VM, etc...)?

Measuring the execution time

All Linux installations should have the command time; by placing it before a command, the operating system can return the execution time of the command passed.
This would seem to be the ideal solution, but there is a problem: it is not very precise, especially if we make it profile only one execution of the PHP script.
A better approach is to use the command perf and pass it the option -r, followed by the number of times you want to run the command to profile.
Example: perf -r 10 my_command

How much data should be passed to the script

Running the PHP script on a billion rows could take a long time; we can use a small set of 1 million rows to start doing some tests. Then we can gradually increase the number of rows until we reach 1 billion.

Where to run the script

Here, just like before, we can take a step-by-step approach. First we can run the benchmark locally on our computer, to see how the different versions compare.
There's one important thing to remember: close all non-essential programs and make sure that the tests are always run under the same conditions.
The version of PHP locally installed is 8.3.14.

Then, when we have an optimized enough PHP script, we can move on to a dedicated server or a virtual machine.
We will need a dedicated server for a short time; the best option would be to use a VM or a cloud instance.

Writing the code

First attempt

Let's start by writing a solution in PHP in the simplest way we can think of:

<?php

$fp = fopen('data/measurements.txt', 'r');

$stations = [];
while (($line = fgets($fp, 4096)) !== false) {
    [$station, $temp] = explode(';', $line);

    if (!isset($stations[$station])) {
        $stations[$station] = [];
    }

    $stations[$station][] = $temp;
}
fclose($fp);

$results = [];
foreach ($stations as $key => $value) {
    $results[$key] = [];
    $results[$key][0] = min($value);
    $results[$key][1] = max($value);
    $results[$key][2] = array_sum($value);
    $results[$key][3] = count($value);
}

ksort($results);

echo '{', PHP_EOL;
foreach ($results as $name => &$temps) {
    echo "\t", $name, '=', $temps[0], '/', number_format($temps[2]/$temps[3], 1), '/', $temps[1], ',', PHP_EOL;
}
echo '}', PHP_EOL;

We open the file data/measurements.txt for reading, populate an array with this data, and finally calculate the minimum, maximum and average temperature for each weather station.
We sort the results alphabetically and print them on the screen.

We have already created the file measurements.txt with 1 million lines using the semi-official tool create_measurements.py:

python3 create_measurements.py 1_000_000

We can then launch the first PHP script using the command:

perf stat -o results/calculate_1.log -r 10 -d php src/calculate_1.php

The -o option specifies where to save the log with the execution results.
To have reliable data we run the script 10 times (with the -r option).

It took 1.066 seconds on the laptop I use for writing (Intel N5100 CPU with 8GB of RAM).
We don't have enough information to understand whether it is a lot or a little.

Second attempt: `trim()`

From the screen output we see some strange line breaks that shouldn't be there. Let's try to apply a trim() on the temperatures:

$stations[$station][] = trim($temp);

We rerun the script and it takes 1.048 seconds. The improvement is minimal, but the screen output no longer shows those extra line breaks.

Third attempt: `(float)`

The “solution” of using trim() does not seem to be the best; a quick analysis of the code shows that PHP stores the temperatures in the array as strings and not as floats. Let's try casting to float instead:

$stations[$station][] = (float) $temp;

The execution time is 0.62 seconds, a big improvement over the previous attempt!

Perhaps we can also increase the number of rows from 1 million to 10 million to better appreciate the timing variations.
Let's try running the same script on 10 million rows:

PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 20480 bytes) in src/calculate_3.php on line 13

PHP's current memory limit is 128MB; the script tries to create an array with all the values of the file measurements.txt, exhausting the 128MB.
Now we could raise that limit, or rewrite the code that takes care of aggregating and processing the data.
Let's proceed with the rewriting of the code avoiding putting the entire input file in memory.

Fourth attempt: `while`

Let's try to read the data and process it in a single while loop:

$results = [];
while (($line = fgets($fp, 4096)) !== false) {
    [$station, $temp] = explode(';', $line);
    $temp = (float) $temp;

    if (!isset($results[$station])) {
        $results[$station] = [$temp, $temp, $temp, 1];
    } else {
        $results[$station][0] = $results[$station][0] < $temp ? $results[$station][0] : $temp;
        $results[$station][1] = $results[$station][1] > $temp ? $results[$station][1] : $temp;
        $results[$station][2] += $temp;
        $results[$station][3]++;
    }
}

This way we don't saturate the memory with the contents of the measurements.txt file.

Execution time on 10 million rows: 6.14 seconds
This time doesn't tell us anything about a possible improvement/deterioration of performance; let's try with the 1 million row dataset.

Execution time on 1 million rows: 0.76 seconds
The time got worse (as expected); we traded the lower RAM usage for a higher execution time.

Fifth attempt: `min()/max()`

This time we try to use the native PHP functions min() and max() to save the minimum and maximum value of each station:

$results[$station][0] = min($results[$station][0], $temp);
$results[$station][1] = max($results[$station][1], $temp);

Execution time on 10 million rows: 6.01 seconds

It's a small improvement, but an improvement nonetheless.

Sixth attempt: `if`

Perhaps we can simplify those two lines of code even further, using two simple if statements:

if ($temp < $results[$station][0]) {
    $results[$station][0] = $temp;
}

if ($temp > $results[$station][1]) {
    $results[$station][1] = $temp;
}

Execution time on 10 million rows: 4.90 seconds

Here we are, another important step in the right direction.

Seventh attempt: `!isset()`

The situation inside the while loop is this:

if (!isset($results[$station])) {
    // ...
} else {
    // ...
}

We are used to thinking of that !isset() as if it were a single statement; in reality, however, there are two: there is the not (!) and the isset().
Let's try to invert the two branches of the if:

if (isset($results[$station])) {
    if ($temp < $results[$station][0]) {
        $results[$station][0] = $temp;
    }

    if ($temp > $results[$station][1]) {
        $results[$station][1] = $temp;
    }

    $results[$station][2] += $temp;
    $results[$station][3]++;
} else {
    $results[$station] = [$temp, $temp, $temp, 1];
}

Execution time on 10 million rows: 4.90 seconds

It's pretty much the same time as the previous attempt...

Eighth attempt: pointer `&`

Another attempt we can make inside the if is to use a pointer to the array element that contains the weather station instead of calling $results[$station][0] every time:

$resultStation = &$results[$station];

Execution time on 10 million rows: 4.65 seconds

There is an improvement in timing, we keep this change.

Ninth attempt: `fgets()`

Let's focus for a moment on the while that takes care of looping the entire measurements.txt file:

while (($line = fgets($fp, 4096)) !== false) {

Why is there a 4096 as the second parameter of the fgets() function?
I saw it in the example code on the PHP documentation page, so I just mindlessly copied it into the script. However, I later checked the documentation and found that the second parameter "limits" lines that are too long.
Some comments on the PHP documentation page suggest there may be a performance issue when passing this second parameter (which is optional by the way).
Let's try removing it and see what happens:

while (($line = fgets($fp)) !== false) {

Execution time on 10 million rows: 4.39 seconds

...this confirms the fact that reading the documentation carefully is always useful

Tenth attempt: `strtok()`

The only point in the code that raises some doubts is the explode() used to split the weather station's name from the temperature.

Let's try replacing it with strtok():

$station = strtok($line, ';');
$temp = (float) strtok(';');

Execution time on 10 million rows: 3.82 seconds

With this latest version of the script, it seems that we have reached the end of the optimizations, there are no further points to optimize.

Test on dedicated hardware

Now that we have our most performant script, we can go ahead and run it on a dedicated machine.

Test on EC2 with 1 billion rows

Let's try to run this script on hardware similar to the one used in the official Java challenge:

CPU: AMD EPYC 7502P 32 cores / 64 threads @ 2.5 GHz
Memory: 128 GB ECC DDR4 RAM
2x SAMSUNG MZQL2960HCJR-00A07, 1 TB, Software RAID-1

The EC2 instance that most closely resembles it (and is cheaper) is the m6a.8xlarge with its 32 vCPUs and 128GB of memory. For the hard disk I opted for a 200GB io1 volume (to reach 10,000 IOPS).

I tried to launch the last script on the EC2 instance; this time I ran the script only 2 times.
The result was this:

 Performance counter stats for 'php src/calculate_10.php' (2 runs):

         261924.05 msec task-clock                       #    1.007 CPUs utilized            ( +-  0.72% )
               813      context-switches                 #    3.127 /sec                     ( +- 28.17% )
                 2      cpu-migrations                   #    0.008 /sec                   
              1486      page-faults                      #    5.715 /sec                   
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               
   <not supported>      L1-dcache-loads                                             
   <not supported>      L1-dcache-load-misses                                       
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

            260.06 +- 1.89 seconds time elapsed  ( +-  0.73% )

260.06 seconds, or 4 minutes and 20 seconds... a truly disastrous result.

In a previous attempt (the fourth) we had to change the way the data was aggregated, due to a PHP out-of-memory error; we had seen a significant performance degradation due to this change.
Since the EC2 instance has a lot of RAM available, let's try to resume the script from the third attempt by applying the changes made from the fifth attempt onwards and raising the memory_limit in the php.ini file.

The results of this test:

 Performance counter stats for 'php src/calculate_in_memory.php' (2 runs):

         259240.23 msec task-clock                       #    0.992 CPUs utilized            ( +-  0.81% )
              1348      context-switches                 #    5.158 /sec                     ( +- 14.73% )
                 5      cpu-migrations                   #    0.019 /sec                     ( +- 10.00% )
            120810      page-faults                      #  462.253 /sec                     ( +-  0.26% )
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               
   <not supported>      L1-dcache-loads                                             
   <not supported>      L1-dcache-load-misses                                       
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

            261.38 +- 2.10 seconds time elapsed  ( +-  0.80% )

261.38 seconds, one second slower than the previous version.
Probably the difference is irrelevant on such high-performance hardware.

Multi-thread PHP

The only thing left to try is to write a PHP script that takes advantage of the interpreter's multithread capabilities.
This task turned out to be quite complex for two reasons:

The PHP interpreter must be compiled with the ZTS (Zend Thread Safe) option to launch parallel executions; few Linux distributions provide an interpreter with this feature turned on;
The PHP script needs to be completely rewritten to take advantage of thread concurrency;

For the first point, the only possibility is to compile PHP from sources with the ZTS option active, and then install the PECL parallel module.
On Debian, this is possible using the commands:

# Install build tools and libraries needed
sudo apt-get -y install build-essential autoconf libtool bison re2c pkg-config git libxml2-dev libssl-dev

# Clone and build a stripped-down version of PHP with ZTS support
git clone https://github.com/php/php-src.git --branch=PHP-8.4.3 --depth=1
cd php-src/
./buildconf
./configure --prefix=/opt/php8.4-zts --with-config-file-path=/opt/php8.4-zts/etc/php --disable-all --disable-ipv6 --disable-cgi --disable-phpdbg --enable-zts --enable-xml --with-libxml --with-pear --with-openssl
make -j32
./sapi/cli/php -v
sudo make install

# Install `parallel` module from PECL
sudo /opt/php8.4-zts/bin/pecl channel-update pecl.php.net
sudo /opt/php8.4-zts/bin/pecl install parallel
sudo mkdir -p /opt/php8.4-zts/etc/php/conf.d
echo 'extension=parallel.so' | sudo tee -a /opt/php8.4-zts/etc/php/php.ini
echo 'memory_limit=-1' | sudo tee -a /opt/php8.4-zts/etc/php/php.ini

# Verify module installation
/opt/php8.4-zts/bin/php -i | grep parallel

As you can see, we downloaded and compiled the 8.4.3 version of PHP.

The ZTS version of PHP can be run with the command /opt/php8.4-zts/bin/php.

For the second point (rewriting the PHP script for multithreading), you can take inspiration from the PHP documentation and solutions to the 1BRC challenge available on the Internet.
The one I took heavily from is this one.
The overhead from multithread management mostly comes from having to cycle the measurements.txt file first to split it into chunks that match the number of cores on the machine where the script is running. Each thread will process one of these chunks that, combined, will lead to the final result.

The source code is fully available in the GitHub repository.

The results on the EC2 instance are:

 Performance counter stats for '/opt/php8.4-zts/bin/php src/zts/zts_calculate_1.php 32' (10 runs):

         521063.71 msec task-clock                       #   30.537 CPUs utilized            ( +-  0.17% )
              4416      context-switches                 #    8.521 /sec                     ( +- 12.71% )
                25      cpu-migrations                   #    0.048 /sec                     ( +- 27.73% )
             41060      page-faults                      #   79.228 /sec                     ( +-  0.21% )
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               
   <not supported>      L1-dcache-loads                                             
   <not supported>      L1-dcache-load-misses                                       
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

           17.0636 +- 0.0181 seconds time elapsed  ( +-  0.11% )

17.0636 seconds... an impressive result considering the previous 260 seconds!

Conclusions

For now, we'll pause here; we're left with a time of 17.06 seconds for the execution on 1 billion rows.

In the next article, we'll see another way to face this challenge with PHP.
I'll leave you with the summary of the test results on the EC2 instance. The first 9 scripts were run on 1 and 10 million rows dataset. Script 10, the one with all the data in RAM (calculate_in_memory.php), and the ZTS script were run on a 1 billion rows dataset.

NOTE: The 1M and 1M lines tests were run with PHP 8.4.2; the 1B lines tests were run with PHP 8.4.3.

The full code is available on Github:
https://github.com/gfabrizi/1brc-php

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (22)

david duymelinck • Jan 31

Somebody did it a while ago;

Processing One Billion Rows in PHP!

Florian Engelhardt ・ Mar 8 '24

#php #programming

I looked at the multithread code and beside from different naming and different placements, the code looks almost the same.

In the post you take a different route, but both of you ended up with similar multithread code. Is this because php doesn't provide another way? I never explored multithreading in php. i just use another language that has threading build in from the start.

Gianluca Fabrizi • Jan 31

Yes, I know, I was late to the party 😅

Well, I must say that this was my first attempt with multi-threading too.
Like I said in the post, I took heavily inspiration from this code:
github.com/realFlowControl/1brc/bl...
it's Florian's repository 😬 ( @realflowcontrol )

I don't see many ways of doing things other than splitting the file into chunks by reading the whole file, processing each chunk and then aggregating the results.
Maybe there are better ways to take advantage of multi-threading in php

david duymelinck • Jan 31

I skipped that line, my mistake.

I know there are fibers and the Swoole library has coroutines.

Andrey Rusev • Feb 1

I will share some suggestions here (instead of under my other comment), 'cos the two of you seem to know more about PHP than me :)

I think you might do without aligning the chunks on EOL boundaries. It's not going to be a massive improvement - only does a few syscalls per chunk, but ... well, haha, this one is a matter of principle with me - get to the actual work as quickly as possible. Your threads are not digesting the chunks during that initial alignment.

I think what you can do instead is something like:

split the file in equally sized chunks (more on this later), have them overlap by one byte.
The thread that gets to read the first chunk starts at 0, processes lines until landing on the end exactly (if it happens to be an EOL) or crosses the end until the first EOL after the end of the chunk. That is - processes the last line even if it crosses the end boundary.
A thread that gets a subsequent chunk - ignores the first line (empty or not)... processes until, again, either ends exactly on the chunk end boundary or the first EOL after that.
Obviously, the thread that got the last chunk doesn't cross the EOF :)

Andrey Rusev • Feb 1

Second suggestion is to try and see if you can adjust the read buffers. You know - typically reading a file results in filling a buffer in memory and then supplying subsequent reads with bytes from that buffer.

Tweaking that buffer might give you better performance - like reducing the number of syscalls to the operating system, but also filling a larger buffer will take a bit longer (time during which you're not digesting data. :)

So, you can fiddle with it to try and find an optimum of some sort... (seems like PHP has a function to set the buffer size? I'm not sure?)

Ah, and also - on linux you should have a way to get some stats out of your runs - like number of syscalls and context switches...

Andrey Rusev • Feb 1 • Edited

And my third and perhaps most important suggestion is ... basically - can we ditch the fgets()? (And yeah, maybe strtok() and some others too... :)

This will get into PHP internals, and I'm not sure how these work (not a PHP developer), but... let's see.

I'll try to illustrate what I think happens behind the scenes in the comments:

 // each loop begins with:
$line = fgets($fp); // this loops over a number of chars to find an EOL
// then update the current possition:
$pos += strlen($line); // possibly also scans the string to find a zero char (not sure about this - maybe php already keeps an integer len of each string
// then the loop looks for the delimiter:
$station = strtok($line, ';'); // scan the same chars (until ';') again
// ...

See where this is going? Each character of the file is checked more than once. These checks are pretty fast, but you've also get massive amounts of chars... And on top of that - my suspicion (I'm not sure with this one) is that fgets(), strtok() return a new string copying from the original file buffer. If this is true - it's a lot of work on top.

IMHO, ideally you should aim at handling each char only once. Should I continue with what I think can be done in that respect or should I leave it here, so that we don't spoil your next post? :)

david duymelinck • Feb 1

If you read Florian's post, the last update mentions the use of the stream_get_line function.

$city = stream_get_line($fp, 99, ';');
$temp = stream_get_line($fp, 99, "\n");
$temp = (float)$temp;

As you see this function takes the streamed line and allows you to read the stream up to a certain character in that line.

Andrey Rusev • Feb 1 • Edited

Haha, didn't look at it really :) I saw your comment, but didn't bother to look, I knew already that Gianluca is using someone else's work for inspiration...

Anyway, looking for the delimiter first and then for EOL is definitely in the right direction!

Do you know though - with stream_get_line will the output ($city for example) be a copy of the stream, or will it be something more like - copy on modify?

You know what I'm talking about, right? If I get a string out of a call to stream_get_line() - that string should behave in exactly the same way as any other string, including - I should be able to modify it without (secretly) affecting something else.

Or in other words - I should be able to add chars to the end of $city and this should not affect $temp. So there will definitely be a copy of some sort, question is - will it happen inside stream_get_line() or will it happen when I add the extra chars to $city?

david duymelinck • Feb 1

From the stream_get_line documentation;

Returns a string of up to length bytes read from the file pointed to by stream, or false on failure.

So you can do something like $city = str_replace(';', '', stream_get_line($fp, 99, ';')); without changing the $temp variable code.

Gianluca Fabrizi • Feb 1

Before writing this article, I read Florian's code (the multi-threading part) but not his post; I didn't want to end up with the same code and the same conclusions, it would be pointless.

I tried stream_get_line(), but to read the whole line (as a replacement for fgets()). The code was a bit slower, so I dropped that idea.
I didn't think of using the third parameter to avoid reading the same line 3 times...

I'm not 100% sure when the actual copying happens; looking at the PHP C source code, it should happen when you call stream_get_line() (it allocates a new string, then copy chars inside it).
So if you call stream_get_line() twice, PHP does 2 copy(s) before you even access the resulting strings.

Thanks for the ideas, I will look deeper into stream_get_line and multi-threading code.

Andrey Rusev • Feb 1

...it allocates a new string, then copy chars inside it...

Hmmm... this is kinda cool and not at the same time :)

I mean - certainly give it a try (with the third parameter to avoid scanning twice), but - doing many, many tiny allocations is typically not good...

There seems to be another function - stream_get_contents() that might be used to limit the number of allocations by getting larger strings... but it sort of depends on if we can find 'a cheap' way of getting substrings. Ideally - without copying. We will need $city as a lookup into the results array (unless we change that too - 'cos I think we might be able to save on lookups too - more on this later).

At the same time - it seems like the PHP string is essentially backed up by a buffer of bytes and kinda works as an array over that buffer (using the []). This is cool, 'cos it's a lot quicker to access subparts of it... but it sort of depends on what other functions we can use on strings... And will they allocate / copy too, or can we 'hack' around that...

I'm thinking about something like:

get a larger buffer out of the file (with stream_get_contents) and of an 'optimal' size (balancing allocation sizes and number of allocations)
Within that buffer (string, really) there should be a function that can cheaply find the first occurrence of a char - strpos or something should be simply checking chars - no allocations, etc.
Here's the tricky part - once we've found it... How do we keep avoiding allocations and copying. I'm pretty sure things like substr() will allocate new strings too. I'm not sure if string references can be obtained at a particular index of a string... Say - an expression like this &str[3] - what type is it? What functions will take this as a 'string-like' argument?

Andrey Rusev • Feb 1

Just to give you an idea of what I mean by we can shave off some array lookups - so you can think about it too - I think something like this:

$resultsForStation = array($sameOrUpdatedMin, $sameOrUpdatedMax, $sumTempSoFar, $newCnt];

... will do fewer array lookups than something like:

$resultsForStation[0] = $sameOrUpdatedMin;
$resultsForStation[1] = $sameOrUpdatedMax;
// ... etc

It also depends on allocations, true - but I'm thinking - along those lines there should be ways to avoid []s.

Andrey Rusev • Feb 1

And an event wilder idea - once we've got the minimum, maximum, etc - instead of keeping them in an array... Can we capture them in a stack frame instead? Like with a callable or a generator?

I kinda suspect PHP stack frames might be more optimized for quick access than PHP arrays...