DEV Community

Cover image for 1 Billion Rows Challenge in PHP

1 Billion Rows Challenge in PHP

Gianluca Fabrizi on January 31, 2025

You've probably heard of the 1BRC (1 Billion Rows Challenge). In a nutshell, it's a challenge (originally aimed at Java programmers) to write code ...
Collapse
 
xwero profile image
david duymelinck

Somebody did it a while ago;

I looked at the multithread code and beside from different naming and different placements, the code looks almost the same.

In the post you take a different route, but both of you ended up with similar multithread code. Is this because php doesn't provide another way? I never explored multithreading in php. i just use another language that has threading build in from the start.

Collapse
 
gfabrizi profile image
Gianluca Fabrizi

Yes, I know, I was late to the party 😅

Well, I must say that this was my first attempt with multi-threading too.
Like I said in the post, I took heavily inspiration from this code:
github.com/realFlowControl/1brc/bl...
it's Florian's repository 😬 ( @realflowcontrol )

I don't see many ways of doing things other than splitting the file into chunks by reading the whole file, processing each chunk and then aggregating the results.
Maybe there are better ways to take advantage of multi-threading in php

Collapse
 
xwero profile image
david duymelinck

I skipped that line, my mistake.

I know there are fibers and the Swoole library has coroutines.

Collapse
 
zethix profile image
Andrey Rusev

I will share some suggestions here (instead of under my other comment), 'cos the two of you seem to know more about PHP than me :)

I think you might do without aligning the chunks on EOL boundaries. It's not going to be a massive improvement - only does a few syscalls per chunk, but ... well, haha, this one is a matter of principle with me - get to the actual work as quickly as possible. Your threads are not digesting the chunks during that initial alignment.

I think what you can do instead is something like:

  1. split the file in equally sized chunks (more on this later), have them overlap by one byte.
  2. The thread that gets to read the first chunk starts at 0, processes lines until landing on the end exactly (if it happens to be an EOL) or crosses the end until the first EOL after the end of the chunk. That is - processes the last line even if it crosses the end boundary.
  3. A thread that gets a subsequent chunk - ignores the first line (empty or not)... processes until, again, either ends exactly on the chunk end boundary or the first EOL after that.
  4. Obviously, the thread that got the last chunk doesn't cross the EOF :)
Thread Thread
 
zethix profile image
Andrey Rusev

Second suggestion is to try and see if you can adjust the read buffers. You know - typically reading a file results in filling a buffer in memory and then supplying subsequent reads with bytes from that buffer.

Tweaking that buffer might give you better performance - like reducing the number of syscalls to the operating system, but also filling a larger buffer will take a bit longer (time during which you're not digesting data. :)

So, you can fiddle with it to try and find an optimum of some sort... (seems like PHP has a function to set the buffer size? I'm not sure?)

Ah, and also - on linux you should have a way to get some stats out of your runs - like number of syscalls and context switches...

Thread Thread
 
zethix profile image
Andrey Rusev • Edited

And my third and perhaps most important suggestion is ... basically - can we ditch the fgets()? (And yeah, maybe strtok() and some others too... :)

This will get into PHP internals, and I'm not sure how these work (not a PHP developer), but... let's see.

I'll try to illustrate what I think happens behind the scenes in the comments:

 // each loop begins with:
$line = fgets($fp); // this loops over a number of chars to find an EOL
// then update the current possition:
$pos += strlen($line); // possibly also scans the string to find a zero char (not sure about this - maybe php already keeps an integer len of each string
// then the loop looks for the delimiter:
$station = strtok($line, ';'); // scan the same chars (until ';') again
// ...
Enter fullscreen mode Exit fullscreen mode

See where this is going? Each character of the file is checked more than once. These checks are pretty fast, but you've also get massive amounts of chars... And on top of that - my suspicion (I'm not sure with this one) is that fgets(), strtok() return a new string copying from the original file buffer. If this is true - it's a lot of work on top.

IMHO, ideally you should aim at handling each char only once. Should I continue with what I think can be done in that respect or should I leave it here, so that we don't spoil your next post? :)

Collapse
 
zethix profile image
Andrey Rusev

Ha, cool! :)

I'm not a PHP dev, but I'm pretty sure you can shave off more! :)

Collapse
 
zethix profile image
Andrey Rusev

Actually - can I suggest some stuff? I'm curious, but I don't want to spoil your next post... Pretty sure you already have ideas...

Collapse
 
gfabrizi profile image
Gianluca Fabrizi

yes, for the next post I already tested something ... ... different.
Yes, if you have any ideas, you can suggest!

Collapse
 
armando_ota profile image
Armando Ota

And then there was java challenge ;)
morling.dev/blog/1brc-results-are-in/

Collapse
 
makscraft profile image
Maksim Zaikov

Very interesting tests! Thank you!

Collapse
 
gfabrizi profile image
Gianluca Fabrizi

Thanks! ❤️

Collapse
 
mefhigoseth profile image
Victor Villarreal

Really appreciate that you share this story.

I know this is not a code optimization, but... Maybe the file can be read from a RAM memory partition, instead of the hard disk.