DEV Community

1 Billion Rows Challenge in PHP

Gianluca Fabrizi on January 31, 2025

You've probably heard of the 1BRC (1 Billion Rows Challenge). In a nutshell, it's a challenge (originally aimed at Java programmers) to write code ...

Read full post

david duymelinck • Jan 31

Somebody did it a while ago;

Processing One Billion Rows in PHP!

Florian Engelhardt ・ Mar 8 '24

#php #programming

I looked at the multithread code and beside from different naming and different placements, the code looks almost the same.

In the post you take a different route, but both of you ended up with similar multithread code. Is this because php doesn't provide another way? I never explored multithreading in php. i just use another language that has threading build in from the start.

Gianluca Fabrizi • Jan 31

Yes, I know, I was late to the party 😅

Well, I must say that this was my first attempt with multi-threading too.
Like I said in the post, I took heavily inspiration from this code:
github.com/realFlowControl/1brc/bl...
it's Florian's repository 😬 ( @realflowcontrol )

I don't see many ways of doing things other than splitting the file into chunks by reading the whole file, processing each chunk and then aggregating the results.
Maybe there are better ways to take advantage of multi-threading in php

david duymelinck • Jan 31

I skipped that line, my mistake.

I know there are fibers and the Swoole library has coroutines.

Andrey Rusev • Feb 1

I will share some suggestions here (instead of under my other comment), 'cos the two of you seem to know more about PHP than me :)

I think you might do without aligning the chunks on EOL boundaries. It's not going to be a massive improvement - only does a few syscalls per chunk, but ... well, haha, this one is a matter of principle with me - get to the actual work as quickly as possible. Your threads are not digesting the chunks during that initial alignment.

I think what you can do instead is something like:

split the file in equally sized chunks (more on this later), have them overlap by one byte.
The thread that gets to read the first chunk starts at 0, processes lines until landing on the end exactly (if it happens to be an EOL) or crosses the end until the first EOL after the end of the chunk. That is - processes the last line even if it crosses the end boundary.
A thread that gets a subsequent chunk - ignores the first line (empty or not)... processes until, again, either ends exactly on the chunk end boundary or the first EOL after that.
Obviously, the thread that got the last chunk doesn't cross the EOF :)

Andrey Rusev • Feb 1

Second suggestion is to try and see if you can adjust the read buffers. You know - typically reading a file results in filling a buffer in memory and then supplying subsequent reads with bytes from that buffer.

Tweaking that buffer might give you better performance - like reducing the number of syscalls to the operating system, but also filling a larger buffer will take a bit longer (time during which you're not digesting data. :)

So, you can fiddle with it to try and find an optimum of some sort... (seems like PHP has a function to set the buffer size? I'm not sure?)

Ah, and also - on linux you should have a way to get some stats out of your runs - like number of syscalls and context switches...

Andrey Rusev • Feb 1 • Edited

And my third and perhaps most important suggestion is ... basically - can we ditch the fgets()? (And yeah, maybe strtok() and some others too... :)

This will get into PHP internals, and I'm not sure how these work (not a PHP developer), but... let's see.

I'll try to illustrate what I think happens behind the scenes in the comments:

 // each loop begins with:
$line = fgets($fp); // this loops over a number of chars to find an EOL
// then update the current possition:
$pos += strlen($line); // possibly also scans the string to find a zero char (not sure about this - maybe php already keeps an integer len of each string
// then the loop looks for the delimiter:
$station = strtok($line, ';'); // scan the same chars (until ';') again
// ...

See where this is going? Each character of the file is checked more than once. These checks are pretty fast, but you've also get massive amounts of chars... And on top of that - my suspicion (I'm not sure with this one) is that fgets(), strtok() return a new string copying from the original file buffer. If this is true - it's a lot of work on top.

IMHO, ideally you should aim at handling each char only once. Should I continue with what I think can be done in that respect or should I leave it here, so that we don't spoil your next post? :)

david duymelinck • Feb 1

If you read Florian's post, the last update mentions the use of the stream_get_line function.

$city = stream_get_line($fp, 99, ';');
$temp = stream_get_line($fp, 99, "\n");
$temp = (float)$temp;

As you see this function takes the streamed line and allows you to read the stream up to a certain character in that line.

Andrey Rusev • Feb 1 • Edited

Haha, didn't look at it really :) I saw your comment, but didn't bother to look, I knew already that Gianluca is using someone else's work for inspiration...

Anyway, looking for the delimiter first and then for EOL is definitely in the right direction!

Do you know though - with stream_get_line will the output ($city for example) be a copy of the stream, or will it be something more like - copy on modify?

You know what I'm talking about, right? If I get a string out of a call to stream_get_line() - that string should behave in exactly the same way as any other string, including - I should be able to modify it without (secretly) affecting something else.

Or in other words - I should be able to add chars to the end of $city and this should not affect $temp. So there will definitely be a copy of some sort, question is - will it happen inside stream_get_line() or will it happen when I add the extra chars to $city?

david duymelinck • Feb 1

From the stream_get_line documentation;

Returns a string of up to length bytes read from the file pointed to by stream, or false on failure.

So you can do something like $city = str_replace(';', '', stream_get_line($fp, 99, ';')); without changing the $temp variable code.

Gianluca Fabrizi • Feb 1

Before writing this article, I read Florian's code (the multi-threading part) but not his post; I didn't want to end up with the same code and the same conclusions, it would be pointless.

I tried stream_get_line(), but to read the whole line (as a replacement for fgets()). The code was a bit slower, so I dropped that idea.
I didn't think of using the third parameter to avoid reading the same line 3 times...

I'm not 100% sure when the actual copying happens; looking at the PHP C source code, it should happen when you call stream_get_line() (it allocates a new string, then copy chars inside it).
So if you call stream_get_line() twice, PHP does 2 copy(s) before you even access the resulting strings.

Thanks for the ideas, I will look deeper into stream_get_line and multi-threading code.

Andrey Rusev • Feb 1

...it allocates a new string, then copy chars inside it...

Hmmm... this is kinda cool and not at the same time :)

I mean - certainly give it a try (with the third parameter to avoid scanning twice), but - doing many, many tiny allocations is typically not good...

There seems to be another function - stream_get_contents() that might be used to limit the number of allocations by getting larger strings... but it sort of depends on if we can find 'a cheap' way of getting substrings. Ideally - without copying. We will need $city as a lookup into the results array (unless we change that too - 'cos I think we might be able to save on lookups too - more on this later).

At the same time - it seems like the PHP string is essentially backed up by a buffer of bytes and kinda works as an array over that buffer (using the []). This is cool, 'cos it's a lot quicker to access subparts of it... but it sort of depends on what other functions we can use on strings... And will they allocate / copy too, or can we 'hack' around that...

I'm thinking about something like:

get a larger buffer out of the file (with stream_get_contents) and of an 'optimal' size (balancing allocation sizes and number of allocations)
Within that buffer (string, really) there should be a function that can cheaply find the first occurrence of a char - strpos or something should be simply checking chars - no allocations, etc.
Here's the tricky part - once we've found it... How do we keep avoiding allocations and copying. I'm pretty sure things like substr() will allocate new strings too. I'm not sure if string references can be obtained at a particular index of a string... Say - an expression like this &str[3] - what type is it? What functions will take this as a 'string-like' argument?

Andrey Rusev • Feb 1

Just to give you an idea of what I mean by we can shave off some array lookups - so you can think about it too - I think something like this:

$resultsForStation = array($sameOrUpdatedMin, $sameOrUpdatedMax, $sumTempSoFar, $newCnt];

... will do fewer array lookups than something like:

$resultsForStation[0] = $sameOrUpdatedMin;
$resultsForStation[1] = $sameOrUpdatedMax;
// ... etc

It also depends on allocations, true - but I'm thinking - along those lines there should be ways to avoid []s.

Andrey Rusev • Feb 1

And an event wilder idea - once we've got the minimum, maximum, etc - instead of keeping them in an array... Can we capture them in a stack frame instead? Like with a callable or a generator?

I kinda suspect PHP stack frames might be more optimized for quick access than PHP arrays...