DEV Community

Discussion on: Processing One Billion Rows in PHP!

Collapse
 
xjakub profile image
Jakub a. Ritchie • Edited

I tried doing this in PHP too, and the end result was very similar to yours but with some changes which should amount to ~5% better performance (which just shows how hard it is to further optimise it):

  • Instead of isset($stations[$city]), you can check if $station = &$stations[$city]; assigns null! This way you only need to access the hashmap once, and then change $station as necessary!
  • You are checking both fgets and $chunk_start > $chunk_end to decide whether to end the loop, but it is possible to skip the first check. Although I'm not sure if this will bring a noticeable difference.

As for the ZTS+Mac issue, you can try to either run the code in a Docker container (php:zts plus pecl install parallel was able to run your solution), or use NTS+multiprocessing by spawning separate processes with either amphp/parallel or just proc_open! (I used the latter in my case)

Edit: I also see that I skip a char at strpos($line, ';', 1);. I don't think it has any performance impact though 😄

Collapse
 
xjakub profile image
Jakub a. Ritchie

I just tried reading the file with fread instead of fgets and I got a 15% performance improvement! The approach makes the code uglier, but it is noticeably faster at least

Collapse
 
realflowcontrol profile image
Florian Engelhardt

Removing the isset() shoved off 2.5% of wall time, that is an awesome idea and was easy to check. Making $chunk_start < $chunk_end the expression of the while loop and moving the fgets() into the loop body shove off another 2.7% of wall time on my machine.

I'll update the blog post once I am back from DutchPHP. Thats awesome! Thank you!

Collapse
 
xjakub profile image
Jakub a. Ritchie

Oh, I'm glad both changes helped! Compared to the fread ones they were much simpler to implement, but I just realized we can simply use stream_get_line to fetch the station name, which is faster and simpler than the other approaches!

Thread Thread
 
realflowcontrol profile image
Florian Engelhardt

Replied on twitter already, but for visibility: switching to stream_get_line() shoved off another 15%
This is awesome!

Collapse
 
realflowcontrol profile image
Florian Engelhardt

I updated the blog post with your suggestions, thank you!