I tried doing this in PHP too, and the end result was very similar to yours but with some changes which should amount to ~5% better performance (which just shows how hard it is to further optimise it):
Instead of isset($stations[$city]), you can check if $station = &$stations[$city]; assigns null! This way you only need to access the hashmap once, and then change $station as necessary!
You are checking both fgets and $chunk_start > $chunk_end to decide whether to end the loop, but it is possible to skip the first check. Although I'm not sure if this will bring a noticeable difference.
As for the ZTS+Mac issue, you can try to either run the code in a Docker container (php:zts plus pecl install parallel was able to run your solution), or use NTS+multiprocessing by spawning separate processes with either amphp/parallel or just proc_open! (I used the latter in my case)
Edit: I also see that I skip a char at strpos($line, ';', 1);. I don't think it has any performance impact though 😄
Removing the isset() shoved off 2.5% of wall time, that is an awesome idea and was easy to check. Making $chunk_start < $chunk_end the expression of the while loop and moving the fgets() into the loop body shove off another 2.7% of wall time on my machine.
I'll update the blog post once I am back from DutchPHP. Thats awesome! Thank you!
Oh, I'm glad both changes helped! Compared to the fread ones they were much simpler to implement, but I just realized we can simply use stream_get_line to fetch the station name, which is faster and simpler than the other approaches!
I just tried reading the file with fread instead of fgets and I got a 15% performance improvement! The approach makes the code uglier, but it is noticeably faster at least
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
I tried doing this in PHP too, and the end result was very similar to yours but with some changes which should amount to ~5% better performance (which just shows how hard it is to further optimise it):
isset($stations[$city]), you can check if$station = &$stations[$city];assigns null! This way you only need to access the hashmap once, and then change$stationas necessary!fgetsand$chunk_start > $chunk_endto decide whether to end the loop, but it is possible to skip the first check. Although I'm not sure if this will bring a noticeable difference.As for the ZTS+Mac issue, you can try to either run the code in a Docker container (
php:ztspluspecl install parallelwas able to run your solution), or use NTS+multiprocessing by spawning separate processes with eitheramphp/parallelor justproc_open! (I used the latter in my case)Edit: I also see that I skip a char at
strpos($line, ';', 1);. I don't think it has any performance impact though 😄I updated the blog post with your suggestions, thank you!
Removing the
isset()shoved off 2.5% of wall time, that is an awesome idea and was easy to check. Making$chunk_start < $chunk_endthe expression of thewhileloop and moving thefgets()into the loop body shove off another 2.7% of wall time on my machine.I'll update the blog post once I am back from DutchPHP. Thats awesome! Thank you!
Oh, I'm glad both changes helped! Compared to the fread ones they were much simpler to implement, but I just realized we can simply use
stream_get_lineto fetch the station name, which is faster and simpler than the other approaches!Replied on twitter already, but for visibility: switching to
stream_get_line()shoved off another 15%This is awesome!
I just tried reading the file with
freadinstead offgetsand I got a 15% performance improvement! The approach makes the code uglier, but it is noticeably faster at least