loading...
Cover image for Which Http Client is Faster for Web Scraping

Which Http Client is Faster for Web Scraping

insolita profile image Insolita Updated on ・3 min read

Inspired by an article Fast Web Scraping With ReactPHP I've decided to make a benchmark for checking, how much it faster than some other popular libraries, like Guzzle, which also can create async requests via multicurl and Amphp that is another non-blocking php framework, that contains http-client

I don't want to make a synthetic benchmark and prefer for my test more practical task - is loading different real urls from the defined list (part of them may be broken), scrape it titles, and save into a file.

During development, I've faced certain difficulties, which did not allow me to create my test clients completely similar. Each client has its own specific features, especially amphp, and also I have not so big experience with async libraries such as reactphp and amphp.

So, you can see the repository with test stuff and benchmark results here
https://github.com/Insolita/php-async-benchmarks All tests were written with php7.4. Each check was run 10 times, and I publish min, max, and average execution time. I should add a notice that the concrete numbers not so important because it depends on internet speed, server config, etc.., and you can have another result. Only their relative differences has a value

Firsts results really surprised me. ReactPhp works 2 times slower than the Guzzle. I rechecked it again and again, but the numbers stayed the same. But with the increasing number of queries, its performance becomes better and better. On the other hand, amphp performance becomes slower and slower and I even exclude it from the last measurement. (It depends on its specific, at the documentation, I find out only one way for concurrent requests https://amphp.org/http-client/concurrent, probably exists better way or additional libraries that also allows to queue promises smarter (like a clue/reactphp-mq), but I have not found it)

In summary, ReactPhp can be a good decision, when you need to fetch many thousands of urls, especially when you keep it as a separate worker, that will receive tasks by socket/Redis or http api. Amphp can be good when you need to fetch a little batch of urls, 5-10-50 asap. Also it can become better with additional wrappers. The Guzzle is awesome.

UPD: Just see the power of the Open Source community in action! One of maintainers of the Amphp, Niklas Keller, thanks to my benchmark find out and fix the bug. And now, thanks to the help of Dmitry Balabka and Niklas Keller - the performance metric of Amphp http-client was significantly improved!

UPD2: An outsider becomes a winner! Thanks to improvements, Amphp metrics look good at small batches as well as big! I'm intrigued, will reactphp team to offer improvements for increase their http-client speed?

UPD3: Yes! ReactPHP team accept the challenge, and also start to work on the fix!

Discussion

markdown guide
 

Why you didn't compare with swoole. It's the fastest async framework in php (based on benchmarks).

 

I haven't had time for this yet, but I'm gonna do it, after new react release and next round of comparison
Or, if you want to help me to do it, you can add PR