Discussion on: What are your programming goals for 2017?

tbodt profile image

My company does a lot of web scraping, it's basically the entire business. Originally we were using Selenium and PhantomJS, but we started running into scaling issues. So now a scraping grid consists of 32 servers each with 8 cores and each costing hundreds of dollars a month. The servers are mostly at like 30% CPU usage. We have like 300k in free servers from various hosting companies so improving efficiency isn't too high priority, but something will have to be done eventually.

The obvious alternative to Selenium is to just make HTTP requests, but we have to crawl a lot of really crappy sites that use JavaScript for no apparent reason, and we want to be able to add a new site without spending a lot of time figuring out how to form spoof. So we're just making our own browser. It uses V8 to run JavaScript, which I had to write a Python C++ extension to do.

Admittedly it's not the most useful thing I could be doing. But it's hella fun.

Thread Thread
ben profile image
Ben Halpern Author

Well whether or not this specific activity is "useful", I'm sure you'll get a hell of a lot out of reading the whole HTML standard!