Discussion on: What are your programming goals for 2017?

View post

I want to finish reading the HTML standard. I'm working on my own browser optimized for web scraping, so knowing a lot about how a browser works is important.

By the way, the HTML standard is an excellent read. About a thousand pages of clear, detailed specifications of every single detail of what a browser does. Incredibly interesting and informative.

Ben Halpern • Dec 29 '16

Fascinating. The HTML standard does look really interesting as I take a glance. I've been reading the CommonMark markdown spec myself lately. How did you first get interested in the browser project?

tbodt • Dec 29 '16

My company does a lot of web scraping, it's basically the entire business. Originally we were using Selenium and PhantomJS, but we started running into scaling issues. So now a scraping grid consists of 32 servers each with 8 cores and each costing hundreds of dollars a month. The servers are mostly at like 30% CPU usage. We have like 300k in free servers from various hosting companies so improving efficiency isn't too high priority, but something will have to be done eventually.

The obvious alternative to Selenium is to just make HTTP requests, but we have to crawl a lot of really crappy sites that use JavaScript for no apparent reason, and we want to be able to add a new site without spending a lot of time figuring out how to form spoof. So we're just making our own browser. It uses V8 to run JavaScript, which I had to write a Python C++ extension to do.

Admittedly it's not the most useful thing I could be doing. But it's hella fun.