loading...

Discussion on: What are your programming goals for 2017?

Collapse
tbodt profile image
tbodt

I want to finish reading the HTML standard. I'm working on my own browser optimized for web scraping, so knowing a lot about how a browser works is important.

By the way, the HTML standard is an excellent read. About a thousand pages of clear, detailed specifications of every single detail of what a browser does. Incredibly interesting and informative.

Collapse
oneearedmusic profile image
Erika Wiedemann

To clarify, the standard found here? html.spec.whatwg.org/multipage/

That's good to know it's a solid & clear read, I'll have to make some time to work through it.

Collapse
tbodt profile image
tbodt

That's the one.

Collapse
ben profile image
Ben Halpern Author

Fascinating. The HTML standard does look really interesting as I take a glance. I've been reading the CommonMark markdown spec myself lately. How did you first get interested in the browser project?

Collapse
tbodt profile image
tbodt

My company does a lot of web scraping, it's basically the entire business. Originally we were using Selenium and PhantomJS, but we started running into scaling issues. So now a scraping grid consists of 32 servers each with 8 cores and each costing hundreds of dollars a month. The servers are mostly at like 30% CPU usage. We have like 300k in free servers from various hosting companies so improving efficiency isn't too high priority, but something will have to be done eventually.

The obvious alternative to Selenium is to just make HTTP requests, but we have to crawl a lot of really crappy sites that use JavaScript for no apparent reason, and we want to be able to add a new site without spending a lot of time figuring out how to form spoof. So we're just making our own browser. It uses V8 to run JavaScript, which I had to write a Python C++ extension to do.

Admittedly it's not the most useful thing I could be doing. But it's hella fun.

Thread Thread
ben profile image
Ben Halpern Author

Well whether or not this specific activity is "useful", I'm sure you'll get a hell of a lot out of reading the whole HTML standard!