I want to finish reading the HTML standard. I'm working on my own browser optimized for web scraping, so knowing a lot about how a browser works is important.
By the way, the HTML standard is an excellent read. About a thousand pages of clear, detailed specifications of every single detail of what a browser does. Incredibly interesting and informative.
Fascinating. The HTML standard does look really interesting as I take a glance. I've been reading the CommonMark markdown spec myself lately. How did you first get interested in the browser project?
My company does a lot of web scraping, it's basically the entire business. Originally we were using Selenium and PhantomJS, but we started running into scaling issues. So now a scraping grid consists of 32 servers each with 8 cores and each costing hundreds of dollars a month. The servers are mostly at like 30% CPU usage. We have like 300k in free servers from various hosting companies so improving efficiency isn't too high priority, but something will have to be done eventually.
The obvious alternative to Selenium is to just make HTTP requests, but we have to crawl a lot of really crappy sites that use JavaScript for no apparent reason, and we want to be able to add a new site without spending a lot of time figuring out how to form spoof. So we're just making our own browser. It uses V8 to run JavaScript, which I had to write a Python C++ extension to do.
Admittedly it's not the most useful thing I could be doing. But it's hella fun.
I want to finish reading the HTML standard. I'm working on my own browser optimized for web scraping, so knowing a lot about how a browser works is important.
By the way, the HTML standard is an excellent read. About a thousand pages of clear, detailed specifications of every single detail of what a browser does. Incredibly interesting and informative.
Fascinating. The HTML standard does look really interesting as I take a glance. I've been reading the CommonMark markdown spec myself lately. How did you first get interested in the browser project?
My company does a lot of web scraping, it's basically the entire business. Originally we were using Selenium and PhantomJS, but we started running into scaling issues. So now a scraping grid consists of 32 servers each with 8 cores and each costing hundreds of dollars a month. The servers are mostly at like 30% CPU usage. We have like 300k in free servers from various hosting companies so improving efficiency isn't too high priority, but something will have to be done eventually.
The obvious alternative to Selenium is to just make HTTP requests, but we have to crawl a lot of really crappy sites that use JavaScript for no apparent reason, and we want to be able to add a new site without spending a lot of time figuring out how to form spoof. So we're just making our own browser. It uses V8 to run JavaScript, which I had to write a Python C++ extension to do.
Admittedly it's not the most useful thing I could be doing. But it's hella fun.
Well whether or not this specific activity is "useful", I'm sure you'll get a hell of a lot out of reading the whole HTML standard!
To clarify, the standard found here? html.spec.whatwg.org/multipage/
That's good to know it's a solid & clear read, I'll have to make some time to work through it.
That's the one.