For a recent DjangoCon Europe talk I wrote a Python Web Spider to catalogue the HTTP headers of the top 10,000,000 domains (based on Open PageRank data).
At the time of writing, it has spidered all 10 million domains at least once.
Spider Stats
- 7,187,532 have completed "successfully", that is they returned a HTTP response which contained any HTTP status code, including error codes in the 4xx and 5xx range.
- 6,280,590 have a HTTP status of 200
- 766,584 have a HTTP status in the 4xx range
- 137,100 have a HTTP status in the 5xx range
- 368 have a HTTP status which is >= 600 (the highest was 999, see screenshot below)
- 2,812,468 domains have failed; DNS errors, timeouts, and so on. I will continue to retry these domains over the coming days (max five attempts)
The dataset is in a single MongoDB Atlas collection, and it is currently a little over 8GB in size.
I'm going to make it public and free to access for all! 😃
Document Schema
Each document contains the domain, Open PageRank, date/time crawled, and all the HTTP headers received.
[
{
_id: "5f31ee822ff3764aa9c446d4",
rank: 610,
domain: "dev.to",
pageRank: { $numberDecimal: "6.70" },
processing: false,
completed: true,
attempts: 1,
last_updated: "2020-09-15T03:29:00.447Z",
headers: {
"Content-Length": "71618",
Server: "Cowboy",
"X-Frame-Options": "SAMEORIGIN",
"X-Xss-Protection": "1; mode=block",
"X-Content-Type-Options": "nosniff",
"X-Download-Options": "noopen",
"X-Permitted-Cross-Domain-Policies": "none",
"Referrer-Policy": "strict-origin-when-cross-origin",
"Cache-Control": "public, no-cache",
"X-Accel-Expires": "600",
"Content-Type": "text/html; charset=utf-8",
"Content-Encoding": "gzip",
Etag: 'W/"9e7cc41631c8a0ba2a886cdb2b844b40"',
"Content-Security-Policy": "",
"X-Request-Id": "bf2b33f2-d4e2-4b5d-a3b0-15717705278d",
"X-Runtime": "0.150673",
Via: "1.1 vegur",
"Access-Control-Allow-Origin": "*",
"Accept-Ranges": "bytes",
Date: "Tue, 15 Sep 2020 03:29:00 GMT",
Age: "327",
"X-Served-By": "cache-den19625-DEN, cache-jax20947-JAX",
"X-Cache": "HIT, MISS",
"X-Cache-Hits": "1, 0",
"X-Timer": "S1600140540.196933,VS0,VE155",
Vary: "Accept-Encoding, X-Loggedin",
},
request_url: "https://dev.to",
response_url: "https://dev.to",
status: 200,
},
]
It's a fun and fascinating dataset, with a lot to discover. So I'm very excited to open it up to the world.
Just like our Johns Hopkins University COVID-19 open dataset, I want to make this super easy to access, no matter if you're using Node, Python, Java, or even Excel!
But before I make it publicly available, I would like to offer access to a limited number of people.
Request Access
I want to see how you query the data, what indexes do you need? How can I structure it to make the data easy and efficient to work with?
If you would like early access to the dataset, please follow and then DM me on Twitter or email me at aaron.bassett@mongodb.com
Top comments (1)
How did you crawl that many domains can make the Python Script open source?