DEV Community

Cover image for HTTP Headers from ~10 Million domains - An Open Dataset
Aaron Bassett
Aaron Bassett

Posted on

HTTP Headers from ~10 Million domains - An Open Dataset

For a recent DjangoCon Europe talk I wrote a Python Web Spider to catalogue the HTTP headers of the top 10,000,000 domains (based on Open PageRank data).

At the time of writing, it has spidered all 10 million domains at least once.

Spider Stats

  • 7,187,532 have completed "successfully", that is they returned a HTTP response which contained any HTTP status code, including error codes in the 4xx and 5xx range.
  • 6,280,590 have a HTTP status of 200
  • 766,584 have a HTTP status in the 4xx range
  • 137,100 have a HTTP status in the 5xx range
  • 368 have a HTTP status which is >= 600 (the highest was 999, see screenshot below)
  • 2,812,468 domains have failed; DNS errors, timeouts, and so on. I will continue to retry these domains over the coming days (max five attempts)

Screenshot of axcis.co.uk HTTP response with a HTTP status of "HTTP/1.1 999 No Hacking"

The dataset is in a single MongoDB Atlas collection, and it is currently a little over 8GB in size.

Screenshot of collection size in Atlas

I'm going to make it public and free to access for all! 😃

Document Schema

Each document contains the domain, Open PageRank, date/time crawled, and all the HTTP headers received.

[
    {
        _id: "5f31ee822ff3764aa9c446d4",
        rank: 610,
        domain: "dev.to",
        pageRank: { $numberDecimal: "6.70" },
        processing: false,
        completed: true,
        attempts: 1,
        last_updated: "2020-09-15T03:29:00.447Z",
        headers: {
            "Content-Length": "71618",
            Server: "Cowboy",
            "X-Frame-Options": "SAMEORIGIN",
            "X-Xss-Protection": "1; mode=block",
            "X-Content-Type-Options": "nosniff",
            "X-Download-Options": "noopen",
            "X-Permitted-Cross-Domain-Policies": "none",
            "Referrer-Policy": "strict-origin-when-cross-origin",
            "Cache-Control": "public, no-cache",
            "X-Accel-Expires": "600",
            "Content-Type": "text/html; charset=utf-8",
            "Content-Encoding": "gzip",
            Etag: 'W/"9e7cc41631c8a0ba2a886cdb2b844b40"',
            "Content-Security-Policy": "",
            "X-Request-Id": "bf2b33f2-d4e2-4b5d-a3b0-15717705278d",
            "X-Runtime": "0.150673",
            Via: "1.1 vegur",
            "Access-Control-Allow-Origin": "*",
            "Accept-Ranges": "bytes",
            Date: "Tue, 15 Sep 2020 03:29:00 GMT",
            Age: "327",
            "X-Served-By": "cache-den19625-DEN, cache-jax20947-JAX",
            "X-Cache": "HIT, MISS",
            "X-Cache-Hits": "1, 0",
            "X-Timer": "S1600140540.196933,VS0,VE155",
            Vary: "Accept-Encoding, X-Loggedin",
        },
        request_url: "https://dev.to",
        response_url: "https://dev.to",
        status: 200,
    },
]
Enter fullscreen mode Exit fullscreen mode

It's a fun and fascinating dataset, with a lot to discover. So I'm very excited to open it up to the world.

Just like our Johns Hopkins University COVID-19 open dataset, I want to make this super easy to access, no matter if you're using Node, Python, Java, or even Excel!

But before I make it publicly available, I would like to offer access to a limited number of people.

Request Access

I want to see how you query the data, what indexes do you need? How can I structure it to make the data easy and efficient to work with?

If you would like early access to the dataset, please follow and then DM me on Twitter or email me at aaron.bassett@mongodb.com

Top comments (1)

Collapse
 
grocker42 profile image
Grocker

How did you crawl that many domains can make the Python Script open source?