You may be wondering, "How do I retrieve data from a resource that has long been closed or some of the data from it has been deleted?" I had the same question a long time ago, that's when I fully discovered the capabilities of Web Archives and created a web archive extraction tool based on Common Crawl.
Recently I had to solve a similar problem, but Common Crawl seemed to be barely breathing due to overload... So I decided to reinvent the wheel and improve my tool by extending it with a Wayback Machine archives source. While developing it, I accidentally got certified by the CIA. Now my altruistic self wants to show everyone how to get such a cool thing on their virtual shelf!
For those who are interested in the "behind the scenes" details, I also described what the two aforementioned services have in common and how we can use them through the API. At the end, we will get the certificate using the GoGetCraw tool.
Web Archive services
First, it is worth mentioning the general points of services such as Common Crawl and Wayback Machine... Their archives are stored in WARC/ARC files, also called CDX files (probably from Capture/Crawl inDeX). Then, with a tool like pywb, we can serve these indexes and examine them with various filters.
For example, in the screenshot below, you can see the result of the request http://<server_addr>/cdx?url=https://twitter.com/internetarchive/
to the CDX server:
As you can see, the result will be a number of rows related to the queried URL (https://twitter.com/internetarchive/
). This data contains the status at the time of the request (status code
) and the data collection time (timestamp
), the file type (mimetype
) and other interesting parameters. A more detailed description of the CDX server and the parameters used in the queries can be found here.
In addition to the Wayback Machine and Common Crawl services described below, there are many others. Unfortunately, their archives are less extensive and are usually archives of individual country websites or dedicated to a topic (e.g. art). You can find some of them here.
Let's exctract some files
I believe some of you have already used web archives through pretty interfaces. Now I will show you how to do it a fun way, using the API of Wayback Machine. For example, we want to get all the JPEG files on cia.gov
and its subdomains and then download the file we are interested in.
To perform our task with this resource, we construct the following query:
https://web.archive.org/cdx/search/cdx?url=*.cia.gov/*&output=json&limit=10&filter=mimetype:image/jpeg&collapse=urlkey
Where:
-
/cdx/search/cdx
- the endpoint of the CDX server, - `url=.cia.gov/` - our target domain,
-
filter=mimetype:image/jpeg
- filtering by MIME type of JPEG file, -
output=json
* - demand result in JSON format,limit=10
- limit to 10 results,collapse=urlkey
- get unique URLs (without this, there are many duplicates).
As a result, we get the 10 images found in the archive. In addition to the URLs, the response contains the MIME type of the files (useful if you do not use filtering), as well as the status code when accessing the object at the time of archive creation:
[["urlkey","timestamp","original","mimetype","statuscode","digest","length"],
["gov,cia)/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg", "20150324125120", "https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg", "image/jpeg", "200", "OJRFXPWOPZQGPRIZZQZOTRSZKAVLQLKZ", "3845"],
["gov,cia)/++theme++contextual.agencytheme/images/aerial_cropped.jpg", "20160804222651", "https://www.cia.gov/++theme++contextual.agencytheme/images/Aerial_Cropped.jpg", "image/jpeg", "200", "3WII7DZKLXM4KSQ5UTEKO5EL7H5VTB35", "196685"],
["gov,cia)/++theme++contextual.agencytheme/images/background-launch.jpg", "20140121032437", "https://www.cia.gov/++theme++contextual.agencytheme/images/background-launch.jpg", "image/jpeg", "200", "3C4G73473VYPOWDNA4VJUV4Q7EC3IXN4", "44501"],
["gov,cia)/++theme++contextual.agencytheme/images/background-video-panel.jpg", "20150629034524", "https://www.cia.gov/++theme++contextual.agencytheme/images/background-video-panel.jpg", "image/jpeg", "200", "CQCUYUN5VTVJVN4LGKUZ3BHWSIXPSCKC", "71813"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-1.jpg", "20130801151047", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-1.jpg", "image/jpeg", "200", "GPSEAEE23C53TRGHLMBXHWQYNB3EGBCZ", "14858"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-2.jpg", "20130801150245", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-2.jpg", "image/jpeg", "200", "L6P2MNAAMZUMHUEHJFGXWEUQCHHMK2HP", "15136"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-3.jpg", "20130801151656", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-3.jpg", "image/jpeg", "200", "ODNXI3HZETXVVSEJ5I2KTI7KXKNT5WSV", "19717"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/an-4.jpg", "20130801150219", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/an-4.jpg", "image/jpeg", "200", "X7N2EIYUDAYWMX7464LNTHBVMTEMZUVN", "20757"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/banner-benefits-background.jpg", "20150510022313", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/banner-benefits-background.jpg", "image/jpeg", "200", "VZJE5XSAQWBD6QF6742BH2N3HOTSCZ4A", "12534"],
["gov,cia)/++theme++contextual.agencytheme/images/bannerheads/chi-diversity.jpg", "20130801150532", "https://www.cia.gov/++theme++contextual.agencytheme/images/bannerheads/CHI-diversity.jpg", "image/jpeg", "200", "WJQOQPYJTPL2Y2KZBVJ44MVDMI7TZ7VL", "6458"]]
Next, to access the archived file, we take one of the results above and use the following query:
https://web.archive.org/web/20150324125120id_/https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg
Where:
-
/web
- file-server endpoint, -
20230502061729id_
* - timestamp obtained during the previous query + id_, -
https://www.cia.gov/++theme++contextual.agencytheme/images/aerial-analysis-btn.jpg
* - file URL received during the previous request.
GoGetCrawl the CIA certificate
It is worth saying that at the moment there are already a few tools for interacting with archives. But with a cursory glance I haven't found any that, apart from URL mining, could download the file and easily integrate into other Go projects!
With this in mind, I updated my outdated solution by doing some refactoring and adding Wayback Machine
as a second source of archive data.
Let's get the paper
Finally, we will get a certificate using gogetcrawl. You can use it in several ways described here. You won't have to compile or install anything, and so for convenience:
- You can download the latest release. And use the binary as follows:
gogetcrawl file *.cia.gov/* --dir ./ --ext pdf
After waiting a little bit we should get the cherished file, if you are bored or there is no file, you can see what happens by adding the -v
flag.
- You can as well use Docker:
docker run uranusq/gogetcrawl url *.cia.gov/* --ext pdf
As a result of this command, we should see the URL with the PDF file. You can learn more about the commands and possible arguments by using the -h
flag.
- Installation option for Go-phers:
go install github.com/karust/gogetcrawl@latest
Congrats
We received our certificate 🫢. For those who don't have it yet, it looks like this:
Not fully understanding what it is and why it is there, I immediately rushed to share this opportunity with you.
Open-source
For those who use Go and want to become an archivarius, there is an option to apply GoGetCrawl in your project:
go get github.com/karust/gogetcrawl
For example, a minimal program that will show us all the pages of example.com
and its subdomains with the status of 200 looks like this:
package main
import (
"fmt"
"github.com/karust/gogetcrawl/common"
"github.com/karust/gogetcrawl/wayback"
)
func main() {
// Get only 10 status:200 pages
config := common.RequestConfig{
URL: "*.example.com/*",
Filters: []string{"statuscode:200"},
Limit: 10,
}
// Set request timout and retries
wb, _ := wayback.New(15, 2)
// Use config to obtain all CDX server responses
results, _ := wb.GetPages(config)
for _, r := range results {
fmt.Println(r.Urlkey, r.Original, r.MimeType)
}
}
On the project page, you can find more examples, including file extraction and CommonCrawl usage.
PS
I hope no one was click baited, and everyone enjoyed the new certificate in their collection :) Perhaps there are other web archive services other than those described in the article that can be used similarly? Let me know.
Top comments (0)