MKaandorp

Posted on Mar 12, 2021

This website contains pictures of your vacation next year, audio of your last words, and a description of the end of humankind

#javascript #hddofbabel #libraryofbabel #datascience

I created a website, called the HDD of Babel, which contains pictures of your vacation next year, an audio recording of your last words, and a description of the end of humankind. It contains all data that ever was, is, or ever will be (as long as its below the size limit). You can submit some data, a file or some text, and it will return the webpage on which your data can be found. This page is not created when you submit the data, it's not like it's adding an entry to some database. Since the launch of this website, your data has existed on this page. Alternatively, you can go to a random page, and explore the secrets this library holds yourself.

Does this website really contain all data? In some ways, yes. In other, more sensible ways, not really.

Context

In 1941, Argentinian author and librarian Jorge Luis Borges published a short story called “The library of Babel”. This story features an immense library containing all possible 410-page books. Although most books will not make any sense, the library must also contain every piece of literature that has ever been written, and that will ever be written. The library contains all truths of the universe, as well as every lie. This leads to some cult-like behaviour, with characters searching through the library for answers, thruths and prophecies.

Although building Borges’ library in a physical form would be challenging, the digital age makes a virtual version possible. Jonathan Basile took up the challenge, and created a website which contains all possible pages of 3200 characters. A fascinating project, which has spawned communities, which, not unlike Borges' characters, try to find meaning in the virtual pages.

As a next evolution for this concept of a universal library, I have created a version which does not contain only text, but all possible data in general. This means that this website also contains every possible image, video, piece of audio, website, etc. There is a page on this website which contains a picture of your yet-to-be-born grandchildren, as well as one with a fragment of Charli XCX's upcoming hit single. There is even a page which contains the homepage of the website itself. However, there are also billions and billions of pages which contain total gibberish.

Questions raised

Can these libraries help us uncover the ancient mysteries of the universe? Can these libraries tell us anything about our future, by showing us pictures of life in the year 3021, by letting us listen to the sounds of the last of humankind? And, closer to home, do these libraries render our current copyright system completely obsolete, by containing every piece of media that will ever be created?

How it works

Although a universal library sounds like a magical place, and, like the characters from Borges' book and the Reddit communities, one might be tempted to search through it to find the hidden truths of the universe, a closer look reveals its simplicity and meaninglessness.

In both the Library of Babel and the HDD of Babel, each piece of data has a specific location. In the Library of Babel, like the story, this location is the hexagon name, wall number, shelf number, and book name. For the HDD of Babel, this location is the URL. By tying each piece of data to a specific location, it’s easy to share the found data, and to find it back after leaving the page. However, this also makes it possible to use the location of a page as a seed for generating its contents.

The homepage of the HDD of Babel contains a button to submit a file, which returns an URL to the page which contains this file. When you submit a file, it generates its URL. It does this in the following way:

First we take the data URL from the uploaded file, which looks likes data:[<mediatype>][;base64],<data>.

Although the actual data in the data URL is already base64 encoded, we encode this complete data URL in base64 as well. This way it looks a bit more random, and we preserve some of the magic. To make it more difficult for the user to recognize patterns in the URL, this base64 string is then reversed. These last two steps serve to make it more difficult to see the link between the location of the page and its contents.

When the user visits this generated URL, it returns a 404, as this page does not really exist. A custom 404 page takes the URL, and performs above steps in reverse to decode the data. If the script recognizes a MIME-type in the decoded data, it tries to display the data on the page. If it does not recognize the type of data, it only allows the user to download the data as a file.

This means that if you send someone a link to a page on the HDD of Babel, you’re not sending them a link to the data, you’re sending them the encoded data (in the URL) and a link to a decoder (found on the webpage).

The Library of Babel website works in a similar way. The text-to-be-found entered on the homepage is encoded into a location (hexagon name, wall number, shelf number, and book name). Upon visiting this location, the location is decoded into text. This means the location is the data.

The realization that the contents of the page are just a decoded version of the location of the page, and the library only serves as a decoder, makes these universal libraries a bit less magical. The illusion only works because it’s not easy to see the link between location and page contents. A similar, although simpler system can be found in Caesar's Cipher, in which each letter of a text is replaced by a letter found a fixed number of positions down the alphabet. For instance, if we use the number one, the text “ABC” becomes “BCD”, and the text “HELLO” becomes “IFMMP”. However, to say Caesar’s Cipher contains every piece of literature ever written, although in some way true, sounds absurd.

An even simpler example would be a library in which the encoded version of the data is equal to the decoded version of the data, in other words, in which the location of the page is exactly the same as the contents. One could think of a website which displays the last part of its URL. For instance, when browsing to universallibrary.com/the-answer-to-life-the-universe-and-everything, one would see a page containing the text “the-answer-to-life-the-universe-and-everything”. It will be no surprise that if one replaces this last part of the URL with the first chapter of the first Game of Thrones novel, it will display a page containing the first chapter of the first Game of Thrones novel.

Both Bastille’s website and the HDD of Babel contain functionality to visit a random page, which would be similar to picking a random book from Borge’s library. A random string is generated, and encoded into a location. Upon visiting this location, it is again decoded into the generated random string. One could skip the encoding and decoding steps, and still find the same result. This reduces the process to a practice already perfected by an infinite number of monkeys. By generating random data, one could find all the truths of the universe, but also everything that’s not true. All these possibilities already exist, describing or instantiating them does not provide any new information. A randomly generated answer to any question will have a high probability of being wrong.

Conclusion

Now that we know how these libraries work, we can answer the questions raised earlier.
These libraries do not provide any new information, nor do they provide any new insights. We can either find information we already have, by visiting the linked location, or we can generate meaningless random information.
The websites do not “contain” all information possible, they just provide a simple decoding algorithm for any encoded information, and are therefore not much different from any simple cipher.
It also does not make our copyright system obsolete, because a piece of media would have to exist, either created by the artist or generated randomly, before it could be found in the library.

Practical Usage

The fact that the library does not really contain the last Game of Thrones book, does not mean it serves no purpose whatsoever. For one thing, it helps us think about the nature of information, and its practical applications.
Also, it could serve as an easy way to “host” some kind of content, for example the terms and conditions for your mobile app, your resume, or your favorite meme. This works best when paired with some URL shortener service.
Thirdly, although we know it does not make any sense, it can still be fun to try to find some meaning in (seemingly) random patterns, or connections between unrelated systems. If we can find faces in toast, fortune in cookies, and our futures in the stars, we can certainly find something of value in these universal libraries.

The source code for the HDD of Babel can be found here.

Top comments (3)

Gerald Gehrke • Mar 12 '21

Sorry, but I think your 404.js is broken at line 39. If no MimeType is detected, the JS just errors out. The rest of the code is unexecuted. Was unable to successfully download any data from "random" pages.

MKaandorp • Mar 12 '21

Good catch, thanks for letting me know. Should be fixed now.

ajmkaandorp • Mar 23 '21

Wow, cool article! This.