DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Zander Bailey
Zander Bailey

Posted on

WikiGolf: An Automated Traversal of Wikipedia

WikiGolf is a program designed to play Wikipedia Golf using an efficient process to navigate Wikipedia. But let’s back up for a moment. You might be asking, what is Wikipedia Golf? Wikipedia Golf is a game where you choose a Wikipedia page to start on, and then another Wikipedia page to end up at. You then try to reach the end page using only links embedded in the current page to navigate further, and reach the target page in as few links as possible. WikiGolf is a program I designed to perform an automated version of this process, with the intention of setting a "Par".

I started by exploring the Wikipedia API and its various wrapper libraries to establish a way of performing the necessary navigation operations using only interaction with the API. At first I was able to do this by making manual calls to the API and requesting the links using the links prop:

PARAMS = {
    'action': "query",
    'titles': TITLE,
    'prop': "links",
    'pllimit': "max",
    'format': "json",
}

R = S.get(url=URL, params=PARAMS)
data = R.json()

links = data['query']['pages'][list(data['query']['pages'])[0]]['links']

but I soon discovered that the wikipedia library has methods to simplify this process greatly:

data = wikipedia.page(TITLE)
links = page.links

Now that my bot had a way to get all links from the page, I just needed to have it make decisions intelligently. I experimented with several types of models, including topic models, but an obstacle with most models was the need for complete data to perform an analysis, and it seemed inefficient to use the entirety of Wikipedia(about 6 million articles only counting english) as a dataset. I had played around Word2Vec previously, as it seemed very applicable to be able to compare the similarity of words, but I was uncertain how to be able to build a model with every word the bot would encounter. The solution I came up with was to build a new model at each page, using the current page and the target page for input, in order to ensure that the dictionary would contain the target subject as well as all the links on the current page. With this I was able to compare similarity between all links on the current page and the target subject. After some testing I established a distance threshold of 0.02, in other words it would only take links with a distance of less than or equal to 0.02, otherwise it chooses a random link. Now that it can examine the links a little more closely, I took another look at establishing if we’ve reached the target. First, we do the basic test, which is to compare the target title with the current title. IF these match, obviously we have reached the page. Otherwise we can also use or model again to compare the current title with the target title and see how similar they are. If the current page is similar enough, we can take note of it for later reference. Because even a smart version of this search could go on for a while, I decided to build in a feature to time out the search if it takes too many links, at which point I will return the closest match it has found so far, if any.

Top comments (0)

Take Your Github Repository To The Next Level

>> Check out this classic DEV post <<