Wikipedia has an enormous collection of knowledge, and can be a useful reference tool. For developers Wikipedia has an API to allow programs to access their store of information. This API is very open and friendly, with very few limitations. An account if you intend to modify anything, but if all you want to do is access information, there is no need to sign up for credentials. The API is called WikiMedia, and there are many libraries for streamlining the interface in different languages. Before getting to any of those, however, it is important to understand how WikiMedia works, and what we can do with it.
For example, let’s say we want to extract some information from the Wikipedia page for Jurassic Park, the novel. After creating a session, we can query the database using various parameters. For this example I will be using python to demonstrate the required parameters, but regardless of the language the name and purpose of the parameters remains the same. To begin with, we will use 3 parameters, action
, page
, and format
. action
is the action to performed, and also determines the type of response. Commonly the action
will be query or parse. A query can be used to search for entries in a category, while parse is commonly used to return the contents of a page. page
is the title of the page as it appears on Wikipedia. If you have spent time on Wikipedia, you might notice that some pages are very specific. For instance, Jurassic Park is a novel, a movie, and a series. So the page titled Jurassic Park refers to the overall series, Jurassic Park (novel) refers to the novel, and Jurassic Park (film) refers to the movie. We want the novel, so we will have to specify that. Finally, format
refers to the format of the response object, which is typically easiest to read in JSON. So let’s try a query using our parameters:
PARAMS = {
'action': "parse",
'page': "Jurassic Park (Novel)",
'format': "json",
}
However, this returns an error:
'code': 'missingtitle',
'info': "The page you specified doesn't exist."
Clearly, this means the page doesn’t exist, there is no page with this title. So what happened? If you look up the Wikipedia page for the Jurassic Park novel, you will see it is actually titled Jurassic Park (novel), in other words ‘novel’ is spelled lower case. This is important to note because Wikipedia can be very specific with it’s title formatting. So let’s say we run a query with the proper formatting:
PARAMS = {
'action': "parse",
'page': "Jurassic Park (novel)",
'format': "json",
}
And now we have a proper response object, and format it using response.json()
. To access the page data, we then use the key parse
. Finally, we have the page data! This part contains many details about the page, including a field called text
which contains the entire body of the page in html. There are many ways to use this, either to parse the html into plain text, or simply navigate the html to find appropriate sections.
Top comments (0)