DEV Community

Zander Bailey
Zander Bailey

Posted on

Wikipedia API Part 2: Categories

Another way the the Wikipedia API can search is by using categories. Before I explain how to search using categories, it is important to understand what categories are and how they work. Wikipedia pages are organized so that you can find related topics and similar pages. One of the ways you can navigate to related pages is by using categories. If you look at the bottom of a page there will be a box labeled ‘Categories’. This box contains a list of links to categories that the subject falls into. For example, if you pull up the Wikipedia page for the character Batman, it will have categories like ‘DC Comics superheroes’, ‘Characters created by Bob Kane’, ‘American superheroes’, and quite a few more. If you click one of these it will take you to a list of all pages that have been classified as belonging to that category. Some categories have hundreds or even thousands of entries. For instance the category of ‘DC Comics superheroes’ contains 768 entries. When viewed in a browser it displays 200 links per page. This is important to note because navigating categories using the API works in a similar fashion.

First, let’s look at a basic search for a Wikipedia category:


PARAMS = {
    'action': "query",
    'list': 'categorymembers',
    'cmtitle': Category: DC Comics superheroes,
    'cmlimit': '10',
    'format': "json",
}

It is important to note that when searching for a category action should be set to ‘query’, so that it will return the right response type. Using ‘category members’ for list indicates that we want to return a list of members of the category. cmtitle is similar to page on a normal search, containing the name of the category, but must be preceded by ‘Category: ’, just as it would appear in a browser. cmlimit is the maximum number of entries to be returned. This is currently set to 10, so any search will return no more than 10 entries per page. Now let’s say we want to get a list of all entries of a category. The maximum number per page is 500, or you use ‘max’. But in the category ‘DC Comics superheroes’ there are 768 entries, so how do we get the rest? To answer this, let’s get the first 500 entries, and then look at the response object. It is returned as a JSON object, so let’s look at the keys:

['batchcomplete', 'continue', 'limits', 'query']

The field ‘continue’ is what interests us here. If we were to examine the contents of ‘continue’, it contains two fields, one of which is a long sequence of letters and numbers. This is actually the instructions to continue to the next batch of entries in the list. So how do we use this to continue? We pass the parameter cmcontinue with the contents of the ‘cmcontinue’ field from ‘continue’. You can do this as many times as you like to traverse the entire category, however large it may be.

Navigating categories can be useful for collecting data, because you can get a list of pages relating to a certain topic. Each item in the list returned also includes a ‘pageid’, which is useful because the titles given in the list are not always the exact titles of the pages, and you can use the ‘pageid’ to perform a more exact search.

Top comments (0)