DEV Community: José Tobias

Processing the Wikipedia dump

José Tobias — Tue, 08 Mar 2022 03:59:28 +0000

Okay, we've already understood the Wikipedia dump format and that's great! But judging how much information we have inside of it, how can we process and index it in a more manageable way than a single XML file? This is exactly what we're going to do here: processing the bz2 archive. Yeah, the archive itself - more on it soon. So, for me, there are usually 3 steps into this whole "processing" phase:

reading the data efficiently
formatting the data as needed
saving the data efficiently

The "efficiently" thing on steps 1 and 3 is just because we have to be pragmatic in such phases, we really need to think through it. When reading we need to be cautious with our time, and when saving we need to be cautious with our space. The step 2, formatting, is a bit of an odd space for me... sometimes there are requisites that we need to match and so we don't have too much control over it - but when reading and saving, we do.

All the code in here will be provided as Scala3 code, but it's fairly easy to translate it all to Java, it's almost a 1 to 1 translation - since I almost don't use functional operations in the code, and pattern matching can easily be replaced by if-else blocks in Java 11 or less and it's literally there on Java 17 and up.

Lets go over those steps to begin with.

Reading

So, I've said we're not supposed to "unzip" the bz2 archive before processing it, and there are multiple reasons for it, but the main one is space and time - oh, the "efficiently" thing. You might remember the Wikipedia dump link in the first post, and you'll probably see that are two types of bz2 archives for pages: the first one is the "enwiki-20220220-pages-articles-multistream.xml.bz2" and the second one is the "enwiki-20220220-pages-articles.xml.bz2". So, what is this "multistream" about? You can find a pretty shallow explanation here on this Wikipedia page (who the hell is going to use dd to manipulate a 20GB file that should be processed? Come on man...), but I'll try to explain it a bit better.

A "multistream" bz2 file can be thought of as: the concatenation of multiple files with the intent to make it easier to search for anything that is included into this archive when their stream position is known. Explaining it a bit better: every file, per say, becomes a stream, and those streams can be separated by counting the size of such files in bytes - after compressed. Therefore, on the Wikipedia dump page, right under the enwiki-20220220-pages-articles-multistream.xml.bz2 file, which is the dump archive itself, we have a enwiki-20220220-pages-articles-multistream-index.txt.bz2 index file containing the number of bytes of each stream inside that file. For example, imagine that we have 3 files made into a bz2 archive, and we know the first stream have 100 bytes, the second stream have 300 bytes and the last stream have 180 bytes. If we want to read something from this archive, without extracting it, and we know is inside the 3rd file, we might go straight into the third stream jumping 100+300 = 400 bytes into the file and we can get only this stream that will for sure obtain the needed data. I KNOW, I KNOW, YEAH YEAH, I KNOW there are some bytes that come into the compression before the data itself forming a header, but we're trying to make sense of those things here, okay? We're just trying to make sense of things, but if you want to go that deep into it, GO. In the case of Wikipedia, each stream contains 100 pages, no matter how many bytes, and the first stream of the file is dedicated to the siteinfo part of the dump aline - remembering our siteinfo object.

The image below should shed some light on how those Wikipedia dumps are built, and why the bz2 archive and the index file work they way they do - this is just how I see it:

Hopefully the archive structure is clear now. Lets begin to code it.

Reading it

If you're inside this Java World, I believe you know Apache Software Foundation and their amazing projects. And one of those amazing project, is the Apache Commons Compress, which we're going to use so we can easily access and process data inside of this bz2 archive - yeah, inside of it, no decompress beforehand is needed, which is amazing.

The class Bzip2CompressorInputStream is where all the magic happens, it takes a InputStream as the constructor. So, all we have to do is something like the method below:



  object StreamLoader:
    val log = LoggerFactory.getLogger(this.getClass)

    def getStreamBuffer(inputStream: InputStream): ListBuffer[Byte] =
      var lb = ListBuffer[Byte]()

      synchronized {
        try lb ++= 
    BZip2CompressorInputStream(inputStream).readAllBytes
        catch
          case e =>
            log.debug(e.getMessage)
            return null
      }
      lb

The synchronized block there might be an indication what we're going to do, right? Yes, we're going to call this single method from multiple threads! That's why we create it inside an object and not a class, so it's static and the concurrent model takes place - one thread has access to read, while all others wait to get a piece of data to process. If you're big into computer science theory like me, you've probably seen it before on a computer architecture or distributed programming class before... yes, this goes into the Flynn's Taxonomy on Single Instruction stream, Multiple Data streams [SIMD].

Making it easier: we have a single operation to apply over data that might be divided beforehand. The diagram below is how I usually see this kind of operation:

We can abstract those processors as threads and the block of data as a stream, and voilá. People will surely say that I'm stretching the concept of processors and instructions here, and YES I AM. But, this hasn't failed me yet - specially when using a VM language and not coding on bare metal. But pasmen: even on bare metal with C, MPI and OpenMP, with a 4 machines cluster, it hasn't failed me yet!

Something I haven't told you, but you should have seen by now in the XML dump, is that it's actually formatted like this:



  <mediawiki>
    <siteinfo>...</siteinfo>
    <page>...</page>
    <page>...</page>
    ...
    <page>...</page>
  </mediawiki>

So, as I've said: first block stream will give you the siteinfo, BUT it comes with this <mediawiki> tag like this:



  <mediawiki>
  <siteinfo> ... </siteinfo>

And so the following streams are going to give you 100 pages at a time, BUT it's going to end wiki a </mediawiki> tag like this:



  <page>...<page>
  <page>...<page>
  ...
  <page>...<page>
  </mediawiki>

And this thing right there will give you problems when processing this with any kind of XML Stream Reader (we're going to talk about it in just a minute, hold on). Also, any XML Stream Reader will demand that all those tags are "embraced" in the form of a single document, so they can't be like this, they should be part of a single other document tag. Therefore, what you'll need to do is having a method to put this stream of pages INSIDE another document, such as:



  def getDocumentStream(inputStream: InputStream): InputStream =
    val content = getStreamBuffer(inputStream)

    if content == null then return null

    BufferedInputStream(
      ByteArrayInputStream(
        (content
          .prependAll("<document>".getBytes)
          .appendAll("</document>".getBytes))
          .toArray
      )
    )

NOW, you have a processable entity by any XML Stream Reader that you get from a provider. Because your streams (on a good day, not the last one as we've seen) will come as:



  <document>
    <page>...</page>
    <page>...</page>
    ...
    <page>...</page>
  </document>

So now we have a starting tag <document> and an ending tag </document>, as we should have in a well-formatted XML document.

So, those are all the ins and outs I can think off when reading those streams in a proper way so they can be processed. The processing is rather interesting too, lets jump right into it.

Writing it

Well, as I've said before: I'm writing it to a postgresql database but you can write it literally to anything you want as long as you have the drivers or the means to do so. Lets remember how the data is structured inside the database:

In the code I've used this idea of a SinkSender, that's just something to "flush away the data" from the program into whatever place it should be. To do so, in Scala, I created a trait which can be relative to an interface in Java, as displayed below:



  trait SinkSender extends Closeable:
    def sendSiteinfo(siteinfo: HashMap[String, String]): Unit
    def sendNamespace(namespace: HashMap[String, String]): Unit
    def sendPage(page: HashMap[String, String]): Unit
    def sendRevision(revision: HashMap[String, String]): Unit
    def sendContributor(contributor: HashMap[String, String]): Unit
    def dispatch: Unit
    def close: Unit

And as we can see, I have a send type of method for each entity of the code that will go straight into the database, but they don't go from there into the database until the dispatch method is called. Since we're processing it 100 pages at a time with each stream, I've found it very practical to just use batch operations for those 100 pages at a time, so the send methods only call the addBatch inside of a connection until the dispatch method is called and then the executeBatch is called for everyone - in a very specific order, as you might remember from the usage of foreign key constraints. This controlled point of synchronization also allow for a document to be build and sent into an Apache Kafka topic for example - but I'll not go deeper here since it might be confusing enough already.

But for now, all you have to understand is this SinkSender structure, and build your own from there. Check the repository and see how I've done it all and also a bit more to build the whole database structure I need from the code itself.

Processing it

Now, I'll show you an easy way to process it without going too much into the details. Here, I'll probably not show you all the code, but will show you how to do the code. If you want to see all the code, you can see it directly on the repository.

When processing the data, I chose to use the java built in XML Stream Reader. I don't have time to go deeper into the XML Stream Reader thing here, but here is a good enough tutorial about it.

Obviously, when processing a 20GB+ Bzip2 archive, you'll not try to load it all into the memory, right? RIGHT? Okay. So having streammed operations like we've already done when reading the file, and then processing the data step by step so we can extract information needed without searching for it - because the whole document has a well formatted schema - seems like the best choice. This is exactly what this XML Stream Reader gives us: a way to process the stream step by step, going from tag to tag, until we reach the end of each block and get ready for the next one.

I've chosen to do so using handle methods to handle each kind of objects I have on this XML dump. I don't think it's very practical to display whole chunks of code here, since it might become extensive and tiring to read, so I'll display diagrams and explain them while trying to make sense of the code I've already written and that should guide you in the path of writing your own or just using mine.

Below is a diagram of how the XML Stream Reader and the handle thing should work for the siteinfo object, exactly as it is:

I know it might seem confusing, but this is just a way to imagine the function expressed as a flow chart, and most of the times it helps me get my head around some details. Let's follow the flow char step by step, with some code snippets:

create a new XML event reader, those are available globally inside this class



  class XMLStreamHeaderProcessor(
      sinkSender: SinkSender,
      xmlif: XMLInputFactory,
      inputStream: InputStream
  ):
    val log = LoggerFactory.getLogger(this.getClass)
    val xmler = xmlif.createXMLEventReader(inputStream)
    ...

call the "nextEvent" method until you reach a desired tag, in this case, the siteinfo XML tag



  def run: String =
      var dbname: String = null

      while xmler.hasNext do
        val xmlne = xmler.nextEvent

        if xmlne.isStartElement then
          val xmlse = xmlne.asStartElement

          xmlse.getName.getLocalPart match
            case "siteinfo" =>
              ...

call a handle function for this tag to extract the desired information, returning information to the caller function so you can mix needed data like connecting foreign keys



  val siteinfo = handleSiteinfo

once you're done collecting information and mixing them with any needed information from the handle methods and their returned info, send it



  def handleSiteinfo: HashMap[String, String] =
    val siteinfo = HashMap[String, String]()
    var namespaces = ListBuffer[HashMap[String, String]]()

    while xmler.hasNext do
      val xmlne = xmler.nextEvent

      if xmlne.isStartElement then
        ...
      else if xmlne.isEndElement then
        val xmlee = xmlne.asEndElement

        xmlee.getName.getLocalPart match
          case "siteinfo" =>
            sinkSender.sendSiteinfo(siteinfo)

            namespaces
              .tapEach(ns => ns.put("dbname", siteinfo.get("dbname").orNull))
              .foreach(sinkSender.sendNamespace(_))

            return siteinfo
          case _ =>
            ...

There are a lot of optimizations in the code, for example: no dispatch was called when sending the siteinfo or the namespace, and that's because it's not needed and much more practical for the operations to be done instantly since all the pages being inserted later need both of those operations to be in the database so the foreign keys are connected properly. So, in those cases, the send calls are actually SEND calls, no operation batch needed.

Another example of those optimizations are those early returns that should instantly cut the function and return to the caller. Might not be much, but since we're dealing with ~20M pages, every instruction counts - and it would honestly be hard to not do it.

All the careful treatment of null values is needed to extract valid information and also the invalid ones. When dealing with such data, usually coming from websites or raw user input (maybe not here since this data should be carefully managed by the Wikimedia friends), I've learned that we should never trust data and even with some errors we should let the code go ahead so we have something to analyze later - and beyond all of it, sometimes null is a valid entry.

Once again, if you're reading this just to get a glimpse of the idea and to write your own processor, go ahead! But, in case you just want to use something that someone has already started, take a look at my github repository containing this code.

Understanding the Wikipedia dump

José Tobias — Fri, 04 Mar 2022 23:14:50 +0000

As a part of my work on SearchOnMath, I'm always trying to find better ways to retrieve and process data, making sure it's in good shape for our powerful mathematical search engine. Wikipedia has always been a problem in such workflow, since the pages are written in a markup language called Wikitext, which is not easy to understand or apply operations to.

Here I will briefly describe how the data is structured inside the dump - which is a humongous XML file - and try to make sense of such structure while not going full clichê and only displaying the XML schema. In the end I'll give an example of how I visualize the document as a relational model, and on following posts I'll describe how I wrote a processor for this XML to fill the given relational model - parallelizing bz2 streams, hopefully. Let's go and understand this data first.

First steps

To store the clean needed data, we first need to somehow ingest it from the source, and the source is the Wikimedia database backup dump. Those backup dumps come in the form of big XML files compressed into big bz2 multistream archives. The provided XML schema is a good way of understanding the whole specification, sometimes looking at the data is very important to understand what we have and what we need from this immense amount of information. To make things easier, I'll try to refer to such XML documents in a object-oriented manner and I swear it'll be much easier than referring to all the different types of "complex elements" that might be inside of this XML document.

I would say this XML file structure is made by only two main object types: the siteinfo object and the page object, each one having multiple associated objects, fields and attributes which we will discover one by one - and focus on the most important ones.

The `siteinfo` object

This structure's main part comprises informations about the source of the dump, such as the link to the main page of the wiki and the sitename. Below is a real example:



  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>https://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.38.0-wmf.22</generator>
    <case>first-letter</case>
    <namespaces>
      ...
    </namespaces>
  </siteinfo>

We see that the siteinfo object is composed by:

dbname: an indicator of this Wikipedia instance, it might work as an id
sitename: the name of the Wikipedia, well formatted, that can be displayed
base: the link to the base page of this Wikipedia, the main one
generator: information about the MediaWiki software used by this instance of the Wikipedia when such dump was generated
case: I believe this is strictly correlated to the sitename formatting, only uppercasing the 'first-letter'

We can see that the first field is the sitename which is referring to the name of the site where this dump came from - I'll not dive deeper to explain the whole usage of the MediaWiki software to build what is now the Wikipedia website here, so I'll let you discover it all if you want to. Then the dbname comes, and I like to think of this dbname as a discriminator to be used inside of the Wikipedia dumps. Fir example, we might have the "enwiki" for the English variation, the "dewiki" for the German variation, and so on. The base attribute indicated the base page to facilitate the access and the case is a formatting thing that should not be useful for us right now.

The namespaces is a child object of the siteinfo which allow us to search for information in the right place. When looking at the explanation for each namespace, we might understand that if we only need to analyse articles, we might only retrieve from this dump pages associated with the namespace 0, but if we want to analyse what users are discussing about changes in a certain page, we might look for pages associated with the namespace 1. The namespaces object is formatted as:



  <namespaces>
    <namespace key="-2" case="first-letter">Media</namespace>
    <namespace key="-1" case="first-letter">Special</namespace>
    <namespace key="0" case="first-letter" />
    <namespace key="1" case="first-letter">Talk</namespace>
    <namespace key="2" case="first-letter">User</namespace>
    ...
    <namespace key="2300" case="first-letter">Gadget</namespace>
    <namespace key="2301" case="first-letter">Gadget talk</namespace>
    <namespace key="2302" case="case-sensitive">Gadget definition</namespace>
    <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
  </namespaces>

Each namespace has a few important attributes:

key: is a namespace identifier used to make the search and link of the namespaces easier
case: I believe this is strictly correlated to the displayable name of each namespace

Exactly as it should from the namespaces explanation link we've seen before, right? The content of each namespace attribute is just a displayable friendly name.

And this is all we have on this siteinfo element that should be important for us to understand. This is a very important piece of information too have, and we'll see it soon. Let's check the page object's structure.

The `page` object

Now things get a bit more exciting. This page object is where all the information of a single page is stored inside this big XML file, and since we have millions of pages inside the Wikipedia, we have millions of such objects inside the XML dump. Each page object is given as:



  <page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title="Computer accessibility" />
    <revision>
      ...
      <contributor>
        ...
      </contributor>
    </revision>
  </page>

We see that the page object is composed by:

id: a unique identifier for this page inside of this Wikipedia instance
title: the title of the page, well formatted, that might be displayed
ns: this is the namespace of the page
redirect: when this page is just a redirect page, this redirect field appears with the title attribute containing the name of the page it should redirect to

The redirect thing is kinda common to happen inside the Wikipedia, hundreds of thousands of pages are only redirect pages. For example, try to access AccessibleComputing page and see how it redirects you to the Computer accessibility page. Also, the text for such pages are only a directive indicating it is a redirect page and nothing else - more on that soon.

Then, for the child objects that compose a page, we have the revision object which is very important. The revision fields are as follows:



  <revision>
    <id>1002250816</id>
    <parentid>854851586</parentid>
    <timestamp>2021-01-23T15:15:01Z</timestamp>
    <contributor>
      ...
    </contributor>
    <comment>shel</comment>
    <model>wikitext</model>
    <format>text/x-wiki</format>
    <text bytes="111" xml:space="preserve">#REDIRECT [[Computer accessibility]]

{{rcat shell|
{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}
}}</text>
    <sha1>kmysdltgexdwkv2xsml3j44jb56dxvn</sha1>
  </revision>

So the revision object is composed by:

id: a unique identifier of this revision inside of this Wikipedia instance
parentid: the id of this revision's parent, that is, the id of the revision that comes before this one for this same page
timestamp: the time which this revision was generated
comment: some comment added by the contributor who sent this revision
model: the model in which this text is formatted
format: seems like something that could be used in a Content-Type header internally
text: the text of this revision, our gold mine
sha1: hash of the text generated by the SHA-1 algorithm

See what I've said about the redirect page's text? Nothing else but a few lines always starting with with "#REDIRECT", followed by some strange template things or wathever, and that's it - from a processing point of view, completely disposable.

At this point we should know that Wikipedia is built by contributors, and all the changes made by a contributor to a certain page, or the very first addition of text to one of them, gives life to a new so-called revision.

The contributor child object, which is the object that identifies the user who made this change to the page with the given revision, is given as below:



  <contributor>
    <username>Elli</username>
    <id>20842734</id>
  </contributor>

And the explanation of the contributor fields are:

id: a unique identifier of this contributor inside of this Wikipedia instance
username: a displayable/more user friendly username

And that's it. Just like this, it's the whole document specification we need to know and even some information that in fact we don't need to know to process this data.

A relational model example

When I see all those objects with fields related to each other, I like to imagine it as relational data and build a relational structure in my head. So I go around and think... "if this page has this id, and this revision is related to this page with this other id, and this contributor wrote this revision with this other id... how can I quickly navigate such structure by connecting those ids?" - and all of a sudden you have your whole relational model done.

Below is how my relational structure ended up thinking of:

You might be finding it very strange that every single table has a composite primary key... well, let me explain. All those ids are only unique for each available Wikipedia around there, so the "enwiki" has its pool of ids, and then the "dewiki" also has its own pool of ids. So, to be able to use this single database and have all the data available for a multi-language type of search engine - for example - I might use the siteinfo dbname field as a discriminator - which works perfectly.

It's minimum, it doesn't have extra tables to connect one-to-many relationships in the case of contributors because I don't need it to, and it actually solves all of my problems when processing this data.

Processing the XML dump

I'm going to start my next post from this relational structure diagram, explaining how you can build a fairly fast (parallel per-stream, at least on paper) processor for this XML dump (in Scala, or Java - preferably), and the ins and outs of reading a XML multistream file using the Apache Commons Compress.