<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: José Tobias</title>
    <description>The latest articles on DEV Community by José Tobias (@tobiasjc).</description>
    <link>https://dev.to/tobiasjc</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F822371%2Fff29db70-5acd-4fdd-805d-994316073af4.jpeg</url>
      <title>DEV Community: José Tobias</title>
      <link>https://dev.to/tobiasjc</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tobiasjc"/>
    <language>en</language>
    <item>
      <title>Processing the Wikipedia dump</title>
      <dc:creator>José Tobias</dc:creator>
      <pubDate>Tue, 08 Mar 2022 03:59:28 +0000</pubDate>
      <link>https://dev.to/tobiasjc/processing-the-wikipedia-dump-pf8</link>
      <guid>https://dev.to/tobiasjc/processing-the-wikipedia-dump-pf8</guid>
      <description>&lt;p&gt;Okay, we've already understood the &lt;a href="https://dev.to/tobiasjc/understanding-the-wikipedia-dump-11f1"&gt;Wikipedia dump format&lt;/a&gt; and that's great! But judging how much information we have inside of it, how can we process and index it in a more manageable way than a single XML file? This is exactly what we're going to do here: processing the bz2 archive. Yeah, the archive itself - more on it soon. So, for me, there are usually 3 steps into this whole "processing" phase:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;reading the data efficiently&lt;/li&gt;
&lt;li&gt;formatting the data as needed&lt;/li&gt;
&lt;li&gt;saving the data efficiently&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The "efficiently" thing on steps 1 and 3 is just because we have to be pragmatic in such phases, we really need to think through it. When reading we need to be cautious with our time, and when saving we need to be cautious with our space. The step 2, formatting, is a bit of an odd space for me... sometimes there are requisites that we need to match and so we don't have too much control over it - but when reading and saving, we do.&lt;/p&gt;

&lt;p&gt;All the code in here will be provided as &lt;a href="https://docs.scala-lang.org/scala3/new-in-scala3.html" rel="noopener noreferrer"&gt;Scala3 code&lt;/a&gt;, but it's fairly easy to translate it all to Java, it's almost a 1 to 1 translation - since I almost don't use functional operations in the code, and pattern matching can easily be replaced by &lt;a href="https://docs.oracle.com/javase/tutorial/java/nutsandbolts/if.html" rel="noopener noreferrer"&gt;if-else blocks in Java 11 or less&lt;/a&gt; and it's literally there &lt;a href="https://docs.oracle.com/en/java/javase/17/language/pattern-matching.html#GUID-A59EF0C7-4CB7-4555-986D-0FD804555C25__GUID-12EDE418-B728-49E4-A579-92AFE560253B" rel="noopener noreferrer"&gt;on Java 17 and up&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Lets go over those steps to begin with.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading
&lt;/h2&gt;

&lt;p&gt;So, I've said we're not supposed to "unzip" the bz2 archive before processing it, and there are multiple reasons for it, but the main one is space and time - oh, the "efficiently" thing. You might remember the &lt;a href="https://dumps.wikimedia.org/enwiki/20220220/" rel="noopener noreferrer"&gt;Wikipedia dump link&lt;/a&gt; in the first post, and you'll probably see that are two types of bz2 archives for pages: the first one is the "enwiki-20220220-pages-articles-multistream.xml.bz2" and the second one is the "enwiki-20220220-pages-articles.xml.bz2". So, what is this "multistream" about? You can find a pretty &lt;a href="https://en.wikipedia.org/wiki/Wikipedia:Database_download#Should_I_get_multistream?" rel="noopener noreferrer"&gt;shallow explanation here on this Wikipedia page&lt;/a&gt; (who the hell is going to use dd to manipulate a 20GB file that should be processed? Come on man...), but I'll try to explain it a bit better.&lt;/p&gt;

&lt;p&gt;A "multistream" bz2 file can be thought of as: the concatenation of multiple files with the intent to make it easier to search for anything that is included into this archive when their stream position is known. Explaining it a bit better: every file, per say, becomes a stream, and those streams can be separated by counting the size of such files in bytes - after compressed. Therefore, on the Wikipedia dump page, right under the &lt;code&gt;enwiki-20220220-pages-articles-multistream.xml.bz2&lt;/code&gt; file, which is the dump archive itself, we have a &lt;code&gt;enwiki-20220220-pages-articles-multistream-index.txt.bz2&lt;/code&gt; index file containing the number of bytes of each stream inside that file. For example, imagine that we have 3 files made into a bz2 archive, and we know the first stream have &lt;code&gt;100 bytes&lt;/code&gt;, the second stream have &lt;code&gt;300 bytes&lt;/code&gt; and the last stream have &lt;code&gt;180 bytes&lt;/code&gt;. If we want to read something from this archive, without extracting it, and we know is inside the 3rd file, we might go straight into the third stream jumping &lt;code&gt;100+300 = 400&lt;/code&gt; bytes into the file and we can get only this stream that will for sure obtain the needed data. I KNOW, I KNOW, YEAH YEAH, I KNOW there are some bytes that come into the compression before the data itself forming a header, but we're trying to make sense of those things here, okay? We're just trying to make sense of things, &lt;a href="https://sourceware.org/bzip2/manual/manual.html#memory-management" rel="noopener noreferrer"&gt;but if you want to go that deep into it, GO&lt;/a&gt;. In the case of Wikipedia, each stream contains 100 pages, no matter how many bytes, and the first stream of the file is dedicated to the &lt;code&gt;siteinfo&lt;/code&gt; part of the dump aline - remembering &lt;a href="https://dev.to/tobiasjc/understanding-the-wikipedia-dump-11f1#the-raw-siteinfo-endraw-object"&gt;our &lt;code&gt;siteinfo&lt;/code&gt; object&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The image below should shed some light on how those Wikipedia dumps are built, and why the bz2 archive and the index file work they way they do - this is just how I see it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fpsgicmlib229fm0f1w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fpsgicmlib229fm0f1w.png" alt="Wikipedia streams division"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hopefully the archive structure is clear now. Lets begin to code it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reading it
&lt;/h3&gt;

&lt;p&gt;If you're inside this Java World, I believe you know &lt;a href="https://apache.org/" rel="noopener noreferrer"&gt;Apache Software Foundation&lt;/a&gt; and their amazing projects. And one of those amazing project, is the &lt;a href="https://commons.apache.org/proper/commons-compress/" rel="noopener noreferrer"&gt;Apache Commons Compress&lt;/a&gt;, which we're going to use so we can easily access and process data inside of this bz2 archive - yeah, inside of it, no decompress beforehand is needed, which is amazing.&lt;/p&gt;

&lt;p&gt;The class &lt;a href="https://commons.apache.org/proper/commons-compress/apidocs/org/apache/commons/compress/compressors/bzip2/BZip2CompressorInputStream.html" rel="noopener noreferrer"&gt;&lt;code&gt;Bzip2CompressorInputStream&lt;/code&gt;&lt;/a&gt; is where all the magic happens, it takes a &lt;code&gt;InputStream&lt;/code&gt; as the constructor. So, all we have to do is something like the method below:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;

  &lt;span class="k"&gt;object&lt;/span&gt; &lt;span class="nc"&gt;StreamLoader&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
    &lt;span class="kt"&gt;val&lt;/span&gt; &lt;span class="kt"&gt;log&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;LoggerFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getLogger&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getClass&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;getStreamBuffer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputStream&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;InputStream&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;ListBuffer&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt;
      &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;lb&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ListBuffer&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Byte&lt;/span&gt;&lt;span class="o"&gt;]()&lt;/span&gt;

      &lt;span class="n"&gt;synchronized&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;lb&lt;/span&gt; &lt;span class="o"&gt;++=&lt;/span&gt; 
    &lt;span class="nc"&gt;BZip2CompressorInputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputStream&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;readAllBytes&lt;/span&gt;
        &lt;span class="k"&gt;catch&lt;/span&gt;
          &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt;
            &lt;span class="nv"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;debug&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;e&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getMessage&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
      &lt;span class="n"&gt;lb&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;synchronized&lt;/code&gt; block there might be an indication what we're going to do, right? Yes, we're going to call this single method from multiple threads! That's why we create it inside an &lt;code&gt;object&lt;/code&gt; and not a class, so it's static and the concurrent model takes place - one thread has access to read, while all others wait to get a piece of data to process. If you're big into computer science theory like me, you've probably seen it before on a computer architecture or distributed programming class before... yes, this goes into the &lt;a href="https://en.wikipedia.org/wiki/Flynn%27s_taxonomy#Single_instruction_stream,_multiple_data_streams_(SIMD)" rel="noopener noreferrer"&gt;Flynn's Taxonomy on Single Instruction stream, Multiple Data streams [SIMD]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Making it easier: we have a single operation to apply over data that might be divided beforehand. The diagram below is how I usually see this kind of operation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4g0h5556a3s4m6z7iln2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4g0h5556a3s4m6z7iln2.png" alt="The idea behind streams and parallel processing"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can abstract those &lt;code&gt;processors&lt;/code&gt; as &lt;code&gt;threads&lt;/code&gt; and the &lt;code&gt;block of data&lt;/code&gt; as a &lt;code&gt;stream&lt;/code&gt;, and voilá. People will surely say that I'm stretching the concept of processors and instructions here, and YES I AM. But, this hasn't failed me yet - specially when using a VM language and not coding on bare metal. But pasmen: &lt;a href="https://github.com/tobiasjc/mpi-omp-counting-stars" rel="noopener noreferrer"&gt;even on bare metal with C, MPI and OpenMP, with a 4 machines cluster&lt;/a&gt;, it hasn't failed me yet!&lt;/p&gt;

&lt;p&gt;Something I haven't told you, but you should have seen by now in the XML dump, is that it's actually formatted like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

  &lt;span class="nt"&gt;&amp;lt;mediawiki&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;siteinfo&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/siteinfo&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/page&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/page&amp;gt;&lt;/span&gt;
    ...
    &lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/page&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/mediawiki&amp;gt;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;So, as I've said: first block stream will give you the &lt;code&gt;siteinfo&lt;/code&gt;, BUT it comes with this &lt;code&gt;&amp;lt;mediawiki&amp;gt;&lt;/code&gt; tag like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

  &lt;span class="nt"&gt;&amp;lt;mediawiki&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;siteinfo&amp;gt;&lt;/span&gt; ... &lt;span class="nt"&gt;&amp;lt;/siteinfo&amp;gt;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And so the following streams are going to give you 100 pages at a time, BUT it's going to end wiki a &lt;code&gt;&amp;lt;/mediawiki&amp;gt;&lt;/code&gt; tag like this:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

  &lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;
  ...
  &lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/mediawiki&amp;gt;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And this thing right there will give you problems when processing this with any kind of &lt;a href="https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/XMLStreamReader.html" rel="noopener noreferrer"&gt;XML Stream Reader&lt;/a&gt; (we're going to talk about it in just a minute, hold on). Also, any XML Stream Reader will demand that all those tags are "embraced" in the form of a single document, so they can't be like this, they should be part of a single other document tag. Therefore, what you'll need to do is having a method to put this stream of pages INSIDE another document, such as:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;getDocumentStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputStream&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;InputStream&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;InputStream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;content&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getStreamBuffer&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputStream&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;

    &lt;span class="nc"&gt;BufferedInputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
      &lt;span class="nc"&gt;ByteArrayInputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
          &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;prependAll&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&amp;lt;document&amp;gt;"&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getBytes&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
          &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;appendAll&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&amp;lt;/document&amp;gt;"&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getBytes&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
          &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;toArray&lt;/span&gt;
      &lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;NOW, you have a processable entity by any XML Stream Reader that you get from a provider. Because your streams (on a good day, not the last one as we've seen) will come as:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

  &lt;span class="nt"&gt;&amp;lt;document&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/page&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/page&amp;gt;&lt;/span&gt;
    ...
    &lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/page&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/document&amp;gt;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;So now we have a starting tag &lt;code&gt;&amp;lt;document&amp;gt;&lt;/code&gt; and an ending tag &lt;code&gt;&amp;lt;/document&amp;gt;&lt;/code&gt;, as we should have in a well-formatted XML document.&lt;/p&gt;

&lt;p&gt;So, those are all the ins and outs I can think off when reading those streams in a proper way so they can be processed. The processing is rather interesting too, lets jump right into it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing it
&lt;/h2&gt;

&lt;p&gt;Well, as I've said before: I'm writing it to a &lt;a href="https://www.postgresql.org/" rel="noopener noreferrer"&gt;postgresql database&lt;/a&gt; but you can write it literally to anything you want as long as you have the drivers or the means to do so. Lets remember how the data is structured inside the database:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmqh5zbz43l32y7rkd9v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmqh5zbz43l32y7rkd9v.png" alt="Example of relational diagram given the XML dump"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the code I've used this idea of a &lt;code&gt;SinkSender&lt;/code&gt;, that's just something to "flush away the data" from the program into whatever place it should be. To do so, in Scala, I created a &lt;code&gt;trait&lt;/code&gt; which can be relative to an &lt;code&gt;interface&lt;/code&gt; in Java, as displayed below:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;

  &lt;span class="k"&gt;trait&lt;/span&gt; &lt;span class="nc"&gt;SinkSender&lt;/span&gt; &lt;span class="k"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Closeable&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
    &lt;span class="kt"&gt;def&lt;/span&gt; &lt;span class="kt"&gt;sendSiteinfo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;siteinfo:&lt;/span&gt; &lt;span class="kt"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;, &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Unit&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sendNamespace&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;, &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Unit&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sendPage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;, &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Unit&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sendRevision&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revision&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;, &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Unit&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sendContributor&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;contributor&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;, &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;])&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Unit&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;dispatch&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Unit&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Unit&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And as we can see, I have a &lt;code&gt;send&lt;/code&gt; type of method for each entity of the code that will go straight into the database, but they don't go from there into the database until the &lt;code&gt;dispatch&lt;/code&gt; method is called. Since we're processing it 100 pages at a time with each stream, I've found it very practical to just use &lt;a href="https://docs.oracle.com/cd/E11882_01/java.112/e16548/oraperf.htm#JJDBC28769" rel="noopener noreferrer"&gt;batch operations&lt;/a&gt; for those 100 pages at a time, so the &lt;code&gt;send&lt;/code&gt; methods only call the &lt;code&gt;addBatch&lt;/code&gt; inside of a connection until the &lt;code&gt;dispatch&lt;/code&gt; method is called and then the &lt;code&gt;executeBatch&lt;/code&gt; is called for everyone - in a very specific order, as you might remember from the usage of &lt;a href="https://www.postgresqltutorial.com/postgresql-foreign-key/" rel="noopener noreferrer"&gt;foreign key constraints&lt;/a&gt;. This controlled point of synchronization also allow for a document to be build and sent into an &lt;a href="https://developer.confluent.io/learn-kafka/apache-kafka/topics/" rel="noopener noreferrer"&gt;Apache Kafka topic&lt;/a&gt; for example - but I'll not go deeper here since it might be confusing enough already.&lt;/p&gt;

&lt;p&gt;But for now, all you have to understand is this &lt;code&gt;SinkSender&lt;/code&gt; structure, and build your own from there. Check &lt;a href="https://github.com/tobiasjc/barewiki-dumper" rel="noopener noreferrer"&gt;the repository&lt;/a&gt; and see how I've done it all and also a bit more to build the whole database structure I need from the code itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Processing it
&lt;/h2&gt;

&lt;p&gt;Now, I'll show you an easy way to process it without going too much into the details. Here, I'll probably not show you all the code, but will show you how to do the code. If you want to see all the code, &lt;a href="https://github.com/tobiasjc/barewiki-dumper" rel="noopener noreferrer"&gt;you can see it directly on the repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;When processing the data, I chose to use the java built in &lt;a href="https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/XMLStreamReader.html" rel="noopener noreferrer"&gt;XML Stream Reader&lt;/a&gt;. I don't have time to go deeper into the XML Stream Reader thing here, but &lt;a href="https://docs.oracle.com/javase/tutorial/jaxp/stax/index.html" rel="noopener noreferrer"&gt;here is a good enough tutorial&lt;/a&gt; about it.&lt;/p&gt;

&lt;p&gt;Obviously, when processing a 20GB+ Bzip2 archive, you'll not try to load it all into the memory, right? RIGHT? Okay. So having streammed operations like we've already done when reading the file, and then processing the data step by step so we can extract information needed without searching for it - because the whole document has a well formatted schema - seems like the best choice. This is exactly what this XML Stream Reader gives us: a way to process the stream step by step, going from tag to tag, until we reach the end of each block and get ready for the next one.&lt;/p&gt;

&lt;p&gt;I've chosen to do so using &lt;code&gt;handle&lt;/code&gt; methods to handle each kind of objects I have on this XML dump. I don't think it's very practical to display whole chunks of code here, since it might become extensive and tiring to read, so I'll display diagrams and explain them while trying to make sense of the code I've already written and that should guide you in the path of writing your own or just using mine.&lt;/p&gt;

&lt;p&gt;Below is a diagram of how the &lt;code&gt;XML Stream Reader&lt;/code&gt; and the &lt;code&gt;handle&lt;/code&gt; thing should work for the &lt;code&gt;siteinfo&lt;/code&gt; object, exactly as it is:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3lx6aa0eudwgt9765un7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3lx6aa0eudwgt9765un7.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I know it might seem confusing, but this is just a way to imagine the function expressed as a flow chart, and most of the times it helps me get my head around some details. Let's follow the flow char step by step, with some code snippets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;create a new XML event reader, those are available globally inside this class&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;

  &lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;XMLStreamHeaderProcessor&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
      &lt;span class="n"&gt;sinkSender&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;SinkSender&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;xmlif&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;XMLInputFactory&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
      &lt;span class="n"&gt;inputStream&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;InputStream&lt;/span&gt;
  &lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt;
    &lt;span class="kt"&gt;val&lt;/span&gt; &lt;span class="kt"&gt;log&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;LoggerFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getLogger&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getClass&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;xmler&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;xmlif&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;createXMLEventReader&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputStream&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;call the "nextEvent" method until you reach a desired tag, in this case, the &lt;code&gt;siteinfo&lt;/code&gt; XML tag&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
      &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;dbname&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;

      &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="nv"&gt;xmler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;hasNext&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
        &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;xmlne&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;xmler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;nextEvent&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nv"&gt;xmlne&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;isStartElement&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt;
          &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;xmlse&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;xmlne&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;asStartElement&lt;/span&gt;

          &lt;span class="nv"&gt;xmlse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getLocalPart&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"siteinfo"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt;
              &lt;span class="o"&gt;...&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;call a handle function for this tag to extract the desired information, returning information to the caller function so you can mix needed data like connecting foreign keys&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;

  &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;siteinfo&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="n"&gt;handleSiteinfo&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;once you're done collecting information and mixing them with any needed information from the handle methods and their returned info, send it&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;handleSiteinfo&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;, &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt;
    &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;siteinfo&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;, &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;]()&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;namespaces&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ListBuffer&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;, &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;]]()&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="nv"&gt;xmler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;hasNext&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
      &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;xmlne&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;xmler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;nextEvent&lt;/span&gt;

      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nv"&gt;xmlne&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;isStartElement&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt;
        &lt;span class="o"&gt;...&lt;/span&gt;
      &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nv"&gt;xmlne&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;isEndElement&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt;
        &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;xmlee&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;xmlne&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;asEndElement&lt;/span&gt;

        &lt;span class="nv"&gt;xmlee&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getName&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getLocalPart&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt;
          &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"siteinfo"&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt;
            &lt;span class="nv"&gt;sinkSender&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sendSiteinfo&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;siteinfo&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;namespaces&lt;/span&gt;
              &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;tapEach&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ns&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;ns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;put&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"dbname"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;siteinfo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;get&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"dbname"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;orNull&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
              &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;foreach&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;sinkSender&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sendNamespace&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;_&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;

            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;siteinfo&lt;/span&gt;
          &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="k"&gt;_&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt;
            &lt;span class="o"&gt;...&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;There are a lot of optimizations in the code, for example: no &lt;code&gt;dispatch&lt;/code&gt; was called when sending the &lt;code&gt;siteinfo&lt;/code&gt; or the &lt;code&gt;namespace&lt;/code&gt;, and that's because it's not needed and much more practical for the operations to be done instantly since all the pages being inserted later need both of those operations to be in the database so the foreign keys are connected properly. So, in those cases, the &lt;code&gt;send&lt;/code&gt; calls are actually SEND calls, no operation batch needed.&lt;/p&gt;

&lt;p&gt;Another example of those optimizations are those early returns that should instantly cut the function and return to the caller. Might not be much, but since we're dealing with ~20M pages, every instruction counts - and it would honestly be hard to not do it.&lt;/p&gt;

&lt;p&gt;All the careful treatment of &lt;code&gt;null&lt;/code&gt; values is needed to extract valid information and also the invalid ones. When dealing with such data, usually coming from websites or raw user input (maybe not here since this data should be carefully managed by the Wikimedia friends), I've learned that we should never trust data and even with some errors we should let the code go ahead so we have something to analyze later - and beyond all of it, sometimes &lt;code&gt;null&lt;/code&gt; is a valid entry.&lt;/p&gt;

&lt;p&gt;Once again, if you're reading this just to get a glimpse of the idea and to write your own processor, go ahead! But, in case you just want to use something that someone has already started, take a look at &lt;a href="https://github.com/tobiasjc/barewiki-dumper" rel="noopener noreferrer"&gt;my github repository containing this code&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>scala</category>
      <category>database</category>
      <category>wikipedia</category>
      <category>data</category>
    </item>
    <item>
      <title>Understanding the Wikipedia dump</title>
      <dc:creator>José Tobias</dc:creator>
      <pubDate>Fri, 04 Mar 2022 23:14:50 +0000</pubDate>
      <link>https://dev.to/tobiasjc/understanding-the-wikipedia-dump-11f1</link>
      <guid>https://dev.to/tobiasjc/understanding-the-wikipedia-dump-11f1</guid>
      <description>&lt;p&gt;As a part of my work on &lt;a href="https://searchonmath.com" rel="noopener noreferrer"&gt;SearchOnMath&lt;/a&gt;, I'm always trying to find better ways to retrieve and process data, making sure it's in good shape for our powerful mathematical search engine. Wikipedia has always been a problem in such workflow, since the pages are written in a markup language called &lt;a href="https://en.wikipedia.org/wiki/Help:Wikitext" rel="noopener noreferrer"&gt;Wikitext&lt;/a&gt;, which is not easy to understand or apply operations to.&lt;/p&gt;

&lt;p&gt;Here I will briefly describe how the data is structured inside the dump - which is a humongous XML file - and try to make sense of such structure while not going full clichê and only displaying &lt;a href="https://www.mediawiki.org/xml/export-0.10.xsd" rel="noopener noreferrer"&gt;the XML schema&lt;/a&gt;. In the end I'll give an example of how I visualize the document as a relational model, and on following posts I'll describe how I wrote a processor for this XML to fill the given relational model - parallelizing bz2 streams, hopefully. Let's go and understand this data first.&lt;/p&gt;

&lt;h2&gt;
  
  
  First steps
&lt;/h2&gt;

&lt;p&gt;To store the clean needed data, we first need to somehow ingest it from the source, and the source is the &lt;a href="https://dumps.wikimedia.org/backup-index.html" rel="noopener noreferrer"&gt;Wikimedia database backup dump&lt;/a&gt;. Those backup dumps come in the form of big XML files compressed into big &lt;a href="https://sourceware.org/bzip2/" rel="noopener noreferrer"&gt;bz2&lt;/a&gt; &lt;a href="https://en.wikipedia.org/wiki/Bzip2#File_format" rel="noopener noreferrer"&gt;multistream&lt;/a&gt; archives. The provided &lt;a href="https://www.mediawiki.org/xml/export-0.10.xsd" rel="noopener noreferrer"&gt;XML schema&lt;/a&gt; is a good way of understanding the whole specification, sometimes looking at the data is very important to understand what we have and what we need from this immense amount of information. To make things easier, I'll try to refer to such XML documents in a &lt;a href="https://en.wikipedia.org/wiki/Object-oriented_programming" rel="noopener noreferrer"&gt;object-oriented&lt;/a&gt; manner and I swear it'll be much easier than referring to all the different types of "complex elements" that might be inside of this XML document.&lt;/p&gt;

&lt;p&gt;I would say this XML file structure is made by only two main object types: the &lt;code&gt;siteinfo&lt;/code&gt; object and the &lt;code&gt;page&lt;/code&gt; object, each one having multiple associated objects, fields and attributes which we will discover one by one - and focus on the most important ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  The &lt;code&gt;siteinfo&lt;/code&gt; object
&lt;/h2&gt;

&lt;p&gt;This structure's main part comprises informations about the source of the dump, such as the link to the main page of the wiki and the sitename. Below is a real example:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

  &lt;span class="nt"&gt;&amp;lt;siteinfo&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;sitename&amp;gt;&lt;/span&gt;Wikipedia&lt;span class="nt"&gt;&amp;lt;/sitename&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;dbname&amp;gt;&lt;/span&gt;enwiki&lt;span class="nt"&gt;&amp;lt;/dbname&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;base&amp;gt;&lt;/span&gt;https://en.wikipedia.org/wiki/Main_Page&lt;span class="nt"&gt;&amp;lt;/base&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;generator&amp;gt;&lt;/span&gt;MediaWiki 1.38.0-wmf.22&lt;span class="nt"&gt;&amp;lt;/generator&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;case&amp;gt;&lt;/span&gt;first-letter&lt;span class="nt"&gt;&amp;lt;/case&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;namespaces&amp;gt;&lt;/span&gt;
      ...
    &lt;span class="nt"&gt;&amp;lt;/namespaces&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/siteinfo&amp;gt;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We see that the &lt;code&gt;siteinfo&lt;/code&gt; object is composed by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;dbname&lt;/code&gt;: an indicator of this Wikipedia instance, it might work as an id&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sitename&lt;/code&gt;: the name of the Wikipedia, well formatted, that can be displayed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;base&lt;/code&gt;: the link to the base page of this Wikipedia, the  main one&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;generator&lt;/code&gt;: information about the &lt;a href="https://www.mediawiki.org/wiki/Download" rel="noopener noreferrer"&gt;MediaWiki software&lt;/a&gt; used by this instance of the Wikipedia when such dump was generated&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;case&lt;/code&gt;: I believe this is strictly correlated to the sitename formatting, only uppercasing the 'first-letter'&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can see that the first field is the &lt;code&gt;sitename&lt;/code&gt; which is referring to the name of the site where this dump came from - I'll not dive deeper to explain the whole &lt;a href="https://www.mediawiki.org/wiki/MediaWiki" rel="noopener noreferrer"&gt;usage of the MediaWiki software to build what is now the Wikipedia website&lt;/a&gt; here, so I'll let you discover it all if you want to. Then the &lt;code&gt;dbname&lt;/code&gt; comes, and I like to think of this &lt;code&gt;dbname&lt;/code&gt; as a discriminator to be used inside of the Wikipedia dumps. Fir example, we might have the "&lt;em&gt;enwiki&lt;/em&gt;" for the English variation, the "&lt;em&gt;dewiki&lt;/em&gt;" for the German variation, and so on. The &lt;code&gt;base&lt;/code&gt; attribute indicated the base page to facilitate the access and the &lt;code&gt;case&lt;/code&gt; is a formatting thing that should not be useful for us right now.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;namespaces&lt;/code&gt; is a child object of the &lt;code&gt;siteinfo&lt;/code&gt; which allow us to search for information in the right place. When looking at the &lt;a href="https://en.wikipedia.org/wiki/Wikipedia:Namespace" rel="noopener noreferrer"&gt;explanation for each namespace&lt;/a&gt;, we might understand that if we only need to analyse articles, we might only retrieve from this dump pages associated with the namespace &lt;code&gt;0&lt;/code&gt;, but if we want to analyse what users are discussing about changes in a certain page, we might look for pages associated with the namespace &lt;code&gt;1&lt;/code&gt;. The &lt;code&gt;namespaces&lt;/code&gt; object is formatted as:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

  &lt;span class="nt"&gt;&amp;lt;namespaces&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;namespace&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"-2"&lt;/span&gt; &lt;span class="na"&gt;case=&lt;/span&gt;&lt;span class="s"&gt;"first-letter"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Media&lt;span class="nt"&gt;&amp;lt;/namespace&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;namespace&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"-1"&lt;/span&gt; &lt;span class="na"&gt;case=&lt;/span&gt;&lt;span class="s"&gt;"first-letter"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Special&lt;span class="nt"&gt;&amp;lt;/namespace&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;namespace&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"0"&lt;/span&gt; &lt;span class="na"&gt;case=&lt;/span&gt;&lt;span class="s"&gt;"first-letter"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;namespace&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"1"&lt;/span&gt; &lt;span class="na"&gt;case=&lt;/span&gt;&lt;span class="s"&gt;"first-letter"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Talk&lt;span class="nt"&gt;&amp;lt;/namespace&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;namespace&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"2"&lt;/span&gt; &lt;span class="na"&gt;case=&lt;/span&gt;&lt;span class="s"&gt;"first-letter"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;User&lt;span class="nt"&gt;&amp;lt;/namespace&amp;gt;&lt;/span&gt;
    ...
    &lt;span class="nt"&gt;&amp;lt;namespace&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"2300"&lt;/span&gt; &lt;span class="na"&gt;case=&lt;/span&gt;&lt;span class="s"&gt;"first-letter"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Gadget&lt;span class="nt"&gt;&amp;lt;/namespace&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;namespace&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"2301"&lt;/span&gt; &lt;span class="na"&gt;case=&lt;/span&gt;&lt;span class="s"&gt;"first-letter"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Gadget talk&lt;span class="nt"&gt;&amp;lt;/namespace&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;namespace&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"2302"&lt;/span&gt; &lt;span class="na"&gt;case=&lt;/span&gt;&lt;span class="s"&gt;"case-sensitive"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Gadget definition&lt;span class="nt"&gt;&amp;lt;/namespace&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;namespace&lt;/span&gt; &lt;span class="na"&gt;key=&lt;/span&gt;&lt;span class="s"&gt;"2303"&lt;/span&gt; &lt;span class="na"&gt;case=&lt;/span&gt;&lt;span class="s"&gt;"case-sensitive"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Gadget definition talk&lt;span class="nt"&gt;&amp;lt;/namespace&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/namespaces&amp;gt;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Each namespace has a few important attributes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;key&lt;/code&gt;: is a namespace identifier used to make the search and link of the namespaces easier&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;case&lt;/code&gt;: I believe this is strictly correlated to the displayable name of each namespace&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Exactly as it should from the &lt;a href="https://en.wikipedia.org/wiki/Wikipedia:Namespace" rel="noopener noreferrer"&gt;namespaces explanation link&lt;/a&gt; we've seen before, right? The content of each namespace attribute is just a displayable friendly name.&lt;/p&gt;

&lt;p&gt;And this is all we have on this &lt;code&gt;siteinfo&lt;/code&gt; element that should be important for us to understand. This is a very important piece of information too have, and we'll see it soon. Let's check the &lt;code&gt;page&lt;/code&gt; object's structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The &lt;code&gt;page&lt;/code&gt; object
&lt;/h2&gt;

&lt;p&gt;Now things get a bit more exciting. This &lt;code&gt;page&lt;/code&gt; object is where all the information of a single page is stored inside this big XML file, and since we have millions of pages inside the Wikipedia, we have millions of such objects inside the XML dump. Each &lt;code&gt;page&lt;/code&gt; object is given as: &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

  &lt;span class="nt"&gt;&amp;lt;page&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;title&amp;gt;&lt;/span&gt;AccessibleComputing&lt;span class="nt"&gt;&amp;lt;/title&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;ns&amp;gt;&lt;/span&gt;0&lt;span class="nt"&gt;&amp;lt;/ns&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;id&amp;gt;&lt;/span&gt;10&lt;span class="nt"&gt;&amp;lt;/id&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;redirect&lt;/span&gt; &lt;span class="na"&gt;title=&lt;/span&gt;&lt;span class="s"&gt;"Computer accessibility"&lt;/span&gt; &lt;span class="nt"&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;revision&amp;gt;&lt;/span&gt;
      ...
      &lt;span class="nt"&gt;&amp;lt;contributor&amp;gt;&lt;/span&gt;
        ...
      &lt;span class="nt"&gt;&amp;lt;/contributor&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/revision&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/page&amp;gt;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;We see that the &lt;code&gt;page&lt;/code&gt; object is composed by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;id&lt;/code&gt;: a unique identifier for this page inside of this Wikipedia instance&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;title&lt;/code&gt;: the title of the page, well formatted, that might be displayed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ns&lt;/code&gt;: this is the namespace of the page&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;redirect&lt;/code&gt;: when this &lt;a href="https://en.wikipedia.org/wiki/Wikipedia:Redirect" rel="noopener noreferrer"&gt;page is just a redirect page&lt;/a&gt;, this &lt;code&gt;redirect&lt;/code&gt; field appears with the &lt;code&gt;title&lt;/code&gt; attribute containing the name of the page it should redirect to&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;redirect&lt;/code&gt; thing is kinda common to happen inside the Wikipedia, hundreds of thousands of pages are only redirect pages. For example, try to access &lt;a href="https://en.wikipedia.org/wiki/AccessibleComputing" rel="noopener noreferrer"&gt;AccessibleComputing&lt;/a&gt; page and see how it redirects you to the &lt;a href="https://en.wikipedia.org/wiki/Computer_accessibility" rel="noopener noreferrer"&gt;Computer accessibility&lt;/a&gt; page. Also, the text for such pages are only a directive indicating it is a redirect page and nothing else - more on that soon.&lt;/p&gt;

&lt;p&gt;Then, for the child objects that compose a page, we have the &lt;code&gt;revision&lt;/code&gt; object which is very important. The &lt;code&gt;revision&lt;/code&gt; fields are as follows:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

  &lt;span class="nt"&gt;&amp;lt;revision&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;id&amp;gt;&lt;/span&gt;1002250816&lt;span class="nt"&gt;&amp;lt;/id&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;parentid&amp;gt;&lt;/span&gt;854851586&lt;span class="nt"&gt;&amp;lt;/parentid&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;timestamp&amp;gt;&lt;/span&gt;2021-01-23T15:15:01Z&lt;span class="nt"&gt;&amp;lt;/timestamp&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;contributor&amp;gt;&lt;/span&gt;
      ...
    &lt;span class="nt"&gt;&amp;lt;/contributor&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;comment&amp;gt;&lt;/span&gt;shel&lt;span class="nt"&gt;&amp;lt;/comment&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;model&amp;gt;&lt;/span&gt;wikitext&lt;span class="nt"&gt;&amp;lt;/model&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;format&amp;gt;&lt;/span&gt;text/x-wiki&lt;span class="nt"&gt;&amp;lt;/format&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;text&lt;/span&gt; &lt;span class="na"&gt;bytes=&lt;/span&gt;&lt;span class="s"&gt;"111"&lt;/span&gt; &lt;span class="na"&gt;xml:space=&lt;/span&gt;&lt;span class="s"&gt;"preserve"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;#REDIRECT [[Computer accessibility]]

{{rcat shell|
{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}
}}&lt;span class="nt"&gt;&amp;lt;/text&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;sha1&amp;gt;&lt;/span&gt;kmysdltgexdwkv2xsml3j44jb56dxvn&lt;span class="nt"&gt;&amp;lt;/sha1&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/revision&amp;gt;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;So the &lt;code&gt;revision&lt;/code&gt; object is composed by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;id&lt;/code&gt;: a unique identifier of this revision inside of this Wikipedia instance&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;parentid&lt;/code&gt;: the id of this revision's parent, that is, the id of the revision that comes before this one for this same page&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;timestamp&lt;/code&gt;: the time which this revision was generated&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;comment&lt;/code&gt;: some comment added by the contributor who sent this revision&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;model&lt;/code&gt;: the model in which this text is formatted&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;format&lt;/code&gt;: seems like something that could be used in a &lt;code&gt;Content-Type&lt;/code&gt; header internally&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;text&lt;/code&gt;: the text of this revision, our gold mine&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sha1&lt;/code&gt;: hash of the text generated by the &lt;a href="https://en.wikipedia.org/wiki/SHA-1" rel="noopener noreferrer"&gt;SHA-1 algorithm&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See what I've said about the redirect page's text? Nothing else but a few lines always starting with with "#REDIRECT", followed by some &lt;a href="https://www.mediawiki.org/wiki/Help:Templates" rel="noopener noreferrer"&gt;strange template things&lt;/a&gt; or wathever, and that's it - from a processing point of view, completely disposable.&lt;/p&gt;

&lt;p&gt;At this point we should know that &lt;a href="https://en.wikipedia.org/wiki/Wikipedia:About" rel="noopener noreferrer"&gt;Wikipedia is built by contributors&lt;/a&gt;, and all the changes made by a contributor to a certain page, or the very first addition of text to one of them, gives life to a new so-called &lt;code&gt;revision&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;contributor&lt;/code&gt; child object, which is the object that identifies the user who made this change to the page with the given revision, is given as below:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;

  &lt;span class="nt"&gt;&amp;lt;contributor&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;username&amp;gt;&lt;/span&gt;Elli&lt;span class="nt"&gt;&amp;lt;/username&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;id&amp;gt;&lt;/span&gt;20842734&lt;span class="nt"&gt;&amp;lt;/id&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/contributor&amp;gt;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And the explanation of the &lt;code&gt;contributor&lt;/code&gt; fields are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;id&lt;/code&gt;: a unique identifier of this contributor inside of this Wikipedia instance&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;username&lt;/code&gt;: a displayable/more user friendly username&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And that's it. Just like this, it's the whole document specification we need to know and even some information that in fact we don't need to know to process this data.&lt;/p&gt;

&lt;h2&gt;
  
  
  A relational model example
&lt;/h2&gt;

&lt;p&gt;When I see all those objects with fields related to each other, I like to imagine it as relational data and build a relational structure in my head. So I go around and think... "if this page has this id, and this revision is related to this page with this other id, and this contributor wrote this revision with this other id... how can I quickly navigate such structure by connecting those ids?" - and all of a sudden you have your whole relational model done.&lt;/p&gt;

&lt;p&gt;Below is how my relational structure ended up thinking of:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmqh5zbz43l32y7rkd9v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgmqh5zbz43l32y7rkd9v.png" alt="Example of relational diagram given the XML dump"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You might be finding it very strange that every single table has a composite primary key... well, let me explain. All those ids are only unique for each available Wikipedia around there, so the &lt;a href="https://en.wikipedia.org/wiki/Main_Page" rel="noopener noreferrer"&gt;"&lt;em&gt;enwiki&lt;/em&gt;"&lt;/a&gt; has its pool of ids, and then the &lt;a href="https://de.wikipedia.org/wiki/Wikipedia:Hauptseite" rel="noopener noreferrer"&gt;"&lt;em&gt;dewiki&lt;/em&gt;"&lt;/a&gt; also has its own pool of ids. So, to be able to use this single database and have all the data available for a multi-language type of search engine - for example - I might use the &lt;code&gt;siteinfo&lt;/code&gt; &lt;strong&gt;dbname&lt;/strong&gt; field as a discriminator - which works perfectly.&lt;/p&gt;

&lt;p&gt;It's minimum, it doesn't have extra tables to connect one-to-many relationships in the case of contributors because I don't need it to, and it actually solves all of my problems when processing this data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Processing the XML dump
&lt;/h2&gt;

&lt;p&gt;I'm going to start my next post from this relational structure diagram, explaining how you can build a fairly fast (parallel per-stream, at least on paper) processor for this XML dump (in Scala, or Java - preferably), and the ins and outs of reading a XML multistream file using the &lt;a href="https://commons.apache.org/proper/commons-compress/" rel="noopener noreferrer"&gt;Apache Commons Compress&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>data</category>
      <category>xml</category>
      <category>database</category>
      <category>wikipedia</category>
    </item>
  </channel>
</rss>
