DEV Community

Wincent Balin
Wincent Balin

Posted on

Build law text corpus

In this part of series, I will describe, how to create a corpus of German law texts from https://www.gesetze-im-internet.de.

Previously in series

In the previous parts of this series, we downloaded 6518 German laws, in XML format, stored in ZIP files.

Conversion to plain text

Converting XML documents to plain text format can be accomplished with many tools and technologies, but after thorough considerations about a couple of edge cases I decided to use an XSLT stylesheet.

After studying the DTD file, which was referenced in the XML files, as well as the XML files themselves, following tasks had to be addressed (the paths given use XPath notation):

  1. The XML files have root element /dokumente
  2. The laws are either incredibly short and consist of a single paragraph, or rather long with a table of contents
  3. In the first case from 2., the law name is in metadaten/enbez and metadaten/titel (if the first path is present) or in metadaten/enbez only; in the second case ibid, the title is in norm/metadaten/langue
  4. The text body is always in textdaten
  5. The paragraphs are in the P tags and end with a new line
  6. The definition lists are in DL tags and are rendered similar to paragraphs, but without new line after the last entry
  7. The new line in text has BR tag, but is not rendered if being within a table or a list entry
  8. Table of contents (TOC tags) are excluded, as they repeat paragram titles only and thus senseless in language model training; also, they are unusable in case of plain text, as there are no known page numbers
  9. Titles (Title tags) are rendered with appended new line
  10. Tables (table tags) are rendered with rows (row tags) ending with a new line and all single cells but the last in row one (entry tags) with a tab character appended
  11. The end marker of the law text will be 25 empty lines

And hence the short XSLT stylesheet of about 100 lines:

Run it in Windows using msxsl.exe as XSLT processor like this:

msxsl BJNR001270871.xml giitotext.xsl > BJNR001270871.txt
Enter fullscreen mode Exit fullscreen mode

Concatenating the text files creates a law text corpus.

Next step

In the next part of series we will see how to train a language model with the text corpus we just created.

Top comments (0)