In this part of series, I will describe, how to create a corpus of German law texts from https://www.gesetze-im-internet.de.
Previously in series
In the previous parts of this series, we downloaded 6518 German laws, in XML format, stored in ZIP files.
Conversion to plain text
Converting XML documents to plain text format can be accomplished with many tools and technologies, but after thorough considerations about a couple of edge cases I decided to use an XSLT stylesheet.
After studying the DTD file, which was referenced in the XML files, as well as the XML files themselves, following tasks had to be addressed (the paths given use XPath notation):
- The XML files have root element
/dokumente
- The laws are either incredibly short and consist of a single paragraph, or rather long with a table of contents
- In the first case from 2., the law name is in
metadaten/enbez
andmetadaten/titel
(if the first path is present) or inmetadaten/enbez
only; in the second case ibid, the title is innorm/metadaten/langue
- The text body is always in
textdaten
- The paragraphs are in the
P
tags and end with a new line - The definition lists are in
DL
tags and are rendered similar to paragraphs, but without new line after the last entry - The new line in text has
BR
tag, but is not rendered if being within a table or a list entry - Table of contents (
TOC
tags) are excluded, as they repeat paragram titles only and thus senseless in language model training; also, they are unusable in case of plain text, as there are no known page numbers - Titles (
Title
tags) are rendered with appended new line - Tables (
table
tags) are rendered with rows (row
tags) ending with a new line and all single cells but the last in row one (entry
tags) with a tab character appended - The end marker of the law text will be 25 empty lines
And hence the short XSLT stylesheet of about 100 lines:
Run it in Windows using msxsl.exe as XSLT processor like this:
msxsl BJNR001270871.xml giitotext.xsl > BJNR001270871.txt
Concatenating the text files creates a law text corpus.
Next step
In the next part of series we will see how to train a language model with the text corpus we just created.
Top comments (0)