DEV Community

Akmal Chaudhri for SingleStore

Posted on

Quick tip: The Plays of William Shakespeare, XML and SingleStoreDB

Abstract

In this short article, we'll see how to store the plays of William Shakespeare in XML format in SingleStoreDB. We'll then connect from a client application and run some XPath expressions.

Introduction

In this article, we'll take 37 plays by William Shakespeare in an XML format and store them in SingleStoreDB after some data cleanup. Using Python, SQLAlchemy and an XML library, we'll connect to SingleStoreDB and run several example XPath expressions over the data.

Create a Deepnote account

We'll create a free account on the Deepnote website. Once logged in, we'll create a new Deepnote project to give us a new notebook.

Create a SingleStoreDB Cloud account

A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use XML Demo Group as our Workspace Group Name and xml-demo as our Workspace Name. We'll make a note of our password and host name.

Create a Database and Table

In our SingleStore Cloud account, let's use the SQL Editor to create a new database and table:

CREATE DATABASE IF NOT EXISTS xml_db;

USE xml_db;

CREATE TABLE plays (
    id INT PRIMARY KEY NOT NULL,
    play LONGTEXT NOT NULL
);
Enter fullscreen mode Exit fullscreen mode

Data Download, Formatting and Loading

Data Download

We can obtain all the plays of William Shakespeare in XML format in a single zip file courtesy of Jon Bosak of Sun Microsystems. After unpacking the zip file, there is important information on copyright and distribution in an htm file. All the xml files are well-formed.

Data Formatting

In each xml file, we'll remove all hidden control characters and the meta information at the beginning of a file. The following bash script will achieve this and write the combined output into a single file called all_plays.csv:

#!/bin/bash

# Set the input directory
input_dir="/path/to/input_dir"

# Set the output directory
output_dir="/path/to/output_dir"

# Check if input_dir and output_dir are the same
if [ "$input_dir" -ef "$output_dir" ]; then
  echo "Error: input_dir and output_dir cannot be the same directory"
  exit 1
fi

# Initialise the loop counter
counter=1

# Loop over all files in the input directory
for file in "$input_dir"/*.xml
do
  # Check if the file is a regular file
  if [[ -f "$file" ]]; then
    # Extract the file name without the extension
    filename=$(basename "$file" .xml)

    # Remove control and newline characters from the file using tr command
    tr -dc '[:print:]' < "$file" > "$output_dir/$filename.csv.tmp"

    # Remove XML declaration and DOCTYPE declaration from the file using sed command
    sed -e '1s/<?xml version="1.0"?>//' -e '1s/<!DOCTYPE PLAY SYSTEM "play.dtd">//'< "$output_dir/$filename.csv.tmp" > "$output_dir/$filename.csv"

    # Add the loop counter at the beginning of the output file
    sed -i "1s/^/$counter|/" "$output_dir/$filename.csv"

    # Remove the temporary file
    rm "$output_dir/$filename.csv.tmp"

    # Increment the loop counter
    counter=$((counter+1))
  fi
done

# Concatenate all output files into a single file
for file in "$output_dir"/*.csv
do
  if [[ -f "$file" ]] && [[ "$file" != "$output_dir/all_plays.csv" ]]; then
    cat "$file" >> "$output_dir/all_plays.csv"
    echo >> "$output_dir/all_plays.csv"
    rm "$file"
  fi
done
Enter fullscreen mode Exit fullscreen mode

We'll replace /path/to/ with the actual path.

Data Loading

From output_dir, we'll launch a MySQL CLI client:

mysql -u admin -h <host> -P 3306 -p<password> --local-infile
Enter fullscreen mode Exit fullscreen mode

We'll replace the <host> and <password> with the values from our SingleStoreDB Cloud account.

We'll load the data, as follows:

USE xml_db;

LOAD DATA LOCAL INFILE 'all_plays.csv'
INTO TABLE plays
COLUMNS TERMINATED BY '|';
Enter fullscreen mode Exit fullscreen mode

This will create 37 rows in the database.

Deepnote notebook

First, we'll install the package:

!pip install sqlalchemy-singlestoredb
Enter fullscreen mode Exit fullscreen mode

Deepnote will prompt us to add this to the requirements.txt file.

Next, we'll create a connection:

import sqlalchemy as s2

eng = s2.create_engine(
   "singlestoredb://admin:<password>@<host>:3306/xml_db"
)
Enter fullscreen mode Exit fullscreen mode

We'll replace the <password> and <host> with the values from our SingleStoreDB Cloud account.

We'll now get one of the plays:

with eng.connect() as conn:
    query = s2.text("SELECT play FROM plays WHERE play LIKE '%Romeo and Juliet%'")
    res = conn.execute(query)
    row = res.fetchone()
    play = row.play
Enter fullscreen mode Exit fullscreen mode

We'll use the lxml library and get the root element of the document:

from lxml import etree

root = etree.fromstring(play)
Enter fullscreen mode Exit fullscreen mode

Next, let's get the title of the play:

xpath_q1 = "/PLAY/TITLE"
play_title = root.xpath(xpath_q1)[0]
print(play_title.text)
Enter fullscreen mode Exit fullscreen mode

The output should be, as follows:

The Tragedy of Romeo and Juliet
Enter fullscreen mode Exit fullscreen mode

Now, let's get the personae:

xpath_q2 = "/PLAY/PERSONAE/PERSONA"
personae = root.xpath(xpath_q2)
for persona in personae:
    print(persona.text)
Enter fullscreen mode Exit fullscreen mode

The result should be, as follows:

ESCALUS, prince of Verona. 
PARIS, a young nobleman, kinsman to the prince.
An old man, cousin to Capulet. 
ROMEO, son to Montague.
MERCUTIO, kinsman to the prince, and friend to Romeo.
BENVOLIO, nephew to Montague, and friend to Romeo.
TYBALT, nephew to Lady Capulet.
BALTHASAR, servant to Romeo.
PETER, servant to Juliet's nurse.
ABRAHAM, servant to Montague.
An Apothecary. 
Three Musicians.
Page to Paris; another Page; an officer.
LADY MONTAGUE, wife to Montague.
LADY CAPULET, wife to Capulet.
JULIET, daughter to Capulet.
Nurse to Juliet. 
Citizens of Verona; several Men and Women, relations to both houses; Maskers, Guards, Watchmen, and Attendants.
Enter fullscreen mode Exit fullscreen mode

Let's find the number of times that Juliet is the speaker:

xpath_q3 = "count(//SPEECH[SPEAKER='JULIET'])"
juliet_count = root.xpath(xpath_q3)
print(juliet_count)
Enter fullscreen mode Exit fullscreen mode

The result should be, as follows:

118.0
Enter fullscreen mode Exit fullscreen mode

We could also try the response to a question on Stackoverflow to find the number of times that Juliet speaks immediately after Romeo:

xpath_q4 = "count(//SPEECH[SPEAKER='JULIET' and preceding-sibling::SPEECH[1][SPEAKER='ROMEO']])"
romeo_and_juliet_count = root.xpath(xpath_q4)
print(romeo_and_juliet_count)
Enter fullscreen mode Exit fullscreen mode

The result should be, as follows:

37.0
Enter fullscreen mode Exit fullscreen mode

The lxml library provides many more capabilities, such as streaming large documents, for example.

Summary

In this short article, we have seen one method of storing XML data in SingleStoreDB. We have also seen how to successfully connect to our database using SQLAlchemy and how to use XPath expressions to query our data.

Top comments (0)