Note: this article is also available in portuguese 🌎.
A landmark in Gothic literature, the iconic novel Dracula, written by Bram Stoker in 1897, stirs the emotions of people across the world. Today, to introduce Spark's new concepts and features, we will develop a brief notebook to analyze the most common words in this classic book 🧛🏼♂️.
To do this, we will write a notebook in Google Colab, a cloud service built by Google to encourage machine learning and artificial intelligence researches.
This notebook is also available in my GitHub 😉.
This novel was obtained through Project Gutenberg, a digital library that centralizes public books around the world.
Before get start
Before start, we need to install PySpark library.
The PySpark is the official API of Apache Spark for Python. We will develop our data analysis using it 🎲.
So, create a new code cell in Colab and add the following line:
!pip install pyspark
Step one: running Apache Spark
After the installation is complete, we need to run Apache Spark. To do this, create a new code cell and add the following code block:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("The top most common words in Dracula, by Bram Stoker")
.getOrCreate()
)
Step two: downloading and reading
In this step, we will download the novel from Guttenberg project and, after that, load it using PySpark.
We will use wget tool to do this, passing the URL book for it and saving it in local directory, and renaming to Dracula – Bram Stoker.txt.
Again, create a new code cell in Colab and add the following code line:
!wget https: // www.gutenberg.org/cache/epub/345/pg345.txt -O "Dracula - Bram Stoker.txt"
Step three: stopwords downloading
In this section, we will download the list of stopwords used in English language. These stops words normally include prepositions, particles, interjections, unions, adverbs, pronouns, introductory words, numbers from 0 to 9 (unambiguous), other frequently used official, independent parts of speech, symbols, punctuation. Relatively recently, this list was supplemented by such commonly used on the Internet sequences of symbols as www, com, http, etc.
This list was obtained through CountWordsFree, a website that centralizes the stopwords used in many languages across the world.
get to work! Create a new code cell in Colab and add the following code line:
!wget https://countwordsfree.com/stopwords/english/txt -O "stop_words_english.txt"
After that, let’s load the book using Spark. Create a new code cell and add the following code block:
book = spark.read.text("Dracula - Bram Stoker.txt")
And let’s load the stopwords as well. The stopwords will are stored in a list, in stopwords variable.
with open("stop_words_english.txt", "r") as f:
text = f.read()
stopwords = text.splitlines()
len(stopwords), stopwords[:15]
Output
(851,
['able',
'about',
'above',
'abroad',
'according',
'accordingly',
'across',
'actually',
'adj',
'after',
'afterwards',
'again',
'against',
'ago',
'ahead']t)
Step four: extracting words
After load is completed, we need to extract the words to a dataframe column.
To do this, use the split function to each line, will split them using blank spaces between them. The result is a list of words.
from pyspark.sql.functions import split
lines = book.select(split(book.value, " ").alias("line"))
lines.show(5)
Output
+--------------------+
| line|
+--------------------+
|[The, Project, Gu...|
| []|
|[This, eBook, is,...|
|[most, other, par...|
|[whatsoever., You...|
+--------------------+
only showing top 5 rows
Step five: exploding list words
Now, let’s convert this list of words in dataframe column, using explode function.
from pyspark.sql.functions import explode, col
words = lines.select(explode(col("line")).alias("word"))
words.show(15)
Output
+---------+
| word|
+---------+
| The|
| Project|
|Gutenberg|
| eBook|
| of|
| Dracula,|
| by|
| Bram|
| Stoker|
| |
| This|
| eBook|
| is|
| for|
| the|
+---------+
only showing top 15 rows
Step six: words to lowercase
This is a simple step. We don't want the same word to be different because of capital letters, so we convert these words to lowercase, using lower function.
from pyspark.sql.functions import lower
words_lower = words.select(lower(col("word")).alias("word_lower"))
words_lower.show()
Output
+----------+
|word_lower|
+----------+
| the|
| project|
| gutenberg|
| ebook|
| of|
| dracula,|
| by|
| bram|
| stoker|
| |
| this|
| ebook|
| is|
| for|
| the|
| use|
| of|
| anyone|
| anywhere|
| in|
+----------+
only showing top 20 rows
Step seven: removing punctuations
so that the same word is not different because of the punctuation at the end of them, is necessary to remove these punctuations.
We'll do this using the regexp_extract function, which extracts words from a string using a regex.
from pyspark.sql.functions import regexp_extract
words_clean = words_lower.select(
regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word")
)
words_clean.show()
Output
+---------+
| word|
+---------+
| the|
| project|
|gutenberg|
| ebook|
| of|
| dracula|
| by|
| bram|
| stoker|
| |
| this|
| ebook|
| is|
| for|
| the|
| use|
| of|
| anyone|
| anywhere|
| in|
+---------+
only showing top 20 rows
Step eight: removing null values
However, how you see, there are null values yet, in other words, blank spaces.
It is necessary remove them so that these blanks values are not analyzed.
words_nonull = words_clean.filter(col("word") != "")
words_nonull.show()
Output
+---------+
| word|
+---------+
| the|
| project|
|gutenberg|
| ebook|
| of|
| dracula|
| by|
| bram|
| stoker|
| this|
| ebook|
| is|
| for|
| the|
| use|
| of|
| anyone|
| anywhere|
| in|
| the|
+---------+
only showing top 20 rows
Step nine: removing stopwords
We are almost there! The last step is removes the stopwords so that, again, these words are not analyzed.
words_without_stopwords = words_nonull.filter(
~words_nonull.word.isin(stopwords))
words_count_before_removing = words_nonull.count()
words_count_after_removing = words_without_stopwords.count()
words_count_before_removing, words_count_after_removing
Output
(163399, 50222)
Step ten: analyzing the most common words in Dracula, finally!
And, finally, our data are completely cleared. So, now we could to analyze the most common words in our book.
At first, we’ll group the words and after use an aggregate function to count them.
words_count = (words_without_stopwords.groupby("word")
.count()
.orderBy("count", ascending=False)
)
After, show the top 20 most common words. This value may be changed through rank variable.
rank = 20
words_count.show(rank)
Output
+--------+-----+
| word|count|
+--------+-----+
| time| 381|
| helsing| 323|
| van| 322|
| lucy| 297|
| good| 256|
| man| 255|
| mina| 240|
| dear| 224|
| night| 224|
| hand| 209|
| room| 207|
| face| 206|
|jonathan| 206|
| count| 197|
| door| 197|
| sleep| 192|
| poor| 191|
| eyes| 188|
| work| 188|
| dr| 187|
+--------+-----+
only showing top 20 rows
Conclusion
That’s all for now, folks! In this article, we analyzed the most common words in Dracula, written by Bram Stoker. To do this, we cleared the words: removing punctuations; converting from uppercase letters to lowercase; and removing stopwords.
I hope you enjoyed it. Keep those stakes sharp, watch out for the shadows that walk at night, and see you in next time 🧛🏼♂️🍷.
bibliography
RIOUX, Jonathan. Data Analysis with Python and PySpark.
STOKER, Bram. Dracula.
Top comments (0)