On 28th May, a post on r/dataisbeautiful inspired me to learn how to make Word Clouds myself. Being a huge Harry Potter fan, the data I was going to use was obvious. Using the Books seemed too simple so I decided to scrape 250 stories from Fanfiction.net , and make a Word Cloud from that data. I posted my first attempt on r/dataisbeautiful, and based on the feedback I received, I decided to write this blog.
I used simple Python + BeautifulSoup combination to scrape the stories form Fanfiction.net. I sorted the stories based on their Favorite Count, and filtered them to stories having more than 100k words. (Link to the URL). I scraped first 10 pages, (each page has 25 stories) resulting in 250 stories. It took me a total of 10 hours (7 on one day, and 3 on the next) to scrape all the stories.
Taking hints from the original post, I used nltk to tokenize the stories, and removed the common words from the nltk English Stopwords Corpus. This was my first attempt at doing anything like this, and the process was taking 3-4 minutes per story initially. After some optimization, I was able to reduce the time to 1-2 minutes per story. I talked to a friend about the problem, and he suggested me to try multiprocessing. After adding multiprocessing, I had the idea of distributing the load over two CPUs (my laptop and a Raspberry Pi 4B). I copied the script and 25% of the stories over to the Pi and started the job.
Additional Tip: screen is a good utility to do long jobs over SSH
It took me an hour to the processing. I didn't want to do the processing again if I needed to remove some more words so I decided to store the word frequency data into json files. (Really helpful in future)
I took a look at wordcloud Python Package and copied the code from its examples to generate the word cloud.
To make the mask image, I downloaded some images from the Internet and used Inkscape to fix them.
After posting the first attempt over at Reddit & Twitter, I received a lot of feedback. Common among them were the queries about why is Daphne more frequent and why is Ron less frequent (I will answer both later), suggestions to remove more words to focus it more on Harry Potter related words, and to show some other visualisations, especially ones comparing the books and fanfiction.
In my first attempt, I used the nltk English stopwords corpus, which is just 179 words. I searched for a bigger list and ended up using a customised 20,000 most common words list from google-10000-english repository. What were the customisations? I had to remove some words (like magic, magical, wand, wards, vampire, etc) and some names (Harry, Ron, Fred, Arthur, etc) from the 20k list so that they aren't removed from my analysis. Storing the results of the processing from my first attempt into json files saved me from spending another hour of processing. I just removed the necessary keys from each data file.
I also downloaded the text versions of the 7 books from somewhere on the Internet, sanitised them a bit, and applied the same process as the fanfiction stories to generate their data. Using that data, I was able to compare the occurrence of some words in fanfiction vs canon. Since I had the data and the code, I decided to make their corresponding word clouds as well.
Hermione’s name was called. Trembling, she left the chamber with Anthony Goldstein, Gregory Goyle, and Daphne Greengrass. Students who had already been tested did not return afterward, so Harry and Ron had no idea how Hermione had done.
-> Harry Potter and the Order of the Phoenix
Daphne Greengrass is an almost non-entity in canon, and a blank slate for fanfiction writers. In canon & most fanfictions, she is the sister of Astoria Greengrass (another almost non-entity) who becomes the wife of Draco Malfoy. In fanfictions, she is usually a Slytherin due to her ambitions & cunningness & not because of being a Pureblood Supermasict. Her family is depicted as Light or Grey, and support "Lord Potter". She is a popular pairing in Independent Harry stories.
Her being a blank slate character-wise is a boon for writers who want to write an OC without explicitly mentioning it.
Ron is an almost opposite of Daphne. JKR wrote Ron in such a beautiful manner that many fanfiction writers are unable to write a good Ron. In canon, Ron is flawed but is also very funny, brave and loyal to his friends. In fanfictions, especially where Harry is very different to canon (Independent, Super-Powered, Lord Potter, etc), Harry usually ignores Ron (if diverging before Hogwarts) or the author does a lot of Ron bashing to justify Harry breaking up their friendship.
Chamber of Secrets
Prisoner of Azkaban
Goblet of Fire
I tried to use an Image of the Triwizard Trophy. Words like Cedric, Beauxbatons, Crouch, Durmstrang start appearing.
Order of the Phoenix
Tried to use an image of a Phoenix. Umbridge is very popular in this book.
Half Blood Prince
I used an image of the Half Blood Prince for this book. Apart from the usual, Slughorn is the most common word in this book.
I used an image of the Deathly Hallows for this. You will see "wand" becomes very used due to "Elder wand". Hallows, Cloak, Wandmaker appear. Also, Griphook is back.
I am planning to scrape AO3 in the future to do some more analysis. I might also create some other Word Clouds from other popular books.