My book of choice to test my UNIX skills was Mary Shelley‘s Frankenstein. I have heard about it but haven’t read it. I downloaded it from Project Gutenberg site as a Plain Text UTF-8. Checking the file type, gives me this:
Ligin% file frankenstein.txt frankenstein.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators
The file contains a few paragraphs of Gutenberg Project Licence & other stuff at the beginning & end of the file; remove that & using another program called dos2unix, I convert the file into ASCII text.
Ligin% dos2unix frankenstein.txt dos2unix: converting file frankenstein.txt to Unix format... Ligin% file frankenstein.txt frankenstein.txt: ASCII text
Starting with the word count (wc) command gives me the following:
Ligin% wc frankenstein.txt 7243 74952 421503 frankenstein.txt
The book contains 7243 lines, 74952 words, and 421503 characters.
The number of words gives us a fair idea about the type of book. A typical Novel has 40,000 words or over, a Novella has 17,500 to 39,999 words, a Novelette has 7,500 to 17,499 words & a Short Story has words under 7,500 words. Since this book has more than 70,000 words we can assume this to be a novel.
For the consistency of further analysis, I convert all the characters into lower-case and split each word per line
Ligin% tr 'A-Z' 'a-z'< frankenstein.txt| tr -sc 'a-z' '\n'
Now exploring more into the contents of the book, by applying sort & uniq programs, we can find the most frequently used words and the number of times it appears.
Ligin% tr 'A-Z' 'a-z' < frankenstein.txt| tr -sc 'a-z' '\n'| sort | uniq -c | sort
It seems “the” is the most used word, which appears 4195 times, but that does not give any insights into to book. Anyway, I saved it into a file named frank.words, without sorting.
Ligin% tr 'A-Z' 'a-z' < frankenstein.txt|tr -sc 'a-z' '\n'> frank.words
Now using grep I can find the most used word in different length of the word.
Ligin% grep -w "[a-z]\{12\}" frank.words | sort -nr | more
As I changed the values, I found some interesting words & its frequencies.
2850 i 1391 a 1776 my 867 me 608 he 136 myself 134 father 71 friend 45 horror 39 months 36 geneva 34 spirit 59 clerval 55 justine 54 friends 51 cottage 76 feelings 44 creature 38 thoughts 27 murderer 92 elizabeth 65 miserable 37 mountains 21 vengeance 34 discovered 32 sensations 39 countenance 28 endeavoured 27 frankenstein 18 conversation 14 wretchedness 14 tranquillity 14 circumstances 7 disappointment 6 notwithstanding
The longest single word has 16 alphabets(fun fact ;) ).
By arranging two adjacent words into a sequence, we can analyze a different type of frequent distributions among the words, this is called bigram.
Now by creating a new file frank.nextwords, which store the words one above the previous file frank.words,
Ligin% tail +2 frank.words > frank.nextwords
and with pasting & sorting these two files, we can form a bigram file frank.bigram.
Ligin% paste frank.words frank.nextwords| sort | uniq -c > frank.bigram
Now using sort on the new bigram file,
Ligin% sort -nr frank.bigram | more
Now some intresting things start to show up.
228 i was 219 i had 100 as i 90 my father 49 my heart 46 my eyes 41 the cottage 36 i thought 35 my mind 31 my friend 31 my dear
For now, I think this is what the book says:
It’s told in a first-person point of view(maybe Frankenstein’s), more like a diary. The person is so close to his/her father. Some other characters are Elizabeth, Justine, Clerval. The person is close to his friends too. There is a cottage involved, probably where the character lives. Someone is accused of a murder and there is some sort horror spread around, which also makes it reasonable to say that this could be a mystery novel.
To know how close I got, I need to sit down & read it, since it involves horror & murder I better not to read it before bed ;)
Sayonara
Top comments (0)