DEV Community

Ligin Vellakkad
Ligin Vellakkad

Posted on • Edited on

Skimming through a Book

My book of choice to test my UNIX skills was Mary Shelley‘s Frankenstein. I have heard about it but haven’t read it. I downloaded it from Project Gutenberg site as a Plain Text UTF-8. Checking the file type, gives me this:

Ligin% file frankenstein.txt
frankenstein.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators

The file contains a few paragraphs of Gutenberg Project Licence & other stuff at the beginning & end of the file; remove that & using another program called dos2unix, I convert the file into ASCII text.

Ligin% dos2unix frankenstein.txt
dos2unix: converting file frankenstein.txt to Unix format...
Ligin% file frankenstein.txt
frankenstein.txt: ASCII text

Starting with the word count (wc) command gives me the following:

Ligin% wc frankenstein.txt
7243 74952 421503 frankenstein.txt

The book contains 7243 lines, 74952 words, and 421503 characters.

The number of words gives us a fair idea about the type of book. A typical Novel has 40,000 words or over, a Novella has 17,500 to 39,999 words, a Novelette has 7,500 to 17,499 words & a Short Story has words under 7,500 words. Since this book has more than 70,000 words we can assume this to be a novel.

For the consistency of further analysis, I convert all the characters into lower-case and split each word per line

Ligin% tr 'A-Z' 'a-z'< frankenstein.txt| tr -sc 'a-z' '\n'

Now exploring more into the contents of the book, by applying sort & uniq programs, we can find the most frequently used words and the number of times it appears.

Ligin% tr 'A-Z' 'a-z' < frankenstein.txt| tr -sc 'a-z' '\n'| sort | uniq -c | sort

It seems “the” is the most used word, which appears 4195 times, but that does not give any insights into to book. Anyway, I saved it into a file named frank.words, without sorting.

Ligin% tr 'A-Z' 'a-z' < frankenstein.txt|tr -sc 'a-z' '\n'> frank.words

Now using grep I can find the most used word in different length of the word.

Ligin% grep -w "[a-z]\{12\}" frank.words | sort -nr | more

As I changed the values, I found some interesting words & its frequencies.

2850 i
1391 a
1776 my
867 me
608 he
136 myself
134 father
71 friend
45 horror
39 months
36 geneva
34 spirit
59 clerval
55 justine
54 friends
51 cottage
76 feelings
44 creature
38 thoughts
27 murderer
92 elizabeth
65 miserable
37 mountains
21 vengeance
34 discovered
32 sensations
39 countenance
28 endeavoured
27 frankenstein
18 conversation
14 wretchedness
14 tranquillity
14 circumstances
7 disappointment
6 notwithstanding

The longest single word has 16 alphabets(fun fact ;) ).

By arranging two adjacent words into a sequence, we can analyze a different type of frequent distributions among the words, this is called bigram.

Now by creating a new file frank.nextwords, which store the words one above the previous file frank.words,

Ligin% tail +2 frank.words > frank.nextwords

and with pasting & sorting these two files, we can form a bigram file frank.bigram.

Ligin% paste frank.words frank.nextwords| sort | uniq -c > frank.bigram

Now using sort on the new bigram file,

Ligin% sort -nr frank.bigram | more

Now some intresting things start to show up.

228 i was
219 i had
100 as i
90 my father
49 my heart
46 my eyes
41 the cottage
36 i thought
35 my mind
31 my friend
31 my dear

For now, I think this is what the book says:

It’s told in a first-person point of view(maybe Frankenstein’s), more like a diary. The person is so close to his/her father. Some other characters are Elizabeth, Justine, Clerval. The person is close to his friends too. There is a cottage involved, probably where the character lives. Someone is accused of a murder and there is some sort horror spread around, which also makes it reasonable to say that this could be a mystery novel.

To know how close I got, I need to sit down & read it, since it involves horror & murder I better not to read it before bed ;)

Sayonara

Top comments (0)