<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ligin Vellakkad</title>
    <description>The latest articles on DEV Community by Ligin Vellakkad (@liginv).</description>
    <link>https://dev.to/liginv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F49351%2F9e12576a-f682-4c16-a183-cf6662206f6d.png</url>
      <title>DEV Community: Ligin Vellakkad</title>
      <link>https://dev.to/liginv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/liginv"/>
    <language>en</language>
    <item>
      <title>Skimming through a Book</title>
      <dc:creator>Ligin Vellakkad</dc:creator>
      <pubDate>Thu, 14 Dec 2017 05:05:32 +0000</pubDate>
      <link>https://dev.to/liginv/skimming-through-a-book-a5o</link>
      <guid>https://dev.to/liginv/skimming-through-a-book-a5o</guid>
      <description>&lt;p&gt;My book of choice to test my UNIX skills was Mary Shelley‘s Frankenstein. I have heard about it but haven’t read it. I downloaded it from &lt;a href="https://www.gutenberg.org/ebooks/84"&gt;Project Gutenberg&lt;/a&gt; site as a Plain Text UTF-8. Checking the file type, gives me this:&lt;/p&gt;

&lt;pre&gt;
Ligin% file frankenstein.txt
frankenstein.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators
&lt;/pre&gt;

&lt;p&gt;The file contains a few paragraphs of Gutenberg Project Licence &amp;amp; other stuff at the beginning &amp;amp; end of the file; remove that &amp;amp; using another program called &lt;a href="https://www.lifewire.com/dos2unix-linux-command-4091910"&gt;&lt;strong&gt;dos2unix&lt;/strong&gt;&lt;/a&gt;, I convert the file into ASCII text.&lt;/p&gt;

&lt;pre&gt;
Ligin% dos2unix frankenstein.txt
dos2unix: converting file frankenstein.txt to Unix format...
Ligin% file frankenstein.txt
frankenstein.txt: ASCII text
&lt;/pre&gt;




&lt;p&gt;Starting with the word count (&lt;strong&gt;wc&lt;/strong&gt;) command gives me the following:&lt;/p&gt;

&lt;pre&gt;
Ligin% wc frankenstein.txt
7243 74952 421503 frankenstein.txt
&lt;/pre&gt;

&lt;p&gt;The book contains 7243 &lt;strong&gt;lines&lt;/strong&gt;, 74952 &lt;strong&gt;words&lt;/strong&gt;, and 421503 &lt;strong&gt;characters&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The number of words gives us a fair idea about the type of book. A typical &lt;strong&gt;Novel&lt;/strong&gt; has 40,000 words or over, a &lt;strong&gt;Novella&lt;/strong&gt; has 17,500 to 39,999 words, a &lt;strong&gt;Novelette&lt;/strong&gt; has 7,500 to 17,499 words &amp;amp; a &lt;strong&gt;Short Story&lt;/strong&gt; has words under 7,500 words. Since this book has more than 70,000 words we can assume this to be a novel.&lt;/p&gt;

&lt;p&gt;For the consistency of further analysis, I convert all the characters into lower-case and split each word per line&lt;/p&gt;

&lt;pre&gt;Ligin% tr 'A-Z' 'a-z'&amp;lt; frankenstein.txt| tr -sc 'a-z' '\n'&lt;/pre&gt;

&lt;p&gt;Now exploring more into the contents of the book, by applying &lt;strong&gt;sort&lt;/strong&gt; &amp;amp; &lt;strong&gt;uniq&lt;/strong&gt; programs, we can find the most frequently used words and the number of times it appears.&lt;/p&gt;

&lt;pre&gt;
Ligin% tr 'A-Z' 'a-z' &amp;lt; frankenstein.txt| tr -sc 'a-z' '\n'| sort | uniq -c | sort
&lt;/pre&gt;

&lt;p&gt;It seems &lt;strong&gt;“the”&lt;/strong&gt; is the most used word, which appears 4195 times, but that does not give any insights into to book. Anyway, I saved it into a file named frank.words, without sorting.&lt;/p&gt;

&lt;pre&gt;
Ligin% tr 'A-Z' 'a-z' &amp;lt; frankenstein.txt|tr -sc 'a-z' '\n'&amp;gt; frank.words
&lt;/pre&gt;

&lt;p&gt;Now using &lt;strong&gt;grep&lt;/strong&gt; I can find the most used word in different length of the word.&lt;/p&gt;

&lt;pre&gt;
Ligin% grep -w "[a-z]\{12\}" frank.words | sort -nr | more
&lt;/pre&gt;

&lt;p&gt;As I changed the values, I found some interesting words &amp;amp; its frequencies.&lt;/p&gt;

&lt;pre&gt;
2850 i
1391 a
1776 my
867 me
608 he
136 myself
134 father
71 friend
45 horror
39 months
36 geneva
34 spirit
59 clerval
55 justine
54 friends
51 cottage
76 feelings
44 creature
38 thoughts
27 murderer
92 elizabeth
65 miserable
37 mountains
21 vengeance
34 discovered
32 sensations
39 countenance
28 endeavoured
27 frankenstein
18 conversation
14 wretchedness
14 tranquillity
14 circumstances
7 disappointment
6 notwithstanding
&lt;/pre&gt;

&lt;p&gt;The longest single word has 16 alphabets(fun fact ;) ).&lt;/p&gt;

&lt;p&gt;By arranging two adjacent words into a sequence, we can analyze a different type of frequent distributions among the words, this is called &lt;a href="https://en.wikipedia.org/wiki/Bigram"&gt;bigram&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now by creating a new file frank.nextwords, which store the words one above the previous file frank.words,&lt;/p&gt;

&lt;pre&gt;
Ligin% tail +2 frank.words &amp;gt; frank.nextwords
&lt;/pre&gt;

&lt;p&gt;and with pasting &amp;amp; sorting these two files, we can form a bigram file frank.bigram.&lt;/p&gt;

&lt;pre&gt;
Ligin% paste frank.words frank.nextwords| sort | uniq -c &amp;gt; frank.bigram
&lt;/pre&gt;

&lt;p&gt;Now using sort on the new bigram file,&lt;/p&gt;

&lt;pre&gt;
Ligin% sort -nr frank.bigram | more
&lt;/pre&gt;

&lt;p&gt;Now some intresting things start to show up.&lt;/p&gt;

&lt;pre&gt;
228 i was
219 i had
100 as i
90 my father
49 my heart
46 my eyes
41 the cottage
36 i thought
35 my mind
31 my friend
31 my dear
&lt;/pre&gt;

&lt;p&gt;For now, I think this is what the book says:&lt;/p&gt;

&lt;p&gt;It’s told in a first-person point of view(maybe Frankenstein’s), more like a diary. The person is so close to his/her father. Some other characters are Elizabeth, Justine, Clerval. The person is close to his friends too. There is a cottage involved, probably where the character lives. Someone is accused of a murder and there is some sort horror spread around, which also makes it reasonable to say that this could be a mystery novel.&lt;/p&gt;

&lt;p&gt;To know how close I got, I need to sit down &amp;amp; read it, since it involves horror &amp;amp; murder I better not to read it before bed ;)&lt;/p&gt;

&lt;p&gt;Sayonara&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>linux</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
