<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Arthur Costa</title>
    <description>The latest articles on DEV Community by Arthur Costa (@atdayan).</description>
    <link>https://dev.to/atdayan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F850763%2F7c329f4b-d626-47d0-9b84-11768d475d6b.jpeg</url>
      <title>DEV Community: Arthur Costa</title>
      <link>https://dev.to/atdayan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/atdayan"/>
    <language>en</language>
    <item>
      <title>Using basic awk, sed and grep for manipulate real data</title>
      <dc:creator>Arthur Costa</dc:creator>
      <pubDate>Wed, 27 Apr 2022 05:15:40 +0000</pubDate>
      <link>https://dev.to/atdayan/using-basic-awk-sed-and-grep-for-manipulate-real-data-1jlc</link>
      <guid>https://dev.to/atdayan/using-basic-awk-sed-and-grep-for-manipulate-real-data-1jlc</guid>
      <description>&lt;p&gt;We all know about the power of the command line utilities but rarely this toolset is put to work. In this short note we are going to manipulate a csv file to extract some informations about it.&lt;/p&gt;

&lt;p&gt;In order for the reader to fully understand the content of this tutorial, minimal knowledge of the command line is required. The file's name with all the data we will use is &lt;code&gt;all_month.csv&lt;/code&gt;. You can put the file in wherever directory you want.&lt;/p&gt;

&lt;p&gt;The data we will analyze is a comma separated values (CSV) simple text file and was taken from the United States Geological Survey (USGS), and it represents the seismic activities captured by sensors all over the country and other territories in the last 30 days which this note was written (march to april, 2022). You can download the dataset with this time period size or shorter ones &lt;a href="https://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The first contact looks like this&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--mlbtOrZP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://earthquake.usgs.gov/earthquakes/feed/v1.0/images/screenshot_csv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--mlbtOrZP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://earthquake.usgs.gov/earthquakes/feed/v1.0/images/screenshot_csv.jpg" alt="so many entries!" width="542" height="307"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's see how many entries this file has. We can do this using the &lt;code&gt;wc&lt;/code&gt; command. We pass a additional option &lt;code&gt;-l&lt;/code&gt; indicanting the number of lines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;wc -l all_month.csv

9271 all_month.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The file contains 9271 lines.&lt;br&gt;
Now, let's take a look at the fields, using the &lt;code&gt;head -N&lt;/code&gt; command. It prints out the first &lt;code&gt;N&lt;/code&gt; lines. By default, it's 10.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;head -1 all_month.csv
time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,net,id,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this tutorial, we are going to use just four of them: &lt;code&gt;time&lt;/code&gt;, &lt;code&gt;depth&lt;/code&gt;, &lt;code&gt;mag&lt;/code&gt; and &lt;code&gt;place&lt;/code&gt;. The output will be of the form: 2022-04-02 12:23:45, 10.1, 1.1, "4km SE of Honolulu - Hawaii".&lt;br&gt;
So, how to split them? For this, we have the &lt;code&gt;cut&lt;/code&gt;. the syntax is &lt;code&gt;cut -d&amp;lt;delim&amp;gt; -f&amp;lt;fields&amp;gt; &amp;lt;file&amp;gt;&lt;/code&gt;, where the &lt;code&gt;-d&lt;/code&gt; tells the delimiter, and &lt;code&gt;-f&lt;/code&gt; inform the fields in numbers. First, lets cut the first three columns we care about.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cut -d, -f1,4,5 all_month.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, the delimiter is the comma, and the fields are the order that they appear. For example, if we wanted all fields from 2 to 5 we could use &lt;code&gt;-f2-5&lt;/code&gt; (In the main example, &lt;code&gt;-f1,4-5&lt;/code&gt; would return the same result).&lt;/p&gt;

&lt;p&gt;Ok, everythins is looking good, let's add the &lt;code&gt;place&lt;/code&gt; field. Adding it to the command: &lt;code&gt;cut -d, -f1,4,5,14 all_month.csv&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cut -d, -f1,4,5,14 all_month.csv
…
2022-03-19T15:17:42.550Z,1.74,1.2,"8km NW of The Geysers
2022-03-19T15:13:52.070Z,3.59,0.6,"24 km SSW of Kanosh
2022-03-19T15:12:31.400Z,1.86,0.36,"7km WNW of Cobb
2022-03-19T15:10:07.880Z,26.4,0.48,"29 km N of Ivanof Bay
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Well... It is a little different than expected, for instance, the final part of the &lt;code&gt;place&lt;/code&gt; field was left out. But why? As you can see, we told the &lt;code&gt;cut&lt;/code&gt; command to split every "cell" of our data file when it encounters a comma. This is why the &lt;code&gt;place&lt;/code&gt; was ripped. It has a comma inside the text.&lt;/p&gt;

&lt;p&gt;A solution for this is &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Extract the text of the &lt;code&gt;place&lt;/code&gt; field and put it in a temporary file;&lt;/li&gt;
&lt;li&gt;Extract the other fields and also put them in a temporary file;&lt;/li&gt;
&lt;li&gt;Join the two files line by line.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It is worth to know that we can redirect the output of a command mostly in three ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The output to a file, using the &lt;code&gt;&amp;gt;&lt;/code&gt; operator;&lt;/li&gt;
&lt;li&gt;The error output to a file, using the &lt;code&gt;2&amp;gt;&lt;/code&gt; operator;&lt;/li&gt;
&lt;li&gt;The output to the input of another command, using the &lt;code&gt;|&lt;/code&gt; (pipe) operator.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Back to our case, how can we extract only the part of the line that contains the double quotes? The answer is &lt;code&gt;grep&lt;/code&gt;. The &lt;code&gt;grep&lt;/code&gt; command searches for a pattern of characters in the given object. It uses the power of regular expressions (&lt;a href="https://en.wikipedia.org/wiki/Regular_expression"&gt;regex&lt;/a&gt;). We won't explain them in this article.&lt;/p&gt;

&lt;p&gt;The command to find all the &lt;code&gt;place&lt;/code&gt; entries is&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;grep -o '"[^"]\+"' all_month.csv
…
"8km NW of The Geysers, CA"
"24 km SSW of Kanosh, Utah"
"7km WNW of Cobb, CA"
"29 km N of Ivanof Bay, Alaska"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can remove the commas with the &lt;code&gt;sed&lt;/code&gt; command. It also works with regex, but in this case it will be more simple.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sed "s/,/ -/g"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;s/&lt;/code&gt; means substitute, the &lt;code&gt;,&lt;/code&gt; is the sequence we want to be substituted and &lt;code&gt;-&lt;/code&gt; is the sequence to replace. The &lt;code&gt;/g&lt;/code&gt; tells to apply to all sequences in the whole line.&lt;/p&gt;

&lt;p&gt;We can join these two commands with the pipe operator, which will redirect the output of the &lt;code&gt;grep&lt;/code&gt; command to &lt;code&gt;sed&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;grep -o '"[^"]\+"' all_month.csv | sed "s/,/ -/g"
…
"8km NW of The Geysers - CA"
"24 km SSW of Kanosh - Utah"
"7km WNW of Cobb - CA"
"29 km N of Ivanof Bay - Alaska"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now redirect this text to a temporary file (I will call &lt;code&gt;places_without_comma&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;grep -o '"[^"]\+"' | sed "s/,/ -/g" &amp;gt; places_without_comma
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, if you &lt;code&gt;ls&lt;/code&gt; into your directory you will see a new file that contains the output above.&lt;/p&gt;

&lt;p&gt;For the second step it will be similar, but with &lt;code&gt;cut&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; tail +2 all_month.csv | cut -d, -f1,4,5 &amp;gt; other_fields
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;tail&lt;/code&gt; command is the opposite of head, it print the last &lt;code&gt;N&lt;/code&gt; lines. The &lt;code&gt;+2&lt;/code&gt; option will print out all the lines starting from the second. This will get rid of the header.&lt;/p&gt;

&lt;p&gt;To join these two files, we will use a command called &lt;code&gt;paste&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;paste -d, other_fields places_without_comma &amp;gt; all_month_formatted.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can suspect, the &lt;code&gt;-d,&lt;/code&gt; specifies the delimiter.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;all_month_formatted.csv&lt;/code&gt; file looks like this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2022-03-19T15:17:42.550Z,1.74,1.2,"8km NW of The Geysers - CA"
2022-03-19T15:13:52.070Z,3.59,0.6,"24 km SSW of Kanosh - Utah"
2022-03-19T15:12:31.400Z,1.86,0.36,"7km WNW of Cobb - CA"
2022-03-19T15:10:07.880Z,26.4,0.48,"29 km N of Ivanof Bay - Alaska"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Extracting information with awk
&lt;/h3&gt;

&lt;p&gt;What if we want the greater values involving the depth and magnitude of the earthquakes? We can do this using &lt;code&gt;awk&lt;/code&gt;. Awk is by itself a fully capable programming language, optimized to dealing with text. A basic example is &lt;code&gt;awk -F, '{print $1}' all_month_formatted.csv&lt;/code&gt; where it will print the first column of our file, acting like &lt;code&gt;cut -d, -f1&lt;/code&gt;. By default it's delimiter is space, the same is valid for many of the other tools used here.&lt;/p&gt;

&lt;p&gt;Jumping ahead, a script in awk for filter the greatest and smallest values of &lt;code&gt;depth&lt;/code&gt; and &lt;code&gt;mag&lt;/code&gt; can be:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;depth_mag_script.awk&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/bin/env -S awk -f

BEGIN {FS="," ; maxDepth=0 ; maxMag=0}
{
    if (NR==1) {   # When it reads the first entry
        minDepth=$2;
        minMag=$3;
    }

    if ($2 &amp;gt; maxDepth)
        maxDepth = $2;
    else if ($2 &amp;lt; minDepth)
        minDepth = $2;

    if ($3 &amp;gt; maxMag)
        maxMag = $3;
    else if($3 &amp;lt; minMag)
        minMag = $3;
}
END {
    print "The deepest earthquake was " maxDepth " and the shallowest was " minDepth;
    print "The biggest earthquake was " maxMag " and the lowest was " minMag 
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;FS&lt;/code&gt; is the "file separator", &lt;code&gt;NR&lt;/code&gt; is "number of records" and the $2 and $3 are the field's number. The &lt;code&gt;BEGIN&lt;/code&gt; and &lt;code&gt;END&lt;/code&gt; are self-explanatory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;./depth_mag_script.awk all_month_formatted.csv

The deepest earthquake was 623.78 and the shallowest was -3.5699999332428
The biggest earthquake was 7 and the lowest was -1.49
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The negative magnitude is due to the logarithmic properties of the measurement scale.&lt;/p&gt;

&lt;p&gt;Finally, we can format the date to &lt;code&gt;yyyy-mm-dd hh:MM:ss&lt;/code&gt; and add spaces after the commas with &lt;code&gt;sed&lt;/code&gt;. This time, passing the &lt;code&gt;-i&lt;/code&gt; option to operate directly in the file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sed -i "s/T/ / ; s/\.[0-9]*\S// ; s/,/, /g" all_month_formatted.csv
…
2022-03-19 15:17:42, 1.74, 1.2, "8km NW of The Geysers - CA"
2022-03-19 15:13:52, 3.59, 0.6, "24 km SSW of Kanosh - Utah"
2022-03-19 15:12:31, 1.86, 0.36, "7km WNW of Cobb - CA"
2022-03-19 15:10:07, 26.4, 0.48, "29 km N of Ivanof Bay - Alaska"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Next Steps
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://pandas.pydata.org/"&gt;pandas&lt;/a&gt; library is well known in the python and data science world, and for a reason. With it you could do all of this work easily. This is a example of a python script that gives the same result but in a more pleasant way.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/usr/bin/env python3

import pandas as pd

df = pd.read_csv('all_month.csv', usecols=[0, 3, 4, 13, 14])
print(df.head(10))

max_depth = df['depth'].nlargest(1)
print(max_depth)

print('-'*10)

min_depth = df['depth'].nsmallest(1)
print(min_depth)

print('='*20)

max_mag = df['mag'].nlargest(1)
print(max_mag)

print('-'*10)

min_mag = df['mag'].nsmallest(1)
print(min_mag)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this and much more, it worth giving the Pandas lib a try, if you haven't or didn't know it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finally...
&lt;/h3&gt;

&lt;p&gt;As my first post here, it certainly contains errors and could be improved. Feedback is always welcome!&lt;/p&gt;

</description>
      <category>commandline</category>
      <category>datascience</category>
      <category>pandas</category>
      <category>awk</category>
    </item>
  </channel>
</rss>
