<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vegard Stikbakke</title>
    <description>The latest articles on DEV Community by Vegard Stikbakke (@vegarsti).</description>
    <link>https://dev.to/vegarsti</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F136692%2Fe73f2cba-d352-496c-86e0-30ad4c60f0a1.png</url>
      <title>DEV Community: Vegard Stikbakke</title>
      <link>https://dev.to/vegarsti</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vegarsti"/>
    <language>en</language>
    <item>
      <title>Problem solving with Unix commands</title>
      <dc:creator>Vegard Stikbakke</dc:creator>
      <pubDate>Fri, 15 Feb 2019 20:40:59 +0000</pubDate>
      <link>https://dev.to/vegarsti/problem-solving-with-unix-commands-4j8l</link>
      <guid>https://dev.to/vegarsti/problem-solving-with-unix-commands-4j8l</guid>
      <description>&lt;p&gt;(Originally published &lt;a href="http://vegardstikbakke.com/unix/"&gt;here&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling. Let me tell you about a problem I had, and how I used some Unix command-line utilities to solve it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;I'm working on research for my master thesis. As with many statisticians, I am running a lot of simulations. I first simulate some data according to some numerical seed (to ensure reproducibility), and then use an algorithm to estimate something based on that data. For each simulation run, I create some files, typically like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;dataset-directory/0001_data.csv
dataset-directory/0001_A.csv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Sometimes a run fails. This doesn't really matter in this case: For any failed simulation, I can just do another one. For the &lt;code&gt;0001&lt;/code&gt; data, I had a successful run with algorithm &lt;code&gt;A&lt;/code&gt;. Therefore I want to use the &lt;code&gt;0001&lt;/code&gt; data on algorithm &lt;code&gt;B&lt;/code&gt; as well. But what I do need is to keep track of which runs have failed.&lt;/p&gt;

&lt;p&gt;After running algorithm &lt;code&gt;A&lt;/code&gt; on a lot of data, I end up with a large list of files like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;dataset-directory/0001_data.csv
dataset-directory/0001_A.csv
dataset-directory/0002_data.csv
dataset-directory/0002_A.csv
dataset-directory/0003_data.csv
dataset-directory/0003_A.csv
dataset-directory/0004_data.csv
dataset-directory/0005_data.csv
dataset-directory/0005_A.csv
dataset-directory/0006_data.csv
dataset-directory/0006_A.csv
dataset-directory/0007_data.csv
dataset-directory/0007_A.csv
dataset-directory/0008_data.csv
dataset-directory/0009_data.csv
dataset-directory/0009_A.csv
...
dataset-directory/0499_A.csv
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The astute observer will note that the file for algorithm &lt;code&gt;A&lt;/code&gt; on data &lt;code&gt;0004&lt;/code&gt; and &lt;code&gt;0008&lt;/code&gt; are missing. &lt;strong&gt;How can I get a list of all the numbers for which &lt;code&gt;A&lt;/code&gt; didn't succeed?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I certainly could go over manually, but that would be error prone, and incredibly boring. It's much better to write a program to do it!&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution
&lt;/h2&gt;

&lt;p&gt;To be obtuse: Those that didn't succeed are the numbers from &lt;code&gt;0001&lt;/code&gt; to &lt;code&gt;0500&lt;/code&gt; except those that suceeded. And one handy command to get a list of numbers is &lt;code&gt;seq&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;seq &lt;/span&gt;10
1
2
3
4
5
6
7
8
9
10
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;(If only one number is given, it is implied that the sequence starts with &lt;code&gt;1&lt;/code&gt;. &lt;code&gt;seq 2 10&lt;/code&gt; would do what you think it would, as well.)&lt;/p&gt;

&lt;p&gt;Now, if we can get a list of all the successful runs, we should be able to get what we want by cross-checking the list of successful runs with a &lt;code&gt;seq&lt;/code&gt; command which prints all possible numbers!&lt;/p&gt;

&lt;p&gt;Most command-line utilities do one pretty specific thing. For example, with &lt;code&gt;cut&lt;/code&gt; you can get the characters on specific locations on each line&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;cat &lt;/span&gt;text
Lorem ipsum
dolor sit amet
&lt;span class="nv"&gt;$&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;cat &lt;/span&gt;text | &lt;span class="nb"&gt;cut&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; 2-5
orem
olor
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Notice here the use of the so-called pipe operator &lt;code&gt;|&lt;/code&gt;. Like I said, most utilities do one specific thing, and it does that thing well. The neat thing is that these can be combined. By using these pipes, the output from the command to the left of the pipe is directed to the command to the right. Note that these commands treat the input as a stream of lines, which is often really handy.&lt;/p&gt;

&lt;p&gt;We can get a list of the successful file names by piping the list of files into a &lt;code&gt;grep&lt;/code&gt; command, which is a command which can use regular expressions. Since all files start with an equal length of 4 digits, we can match these to the regular expression &lt;code&gt;\d\d\d\d&lt;/code&gt;, matching 4 digits in a row, and add the file ending for the &lt;code&gt;A&lt;/code&gt; algorithm to the regular expression. To get the list of files with one line for each file, we can simply do &lt;code&gt;ls&lt;/code&gt;. (Although &lt;code&gt;ls&lt;/code&gt; doesn't give each file its own line when calling it separately, it turns out that piping the output from &lt;code&gt;ls&lt;/code&gt; will.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;ls &lt;/span&gt;dataset-directory | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s1"&gt;'\d\d\d\d_A.csv'&lt;/span&gt;
0009_A.csv
0001_A.csv
0002_A.csv
0005_A.csv
0007_A.csv
0003_A.csv
0006_A.csv
...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For some reason, these show up in a scrambled order after using &lt;code&gt;grep&lt;/code&gt;. We can use &lt;code&gt;sort&lt;/code&gt; to fix that. And we are only interested in the numbers, so we can use &lt;code&gt;cut -c 1-4&lt;/code&gt; to extract the number parts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;ls &lt;/span&gt;dataset-directory | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s1"&gt;'\d\d\d\d_A.csv'&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; | &lt;span class="nb"&gt;cut&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; 1-4
0001
0002
0003
0005
0006
0007
0009
...
0499
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;These numbers aren't exactly the same as the numbers from the &lt;code&gt;seq&lt;/code&gt; command, since these are zero-padded. Therefore we write a quick Python script to parse them as integers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# parse.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;sys&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Now, piping into this script will give us the numbers that we want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;ls &lt;/span&gt;dataset-directory | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s1"&gt;'\d\d\d\d_A.csv'&lt;/span&gt; | &lt;span class="nb"&gt;cut&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; 1-4 | python3 parse.py
1
2
3
5
6
7
9
...
499
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We're getting there! Now we have to figure out how to cross-check these lists of numbers. Luckily, there exists a command called &lt;code&gt;comm&lt;/code&gt;, which checks for &lt;u&gt;comm&lt;/u&gt;on characters in two input streams. To get the input of a sequence of commands such as the one above, we can evaluate it and redirect it, which we do by wrapping it in &lt;code&gt;&amp;lt;(...)&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;comm&lt;/span&gt; &amp;lt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;ls &lt;/span&gt;dataset-directory | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s1"&gt;'\d\d\d\d_A.csv'&lt;/span&gt; | &lt;span class="nb"&gt;cut&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; 1-4 | python3 parse.py&lt;span class="o"&gt;)&lt;/span&gt; &amp;lt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;500&lt;span class="o"&gt;)&lt;/span&gt;
        1
        2
        3
    4
        5
        6
        7
    8
        9
    10
...
    500
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This output is a bit disorienting. If we read the manual of &lt;code&gt;comm&lt;/code&gt; (by doing &lt;code&gt;man comm&lt;/code&gt;), we see that &lt;code&gt;comm&lt;/code&gt; "produces three text columns as output: lines only in file1; lines only in file2; and lines in both files." To suppress column 1 -- which is empty, since no numbers are only from the file list -- call &lt;code&gt;comm&lt;/code&gt; with the flag &lt;code&gt;-1&lt;/code&gt;. And since we are not interested in the numbers which are in both streams, we suppress with the &lt;code&gt;-3&lt;/code&gt; flag as well.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;comm&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt; &lt;span class="nt"&gt;-3&lt;/span&gt; &amp;lt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;ls &lt;/span&gt;dataset-directory | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s1"&gt;'\d\d\d\d_A.csv'&lt;/span&gt; | &lt;span class="nb"&gt;cut&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; 1-4 | python3 parse.py&lt;span class="o"&gt;)&lt;/span&gt; &amp;lt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;seq &lt;/span&gt;500&lt;span class="o"&gt;)&lt;/span&gt;
4
8
...
500
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And we're done!&lt;/p&gt;

&lt;p&gt;Update: This post generated &lt;a href="https://news.ycombinator.com/item?id=19160659"&gt;some interesting discussion on Hacker News&lt;/a&gt;. There's many ways to solve this problem, and the way I did it is probably not the best. Be sure to check it out for tips on how to improve.&lt;/p&gt;

</description>
      <category>unix</category>
    </item>
  </channel>
</rss>
