Sundeep

Posted on Sep 30, 2020 • Originally published at learnbyexample.github.io

Using Ruby as a cli tool for text processing

#ruby #tutorial #linux #textprocessing

Why use Ruby for one-liners?

I assume you are already familiar with use cases where command line is more productive compared to GUI. See also this series of articles titled Unix as IDE.

A shell utility like bash provides built-in commands and scripting features to make it easier to solve and automate various tasks. External *nix commands like grep, sed, awk, sort, find, parallel etc can be combined to work with each other. Depending upon your familiarity with those tools, you can either use ruby as a single replacement or complement them for specific use cases.

Here's some one-liners (options will be explained later):

ruby -e 'puts readlines.uniq' *.txt — retain only one copy if lines are duplicated from the given list of input file(s)
ruby -e 'puts readlines.uniq {|s| s.split[1]}' *.txt — retain only first copy of duplicate lines using second field as duplicate criteria
ruby -rcommonregex -ne 'puts CommonRegex.get_links($_)' *.md — extract only the URLs, using a third-party CommonRegexRuby library
stackoverflow: merge duplicate key values while preserving order — a recent Q&A that I answered with a simpler ruby solution compared to awk

The main advantage of ruby over tools like grep, sed and awk includes feature rich regular expression engine, standard library and third-party libraries. If you don't already know the syntax and idioms for sed and awk, learning command line options for ruby would be the easier option. The main disadvantage is that ruby is likely to be slower compared to those tools.

Command line options

Option	Description
`-0[octal]`	specify record separator (`\0`, if no argument)
`-a`	autosplit mode with `-n` or `-p` (splits `$_` into `$F`)
`-c`	check syntax only
`-Cdirectory`	cd to directory before executing your script
`-d`	set debugging flags (set `$DEBUG` to true)
`-e 'command'`	one line of script. Several `-e`'s allowed. Omit [programfile]
`-Eex[:in]`	specify the default external and internal character encodings
`-Fpattern`	`split()` pattern for autosplit (`-a`)
`-i[extension]`	edit `ARGV` files in place (make backup if extension supplied)
`-Idirectory`	specify `$LOAD_PATH` directory (may be used more than once)
`-l`	enable line ending processing
`-n`	assume `'while gets(); ... end'` loop around your script
`-p`	assume loop like `-n` but print line also like `sed`
`-rlibrary`	require the library before executing your script
`-s`	enable some switch parsing for switches after script name
`-S`	look for the script using PATH environment variable
`-v`	print the version number, then turn on verbose mode
`-w`	turn warnings on for your script
`-W[level=2\|:category]`	set warning level; 0=silence, 1=medium, 2=verbose
`-x[directory]`	strip off text before #!ruby line and perhaps cd to directory
`--jit`	enable JIT with default options (experimental)
`--jit-[option]`	enable JIT with an option (experimental)
`-h`	show this message, `--help` for more info

Executing Ruby code

If you want to execute a ruby program file, one way is to pass the filename as argument to the ruby command.

$ echo 'puts "Hello Ruby"' > hello.rb
$ ruby hello.rb
Hello Ruby

For short programs, you can also directly pass the code as an argument to the -e option.

$ ruby -e 'puts "Hello Ruby"'
Hello Ruby

$ # multiple statements can be issued separated by ;
$ ruby -e 'x=25; y=12; puts x**y'
59604644775390625
$ # or use -e option multiple times
$ ruby -e 'x=25' -e 'y=12' -e 'puts x**y'
59604644775390625

Filtering

ruby one-liners can be used for filtering lines matched by a regexp, similar to grep, sed and awk. And similar to many command line utilities, ruby can accept input from both stdin and file arguments.

$ # sample stdin data
$ printf 'gate\napple\nwhat\nkite\n'
gate
apple
what
kite

$ # print all lines containing 'at'
$ # same as: grep 'at' and sed -n '/at/p' and awk '/at/'
$ printf 'gate\napple\nwhat\nkite\n' | ruby -ne 'print if /at/'
gate
what

$ # print all lines NOT containing 'e'
$ # same as: grep -v 'e' and sed -n '/e/!p' and awk '!/e/'
$ printf 'gate\napple\nwhat\nkite\n' | ruby -ne 'print if !/e/'
what

By default, grep, sed and awk will automatically loop over input content line by line (with \n as the line distinguishing character). The -n or -p option will enable this feature for ruby. As seen before, the -e option accepts code as command line argument. Many shortcuts are available to reduce the amount of typing needed.

In the above examples, a regular expression (defined by the pattern between a pair of forward slashes) has been used to filter the input. When the input string isn't specified in a conditional context (for example: if), the test is performed against global variable $_, which has the contents of the input line (the correct term would be input record). To summarize, in a conditional context:

/regexp/ is a shortcut for $_ =~ /regexp/
!/regexp/ is a shortcut for $_ !~ /regexp/

$_ is also the default argument for print method, which is why it is generally preferred in one-liners over puts method. More such defaults that apply to the print method will be discussed later.

See ruby-doc: Pre-defined global variables for documentation on $_, $&, etc.

Here's an example with file input instead of stdin.

$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14

$ # same as: grep -oE '[0-9]+$' table.txt
$ ruby -ne 'puts $& if /\d+$/' table.txt
42
7
14

Substitution

Use sub and gsub methods for search and replace requirements. By default, these methods operate on $_ when the input string isn't provided. For these examples, -p option is used instead of -n option, so that the value of $_ is automatically printed after processing each input line.

$ # for each input line, change only first ':' to '-'
$ # same as: sed 's/:/-/' and awk '{sub(/:/, "-")} 1'
$ printf '1:2:3:4\na:b:c:d\n' | ruby -pe 'sub(/:/, "-")'
1-2:3:4
a-b:c:d

$ # for each input line, change all ':' to '-'
$ # same as: sed 's/:/-/g' and awk '{gsub(/:/, "-")} 1'
$ printf '1:2:3:4\na:b:c:d\n' | ruby -pe 'gsub(/:/, "-")'
1-2-3-4
a-b-c-d

You might wonder how $_ is modified without the use of ! methods. The reason is that these methods are part of Kernel (see ruby-doc: Kernel for details) and are available only when -n or -p options are used.

sub(/regexp/, repl) is a shortcut for $_.sub(/regexp/, repl) and $_ will be updated if substitution succeeds
gsub(/regexp/, repl) is a shortcut for $_.gsub(/regexp/, repl) and $_ gets updated if substitution succeeds

Field processing

Consider the sample input file shown below with fields separated by a single space character.

$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14

Here's some examples that is based on specific field rather than the entire line. The -a option will cause the input line to be split based on whitespaces and the field contents can be accessed using $F global variable. Leading and trailing whitespaces will be suppressed and won't result in empty fields.

$ # print the second field of each input line
$ # same as: awk '{print $2}' table.txt
$ ruby -ane 'puts $F[1]' table.txt
bread
cake
banana

$ # print lines only if last field is a negative number
$ # same as: awk '$NF<0' table.txt
$ ruby -ane 'print if $F[-1].to_f < 0' table.txt
blue cake mug shirt -7

$ # change 'b' to 'B' only for the first field
$ # same as: awk '{gsub(/b/, "B", $1)} 1' table.txt
$ ruby -ane '$F[0].gsub!(/b/, "B"); puts $F * " "' table.txt
Brown bread mat hair 42
Blue cake mug shirt -7
yellow banana window shoes 3.14

BEGIN and END

You can use a BEGIN{} block when you need to execute something before input is read and a END{} block to execute something after all of the input has been processed.

$ # same as: awk 'BEGIN{print "---"} 1; END{print "%%%"}'
$ # note the use of ; after BEGIN block
$ seq 4 | ruby -pe 'BEGIN{puts "---"}; END{puts "%%%"}'
---
1
2
3
4
%%%

ENV hash

When it comes to automation and scripting, you'd often need to construct commands that can accept input from user, file, output of a shell command, etc. As mentioned before, this book assumes bash as the shell being used. To access environment variables of the shell, you can call the special hash variable ENV with the name of the environment variable as a string key.

$ # existing environment variable
$ # output shown here is for my machine, would differ for you
$ ruby -e 'puts ENV["HOME"]'
/home/learnbyexample
$ ruby -e 'puts ENV["SHELL"]'
/bin/bash

$ # defined along with ruby command
$ # note that the variable is placed before the shell command
$ word='hello' ruby -e 'puts ENV["word"]'
hello
$ # the input characters are preserved as is
$ ip='hi\nbye' ruby -e 'puts ENV["ip"]'
hi\nbye

Here's another example when a regexp is passed as an environment variable content.

$ cat word_anchors.txt
sub par
spar
apparent effort
two spare computers
cart part tart mart

$ # assume 'r' is a shell variable that has to be passed to the ruby command
$ r='\Bpar\B'
$ rgx="$r" ruby -ne 'print if /#{ENV["rgx"]}/' word_anchors.txt
apparent effort
two spare computers

As an example, see my repo ch: command help for a practical shell script, where commands are constructed dynamically.

Executing external commands

You can call external commands using the system Kernel method. See ruby-doc: system for documentation.

$ ruby -e 'system("echo Hello World")'
Hello World

$ ruby -e 'system("wc -w <word_anchors.txt")'
12

$ ruby -e 'system("seq -s, 10 > out.txt")'
$ cat out.txt
1,2,3,4,5,6,7,8,9,10

Return value of system or global variable $? can be used to act upon exit status of command issued.

$ ruby -e 'es=system("ls word_anchors.txt"); puts es'
word_anchors.txt
true
$ ruby -e 'system("ls word_anchors.txt"); puts $?'
word_anchors.txt
pid 6087 exit 0

$ ruby -e 'system("ls xyz.txt"); puts $?'
ls: cannot access 'xyz.txt': No such file or directory
pid 6164 exit 2

To save the result of an external command, use backticks or %x.

$ ruby -e 'words = `wc -w <word_anchors.txt`; puts words'
12

$ ruby -e 'nums = %x/seq 3/; print nums'
1
2
3

See also stackoverflow: difference between exec, system and %x() or backticks

Summary

This post introduced some of the common options for ruby cli usage, along with typical cli text processing examples. While specific purpose cli tools like grep, sed and awk are usually faster, ruby has a much more extensive standard library and ecosystem. And you do not have to learn a lot if you are comfortable with ruby but not familiar with those cli tools.

Ruby one-liners cookbook

If you liked this post and would like to learn more, check out my ebook using the links below. These are free to download until this Sunday (4-Oct-2020).

You can also get the ebooks as part of Ruby text processing bundle using these links:

DEV Community