DEV Community

Cover image for Awk: A beginners guide for humans
John McBride
John McBride

Posted on • Originally published at johncodes.com

Awk: A beginners guide for humans

Earlier this week, I had a file of names, each delimited by a newline:

john
jack
jill
Enter fullscreen mode Exit fullscreen mode

But really, I needed this file to be in the form:

{
    "full_name": "name"
},
Enter fullscreen mode Exit fullscreen mode

This file wasn't absolutely huge, but it was big enough that editing it manually would have been annoying. I thought to myself, "instead of editing this file manually or generating it correctly, how can I spend the maximum amount of time using a bespoke tool to get it in the right format? A neovim macro? Sed? Write some python? Why not awk!"

In the end, here's the awk command I used:

awk '{print "{\n    \"full_name\": \"" $0 "\"\n},"}' names.txt
Enter fullscreen mode Exit fullscreen mode

This printed each line surrounded by the appropriate curly braces and whitespace.


Let's break down how I did this and build the command one bit at a time:

  1. Awk is a Linux command line utility just like any other. But, similar to something like like python or lua, it's a special program interpreter that is especially good at scanning and processing inputs with small (or big) one liner programs you give it.
awk '<an-awk-program>' some-input-file
Enter fullscreen mode Exit fullscreen mode
  1. Let's start simple and just print the names from the file directly to stdout:
awk '{print $0}' names.txt
Enter fullscreen mode Exit fullscreen mode
john
jack
jill
Enter fullscreen mode Exit fullscreen mode

within the '', we provide awk with a small program it will execute. This is basically the "hello world" of awk: it just takes each line and prints it out just like it is, unedited, in the file.

But what is $0? Awk has the concept of "columns" in a file: these are typically space delimited. So a file like:

1 2 3
4 5 6 
Enter fullscreen mode Exit fullscreen mode

has 3 columns and 2 rows.

The $0 variable is a special one and represents the entire row of arguments. Then, each $N is the N-th (where 1 is the first column) argument in that row.

So, if we only wanted the 1st column in the above file with 3 columns, we could run the following awk program:

awk '{print $1}' numbers.txt
Enter fullscreen mode Exit fullscreen mode
1
4
Enter fullscreen mode Exit fullscreen mode

If we only wanted the 2nd and 3rd columns, we could run:

awk '{print $2 " " $3}' numbers.txt
Enter fullscreen mode Exit fullscreen mode
2 3
5 6
Enter fullscreen mode Exit fullscreen mode

(notice the blank " " we provide as a string to force some whitespace formatting so the columns are closer to what exists in the original file.

  1. Next, lets add in some additional text to print out:
awk '{print "{\"full_name\": \"" $0 "\"},"}' names.txt
Enter fullscreen mode Exit fullscreen mode

First thing you'll notice is a confusing array of " - the first " denotes the beginning of a string output for awk to print. The subsequent \" are literal escaped quotes which we want to appear in the output. We eventually end the first string with a standalone " to then print the line with the $0 variable and then we enter a string again to add the trailing bracket } and comma ,.

When run, this outputs:

{"full_name": "john"},
{"full_name": "jack"},
{"full_name": "jill"},
Enter fullscreen mode Exit fullscreen mode
  1. Now we're getting somewhere! Let's finish this off by adding the additional white spacing:
awk '{print "{\n    \"full_name\": \"" $0 "\"\n},"}' names.txt
Enter fullscreen mode Exit fullscreen mode
{
    "full_name": "john"
},
{
    "full_name": "jack"
},
{
    "full_name": "jill"
},
Enter fullscreen mode Exit fullscreen mode

The added whitespace within the strings (by including the literal escaped newlines \n) are printed to give the correct, desired output!

  1. Bonus: what if we wanted to remove the trailing comma? What if we wanted to wrap this all in [...] to be closer to valid json? Yeah, yeah, I know, jq exists, but by the power of our lord and savior awk, all things possible!!

To remove the trailing comma, we can use a sliding window technique:

awk 'NR > 1 {print prev ","} {prev = "{\n    \"full_name\": \"" $0 "\"\n}"} END {print prev}' names.txt
Enter fullscreen mode Exit fullscreen mode

This introduces abit more complexity.

First, we add the NR concept: NR is the "number of records". This can be really useful for checking progress, doing different things based on number of records processed, etc.

So, after the first record, we print the comma. We also always store the "previous" chunk in a prev variable: this is the N + 1 sliding window. Nothing actually happens when the first record is processed, it's line output is simply stored in the prev variable to be printed on the next iteration. This way, we're always one behind the current record and when we reach the very end (using the END keyword), we can print the previous chunk without the trailing comma!

To wrap it up the entire output in a square bracket and give it the correct spacing, we can use this awk program:

BEGIN {
    # Print the opening bracket for the JSON array
    print "["
}

NR > 1 {
    # after the first line, print the previously stored chunk
    print prev ","
}

{
    # Store the current line in a JSON object format
    prev = "    {\n        \"full_name\": \"" $0 "\"\n    }"
}

END {
    # Print the last line stored in prev and close the JSON array
    print prev "\n]"
}
Enter fullscreen mode Exit fullscreen mode

We can run this awk program via a file instead of doing all of that on the command line directly. This greatly helps with readability, maintainability, etc.

awk -f format_names.awk names.txt
Enter fullscreen mode Exit fullscreen mode
[
    {
        "full_name": "john"
    },
    {
        "full_name": "jack"
    },
    {
        "full_name": "jill"
    }
]
Enter fullscreen mode Exit fullscreen mode

Just like the previous awk program, we are printing each segment and then at the end, leaving off the trailing comma. But this time, at the beginning of the program, using BEGIN and END, we print an opening and closing bracket.


Happy awk-ing and good luck!

Top comments (1)

Collapse
 
kaamkiya profile image
Kaamkiya

Thanks for this! I've been trying to figure out what awk was for a while now.