DEV Community

Tom Deneire ⚡
Tom Deneire ⚡

Posted on • Originally published at tomdeneire.Medium

Get started with AWK

Photo by [Natalia Yakovleva](https://unsplash.com/@foxfox?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral)
Photo by Natalia Yakovleva on Unsplash

What is AWK?

AWK is a true GNU/Linux classic, designed by Alfred Aho, Peter Weinberger, and Brian Kernighan — three of the biggest names in UNIX development and computer science in general. It is a convenient and expressive programming language that can be applied to a wide variety of computing and data-manipulation tasks.

Of course this tutorial is only a brief overview of the basics of AWK. There are several more elaborated tutorials to be found online, as well as the original manual The AWK Programming Language written by the designers…

Workflow and structure

Starting out, we need to understand the workflow and program structure of AWK.

An AWK program follows this outline:

  1. execute commands from BEGIN block

  2. read a line from input stream

  3. execute commands on a line or rather ‘record’

  4. repeat if end of file has not been reached

  5. execute commands from END block

Not all steps are necessary. For instance, a valid AWK instruction can be:

echo -e 'one\ntwo\nthree' | awk '/o/ {print}'

one
two
Enter fullscreen mode Exit fullscreen mode

This only uses the middle or “body” block.

AWK commands take the form of pattern-action statements with the syntax pattern { action }.

If there is a match for the pattern, the action is performed. If there is no pattern, the action is performed for every input line, for instance:

echo -e 'one\ntwo\nthree' | awk '{print}'

one
two
three
Enter fullscreen mode Exit fullscreen mode

Together with the BEGIN and END block, this becomes:

echo -e 'one\ntwo\nthree' | awk 'BEGIN{print "STARTING\n"} {print} END{print "READY"}'

STARTING
one
two
three
READY
Enter fullscreen mode Exit fullscreen mode

By default, AWK separates input into ‘records’ using newlines, but this can be changed (see below).

Running an AWK program

There are several ways to run an AWK program.

We can use AWK as a command-line application, as in the examples above, but we can also provide AWK commands in a script file and use the -f flag:

cat input.txt | awk -f script.awk
Enter fullscreen mode Exit fullscreen mode

We can use input that is piped to awk, like in the example above, or we can specify a source file (or several) for the text input, e.g.

awk '{print}' input.txt
Enter fullscreen mode Exit fullscreen mode

Built-in variables

FS, $0, $1, …

As a text processing utility, AWK is designed to split lines into fields. To do this, it uses a built-in variable FS which defaults to space but which you can change with the -F flag, like so:

echo 'a b c' | awk '{print $1}'

a

echo 'a,b,c' | awk -F "," '{print $1}'

a
Enter fullscreen mode Exit fullscreen mode

(Similarly, built-in variable RS, record separator, allows to change the default record separator, which defaults to newline).

AWK has many such variables, not only the above-mentioned $n (nth field) or $0 (the entire record), but also, for instance:

NF

This represents the number of fields in the current record, which can be used as a pattern to match the input to, for instance:

echo -e 'a b c\nd e' | awk 'NF > 2'

a b c
Enter fullscreen mode Exit fullscreen mode

NR

This represents the index number of the ‘record’ (in this case, line) from the input:

echo -e 'a b c\nd e' | awk '{print NR}'

1
2
Enter fullscreen mode Exit fullscreen mode

IGNORECASE

Some variables are specific for certain versions of AWK, such as GAWK (GNU AWK), which is the default awk on many Linux systems. One of those is IGNORECASE which can be used when matching records to certain patterns (note the / / syntax for patterns, see below):

echo -e 'a b c' | awk '{IGNORECASE = 1} /A/'

a b c
Enter fullscreen mode Exit fullscreen mode

Operators

This overview is too short to discuss AWK operators in full, but let’s look at some examples.

Arithmetic

There are only two types of data in awk: strings of characters and numbers. Awk converts strings into numbers using a conversion format (which can be changed using built-in variable CONVFMT):

echo -e 'a 1 1' | awk '{print ($2 + $3)}'

2
Enter fullscreen mode Exit fullscreen mode

Increment and decrement

echo -e '1 1 1' | awk '{print (++$1)}'

2
Enter fullscreen mode Exit fullscreen mode

Relational

echo -e '1\n0\n7\n0\n3\n0\n4\n2\n3' | awk '$0 != 0'

1
7
3
4
2
3
Enter fullscreen mode Exit fullscreen mode

Assignment

awk 'BEGIN { name = "Tom"; printf "My name is %s", name }'

My name is Tom
Enter fullscreen mode Exit fullscreen mode

By the way, notice how this is an AWK instruction that does not have a body, only a BEGIN block. Hence it does not need an input to function either… It also uses formatted printing printf instead of the regular print.

Patterns and regular expressions

AWK’s bread and butter is pattern matching. Moreover, it is very powerful and efficient in handling regular expressions.

A basic example of this is:

echo -e 'one\ntwo\nthree' | awk '/one/ {print}'

one
Enter fullscreen mode Exit fullscreen mode

which we can also write using a relational operator for the pattern:

echo -e 'one\ntwo\nthree' | awk '$0 == "one" {print}'
Enter fullscreen mode Exit fullscreen mode

AWK uses the POSIX regular expression syntax (as do other UNIX utilities, like sed):

Source: [https://webannex.blogspot.com/2013/04/regular-expressions-cheat-sheet.html](https://webannex.blogspot.com/2013/04/regular-expressions-cheat-sheet.html)

Patterns can be combined with parentheses and the logical operators &&, || and !, which stand for AND, OR, and NOT.

echo -e 'one\ntwo\nthree' | awk '(/one/) || (/two/) {print}'
one
two
Enter fullscreen mode Exit fullscreen mode

Patterns are AWK’s true strength, even the keywords BEGIN and END mentioned above are, in fact, just patterns…

Computing with AWK

Besides text, AWK also handles numerical data well, and is a formidable tool for performing calculations, generating reports, and so on.

For computations it is often handy to make use of user-created variables, which you do not need to define, like so:

echo -e '1\n1\n1' | awk '{total+=$0} END {print total}'

3
Enter fullscreen mode Exit fullscreen mode

Functions

Built-in functions

We have already seen that awk provides built-in variables. Similarly, there are built-in functions for computing useful values. There are several kinds of functions such as functions for arithmetic, time, bit manipulation and also text functions. One of these is length, which counts the number of characters in a string:

echo -e 'one\ntwo\nthree' | awk '{print length($0)}'

3
3
5
Enter fullscreen mode Exit fullscreen mode

User-defined functions

You can also define your own functions with the following syntax:

function function_name(arg1, arg2, ...){function body}
Enter fullscreen mode Exit fullscreen mode

Returning value with return is optional.

A basic example is a function that returns the minimum number:

echo -e '1 2\n4 3\n5 7' | awk 'function min(num1,num2){if (num1 < num2) return num1 else return num2}{print min($1,$2)}'
1
3
5
Enter fullscreen mode Exit fullscreen mode

Of course, such an example would be more legible in an AWK script than a oneliner.

Control-flow

The previous example already showed that AWK is also capable of if-else statements for making decisions:

if (condition)
action-1
else
action-2
Enter fullscreen mode Exit fullscreen mode

It also has the ability to create for loops:

for (initialization; condition; increment/decrement)
action
Enter fullscreen mode Exit fullscreen mode

(Both are modeled on the syntax of the C programming language)

awk 'BEGIN { for (i = 1; i <= 5; ++i) print i }'

1
2
3
4
5
Enter fullscreen mode Exit fullscreen mode

Arrays

Awk provides arrays for storing groups of related values. These are very powerful as they can even be multidimensional.

A simple example:

echo -e 'Tom 1\nHarry 2\nTom 3\nHarry 4\nJimmy 5' | awk '{total[$1]+=$2} END {print total["Tom"]}'

5
Enter fullscreen mode Exit fullscreen mode

AWK as a scripting language

A more elaborate example shows how AWK can be a valid alternative for Python or other scripting languages. (Also keep in mind that in some programming environments (servers, for instance) you might not have access to Python or your favourite scripting language, or limited possibilities to upload source code files!)

Let’s count all .json, .xml and .csv files in my Dropbox folder.

Let’s first gather all files recursively into a source file:

find /home/tdeneire/Dropbox/. -not -type d > files.txt
wc -l files.txt 

132899 files.txt
Enter fullscreen mode Exit fullscreen mode

This amounts to some 130K files.

In Python we could do something like this:

total = {}

with open("/home/tdeneire/Dropbox/files.txt", "r") as reader:
    for line in reader:
        line = line.strip()
        extension = line.split(".")[-1]
        total.setdefault(extension, 0)
        total[extension] += 1

for key in ["json", "xml", "csv"]:
    print(key, total[key])
Enter fullscreen mode Exit fullscreen mode

In AWK:

awk -F "." '{extension=$(NF); total[extension]+=1} END {wanted["json"]; wanted["csv"]; wanted["xml"]; for (ext in wanted){print ext,total[ext]}}' /home/tdeneire/Dropbox/files.txt
Enter fullscreen mode Exit fullscreen mode

In this particular example, AWK is much slower than Python (about 2.5 times slower), but if we play to AWK’s strengths, i.e. pattern matching, for instance, searching for ten consecutive digits, matters are quickly reversed.

In Python

import re

with open("/home/tdeneire/Dropbox/files.txt", "r") as reader:
    for line in reader:
        line = line.strip()
        if re.match('.*[0-9]{10}', line):
            print(line)
Enter fullscreen mode Exit fullscreen mode

With AWK:

awk -F "." '/.*[0-9]{10}/' /home/tdeneire/Dropbox/files.txt
Enter fullscreen mode Exit fullscreen mode

Difference:

time py3 test.py

real 0m1.296s
user 0m0.693s
sys  0m0.143s

time awk -F "." '/.*[0-9]{10}/' /home/tdeneire/Dropbox/files.txt

real 0m0.659s
user 0m0.105s
sys  0m0.113s
Enter fullscreen mode Exit fullscreen mode

This just goes to show the importance of picking the right tool for the job for scripting.

But in some cases that tool might just be AWK, even if it’s over 40 years old!

Sources


Hi! 👋 I’m Tom. I’m a software engineer, a technical writer and IT burnout coach. If you want to get in touch, check out https://tomdeneire.github.io

Top comments (0)