Raunak Ramakrishnan

Posted on Jun 1, 2018

Awk - A useful little language

#programming #bash #shell #linux

Awk is a small but capable programming language which is used for processing text. It was developed by Aho, Weinberger, Kerninghan at Bell Labs.

Julia Evans made an awesome intro to awk:

Awk scans input file as a sequence of lines and splits each line into fields. The field separator is usually whitespace but you can customize it to any character.

An awk program is a sequence of pattern-action pairs i.e for each line, it checks if it matches the pattern and if yes, it performs the associated action on the line. Awk can be used interactively or to run saved programs.

Here is what Awk does written in Python-like pseudocode:

initialize() # Initializes variables in BEGIN block
for line in input_lines: # Awk divides file / input into a list of lines
    for condition, action in conditions: # A program is a list of condition-action pairs
        if condition(line): #match line against condition
            action() #perform action on match

Here are some small snippets of Awk:

1. Hello World!

You can run awk programs inline or through a file:

awk 'BEGIN{ print "Hello, World!"}'

Alternatively, you can save this to a file hello.awk:

BEGIN{ print "Hello, World!"}

Then run it as awk -f hello.awk

2. Reading a CSV and printing a specific column

Let's now do something useful! Download this csv which is 2010 census data by zip code in Los Angeles city.

Read the first 3 lines from csv: head -3 2010_Census_Populations_by_Zip_Code.csv

Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
91371,1,73.5,0,1,1,1
90001,57110,26.6,28468,28642,12971,4.4

We will print just the total column using awk -F, '{print $2}' 2010_Census_Populations_by_Zip_Code.csv

The -F, sets the field separator to comma as we need to split by commas for getting fields in a CSV file. $n allows you to use the value in the nth column.

3. Computing some statistics

Awk allows the use of variables and functions. Let's see how to use them by computing the total population in the entire city.

# total.awk
{s += $2}
END {print "Total population:", s}

Variables are by default initialized to 0. Here, we use a variable s to hold the total.

Running this script as awk -F, -f total.awk 2010_Census_Populations_by_Zip_Code.csv, we get output: Total population: 10603988

Special variables and built-in functions

Awk uses some special variables and functions to make your programs more compact:

NF : Number of fields in a line
NR : Line number
$0 : The entire input line
length : gives number of characters in a string

Now, we will compute the average household size which is total population divided by total households. The columns of interest are $2 and $6.
We also want the average population per zip code. Our script:

# stats.awk
{ s += $2; h += $6;}
END {print "Total population:", s, "\nTotal households:", h, "\nAverage household size:", s/h, "\nAverage population per zip code:", s/NR}

NR gives us the total number of lines. But we do not want the header line. We can use tail command to skip the 1st line as tail -n +2. Running tail -n +2 2010_Census_Populations_by_Zip_Code.csv | awk -F, -f total.awk gives us :

Total population: 10603988
Total households: 3497698
Average household size: 3.0317
Average population per zip code: 33241.3

4. Pattern matching

We have done some useful things with awk so far, but we have ignored its biggest strength - pattern matching. We can match based on field values, regexes, line numbers.

Print every 2nd line : NR%2 == 0 {print $0}. Here $0 stands for the entire line.
Print all zip codes with population > 100,000 : $2 > 100000 {print $1}
Print all zip codes with population > 10,000 and average household size > 4 : $2 > 10000 && $7 > 4 { print $1}. We can combine conditions using && and || which stand for logical and and or respectively.

Top comments (15)

leob • Jun 2 '18

Nice intro/overview! Awk is great, I've used it a lot ... I remember I often used it to read/parse log files and then generate SQL using Awk, in order to perform database changes. It's very flexible for this kind of extract/transform/output work.

Raunak Ramakrishnan • Jun 3 '18

Some Awk one-liners replace my use of other Unix tools like cut, grep, join. Some people use perl as a better awk but I prefer the simplicity of Awk.

leob • Jun 3 '18

Awk is extremely flexible and easy to use. Can indeed replace cut/grep/join if you want to. And perl, never got into it, too complicated and cryptic, I prefer the C/javascript-like syntax of Awk.

Richard C • Jun 5 '18

Very helpful! Thanks!

Matt Wang • Dec 27 '19

The picture is enough for me, thanks the author!

Ishani Gupta • Jun 4 '18

Interesting! Just wondering why a whole new language for a feature? (Not to sound critical).

Are there any specialised optimizations specific to file i/o and parsing the file at lower level ? If so, it would be great to have it also as wrapper for other langaugaes. Any benchmarking ?

🙂🙂

Raunak Ramakrishnan • Jun 5 '18

Hi Ishani,

awk is a very old language (1977), predating scripting languages like perl and python. As part of Unix philosophy, it is used in combination with other Unix tools. It is simpler and faster to write than say a python script. Most awk uses are simple one-liners to extract particular columns.

It is indeed very fast as all it does is: For each line: