DEV Community

Raunak Ramakrishnan
Raunak Ramakrishnan

Posted on

Awk - A useful little language

Awk is a small but capable programming language which is used for processing text. It was developed by Aho, Weinberger, Kerninghan at Bell Labs.

Julia Evans made an awesome intro to awk:
AWK comic

Awk scans input file as a sequence of lines and splits each line into fields. The field separator is usually whitespace but you can customize it to any character.

An awk program is a sequence of pattern-action pairs i.e for each line, it checks if it matches the pattern and if yes, it performs the associated action on the line. Awk can be used interactively or to run saved programs.

Here is what Awk does written in Python-like pseudocode:

initialize() # Initializes variables in BEGIN block
for line in input_lines: # Awk divides file / input into a list of lines
    for condition, action in conditions: # A program is a list of condition-action pairs
        if condition(line): #match line against condition
            action() #perform action on match 
Enter fullscreen mode Exit fullscreen mode

Here are some small snippets of Awk:

1. Hello World!

You can run awk programs inline or through a file:

awk 'BEGIN{ print "Hello, World!"}'
Enter fullscreen mode Exit fullscreen mode

Alternatively, you can save this to a file hello.awk:

BEGIN{ print "Hello, World!"}
Enter fullscreen mode Exit fullscreen mode

Then run it as awk -f hello.awk

2. Reading a CSV and printing a specific column

Let's now do something useful! Download this csv which is 2010 census data by zip code in Los Angeles city.

Read the first 3 lines from csv: head -3 2010_Census_Populations_by_Zip_Code.csv

Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
91371,1,73.5,0,1,1,1
90001,57110,26.6,28468,28642,12971,4.4
Enter fullscreen mode Exit fullscreen mode

We will print just the total column using awk -F, '{print $2}' 2010_Census_Populations_by_Zip_Code.csv

The -F, sets the field separator to comma as we need to split by commas for getting fields in a CSV file. $n allows you to use the value in the nth column.

3. Computing some statistics

Awk allows the use of variables and functions. Let's see how to use them by computing the total population in the entire city.

# total.awk
{s += $2}
END {print "Total population:", s}
Enter fullscreen mode Exit fullscreen mode

Variables are by default initialized to 0. Here, we use a variable s to hold the total.

Running this script as awk -F, -f total.awk 2010_Census_Populations_by_Zip_Code.csv, we get output: Total population: 10603988

Special variables and built-in functions

Awk uses some special variables and functions to make your programs more compact:

  • NF : Number of fields in a line
  • NR : Line number
  • $0 : The entire input line
  • length : gives number of characters in a string

Now, we will compute the average household size which is total population divided by total households. The columns of interest are $2 and $6.
We also want the average population per zip code. Our script:

# stats.awk
{ s += $2; h += $6;}
END {print "Total population:", s, "\nTotal households:", h, "\nAverage household size:", s/h, "\nAverage population per zip code:", s/NR}
Enter fullscreen mode Exit fullscreen mode

NR gives us the total number of lines. But we do not want the header line. We can use tail command to skip the 1st line as tail -n +2. Running tail -n +2 2010_Census_Populations_by_Zip_Code.csv | awk -F, -f total.awk gives us :

Total population: 10603988
Total households: 3497698
Average household size: 3.0317
Average population per zip code: 33241.3
Enter fullscreen mode Exit fullscreen mode

4. Pattern matching

We have done some useful things with awk so far, but we have ignored its biggest strength - pattern matching. We can match based on field values, regexes, line numbers.

  • Print every 2nd line : NR%2 == 0 {print $0}. Here $0 stands for the entire line.
  • Print all zip codes with population > 100,000 : $2 > 100000 {print $1}
  • Print all zip codes with population > 10,000 and average household size > 4 : $2 > 10000 && $7 > 4 { print $1}. We can combine conditions using && and || which stand for logical and and or respectively.

Further reading

There is a lot more to Awk. Here are some references:

  • The best resource for learning Awk is The AWK programming language written by the same trio. This book goes over and beyond a typical programming language tutorial and teaches you how to use your Awk superpowers to build versatile systems like a relational database, a parser, an interpreter, etc.

  • The GNU Awk Manual for Effective Awk Programming is a thorough reference.

Discussion (15)

Collapse
leob profile image
leob

Nice intro/overview! Awk is great, I've used it a lot ... I remember I often used it to read/parse log files and then generate SQL using Awk, in order to perform database changes. It's very flexible for this kind of extract/transform/output work.

Collapse
rrampage profile image
Raunak Ramakrishnan Author

Some Awk one-liners replace my use of other Unix tools like cut, grep, join. Some people use perl as a better awk but I prefer the simplicity of Awk.

Collapse
leob profile image
leob

Awk is extremely flexible and easy to use. Can indeed replace cut/grep/join if you want to. And perl, never got into it, too complicated and cryptic, I prefer the C/javascript-like syntax of Awk.

Collapse
bauripalash profile image
Palash Bauri 👻

Awesome Article! Great Hands on tutorial ❤

Collapse
richardvk profile image
Richard vK

Best awk intro/summary I have read, thank you.

Collapse
ferricoxide profile image
Thomas H Jones II

Output formatting becomes much better when you ditch print in favor of printf.

Collapse
functionalstoic profile image
JasonSooter

That was a fun introduction. Glad to add that as a tool in my toolbelt

Collapse
samy80 profile image
Samuel A.

Nice intro, good pointers. Thanks!

Collapse
wangtiejun001 profile image
Matt Wang

The picture is enough for me, thanks the author!

Collapse
ishanigupta27 profile image
Ishani Gupta

Interesting! Just wondering why a whole new language for a feature? (Not to sound critical).

Are there any specialised optimizations specific to file i/o and parsing the file at lower level ? If so, it would be great to have it also as wrapper for other langaugaes. Any benchmarking ?

🙂🙂

Collapse
rrampage profile image
Raunak Ramakrishnan Author

Hi Ishani,

awk is a very old language (1977), predating scripting languages like perl and python. As part of Unix philosophy, it is used in combination with other Unix tools. It is simpler and faster to write than say a python script. Most awk uses are simple one-liners to extract particular columns.

It is indeed very fast as all it does is: For each line:

  • Check if line matches a pattern
  • Perform the associated action on the pattern

There was this famous article which showed that clever use of command-line tools can be several times faster than some big-data tools.

Collapse
reisclef profile image
Richard C

Very helpful! Thanks!

Collapse
stevezieglerva profile image
Steve Ziegler

LOVE AWK!!!!!!!!!!!

Collapse
lukaszkuczynski profile image
lukaszkuczynski

Wow, pandas in Bash ! Does it support cluster computing? ;)

Collapse
rrampage profile image
Raunak Ramakrishnan Author

It may replace your cluster :P
See this article on how a bash script is 235x faster than a Hadoop cluster.