loading...

Awk - A useful little language

rrampage profile image Raunak Ramakrishnan ・3 min read

Awk is a small but capable programming language which is used for processing text. It was developed by Aho, Weinberger, Kerninghan at Bell Labs.

Julia Evans made an awesome intro to awk:
AWK comic

Awk scans input file as a sequence of lines and splits each line into fields. The field separator is usually whitespace but you can customize it to any character.

An awk program is a sequence of pattern-action pairs i.e for each line, it checks if it matches the pattern and if yes, it performs the associated action on the line. Awk can be used interactively or to run saved programs.

Here is what Awk does written in Python-like pseudocode:

initialize() # Initializes variables in BEGIN block
for line in input_lines: # Awk divides file / input into a list of lines
    for condition, action in conditions: # A program is a list of condition-action pairs
        if condition(line): #match line against condition
            action() #perform action on match 

Here are some small snippets of Awk:

1. Hello World!

You can run awk programs inline or through a file:

awk 'BEGIN{ print "Hello, World!"}'

Alternatively, you can save this to a file hello.awk:

BEGIN{ print "Hello, World!"}

Then run it as awk -f hello.awk

2. Reading a CSV and printing a specific column

Let's now do something useful! Download this csv which is 2010 census data by zip code in Los Angeles city.

Read the first 3 lines from csv: head -3 2010_Census_Populations_by_Zip_Code.csv

Zip Code,Total Population,Median Age,Total Males,Total Females,Total Households,Average Household Size
91371,1,73.5,0,1,1,1
90001,57110,26.6,28468,28642,12971,4.4

We will print just the total column using awk -F, '{print $2}' 2010_Census_Populations_by_Zip_Code.csv

The -F, sets the field separator to comma as we need to split by commas for getting fields in a CSV file. $n allows you to use the value in the nth column.

3. Computing some statistics

Awk allows the use of variables and functions. Let's see how to use them by computing the total population in the entire city.

# total.awk
{s += $2}
END {print "Total population:", s}

Variables are by default initialized to 0. Here, we use a variable s to hold the total.

Running this script as awk -F, -f total.awk 2010_Census_Populations_by_Zip_Code.csv, we get output: Total population: 10603988

Special variables and built-in functions

Awk uses some special variables and functions to make your programs more compact:

  • NF : Number of fields in a line
  • NR : Line number
  • $0 : The entire input line
  • length : gives number of characters in a string

Now, we will compute the average household size which is total population divided by total households. The columns of interest are $2 and $6.
We also want the average population per zip code. Our script:

# stats.awk
{ s += $2; h += $6;}
END {print "Total population:", s, "\nTotal households:", h, "\nAverage household size:", s/h, "\nAverage population per zip code:", s/NR}

NR gives us the total number of lines. But we do not want the header line. We can use tail command to skip the 1st line as tail -n +2. Running tail -n +2 2010_Census_Populations_by_Zip_Code.csv | awk -F, -f total.awk gives us :

Total population: 10603988
Total households: 3497698
Average household size: 3.0317
Average population per zip code: 33241.3

4. Pattern matching

We have done some useful things with awk so far, but we have ignored its biggest strength - pattern matching. We can match based on field values, regexes, line numbers.

  • Print every 2nd line : NR%2 == 0 {print $0}. Here $0 stands for the entire line.
  • Print all zip codes with population > 100,000 : $2 > 100000 {print $1}
  • Print all zip codes with population > 10,000 and average household size > 4 : $2 > 10000 && $7 > 4 { print $1}. We can combine conditions using && and || which stand for logical and and or respectively.

Further reading

There is a lot more to Awk. Here are some references:

  • The best resource for learning Awk is The AWK programming language written by the same trio. This book goes over and beyond a typical programming language tutorial and teaches you how to use your Awk superpowers to build versatile systems like a relational database, a parser, an interpreter, etc.

  • The GNU Awk Manual for Effective Awk Programming is a thorough reference.

Posted on by:

rrampage profile

Raunak Ramakrishnan

@rrampage

Passionate about databases, distributed systems and functional programming.

Discussion

pic
Editor guide
 

Nice intro/overview! Awk is great, I've used it a lot ... I remember I often used it to read/parse log files and then generate SQL using Awk, in order to perform database changes. It's very flexible for this kind of extract/transform/output work.

 

Some Awk one-liners replace my use of other Unix tools like cut, grep, join. Some people use perl as a better awk but I prefer the simplicity of Awk.

 

Awk is extremely flexible and easy to use. Can indeed replace cut/grep/join if you want to. And perl, never got into it, too complicated and cryptic, I prefer the C/javascript-like syntax of Awk.

 

Awesome Article! Great Hands on tutorial ❤

 

Best awk intro/summary I have read, thank you.

 

Output formatting becomes much better when you ditch print in favor of printf.

 

That was a fun introduction. Glad to add that as a tool in my toolbelt

 

Nice intro, good pointers. Thanks!

 

The picture is enough for me, thanks the author!

 

Interesting! Just wondering why a whole new language for a feature? (Not to sound critical).

Are there any specialised optimizations specific to file i/o and parsing the file at lower level ? If so, it would be great to have it also as wrapper for other langaugaes. Any benchmarking ?

🙂🙂

 

Hi Ishani,

awk is a very old language (1977), predating scripting languages like perl and python. As part of Unix philosophy, it is used in combination with other Unix tools. It is simpler and faster to write than say a python script. Most awk uses are simple one-liners to extract particular columns.

It is indeed very fast as all it does is: For each line:

  • Check if line matches a pattern
  • Perform the associated action on the pattern

There was this famous article which showed that clever use of command-line tools can be several times faster than some big-data tools.

 

Very helpful! Thanks!

 

LOVE AWK!!!!!!!!!!!

 

Wow, pandas in Bash ! Does it support cluster computing? ;)

 

It may replace your cluster :P
See this article on how a bash script is 235x faster than a Hadoop cluster.