Discussion on: Which programming language/environment is more “powerful” than people realize?

View post

AWK

It can do anything with delimited data tables with millions of rows in a few lines of code, as fast as a Go, C counterpart program.

In Python you need to write lots of lines, use third-party libraries to normalize columns with regex, dropping rows, etc. In AWK, it's just a one liner, or you can write it in an *.awk file as well.

AWK is a secret gem for dealing with big data.

Update:

I share my AWK script where I normalize a millions of lines long delimited CSV table only with AWK, as fast as a Go counterpart program:

#!/usr/bin/awk -f

# This block only runs once at the start, before the first line
# Use this to print CSV header on the top
BEGIN {
    FS="|"; # input field separator
    OFS="|"; # output field separator
}

# This block runs at every line
{
    # We will order a new named variable to every column
    line = $0; # variable `$0` stores the entire line
    url = $1;
    title = $2;
    body = $3
    tags = $4;

    if (line ~ /^$/) next; # if line is blank, then skip it
    if (NF != 4) next; # if column count is not equal to 4, then skip the line

    # Skip any line where tags column contains the word "cars"
    if (index(tags, "cars") != 0) { next; }

    # Normalize the url column with regex by only keeping the article id
    # Example input: <a href="https://example.com/article/foo123456">Hello</a>
    gsub(/.*example\.com\/article\/|[\042].*/, "", url); # outputs: foo123456

    # Skip lines that has non-alphanumeric characters in url column (like <>#&@)
    # Skip lines that has empty url column (after gsub normalization)
    # Skip lines where url starts with foo or bar
    if (url !~ /[[:alnum:]]/ ||
        length(url) == 0 ||
        url ~ /^foo|^bar/) {
        next;
    }

    # Replace multiple ; with one (needed for errorless CSV import in Postgres)
    gsub(/[\073]+/, ";", tags);

    # Print the line with OFS, aka: profit! :)
    print url, title, body, tags;
}

Josh Cheek • Jun 4 '18

I'm going to put Ruby under this one, as well. Ruby inherited a ton of Perlisms that make it competitive with Perl, awk, and sed for these types of use cases. But they're really underused Eg the following flags are all relevant here: n, p, e, i, l, a, s, 0, c, F, and the 2-letter globals ruby -e 'puts global_variables.grep /\$.$/' and BEGIN { ... } and END { ... } and ARGF and flip flops (which most people don't even know exist), and regex literals in conditionals, and the private methods that are added to main when -n and -p flags are set. IDK, probably other stuff, too, that's all off the top of my head.

Donald Merand • Jun 3 '18

100% agree, I love awk and use it all the time! Also: combining AWK with the other UNIX utilities such as cat, sort, uniq etc.

Have you ever read Ryan Tomayko's AWK-ward Ruby? I didn't realize that Ruby had inherited so much from AWK, but it makes me happy as a Ruby user.