DEV Community

loading...

Super simple and fast delimited CSV data normalization with AWK

István Lantos
・2 min read

There was a discussion here on dev.to which programming language or tool more powerful than people realize?

I shared my AWK snippet, where I normalize a 8GB sized delimited CSV data source for database import, only with AWK, as fast as a Go counterpart program. This task completes in 7 minutes on my home machine, same as my Go code:

#!/usr/bin/awk -f

# This block only runs once at the start, before the first line
# Use this to print CSV header on the top
BEGIN {
    FS="|"; # input field separator
    OFS="|"; # output field separator
}

# This block runs at every line
{
    # We will order a new named variable to every column
    line = $0; # variable `$0` stores the entire line
    url = $1;
    title = $2;
    body = $3
    tags = $4;

    if (line ~ /^$/) next; # if line is blank, then skip it
    if (NF != 4) next; # if column count is not equal to 4, then skip the line

    # Skip any line where tags column contains the word "cars"
    if (index(tags, "cars") != 0) { next; }

    # Normalize the url column with regex by only keeping the article id
    # Example input: <a href="https://example.com/article/foo123456">Hello</a>
    gsub(/.*example\.com\/article\/|[\042].*/, "", url); # outputs: foo123456

    # Skip lines that has non-alphanumeric characters in url column (like <>#&@)
    # Skip lines that has empty url column (after gsub normalization)
    # Skip lines where url starts with foo or bar
    if (url !~ /[[:alnum:]]/ ||
        length(url) == 0 ||
        url ~ /^foo|^bar/) {
        next;
    }

    # Replace multiple ; with one (needed for errorless CSV import in Postgres)
    gsub(/[\073]+/, ";", tags);

    # Print the line with OFS, aka: profit! :)
    print url, title, body, tags;
}

Enter fullscreen mode Exit fullscreen mode

Then you should run this with:

$ zcat my-csv-file.zip | ./normalize.awk | sort --field-separator="|" --key=1,1 > normalized-sorted.csv

Pro tip: combine this with tail command when updating DB later on. ;)

Maybe you find this snippet useful and saves you countless time to develop similar solution in Python or Go. You're welcome!

If you have anything to add or modify, then shoot me!

Discussion (0)