Iterative Development as Discovery

#programming #softwaredevelopment

In a class on shell scripting, my professor asked whether pipes were better than temporary files. I don't know enough to generalize, but pipes often suit that evolving thinking-about-a-problem where you're not entirely sure what the answer is because you're still formulating and reformulating the question.

Say, for example, I notice several failed SSH attempts on my server. Where are they coming from? To print the IP addresses, I might run:

awk '/Failed password for/ {print $11}' /var/log/auth.log

But imagine that returns a long list of IP addresses, some of which are repeats because I haven't installed fail2ban. So I pipe it to sort to grab just the unique values.

awk '/Failed password for/ {print $11}' /var/log/auth.log | sort -u

That gives me a list without repeats. But then perhaps I wonder where these IPs are located, so I do a whois lookup on each.

awk '/Failed password for/ {print $11}' /var/log/auth.log | sort -u \
    | while read address; do whois "$address" \
    | grep "Country: "; done

That's fine, I guess. But now I want to see the attempts per country. So I drop the sort -u, and do a whois lookup on every IP address, leaving in all the repeats.

awk '/Failed password for/ {print $11}' /var/log/auth.log \
    | while read address; do whois "$address" \
    | grep -m 1 -i "Country: "; done

(The grep arguments say “match the first instance only, case insensitive. Some whois information uses a lowercase ‘c' in ‘Country'.) But the output isn't very helpful.

Country:       US 
Country:       US 
country:       JP 
country:       JP 
country:       JP

Ah, I know. I'll switch the word ‘country' with the actual IP address within the while loop and then pipe it to awk to switch the position of the country code and IP address, perhaps even adding an ASCII arrow between them.

awk '/Failed password for/ {print $11}' /var/log/auth.log \
 | while read address; do whois "$address" \
     | grep -m 1 -i "Country: " \
     | sed "s/[Cc]ountry:/$address/"; done \
 | awk '{print $2 " --> "$1}'

Example output (with random, made-up IPs):

JP --> 219.47.220.30 
JP --> 219.47.220.30 
NL --> 139.47.220.30 
US --> 149.47.31.192

And this is almost what I want to know, except that I want to count the number of repeated IP addresses and then rank them by number of attempts, with the higher numbers first. And I don't like that ASCII arrow, after all.

awk '/Failed password for/ {print $11}' /var/log/auth.log \
 | while read address; do whois "$address" \
     | grep -m 1 -i "Country: "\
     | sed "s/[Cc]ountry:/$address/"; done \
 | awk '{print $2 " " $1}' \
 | uniq -c \
 | sort -r

Example output (with random, made-up IPs):

5 JP 219.47.220.30 
4 KR 59.150.234.132 
1 US 99.62.218.217 
1 US 9.41.14.160

And at this point, I've found both the question and the answer. Ranked by number of attempts, what are the IP addresses and countries from which failed login attempts have come? Since I've answered my question, I'd write the output to a file for later analysis and processing, if any.

Pipes separate the processing of data from its storage in a way analogous to how HTML and CSS separate the structure and presentation of content. In this analogy, temporary files are the inline CSS of data processing.

But pipes are also useful when you're still waiting for the emergence of the question which your code answers.

This article was originally published on Medium.