Did you enjoy my introduction to Regular Expressions (RegEx)? Let’s use some to do some actual work.
Here’s a classic problem
You run a web service, and you’re experiencing poor performance for all users. For reasons I won’t go into, you suspect a single user might be causing problems. The good news is that every request that comes in to your server is logged. The bad news is that your service is popular, and you’ve had more than 1,500 requests in the last hour.
Can you spot the IP address that’s most frequently found in these logs?
The hunt is on! Let’s find the most frequently occurring IP address that appears in the logs.
want to see the command we’re gonna build? It’s
heroku logs -n 1500 | grep -o '[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*' | sort | uniq -c
I promise by the end of this article all of that will make sense.
First we need to fill our logs
In this example we’ll be working with a web service hosted on Heroku. While Heroku gives you super-useful production services for handling web logs (like setting alerts, filters, and so on) via add-ons, we're going to bypass all of that and do some CLI hacking instead.
throughout this tutorial we’ll be using a command line interface (CLI) to view our Heroku logs and run our RegEx filtering. I’ll generally refer to this interface as ‘the CLI or ‘bash’ after the shell version I use. Unfamiliar with the CLI? I wrote a little something about it :)
For this to be fun at all we want to really fill up our Heroku logs. I wrote a demo app to do just that, so all you need to do is:
- Create a Heroku account
- Install the Heroku CLI
- Run
git clone https://github.com/tobyfee/log-filler.git
- Go into the directory you just created with
cd log-filler
- Create a new Heroku app
heroku apps:create logfiller
(you must change the name ‘logfiller’ to whatever you want) - Send the app code up to Heroku and run it once with
git push heroku master
You can test that everything worked by running heroku logs
Woo-hoo, logs! Those white strings of numbers and dots are the IP addresses we’re after
important tip: I recommend grabbing a few dozen lines and pasting it into your favorite code editor. As we write our regex it will be nice to have something to refer back to without having to re-run this command.
If you scroll up a bit you’ll notice that we only get about 100 lines of logs with this command. That’s the Heroku default but with heroku logs -n [number]
you can get more information.
Now that we’re looking at the logs let’s pause to cover two fundamental points:
When you write a RegEx, what’s in your source material matters
Fundamentally RegEx is about finding things within text. When we evaluate whether a RegEx is doing the right thing we need to consider the source material. If the logs we were searching had no numbers other than IP numbers, a RegEx that matched only numbers (perhaps grabbing anything between two sets of numbers would work fine). Check the logs above and you’ll see there are lots of numbers in the log that aren’t the IP addresses.
There are many ways to find the same thing
Sometimes two RegEx patterns do exactly the same thing [abc]
and [a-c]
are identical, and the only reason to choose between them is which is more readable.
Other patterns work similarly but have small functional difference. The pattern /ab{2]a/
and /ab*a/
will both match ‘abba’; but only the second pattern will also match ‘abbbbbbbba’
From what you’ve already learned you might write a pattern to find email as follows:
[a-z0-9]*@.*\.[a-z]{3]
To write this out in English:
-
[a-z0-9]*
Any letter or digit any number of times -
@
An @ sign -
.*
Any character any number of times -
\.
A dot -
[a-z]{3}
Any letter three times
Great! That should cover… most emails right? Then you might realize the bit before the @ can include a dot, so you might add a dot to the first part of the pattern [a-z0-9\.]*
. Soon more exceptions might occur, until you find that your RegEx email pattern is thousands of characters long.
Okay, back to the task at hand.
Let’s find some IPs
We’ve established that heroku logs -n 1500
will get us a bunch of recent log lines (1500 to be exact), how do we use a RegEx on that output? Time to use the unix pipe |
I’m not going to go too deeply into the explanation of how the pipe works, for the moment I’ll just define it as “take the output of the command send it (like I don’t know directing it through a pipe) as the input of the next command. The pipe on many keyboards is the same key as \
We want to send the output of heroku logs
to grep, a tool intended to filter things with a RegEx pattern. heroku logs -n 1500 | grep
will let grep filter the logs. If you run that command, grep will complain because you haven’t given it a pattern. How about we give it a very simple pattern? Look back at our logs. Notice how the name of the file requested shows up at the end? Let’s filter the logs for to find those that requested ‘err.png’ or some variant.
heroku logs -n 1500 | grep 'err'
Here we’re using the simplest kind of RegEx pattern: a literal that looks for three exact characters.
You should get back a bunch of log lines containing ‘err’ somewhere in them:
2019-07-28T12:27:44.077022+00:00 app[web.1]: 10.0.0.153 - - "GET /cgi-bin/mailgraph.cgi/mailgraph_3_err.png HTTP/1.1" 200 5554
2019-07-28T12:27:44.077057+00:00 app[web.1]: 10.0.0.153 - - "GET /cgi-bin/mailgraph.cgi/mailgraph_0_err.png HTTP/1.1" 200 6324
2019-07-28T12:27:44.077063+00:00 app[web.1]: 10.0.0.153 - - "GET /cgi-bin/mailgraph.cgi/mailgraph_2_err.png HTTP/1.1" 200 7001
2019-07-28T12:27:44.077067+00:00 app[web.1]: 10.0.0.153 - - "GET /cgi-bin/mailgraph.cgi/mailgraph_1_err.png HTTP/1.1" 200 6949
2019-07-28T12:27:44.077078+00:00 app[web.1]: 10.0.0.153 - - "GET /cgi-bin/mailgraph.cgi/mailgraph_3_err.png HTTP/1.1" 200 5554
Okay now we’re filtering! There is just one problem as we try to get only the IP addresses. Right now when grep found ‘err’ in the log line, it gives us the whole line. We want only the part of the line that the RegEx actually matched. We can do this by adding an option to grep, -o
which I always remember as ‘only the matched part.’
heroku logs -n 1500 | grep -o 'err'
returns some phenomenally boring output, but that is the behavior we want when we find IPs.
Wait, what does an IP address look like?
This is something that can trip us up quite badly when we’re writing RegExs to verify user input, as in my email example. In this case we can use a simplistic definition of ipv4 which is “four groups of numbers, separated by dots”. Let’s write our RegEx!
It might be helpful to save the command you’re entering into a separate text file. Pressing the up arrow on the keyboard will load your last bash command. These things get tricky to edit and it’s nice to have a log of your previous versions.
Requesting 1500 log lines would take a while for each command so I’ll only request the last 50 for the next few versions. First let’s filter for numbers:
heroku logs -n 50 | grep -o '[0-9]'
And we get back (just showing the last few lines)
2
0
0
0
0
1
0
Huh, why doesn’t it give us all the digits in each line? While there are some grep modes that might get us close to that, our pattern only asked for a single digit. Let’s say ‘any number of consecutive digits’
heroku logs -n 50 | grep -o '[0-9]*'
2019
2019
2019
2019
2019
What’s happening here? Again we could give grep some configuration which might change this behavior but RegEx will want to match the first thing it finds. Since the year value precedes everything else in each log line, that’s all we’re getting.
There are a few ways to proceed from here, we could write a whole thing to say ‘don’t match dates’ but that’s unnecessary if we go a step further and say ‘any number of digits followed by a dot’
heroku logs -n 50 | grep -o '[0-9]*\.'
(since a dot in a regex pattern means ‘any character’ we need to escape it to mean ‘an actual real dot’ so we use \.
)
The last few lines we get back are:
.
64.
242.
88.
1.
1.
1.
Wait, what? Why are we getting some bare dots, didn’t we say ‘any number of numbers followed by a dot? That’s the trouble with *
. Turns out that ‘any number of this character’ can be zero. There are a bunch of ways to fix this, but a simple way is to just say ‘a number, then any number of numbers’
heroku logs -n 50 | grep -o '[0-9][0-9]*\.'
20.
64.
242.
88.
1.
1.
1.
Oh boy howdy, that’s looking good!
Spread our wings
Since we defined our ipv4 addresses as 'four groups of digits separated by dots' we can extend our pattern to
heroku logs -n 50 | grep -o '[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*'
okay okay wait, we can’t possibly have to type
[0-9]
all those times right?
In my previous tutorial I skipped character classes, but to make our pattern much tidier we should meet one of them: \d
which I remember as ‘any digit’. We can use it to replace all those [0-9]
s
So our pattern becomes:
heroku logs -n 50 | grep -o '\d\d*\.\d\d*\.\d\d*\.\d\d*'
Sure enough we get nice looking ip addresses back
10.0.0.153
10.0.0.153
10.0.0.153
10.0.0.153
10.0.0.153
10.0.0.153
10.0.0.153
10.0.0.153
10.0.0.153
64.242.88.10
64.242.88.10
Time for a reality check
Looking at our sample logs, it seems like we are matching the IP address that appears in each line. But how do we know we aren’t missing some lines? Let’s do a reality check by piping our output to another command wc
. wc counts words in our output, but wc -l
counts lines:
heroku logs -n 50 | grep -o '\d\d*\.\d\d*\.\d\d*\.\d\d*' | wc -l
That gets back the number of lines we matches with our RegEx: 49
Oh no, oh gosh, _what happened to that missing line?!?! _Try running:
heroku logs -n 500 | grep -o '\d\d*\.\d\d*\.\d\d*\.\d\d*' | wc -l
And you’ll see we have 499 matches, so we seem to just be missing one line.
If you take a look at the last few lines with heroku logs -n 5
you’ll see what’s really happening. The very last line is:
‘2019-07-28T18:00:20.143268+00:00 heroku[web.1]: Process exited with status 0’
Which doesn’t have an IP address :)
okay, great but I’m tired of typing
\d
all those times
\d\d*
feels pretty awkward on its own, surely there’s a way to say “at least one digit and as many as you’d like after that”? There is! +
! Now the pattern is just'\d+\.\d+\.\d+\.\d+'
there is one tiny issue with this,
+
is an ‘extended’ part of regex not available in grep by default, along with -o, we need to add the -e option to make it available, so our final command isheroku logs -n 50 | grep -oe '\d+\.\d+\.\d+\.\d+' | wc -l
Turning data into knowledge
Okay, the end is in sight, let’s request the last 1500 log lines, use grep and our RegEx to pare each line down to _just _the IP address, sort
the addresses, and then use uniq
to count how many addresses appear in the logs.
I’m not going to dive too deep into
sort
anduniq
. This article is already super duper long
heroku logs -n 1500 | grep -oe '\d+\.\d+\.\d+\.\d+' | sort | uniq -c
And our output?
bash
336 10.0.0.153
1 12.22.207.235
27 128.227.88.79
7 142.27.64.35
49 145.253.208.9
4 194.151.73.43
7 195.11.231.210
1 195.230.181.122
12 195.246.13.119
4 200.160.249.68
1 200.222.33.33
13 203.147.138.233
20 207.195.59.160
4 208.247.148.12
4 20964.242.88.10
1 212.21.228.26
28 212.92.37.62
2 213.181.81.4
24 213.54.168.132
1 216.139.185.45
2 219.95.17.51
1 4.37.97.186
28 61.165.64.6
6 61.9.4.61
879 64.242.88.10
1 64.246.94.141
1 64.246.94.152
2 66.213.206.2
3 67.131.107.5
4 80.58.14.235
21 80.58.33.42
1 80.58.35.111
The number to the left of each line is showing how many of that line is showing up.
Oh-KAY, now we’ve got IP addresses counted, and the largest culprit is pretty clear. For tidiness’ sake let’s sort by that count by adding a second sort command at the end
heroku logs -n 1500 | grep -oe '\d+\.\d+\.\d+\.\d+' | sort | uniq -c | sort
And the last few lines show
bash
28 212.92.37.62
28 61.165.64.6
49 145.253.208.9
336 10.0.0.153
878 64.242.88.10
64.242.88.10, you vile rogue. You send more than half the requests to our server. Now we have you in our sights.
Final cleanup
'\d+\.\d+\.\d+\.\d+'
is well and good, but can we make it any shorter? After all the bit \d+\.
is repeated three times… sure enough we can compact that a bit further:
'(\d+\.){3}\d+'
Wow, now our pattern to find IP addresses is really compact, and maybe that’s even a bit more readable! I’d emphasize that this change has made no functional difference to the regex, and if you have a pattern
Well this was a big adventure and I still have a lot more I’d like to say! If you liked this piece let me know and I’ll continue the series.
Top comments (3)
quick question, which
grep
allows you to use\d
? as far as I know, only GNU grep with-P
option (PCRE) will allow it, otherwise you have to use[0-9]
or the POSIX character set[:digit:]
inside a character classalso, ERE is enabled using
-E
not-e
, from GNU grep's man page:Hi, i did similar project using python github.com/PetengDedet/Ubuntu-Auth...
this is a lot neater than my thing, I'll have to try to run it some time! Thanks for sharing!