Edouard Le Juge for Monisnap

Posted on Jan 31, 2020 • Edited on Feb 2, 2020

🔎🔍 AWS Cloudwatch - Top 5 things you need to know

#aws #cloudwatch #tutorial #monitoring

If you can rely on your logging, monitoring and alerts you'll be free to code fearless, how does that look?

At Monisnap our entire backend is built using AWS features like API Gateway, Lambdas, SQS, SNS... It made sense for us to use Cloudwatch as our monitoring system. In this article I will share with you a few tricks I use within cloudwatch. Nothing fancy at all, just the basics, enough to feel the power of cloudwatch and find what you are looking for.

📖 Search through your logs in cloudwatch

Before we dig in the log search itself I would like to advice any of you to always log JSON objects instead of plain text. You can grep for texts in strings but you loose the power of more specific filtering, grouping or sorts that will get your logging game to the next level.

📃 Search a specific log group

If you go to a specific log group, either a log group that you created yourself or from a specific lambda, click on the 'Search Log group' button. You will now be able to search in all the log streams within this log group.

Let's now dive into how to filter those logs.

✅ Find specific value, or filter out a specific value

Looking for ERROR lines? You could just type ERROR in the search bar. But, as a reflex I always use the double quotes "ERROR", in case the value I am looking for has a '-' or any possible interpreted character.

    "ERROR"

What will the '-' could do to my search? It would filter it out. If you don't want all the 'INFO' logs, you could simply do this:

    -"INFO"

✅✅ Find specific values (AND)

You now want all the ERRORs from the requestId 'c8ae6b2c-1750-4f5d-bfda-14403cb978a0' but without the ones from a 'NotImportantException'. In that case you can just put the seraching operation one after the other:

"ERROR" "c8ae6b2c-1750-4f5d-bfda-14403cb978a0" -"NotImportantException"

✅🔘 Find one of the values (OR)

If you want to have all the logs with ERROR or WARN, you can use the question mark as a OR operator. In our case this search would like like this:

?"ERROR" ?"WARN"

Note: I haven't find a way to filter using both AND and OR together. If you know how, please leave your trick in the comments section below and I will add it to my article so the entire community can benefit from your wiseness :)

⎨⎬ Search on JSON attributes

If, as a smart logger, you are using JSON objects and not plain text you can now apply those filters on specific attributes of your object using the {$.} notation. {} means that you will be searching on the object itself and $. is the root of your logged object. the OR operator will now be '||', and, you guessed it, the AND will be '&&''

Example: If you have the following logs:

    A {"line": 'first', "type": "1"}
    B {"line": 'second', "type": "2"}
    C {"line": 'third' , "type": "3"}

Let's see some example here:

{$.line="first" || $.line="second"}
 -> Returns line A and B.

{$.type>2}          
 -> Returns line C. 

{$.line="first" && $.type!=3}
-> Returns line A.

I will let you play with this and query for more complex object with a list maybe? Something in that taste:

{$.type.subtype[0].value=1}>

🕵️‍♀️ Insights

Now that you are able to find logs in a specific log group, let's explore the power of the Insight feature that AWS offers. Insight let you run search among several log groups in a query like fashion. Yes, you will now be able to sort, group and all.

First of all, go to you cloudwatch UI, click on Insight menu on the left of the screen. You will end up with an empty log group bar and a basic query.

Add the log group(s) you want to run query on. A specific lambda is totally fine for now. Once you have selected the log group(s) you want to run queries on, you will be able to see on the right side of your screen all the commands and fields that you can play out with.
I will, in this article, only go through the basic ones. We just want to get a taste of what are the possibilties here. For now, we will only look into the commands: fields, filter, sort, and limit. I invite you to look into stats and parse though, very powerful features !!

All the commands that you will add in the query will be treated one after the other. It's a suite of pipe operations.

📝 Fields

It kind of speaks for itself: the list of fields you want to have the value of in the output of the pipe operation. If you want to see all the fields you have access to, run the basic query that Insight gives you and you will see a lot more in the Discovered fields on the right side of the screen.

fields @timestamp, @message, @requestId

🛃 Filter

Pretty obvious as well, this will filter the logs as needed. If you want all the logs of the requestId '':

fields @timestamp, @message
    | filter @requestId='47dbd193-c8ad-47b5-8668-6045668ce472'

Among those logs, only the line with our own logType set to WARN OR ERROR are interesting to you? You can use the 'like' operator to apply a new filter on the result of the previous pipe operation. 'like' operator uses the regex synthax to filter.

fields @timestamp, @message
    | filter @requestId='47dbd193-c8ad-47b5-8668-6045668ce472' 
    | filter logType like /(ERROR|WARN)/

🔻 Sort

If you want to see the latest ERRORs that happend on this (or these) log group(s) you can apply a sort operation to the mix and use it on timestamp field. ASC and DESC sort types are allowed.

fields @timestamp, @message
    | filter logType like /ERROR/
    | sort @timestamp desc

⇥ Limit

Too many results? limit the number of results you want in you result by using the limit operator and only show the last 10 ones:

fields @timestamp, @message
    | filter logType like /ERROR/
    | sort @timestamp desc
    | limit 10

👨‍🚒 Dashboards

Cloudwatch also gives you the ability to gather in one place all the monitoring you feel you should be keeping an eye on. Those dashboards can, not only contain result from log queries that we saw in the section above but also aggregate values of operations on your API gateway, SQS queues, S3 files and much more.

In our team we have several type of dashboards:

👨‍⚕️ Regular monitoring

For health monitoring purpose we have a few general dashboards, general monitoring showing 2XX, 4XX, 5XX and other HTTP codes we want to monitor. A tailing of the latest errors. Anything that could give us a feel of what is going on on our system. And what would show us that something is odd right away. Nothing fancy, just enough to notice and diagnose where the blood is coming from if we have alerts firing.

👩‍🔧 Specific features monitoring

For any impactful feature we try our best to add graphs or tailing logs that can monitor the impact of the feature by tracking a specific KPI. It can totally be a specific log nomenclature that we can track for a certain amount of time.

Example:

On a recurring error from a partner we added the specific log "error 0234 occurred"
We created a query that populate a graph based on it
We confirmed based on this that the impact of this specific error was pretty bad
We implemented the ability to recover from that issue
We added in our code a new log saying "recovered from error 0234"
Created a new query to monitor this implementation
We are now able to monitor how many requests got recovered from this specific error or not.
close the bug if the fix stopped our problem

🚨 Alarms

Last but not least in our article is the ability to set Alerts using the very helpful Alarms feature Cloudwatch gives us. You can set Alerts on any of your AWS resources: API gateway, SQS, lamdbas and many more.

⏰ Create an alarm

To add an Alert simply go to the Alarms menu on the left side menu. Click on the 'create alarm' red button.

📈Choose your metric

You will then be asked to select the metric(s) on which you want your alarm to be set on. A metric is usually a combinaison of an object (api gateway endpoint, lambda function, sqs topic...) and an operation on it (sum of execution, sum of errors, number of deletion, average duration...)

You might have to apply your alerts based on a duration type on a period, per second, per minutes, per 5 minutes...

📛Threshold type

Two type of threshold can then be applied: a static number reached, or an anomaly gets detected. We only have used the static alerts for now and it is quite straight forward, you set the value above/under which you want to receive an alert. The anomaly type applies an algorithm on how the metric you used is evolving through time. If the value evolves in what the algorithm find abnormal it will trigger your notification. Feel free to share your thoughts and experience of this feature in the comments if you have been playing with this. It would be much appreciated !!!

☎ Action

Here you can select how you want to be notified. We usually use a SNS topic which will be sending an email to our developers email address. But it is up to you to set up as you see fit.

🤓 Review and confirm

You will have to simply set the name and description of your alarm. click confirm and you are good to go !

Conclusion

We only scratch the surface of what Cloudwatch can offer in this article but I think we covered here all the parts that you have to know for a basic monitoring and debugging skills.
I hope it helped you having an overview of what AWS gives you in the monitoring world. We have been using it for 6 months at Monisnap and haven't been blocked by anything serious so far. Hope you'll be using it soon, it's a great tool box to play with !!!!

Top comments (1)

Jonathan BROSSARD • Jan 31 '20 • Edited

@monisnapedouard , I believe that you can perform a query with both AND and OR like that :

{ $.user.lastname = "malou" || $.user.lastname = "goodenough" && $.user.enabled = "1"}

DEV Community