Logging is important in order to get insight into a system's behavior. Logs are useful not only for debugging a system in a dev environment, but also in production if enabled in a controlled manner.
We would like logging to be reliable without adding too much overhead to the main computation.
When it comes to serverless, it is not trivial to build a robust logging solution that is reliable as well as fast.
Here, we discuss some common problems we have when we integrate an AWS serverless application with an external logging service.
- Lambdas can be executed more times on failure. This means that we can have duplicate log messages.
- Lambdas are "frozen" when a value is returned and the event loop is empty. This may cause a loss of log data.
- Lambdas are terminated with failure by timeouts. If an external logging service is too slow, we might experience timeouts, but we do not want to fail the whole computation.
- Lambdas fail and might be executed again when an unhandled exception occurs. We would like to catch unhandled exceptions in log commands.
Let's go deeper into technical details. In the rest of this post I will present some toy examples to explain the main concepts. You can have a look at this github repository if you want to try and play around on your own.
AWS Lambda Retries
In some cases, depending on the event source, AWS Lambda might retry the execution of a Lambda function that failed.
Since there is no roll-back mechanism, side effects are not undone on failure. This problem affects logging, as well. Fortunately, the solution is simple.
The general rule is add awsRequestId
to log data (CloudWatch does it for you, but here we want to rely on an external service). A Lambda context has an awsRequestId
property unique per invocation. In other words, if a Lambda fails, it is executed again with the same awsRequestId
.
Adding awsRequestId
to log data allows us to understand if a log message is a duplicate.
Some related suggested reads
- You need to use structured logging with AWS Lambda
- Capture and forward correlation IDs through different Lambda event sources
Frozen execution context
Ideally, We do not want to block the main computation because we are waiting for the logging service's responses. Creating a user is more important than logging the fact itself.
In addition, if we do not wait, we actually run each log command in parallel, making the execution faster.
Basically, what we can do is to send HTTP requests to our favorite logging service and do not wait for a promise's resolution.
This could work in traditional NodeJS, but it is doomed to fail in serverless.
You can try it yourself with the code in the repo. If the code is run in nodejs, you should see something similar to this:
$ node delayed-logging/run.js
[2020-03-02T10:06:27.912Z] STARTED LAMBDA
[2020-03-02T10:06:27.914Z] before log
[2020-03-02T10:06:27.920Z] after log
[2020-03-02T10:06:27.920Z] before log
[2020-03-02T10:06:27.920Z] after log
[2020-03-02T10:06:27.921Z] END LAMBDA
node delayed-logging/run.js 0,08s user 0,01s system 1% cpu 5,113 total
The business logic of the Lambda function is executed immediately, but the computation ends after 5 seconds, when the HTTP server returns its responses. Under the hood, nodejs waits for pending requests.
Log messages are actually printed in server's logs as you can verify easily.
β― node server.js
Listening on port 3000!
[2020-03-02T10:06:32.939Z] LOGGED: {"message":"Create user"}
[2020-03-02T10:06:32.952Z] LOGGED: {"message":"Return"}
Let's see what happens in AWS.
# invoke and forget log promises
$ curl https://XXX.execute-api.YOUR-REGION.amazonaws.com/dev/functions/1
In the following picture, we show logs for the lambda as well as for the logger. As you can see, the logger did not print out any message. More precisely, the logger lambda didn't start at all!
We lost log data.
Basically, the problem is bound to the execution semantics of AWS Lambdas.
The execution context is the runtime environment of a Lambda function and includes external HTTP endpoints, timeout handlers and other background processes.
When a Lambda function returns a value and the event loop is empty, the execution context is frozen for future reuse for some time.
Freezing the execution context is useful in many cases and leads to better performances because the initialization phase of external dependencies can be skipped. For example, we can take advantage of this feature to avoid recreating database connections at every invocation (from here).
If we have pending HTTP requests to an external service, like in the case of logging, when the Lambda returns, the event loop is empty. The HTTP request callbacks have not been pushed into the event loop yet. However, the execution context is not blank: HTTP request processes are still pending on background.
So, the Lambda is frozen. If it is invoked again, then the old execution context is resumed and the HTTP request processes are executed in the second call. Maybe... Unfortunately, we cannot be sure about that! In fact, if a lambda is idle for long, it is killed and the execution context deleted.
In practice, we might lose log data or see them in following invocations, which is quite confusing and unpredictable. Thus, the general recommendation is:
You should make sure any background processes or callbacks in your code are complete before the code exits. source
We can verify it. If we invoke our lambda several times very quickly, then we can see some logs, but others are still missing (but I am not sure if it depends on some concurrency policy of the API Gateway, any idea?).
$ repeat 5 { curl https://XXX.execute-api.YOUR-REGION.amazonaws.com/dev/functions/1 }
Instead, if we add an await
statement before fetch
requests (as AWS documentation recommends), then all the logs pass through (remember to decrease the value for the environment variable LOGGER_TIMEOUT
for the logger lambda in the AWS console, otherwise you get a timeout).
# invoke with blocking logs
$ curl https://XXX.execute-api.YOUR-REGION.amazonaws.com/dev/functions/2
However, the main computation is blocked until the logger returns something, which is exactly what we wanted to avoid.
The solution is to wait for logging promises before the end of the Lambda function. In other words, we need to collect promises and wait for them together (i.e. Promise.all
) before returning a value.
# invoke with promise collector and await at the end
$ curl https://XXX.execute-api.YOUR-REGION.amazonaws.com/dev/functions/3
This solution is also more efficient. If you run fun2
and fun3
with a slow logger (LOGGER_TIMEOUT
equals 5 seconds, for example), fun3
returns a value while fun2
is timed out. If LOGGER_TIMEOUT
is low, you can measure latency using xRay or artillery more accurately. Not surprisingly, fun3
is faster than fun2
.
If you want to understand this topic better, I found the following blog posts very useful. I picked up the information presented in this section from there. I recommend them also if you want to try your knowledge of the event loop!
- Avoiding the pitfalls of async Node.js functions in AWS Lambda
- AWS Lambda and the Node.js Event Loop
Timeouts
If you have tried the code above, you might have experienced one or more timeouts. Indeed, lambdas are interrupted after some time.
Our logging service is quite slow (by design), so the main Lambda function might time out if it waits for the logger's responses at the end of its body.
We do not want to cause a timeout because the logging service is too slow. In this case, we would like to interrupt logging preemptively if we are running out of time.
You may object that logging services are fast. But if you want to build a resilient distributed system, you should not trust other parties. You do not have control on an external logging service. If it becomes unavailable for any reason, your serverless application will start behaving in an unpredictable way.
If a timeout occurs, a Lambda might be executed again. So it is more desirable to lose some log data than rerunning the computation.
However, interrupting logging does not mean that log commands do not have any effect. As far as we know, the external service might have received an HTTP request. If a promise is pending on our side, it means only that we have not received an HTTP response or that the HTTP callback has not been executed.
Unfortunately, Promises
cannot be canceled in NodeJs. The correct solution is to use the native Promise.race
as in the following snippet:
await Promise.race([
Promise.all(promises).then(() => {
if (timeoutHandle) clearTimeout(timeoutHandle);
}),
new Promise( (_resolve, reject) => {
timeoutHandle = setTimeout(() => {
reject();
}, LOGGER_TIMEOUT);
})
]);
In this way, we are sure that Promise.all
is removed from the execution context when setTimeout
callback is executed (credits).
Unhandled exceptions
Lambdas can fail for the following reasons
- Timeouts
- Unhandled exception, e.g.
TypeError
orReferenceError
. - Run out of Memory
I have not seen much of (3) and we discussed (1) previously. Here, I want to talk about "Unhandled exceptions".
Firstly, it is a good practice to wrap a Lambda in a try-catch
block. In this way, we can handle uncaught exceptions, gracefully.
However, we do not want to fail the whole computation because of a TypeError
in a log command.
We should try-catch
individual log commands.
Bonus: I have a playground repo for testing unhandled exceptions' behavior in AWS.
Final remarks
Building a logging solution in serverless is not easy. We need to accomplish at least these steps.
- Add context information to log message (i.e.
awsRequestId
) - "Flush" pending log promises with
Promise.all
before returning a value - Preemptively interrupt logging if we are running out of time.
- Wrap log commands in
try-catch
A lot of work for a lazy programmer like me.
Fortunately, it is a rote job that can be encoded in a library without rewriting the basics every time.
In a next post, I will show you an elegant functional solution to this problem.
Top comments (3)
Cool article Marco! This covers some important and interesting nuances about logging in serverless, which is definitely not easy! And as you say, a really important aspect.
I thought you might find this package, and specifically this module interesting if you haven't already seen it:
github.com/getndazn/dazn-lambda-po...
Hey @Lou, thanks for your comment and the link. I was aware of that package and I was going to mention it. As far as I know, its implementation is based on the first links above (or the other way around :P).
The
log-timeout
middleware is useful to get warnings when a lambda times out if I understand correctly. In fact, native AWS timeout messages are quite hard to spot.In this post, I'm saying that sometimes it is better sacrifice "logging" in order to avoid timeouts in the first place.
Awesome β also congrats on your first DEV article! I'm the mod for serverless, so let me know if you have any ideas or thoughts! :)