DEV Community: Scalyr

Java Exceptions and How to Log Them Securely

Erik Dietrich — Tue, 12 May 2020 15:31:38 +0000

As a security consultant, I perform assessments across a wide variety of applications.

Throughout the applications I've tested, I've found it's common for them to suffer from some form of inadequate exception handling and logging.

Logging and monitoring are often-overlooked areas, and due to increased threats against web applications, they've been added to the OWASP Top 10 as the new number ten issue, under the name "Insufficient Logging and Monitoring."

So what’s the problem here? Well, let’s take a look.

Logs? Who Needs Logs?

To start off, why do we even use logging? What’s the point?

Not only is proper logging useful for debugging applications, but it also has serious implications for compliance and many benefits for forensics and incident response.

How do you know if someone is running a vulnerability scanner against your application?

Or is attempting a brute force authentication attack to try and access user accounts? All of this is good to know, but there are other subtle things as well.

The majority of successful attacks start with an attacker who probes the application and looks for weak points.

The more an attacker can probe the application, the higher the chance that the attacker will find and successfully exploit the application.

Attackers rely on being able to go unnoticed, and since the breach detection rate is an average of 191 days, the logs are often the only way that anyone can see what happened.

Not having this information makes it extremely difficult to assess who did what when and to what extent access was gained.

Create and Follow a Logging Strategy

It's very rare that I see an application that has an actual logging strategy. Most of the time, we implement logging as an afterthought.

I guess that can be a strategy, but can we do better? I think we can.

When you add logging into the application, it's a good idea to have an overall consistent strategy. Use the same logging framework across all of the applications wherever possible.

This makes it easy to share configurations, such as message formats, and to adopt consistent logging patterns.

Consistency on when a message is a warning or an error and what logging levels to use also need to be documented.

When logging anything, the message format should always contain at a minimum the timestamp, current thread identifier, caller identity, and source code information.

All modern logging frameworks support this type of information out of the box.

Having all of this as part of your developer documentation would be a great way to create and maintain a consistent logging strategy across all of your business's applications.

Log the Complete Stack Trace

In many of the secure code reviews I’ve done, a mistake I commonly see is not logging the entire stack trace for an exception.

Take this hypothetical example, representative of the exact pattern I've seen many times in code reviews:

public Customer findCustomerByName(String customerName) {
  try {
    Customer c = customerService.findByName(customerName);
    return c;
  } catch (Exception ex) {
    LOG.error("Exception looking up customer by name: " + ex.getMessage());
  }
}

Now, there are a few things wrong with this example, but let’s just focus on handling the SQLException. Let’s say that in production you look at the logs and see this:

2018-03-02 09:29:47.287 ERROR 5166 --- [nio-8090-exec-1] com.scalyr.controllers.DemoController    : org.hibernate.exception.SQLGrammarException: error executing work

That doesn’t tell you a whole lot. What caused the SQLGrammarException?

The logger classes all have an overload that takes a Throwable object and will handle constructing and writing out the stack trace.

By changing the code slightly, we can get a clearer picture of what's going on:

public Customer findCustomerByName(String customerName) {
  try {
    Customer c = customerService.findByName(customerName);
    return c;
  } catch (Exception ex) {
    LOG.error("Exception looking up customer by name: " + ex.getMessage(), ex);
  }
  return null;
}

This code change we applied resulted in logging the full stack trace, which clearly shows some nefarious activity here (or fat fingers...).

2018-03-02 09:33:11.341 ERROR 5188 --- [nio-8090-exec-1] com.scalyr.controllers.DemoController&nbsp;&nbsp;&nbsp;&nbsp;: org.hibernate.exception.SQLGrammarException: error executing work

org.hibernate.exception.SQLGrammarException: error executing work
&nbsp;&nbsp;&nbsp;&nbsp;at org.hibernate.exception.internal.SQLExceptionTypeDelegate.convert(SQLExceptionTypeDelegate.java:63) ~[hibernate-core-5.0.12.Final.jar:5.0.12.Final]
&nbsp;&nbsp;&nbsp;&nbsp;at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:42) ~[hibernate-core-5.0.12.Final.jar:5.0.12.Final]
&nbsp;&nbsp;&nbsp;&nbsp;at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:109) ~[hibernate-core-5.0.12.Final.jar:5.0.12.Final]
&nbsp;&nbsp;... omitted
&nbsp;&nbsp;at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) [tomcat-embed-core-8.5.15.jar:8.5.15]
&nbsp;&nbsp;&nbsp;&nbsp;at java.lang.Thread.run(Thread.java:748) [na:1.8.0_144]
Caused by: java.sql.SQLSyntaxErrorException: malformed string: 'Acme''
&nbsp;&nbsp;&nbsp;&nbsp;at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) ~[hsqldb-2.4.0.jar:2.4.0]
&nbsp;&nbsp;&nbsp;&nbsp;at org.hsqldb.jdbc.JDBCUtil.sqlException(Unknown Source) ~[hsqldb-2.4.0.jar:2.4.0]
&nbsp;&nbsp;&nbsp;&nbsp;at org.hsqldb.jdbc.JDBCStatement.fetchResult(Unknown Source) ~[hsqldb-2.4.0.jar:2.4.0]
&nbsp;&nbsp;&nbsp;&nbsp;at org.hsqldb.jdbc.JDBCStatement.executeQuery(Unknown Source) ~[hsqldb-2.4.0.jar:2.4.0]
&nbsp;&nbsp;&nbsp;&nbsp;at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_144]
&nbsp;&nbsp;&nbsp;&nbsp;at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_144]
&nbsp;&nbsp;&nbsp;&nbsp;at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_144]
&nbsp;&nbsp;&nbsp;&nbsp;at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_144]
&nbsp;&nbsp;&nbsp;&nbsp;... 105 common frames omitted
Caused by: org.hsqldb.HsqlException: malformed string: 'Acme''
&nbsp;&nbsp;&nbsp;&nbsp;at org.hsqldb.error.Error.error(Unknown Source) ~[hsqldb-2.4.0.jar:2.4.0]
&nbsp;&nbsp;&nbsp;&nbsp;at org.hsqldb.error.Error.error(Unknown Source) ~[hsqldb-2.4.0.jar:2.4.0]
&nbsp;&nbsp;&nbsp;&nbsp;at org.hsqldb.ParserBase.read(Unknown Source) ~[hsqldb-2.4.0.jar:2.4.0]
&nbsp;&nbsp;&nbsp;&nbsp;at org.hsqldb.ParserDQL.XreadPredicateRightPart(Unknown Source) ~[hsqldb-2.4.0.jar:2.4.0]
&nbsp;&nbsp;&nbsp;&nbsp;at org.hsqldb.ParserDQL.XreadBooleanPrimaryOrNull(Unknown Source) ~[hsqldb-2.4.0.jar:2.4.0]&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;... 122 common frames omitted

Now if we were to see this in the logs, we can pretty immediately see what the issue is. Someone has attempted to look up a customer with the name of Acme’and it broke our SQL statement.

This exception is a clear indicator of a SQL injection and could be easily missed if someone analyzes the logs and only sees the original message.

They might not think much of it and move on to other issues, not catching a serious flaw.

Log All Java Exceptions

The “swallowing” of exceptions is another all-too-common issue I see.

An exception is thrown somewhere in the application and the developer has a catch block intending to handle the exception, but for some reason forgets to come back to it or decides that it isn't important.

The following example illustrates this problem:

public Customer findCustomerByName(String customerName) {
  try {
    Customer c = customerService.findByName(customerName);
    return c;
  } catch (Exception ex) {
    // todo: Log using the new logging strategy..
  }
  return null;
}

This practice is all too common from my experience and definitely warrants being called out.

Logging the exception, rethrowing it, or just not handling it at all results in no indication in the logs that anything went wrong with the application.

There’s never a reason not to at least log an exception.

Swallowing exceptions like this results in hiding any problems with the underlying query or another abstraction, which may go unnoticed and may be the result of issues in business logic or a security flaw.

Don’t Return Exceptions to the User

When performing a security assessment of any kind, every piece of information you can learn about the application or its environment is potentially useful.

A seemingly innocuous error message may be just what a consultant (or an attacker) needs.

They could find the one exploit that may work against your system or greatly reduce the payloads needed to test for a SQL injection if an error message reveals something about the database system in use.

It’s also a common practice to simply return an exception message to the user through some kind of error handling.

I come across this a lot when testing authentication systems, as in the following screenshot:

The code that handles this might be doing something like this:

User findByUsername(String userName) throws UserNameNotFoundException {
  EntityManager em = entityManagerFactory.createEntityManager();
&nbsp; return em.createQuery("from User where userName = :userName", User.class)
  .setParameter("userName", userName)
  .getSingleResult();
}

Later on, the exception is thrown and caught. The developer uses the exception message to construct an error that's passed along to the user. This results in the user being able to see the raw exception message.

public String login(Model model, String username, String password) {
  try {
    // attempt to login user
    userService.login(username, password);
  } catch (Exception ex) {
    model.addAttribute("error", ex.getMessage());
  }
  return "login”;

Not only is this bad practice as far as exception handling goes, but it also opens up the application to user account validation.

Depending on the type of application you're working on, this could be a risk in itself.

Never return the contents of an exception object to the user. Catch the exception, log it, and return a generic response.

You never know what information the exception message may contain as code evolves, and the message itself may change in the future.

Don’t Log Sensitive Information

I mentioned that logs can be useful for not only debugging but also for compliance, audit, and forensics. Because logs have many uses and we have a tendency to just “log everything,” they can be an incredible source of information.

If logs contain usernames, passwords, session tokens, or other sensitive information, it really reduces the work for an attacker.

Logs will reveal the inner workings and failures of an application, all of which an attacker can use to attack the application further.

Due to this, we need to view and treat logs as sensitive and keep them secure. We probably already know not to log the following information:

Credit card numbers
Social security numbers
Passwords

But the following types of information shouldn't be written to logs either:

Session identifiers
Authorization tokens
Personal names
Telephone numbers
Information the user has opted out of (e.g., do not track)

There’s another issue: some jurisdictions don’t allow certain information to be tracked, and doing so violates the law.

Knowing the compliance requirements of the application and the data it processes is extremely important.

Don’t Be in the Dark

While logging isn't a complex task, there's a lot of subtlety and balance in getting it right. Too little information won't be very valuable. Too much information can be overwhelming if not relevant or isn't handled properly.

Application logging isn't optional. Without adequate logs, you're truly in the dark.

This post was written by Casey Dunham. Casey, who recently launched his own security business, is known for his unique approaches to all areas of application security, stemming from his 10+ year career as a professional software developer. His strengths include secure SDLC development consulting; threat modeling; developer training; and auditing web, mobile, and desktop applications for security flaws.

Get Started Quickly With Java Logging

Carlos Schults — Tue, 05 May 2020 15:26:53 +0000

You've already seen how to get started with C# logging as quickly as possible. But what if you're more of a Java guy or gal? Well, then we've got your back, too: today's post will get you up to speed with logging using C#'s older cousin, Java.

As in the previous post in this series, we'll not only provide a quick guide but also go into more detail about logging, diving particularly into the what and why of logging.

So, let's get you started with Java logging as quickly as possible. When you're finished reading, you'll know how to approach logging in your Java codebase.

The Simplest Possible Java Logging

For this simple demo, I'm going to use the free community version of IntelliJ IDEA. I'm also assuming that you have the Java JDK installed on your machine.

First, open IntelliJ IDEA and click on "Create New Project":

On the next screen, select "Java" as the type of project, on the left panel. You'll also need to point to the JDK's path, in case you haven't yet.

Third step---mark the "Create project from template" option and then select "Command Line App":

I've called my project "JavaLoggingDemo," as you can see from the image below.

After doing the steps above, you should see the Main class that was automatically created for you. It should look like this:

package com.company;

public class Main {

    public static void main(String[] args) {
    // write your code here
    }
}

Let's get to work. Replace the "// write your code here" comment with the following line:

System.out.println("Hello there!");

Now you have an application that displays a friendly greeting.

Next, let's run our app. Go to Run > Run 'Main' or use the shortcut Shift + F10 (Windows and Linux) or Control + R on Mac.

If everything went well, you should see something like this:

Nice. But let's say the requirements for our little app changed, as they always seem to do. Now we need our app not only to display a message but also to log it somewhere. How to go about that?

There are a lot of sophisticated options we could use, but let's try and do the simplest thing we can. Edit the code in your main function so it looks like this:

public static void main(String[] args) throws IOException {
    String message = "Hello there!";
    System.out.println(message);
    Files.write(Paths.get("log.txt"), message.getBytes());
}

Notice the changes. We've added a new variable (message) to store the greeting's text. Then there's the printing. And finally, there's a new line, which is meant to save the text to a file.

We also had to add a throw declaration to our method since the last line can throw an IOException. Additionally, we've added some new import declarations. The whole code now looks like this:

package com.company;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;

public class Main {

    public static void main(String[] args) throws IOException {
        String message = "Hello there!";
        System.out.println(message);
        Files.write(Paths.get("log.txt"), message.getBytes());
    }
}

When you run your application again, it should behave the same as before. But the last line added will create a log file for you. It'll be located in your project folder:

That's pretty much it. You've just written your first Java logger! And with that first, small step, you're ready to try larger ones.

What Is Application Logging?

First things first. Let's briefly define what application logging is.

According to Wikipedia,

In computing, a log file is a file that records either events that occur in an operating system or other software runs, or messages between different users of a communication software. Logging is the act of keeping a log.

A little bit vague, right? I tend to prefer the definition we gave in our C# logging post:

Application logging involves recording information about your application’s runtime behavior to a more persistent medium.

With the what out of the way, let's get to the why of logging.

What’s the Motivation for Logging?

I think the key part of the logging definition we've presented in the previous section was the word persistent. Why would we need to record information about the behavior of our app in a persistent medium?

Well, for the same reason we record anything to a persist medium: so we can get back to it later. The question then becomes "Why would you want to do that with events that happened in an application execution?"

It all boils down to the nature of software itself.

A piece of software in production is a very complicated thing, more often than not running in a totally non-controlled environment.

How do you know it's going to work? And when things go wrong, how can you know what exactly went off the rails?

You can always cross your fingers and hope, but as they say, hope is not a strategy. Logging is, though!

In a nutshell, you use logging so that you can after-the-fact debug your application. By reading the log entries, you can understand the events that happened with your application, or even retrace the steps taken by a user, in order to diagnose and fix issues.

Evolving Our Approach: What Should We Capture? And To Where?

Being able to do an after-the-fact investigation on your app sounds like a great benefit --- and, indeed, it is. Our humble logger isn't able to provide us this benefit, though, since all it can do is write some text to a file.

Our job now should be to evolve our logger, turning it into a richer source of information for its future readers.

So how do we do that?

The key point to understand is the nature of a log entry.

Some of this is familiar ground, but I'm going to recap here for completeness' sake. Think of a log entry as an event. Something that is of interest to your application happened in a given instant in time. So the value of your log entry derives from the data you capture about the event.

The following list contains some examples of things that you'll probably want to capture.

A timestamp. Exactly when did the event take place? (You'll want to record times in UTC and using a standard format such as ISO-8601, in case you're logging to a file.)
Context that makes it clear to a reader what the entry's about. Just recording “Hello, user!” might prove confusing weeks or months later. A better message might say, “Recorded user greeting of ‘Hello, user!'”
Tags and/or categories for each entry, to enable later search and classification.
Log levels, such as “error,” “warning,” or “information,” that allow even more filtering and context.

After deciding what things you want to capture, the next step is defining where you want to log to. Even though I've used a file in our first example, there's nothing stopping you from logging to other media, such as a database (relational or not), the windows event viewer, or even the console.

Enter the Java Logging Framework

Time to get back to our demo. Here's the thing: it was designed to give you a quick start on the Java logging world. But now you're equipped to look at it in a more critical way, so let's do just that.

As it turns out, our humble logger is inadequate in several important ways.

For one thing, the code always overwrites the log file. A log strategy that only allows you to see the last event isn't terribly useful, isn't it?
And since we're talking about overwriting the file, it probably wouldn't hurt to think about file permissions, how to deal with concurrent access, and that sort of thing.
Finally, it sounds like an awful lot of work having to write boilerplate code in each logging call to get log levels, timestamps, and all that jazz I mentioned in the previous section.

I could go on and on, but you've got the picture: coming up with a sound Java logging approach requires a lot of thought. I have good news, though. As it turns out, people have already solved these problems. Not only that, they provide their solutions for you, ready to use and often for free, in the form of a logging framework.

A logging framework is a package you install in your application that provides easy and configurable logging for you. With a little bit of configuration, you can make one-line calls as simple as the one in our toy example. But this time, they'll have consistent and nicely formatted log entries, and they'll have answers to all those questions from the previous section.

Installing a Logging Framework

The first thing we're going to do is go back to our codebase and delete that line that writes to the file. Then we can proceed to install a logging framework called log4j. This is only one of the options that are out there, and I do encourage you to try alternatives later. But since this is a quick guide, it makes sense to go with log4j since it's a widely used framework.

On the Project tool windows, right-click on the "JavaLoggingDemo" module and then on "Add Framework Support...":

Now, select "Maven" and click OK:

After you've done this, the IDE will create a pom.xml file for you and open it for editing:

Now, copy and paste the following XML text into your file:

<dependencies>
      <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-api</artifactId>
        <version>2.10.0</version>
      </dependency>
      <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-core</artifactId>
        <version>2.10.0</version>
      </dependency>
    </dependencies>

The whole file should look like this by now:

<?xml version="1.0" encoding="UTF-8"?>
<project 
         
         xsi_schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>groupId</groupId>
    <artifactId>JavaLoggingDemo</artifactId>
    <version>1.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-api</artifactId>
            <version>2.10.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>2.10.0</version>
        </dependency>
    </dependencies>     
</project>

After adding the dependencies to the file, you should see a small pop-up in the bottom-right corner of your screen. Click on "Import Changes" to finish the installation.

Configuring the Logger

Now it's time to configure log4j.

What I'm going show you is one way of doing this; it's not the only one, and it's probably not even the best one, whatever "best" may mean. But it's definitely shorter than a lot of the tutorials you see out there.

Pretty much all you have to do is to create a configuration file and then write some lines of code. Don't believe me? Well, let me just show you, then.

First, go to the "Project" tool window on IntelliJ. Expand the folders and locate the "resources" folder, under JavaLoggingDemo > src > main, just like the image below:

Now, right-click on the resources folder, and then select New > File:

When you're prompted for a name, enter "log4j2.xml." After the file is created, paste the following text into it:

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="INFO">
    <Appenders>
        <File name="FileAppender" fileName="proper.log" immediateFlush="false" append="true">
            <PatternLayout pattern="%d{yyy-MM-dd HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/>
        </File>
    </Appenders>
    <Loggers>
        <Root level="ALL">
            <AppenderRef ref="FileAppender"/>
        </Root>
    </Loggers>
</Configuration>

Editing the Code

First, at the top of the file, add the following two import declarations:

import org.apache.logging.log4j.Logger;
import org.apache.logging.log4j.LogManager;

Then, add the following line to the top of the Main class, which declares a private field to hold the instance of the logger:

private static final Logger logger = LogManager.getLogger(Main.class);

Finally, add the following line of code to where you earlier had the call to Files.write(Paths.get("log.txt"), message.getBytes()):

logger.info(message);

The whole class should now look like this:

package com.company;

import org.apache.logging.log4j.Logger;
import org.apache.logging.log4j.LogManager;

public class Main {

    private static final Logger logger = LogManager.getLogger(Main.class);

    public static void main(String[] args) {
        String message = "Hello there!";
        System.out.println(message);
        logger.info(message);
    }
}

And we're done! Now all you have to do is run the application.

Checking the Results

Now navigate back to the application folder and notice a new file there, called "proper.log." If you open it using a text editor, you should see the following:

2019-09-25 15:39:29.739 [main] INFO  com.company.Main - Hello there!

Let's break this line into its components.

2017-12-31 15:39:29.739

First, we have the timestamp, in the ISO-8601-compliant format.

[main]

This refers to the name of the thread from which the event originated.

INFO

Here we have the logging level.

com.company.Main

Then, the name of the class.

Hello there!

Last, but not least, the message itself.

The Power of a Java Logging Framework

I think you'll agree that our logger just got a lot more useful with the update we've made.

Just by creating a config file and writing a few lines of code, we were able to configure a realistic example of a logger. This isn't all there is to it, of course, but it's already enough to be useful in a real application. What's more, it gives you a pretty good idea of the power a logging framework can put into your hands.

The XML configuration file is one the places this power really manifests itself. It offers you a lot of options to configure each entry that you log, such as the name of the log file or if the logger should append to it or overwrite it, just to name a few.

A lot of the flexibility log4j provides is due to something called an appender. The appender is the component that effectively writes the message to some medium, and there are many of them available. This means it's possible to direct your logs somewhere else entirely, just by adding a new appender to the XML file, without even touching your application code.

Separation of concerns at its finest, if you ask me.

Happy Learning!

What we've seen today is just the tip of the iceberg. Make no mistake: you now have a lot of learning ahead of you. But you have the advantage of already having a live, functional, and realistic setup to work with. Start by playing with and tweaking it.

Here are some tips on what you can do:

Learn how to configure different loggers, one for each log level.
Research and try out the different appenders available.
Play with the options available for the "layout" entry.
Learn about the other ways you can configure log4j.
You have lots of options for logging frameworks; try some of them, once you're confident enough with log4j.

And to learn more about Java logging and logging strategies in general, you probably won't find a better place then Scalyr's blog. Scalyr offers a log aggregation tool, which means that once you have lots of log files and data, they’ll help you organize, search, and make sense of all these data. So stay tuned for more!

Learning something from zero can be an overwhelming task. Fortunately for you, you don't have to do that. We've given you the fundamentals upon which you can build a solid and long-lasting body of knowledge.

Now it's up to you.

PS: If you're interested in further reading on the topic of logging in other languages and platforms, take a look at some of our available guides:

Creating an Audit Trail for Your Business

Erik Dietrich — Tue, 28 Apr 2020 15:50:35 +0000

No matter what you do, there will be aspects of your job that you absolutely love. And then you'll have the things that you tolerate out of necessity. I'm guessing that, for almost everyone reading, "audit trail" sounds like something that fits squarely into the "tolerate" bucket.

Even if you don't know what it is, it probably sounds equal parts intimidating and boring. The closest word association you'll likely have with "audit" is that it's what the IRS does to you when it simultaneously takes a fine-toothed comb to your life and demands more money from you. And looking to avoid angering the IRS is probably not what you dreamed of on career day as a child.

But building and maintaining an audit trail for your business doesn't have to be onerous. Far from it.

What Is an Audit Trail, Anyway?

I've thrown the word around a few times, but let's get a little more precise to set the stage for a post. What is an audit trail?

To get a good working definition of "audit trail," consider the definition of "audit."

An official examination and verification of accounts and records, especially of financial accounts.

It has official overtones to it, and it involves taking a detailed look at relevant records. So when you commission an audit, you ask someone to come in, on the record, and take a detailed look at what you're doing.

An audit trail, then, is what you do to facilitate this activity. You make sure to dutifully document and capture anything that an auditor might need. What's the reasoning for this? Generally speaking, you do this to demonstrate that you operate with a high degree of transparency and that your activities are all ethical, responsible, and legal.

Take the aforementioned case of the IRS mandating an audit for you. This will tend to go much better for you if you've made sure to create an audit trail: saving receipts, noting business expenses, keeping careful track of all income, etc.

Examples of Audit Trails in Business

What are some other examples that are perhaps a little less adversarial than dealing with the IRS? Let's take a look at the sorts of audit trails a business might find handy and why.

A detailed security log of which users have access to what information in a database. This sort of thing can be crucial for protecting sensitive data like health information.
Records of all financial transactions a business makes. This includes everything from paying employees to operating expense s to accounts receivable. When you know where all money comes from and goes, you can prevent abuse and fraud.
Tracking all customer communication. This can help a great deal with dispute resolution and also with keeping customers happy by providing relevant information about their past communications.
Operational transactions. Imagine a company like Uber, for instance. The ability to audit pickup and dropoff times, as well as routes, helps with pricing and staffing considerations.

I could go on, but you get the idea. An audit trail can indemnify you against potential legal actions and accusations. But it can also serve to give you important intelligence about your business. It might not be the most exciting idea in the world on the surface, but it's incredibly important.

Do You Really Need an Audit Trail? Who Does This Matter To?

Some of these examples might seem to apply to larger or more mature operations. Does a small business really need something like this? What kinds of businesses are the best candidates?

Well, obviously larger companies with larger risk profiles have the strongest need in this department. This holds especially true for highly regulated industries. If you work with sensitive health information or offer a product with safety implications, you'll have the greatest need. Governmental agencies will check up on your compliance, so it behooves you to be able to demonstrate it quickly and easily.

But this doesn't mean that smaller companies can't benefit. In the first place, today's smaller companies are tomorrow's larger ones. But beyond that, this information can help you with both business intelligence and the prevention of problems. No matter what the size of the business, is it ever reasonable to be unable to account for the money you make or the people that log into your system?

Audit trails let you perform your own business health checkups.

How to Create Audit Trails for Your Business

I've already defined what an audit trail is. But let's now look at what it involves. What are the prerequisites for a meaningful audit trail?

It has to be a complete history of what you're interested in monitoring. If it's your accounting, you can't leave transactions out here and there.
It has to be sequential so that you can recreate a chronological play-by-play.
You need to make sure it's both consumable and searchable. If you can't read it, then it's useless.

With those prerequisites in mind, what's the best way to create an audit trail for your organization? Well that part, at least, is simple. You do it through software logging.

In this day and age, the world (and most of your operations) runs on some form of software or another. You're probably not using carbon paper and binders to keep track of your finances and customer orders. Instead, you use accounting software and CRM systems. And all of those systems produce log files, as do the pieces of server and operating system software running beneath them.

You might also have your own software. And, if you do, you're probably logging from it as well. You can help your own audit trail by implementing good logging practices.

So in the broadest terms, you want to make sure that you're capturing all of this information in your log files. Gather them all up and look through them, making sure you're capturing what you need. If you're not getting everything, work with vendors or the responsible people in house to capture the additional information.

Log Aggregation for a Sophisticated Audit Trail

What I just described probably sounds like a lot of work. And that's because it will be a lot of work. Going out and finding all sorts of different log files, rounding up the people with relevant expertise, and making sure they have what you need...it's a daunting task.

In the end, it'll be worth the effort. But if you'd rather forgo some of that effort, you can take advantage of modern automation around log management. Here are a few of the relevant features that a log aggregation tool offers:

It can put all of your logs into a single place and weave them together into chronological order.
It can extract the most salient bits of information and allow you to tag them so that you can examine different facets of your business.
It'll give you really fast search capabilities, as well as charts and graphs to help you visualize the log contents.

Tooling exists to do these things, which figures to save you a lot of effort.

Way too many organizations first think about creating an audit trail when someone comes along to audit them. Then it's a painful and high-pressure experience. But if you start before the pressure's on, you'll have a much different experience building your audit trail. The combination of today's tooling and getting a jump on it early can give you both peace of mind and a huge competitive advantage.

Containers: Benefits and Making a Business Case

Lou (🚀 Open Up The Cloud ☁️) — Tue, 21 Apr 2020 16:32:02 +0000

Containers are hot stuff right now.

So it’s natural that you’re found your way here wondering what the business case and benefits of containers could be. I’m guessing you are assessing whether containers would make sense for your company? If I’m right — you’re in the right place.

By the end of this article you’ll know what containers are, what their benefits and drawbacks are and you’ll have some decision making criteria to assist you in your business case.

We've got quite a bit of ground to cover, so let's get to it.

Firstly, What Is a Container?

Before we get into pros and cons, we need to have a basic understanding of what a container is. We need to understand how containers operate and what areas of our company could be positively affected by the implementation of containers.

Docker, the current most popular container software on the market defines a container as:

A standardized unit of software.

Sounds pretty vague to me. Anyone working in software will have heard phrases similar to this, such as component, module, or app.

Aren't these all standardized units of software? What makes a container unique?

According to Docker:

Containers isolate software from its environment and ensure that it works uniformly despite differences for instance between development and staging.

Bingo.

This is the real value of containers: isolation. Containers are pretty much isolated processes running on a host machine. I say pretty much since there are some instances when a container isn't truly isolated. But we'll get into that a bit later on.

Why is isolation important? Because isolated software runs with the same behavior no matter where you put it.

This is useful when:

Moving the software between environments, e.g., for testing purposes.
Moving the software to a different running location, e.g., from on-site servers to the cloud (or even between cloud providers).
Scaling the software to run compute power concurrently (horizontal scaling), e.g., to achieve high availability or performance requirements.

Simply put: containers allow software engineers to create small, isolated pieces of code that can run on any machine, anywhere, in a consistent fashion.

Don't worry if this is a little too abstract, because we are going to break it down.

But first, to ensure a balanced discussion, let's look at the alternatives to containers.

Container Alternatives

If you’ve been in a tech for a while, my previous statement about creating isolated software could have you raising your eyebrows. Because indeed there are other methods for isolation and portability that don’t require containers.

In order to see the benefits Containers make, we should review some of the alternatives.

Let's take a quick look at these options.

Alternative 1: Manual Configuration

Manual configuration is essentially the antithesis to containers. Rather than having your code run in the same way on all machines, you become susceptible to the most fickle errors of them all: human ones.

Manual configuration is simply using an individual, a human, to manage servers or applications. Servers that are manually configured (in what is often now considered a fairly old-school way) are maintained by an operations team or a systems administrator.

Even though it may be old school, we must consider manual configuration as an option because often, especially for small companies, it can be the most pragmatic solution.

As we'll soon see, containers can help companies looking to achieve scale, and without the right conditions, adding more complexity to a new project might come with additional risk beyond which makes sense to undertake.

Alternative 2: Virtual Machines (VM)

Next up, virtual machines.

Image: Containers vs. VMs (source)

For a long time, the tech industry standardized on the idea of virtual machines as a way to create the aforementioned isolation and combat the inconsistencies of manual configuration.

Simply put: virtual machines are small, isolated machines that run within another machine (called the host). I think it's easiest to think of virtual machines as a computer within a computer.

For a long time, I couldn't discern the difference between VMs and containers. The penny finally dropped when I understood containers aren't magic, nor are they just boxes, as they're often drawn.

Instead, they're a protected process running on a machine.

And because containers are processes, they can share the system resources of their host, unlike virtual machines that require an entire operating system within each virtual machine you want to run. This creates a trade-off between the near-perfect isolation of a virtual machine and the speed and lightweight benefits of a container.

Alternative 3: Serverless

Image: A Serverless architecture (source)

Lastly, the newest entrant in the getting-code-onto-a-server market, serverless.

Serverless tackles the difficulties of matching application and infrastructure in a different way: by removing infrastructure from the equation completely.

But how does this work? Surely there must still be a sever?

In serverless, there are still servers that run our code, but this responsibility is passed to a cloud provider.

Instead of putting applications into the cloud as containers or virtual machines, we send small functions of code. These are then run individually based on demand. Serverless can be great for scaling and achieving lower costs, but it also comes with added complexity.

Serverless and containers are comparable, but they’re not direct competitors. By design serverless ties you to a cloud provider. While some tools such as Serverless Framework help you minimize the lock-in, you might be losing out on a lot of the benefits.

Companies who adopt serverless often benefit by going fully cloud-native and embracing all of a cloud provider's features without fear of lock-in. So, if cloud agnosticism is a goal, containers could be a better choice than serverless.

Benefits of Containers

Hopefully by now you have a good grasp of what a container is and what the alternatives are: manual configuration, VM's and serverless. With this base of knowledge, let's move onto looking in more detail at the specific benefits of containers.

Benefit 1: Deployment Flexibility

Containers don't have to just work for applications such as websites. Since they are simply processes, it's easy to scale them and use them for different purposes.

Containers also work for other ad-hoc and operational tasks, such as:

Database schema updates
Performance testing.

For instance, recently I launched a performance testing suite on a fleet of containers. Each container had a performance profile, and I could easily scale it up and down to mimic load for testing purposes.

These containers could just as easily be run on a local machine as they could on AWS, GCP, or any other cloud provider.

Benefit 2: Cloud Agnosticism (Portability)

Image: Container portability (source)

When it comes to architecture decisions, there will always be a stakeholder who will ask...

"What if we want to move cloud providers? How hard will that be?"

With containers, the answer is: not too hard.

Containers themselves can run on open source software, such as Linux. They're shipped with a run file, meaning that you'll find their packaging instructions contained within the software, not the cloud provider.

So if you do choose to move your application to a different set of servers, you can do so with relative ease.

Alternatively, a solution such as serverless by its very nature is highly coupled to the cloud provider. This is worth considering if portability is a concern for you.

Benefit 3: Fine-Grained Architectural Implementation

Image: Breaking down a monolithic application (source)

Often a container-based solution for an application goes hand in hand with a microservices-type architecture. In microservices, a large application is broken down into smaller parts that are typically worked on by separate teams and deployed in isolation.

When separated in such a way, you can scale them horizontally, which simply means that rather than making the machine bigger, you run two of them side by side. This can then give you a fine level of granularity when performance tuning an application, allowing you to scale up and down only where needed.

Microservice-type architecture brings additional advantages, as teams can work independent, drastically increasing velocity and speed of implementing change (at least in theory).

On the other hand, microservice architectures are difficult to get right and require expertise and team communication. Split them in the wrong place and you have a very complex, distributed application.

But with containers in place, you have the option of whether to run one giant monolithic-type architecture or break your application down into pieces. You can even put it back together again if you split up your application in the wrong way.

Benefit 4: Easier Employment of Talented Engineers

A pragmatic—albeit slightly dry—reason to choose containers: resourcing.

Containers are hot technology in the market right now. Most technology companies struggle to find good developers to work on their products and services.

So it makes sense to ensure that the technologies we run are desirable by the market and that we can find good staff to work our software.

Drawbacks of Containers

Okay, so now we've talked a lot of the merits of containers.

But to make sure we get a complete picture, let's take a look at all the ways containers can let us down, or worse: actually hurt our performance and ultimately, our company.

Drawback 1: Additional Complexity

Image: Orchestration tool options (source)

Containers don't come alone. Unfortunately, oftentimes running containers means additional overhead in setting up systems to facilitate them. We will need somewhere to store our container images and likely, we'll want what is called a container orchestration tool.

On a practical level, container orchestration:

Chooses how many containers to run at any given time.
Tells our containers on what hosts they need to run.
Starts, restarts, or destroys containers.

This additional tooling brings with it added complexity. If you were operating in a monolithic architecture scaling might be a simple case of making the host machine really big.

Not only this, but if you go ahead with containers, you'll need to start considering the other tooling you'll want to run. This can be both complex and time consuming.

Just to paint a picture of the current crowded orchestration market: a couple of the main technologies available are Swarm, by Docker, and Kubernetes, an orchestration tool built by Google.

In terms of cloud adoption, naturally Kubernetes comes as a managed service on Google's Cloud Platform (GCP), but cloud adoption is also possible on Microsoft's Azure, with AWS offering its own container orchestration solution called Elastic Container Service (ECS).

Most of these orchestration services are fairly new, and the moment you use a service such as ECS on AWS, you begin to couple yourself to that specific cloud provider and lose some of the original portability benefits of containers.

Drawback 2: The Learning Curve

Implementing new technology is rarely a breeze, despite what the 101 conference talks and the "Hello World" YouTube videos make us think. New technology means training your teams to use it, which means time away from their usual jobs, which may be difficult in your current situation. You should consider the cost that will go into learning something new.

The Big Question: To Container, or Not to Container?

I'm glad you've made it this far, as this is where things get really interesting. Now we're going to take everything we've learned about containers and consider whether or not they will work for your company.

In order to do this, here are the questions you should consider:

Do we have the risk tolerance for additional complexity? If you are at the start of a project, it may be worth opting for a more simplistic monolith architecture. With this in mind, you could also build, package, and deploy your project in a single container, deferring the complexity of microservices and distributed systems into the future.

What is your team's level of enthusiasm for containers? If you're thinking of doing a big switch to container technology, keep in mind that the switch will require plenty of enthusiasm from your team. It would be worth your while to check in with your teams before adopting new technologies.

Do you have the ability to train your teams on containers? If you don't give your teams adequate opportunities to learn new technologies, building in time for experimentation and failure, then your journey to implementing containers could be fraught with difficulty. So consider the amount of time you have available to dedicate to this venture.

Do you have an immediate need for high scale? Containers could be a good fit for you if you'll need high levels of scale in the near future. If your application is set to serve millions of requests and you'll want to scale independent parts of your application, containers might be the right choice.

Conclusion

And that's a wrap on our whirlwind tour of containers, their benefits, and their drawbacks.

Hopefully you've had a chance to reflect on whether or not they might work for you.

If you've got an immediate need for scale, the ability to take on additional complexity, and an enthusiastic team, then maybe it's time to get going with containers. If you're missing one of these areas, maybe you want to hold off and investigate containers a little more before making your decision.

Containers aren't a silver bullet, but they could be the solution to unlocking a big performance boost for your company.

Now, armed with your newfound knowledge about the benefits and risks of containers, you should be able to make an informed decision about whether containers are right for your company. And remember: no setup works for every situation. So keep an open mind, explore, experiment, and stay curious.

What Goes Into Log Analysis?

Erik Dietrich — Tue, 14 Apr 2020 14:58:01 +0000

I've talked here before about log management in some detail. And I've talked about log analysis in high-level terms when making the case for its ROI. But I haven't gone into a ton of detail about log analysis. Let's do that today.

At the surface level, this might seem a little indulgent. What's so hard? You take a log file and you analyze it, right?

Well, sure, but what does that mean, exactly? Do you, as a human, SSH into some server, open a gigantic server log file, and start thumbing through it like a newspaper? If I had to guess, I'd say probably not. It's going to be some interleaving of tooling, human intelligence, and heuristics. So let's get a little more specific about what that looks like, exactly.

Log Analysis, In the Broadest Terms

In the rest of this post, I'll explain some of the most important elements of log analysis. But, before I do that, I want to give you a very broad working definition.

Log analysis is the process of turning your log files into data and then making intelligent decisions based on that data.

It sounds simple in principle. But it's pretty involved in practice. Your production operations generate all sorts of logs: server logs, OS logs, application logs, etc. You need to take these things, gather them up, treat them as data, and make sense of them somehow. And it doesn't help matters any that log files have some of the most unstructured and noisy data imaginable in them.

So log analysis takes you from "unstructured and noisy" to "ready to make good decisions." Let's see how that happens.

Collection and Aggregation

As I just mentioned, your production systems are going to produce all sorts of different logs. Your applications themselves produce them. So, too, do some of the things your applications use directly, such as databases. And then, of course, you have server logs and operating system logs. Maybe you need information from your mail server or other, more peripheral places. The point is, you've got a lot of sources of log data.

So, you need to collect these different logs somehow. And then you need to aggregate them, meaning you gather the collection together into a whole.

By doing this, you can start to regard your production operations not as a hodgepodge collection of unrelated systems but as a more deliberate whole.

Parsing and Semantic Interpretation

Let's say you've gathered up all of your log files and kind of smashed them together as your aggregation strategy. That might leave you with some, shall we say, variety.

111.222.333.123 HOME - [03/Mar/2017:02:44:19 -0800] "GET /some/subsite.htm HTTP/1.0" 200 198 "http://someexternalsite.com/somepage" "Mozilla/4.01 (Macintosh; I; PPC)"

2015-12-10 04:53:32,558 [10] ERROR WebApp [(null)] - Something happened!

6/15/16,8:23:25 PM,DNS,Information,None,2,N/A,ZETA,The DNS Server has started.

As you can see, parsing these three very different styles of log entry would prove interesting. There seems to be a timestamp, albeit in different formats, and then a couple of the messages have kind of a general message payload. But beyond that, what do you do?

That's where the ideas of parsing and semantic interpretation come in. When you set up aggregation of the logs, you also specify different parsing algorithms, and you assign significance to the information that results. With some effort and intelligence, you can start weaving this into a chronological ordering of events that serve as parts of a whole.

Data Cleaning and Indexing

You're going to need to do more with the data than just extract it and assign it semantic meaning, though. You'll have missing entries where you need default values. You're going to need to apply certain rules and transformations to it. And you're probably going to need to filter some of the data out, frankly. Not every last byte capture by every last logging entity in your ecosystem is actually valuable to you.

In short, you're going to need to "clean" the data a little.

Once you've done that, you're in good shape, storage-wise. But you're also going to want to do what databases do: index the data. This means storing it in such a way to optimize information retrieval.

High-Powered Search

The reason you need to index as part of your storage and cleaning process is pretty straightforward. Any good log analysis paradigm is going to be predicated upon search. And not just any search --- really good search.

This makes sense when you think about it. Logs collect tons and tons of data about what your systems are doing in production. To make use of that data, you're going to need to search it, and the scale alone means that search has to be fast and sophisticated. We're not talking about looking up the customer address in an MS Access table with 100 customer records.

Visualization Capabilities

Once you have log files aggregated, parsed, stored, and indexed, you're in good shape. But the story doesn't end there. What happens with the information is just as important for analysis.

First of all, you definitely want good visualization capabilities. This includes relatively obvious things, like seeing graphs of traffic or dashboards warning you about spikes in errors. But it can also mean some relatively unusual or interesting visualization scenarios.

Part of log analysis means having the capability for deep understanding of the data, and visualization is critical for that.

Analytics Capability

You've stored and visualized your data, but now you also want to be able to slice and dice it to get a deeper understanding of it. You're going to need analytics capability for your log analysis.

To get a little more specific, analytics involves automated assistance interpreting your data and discovering patterns in it. Analytics is a discipline unto itself, but it can include concerns such as the following:

Statistical modeling and assessing the significance of relationships.
Predictive modeling.
Pattern recognition.
Machine learning.

To zoom back out, you want to gather the data, have the ability to search it, and be able to visualize it. But then you also want automated assistance with combing through it, looking for trends, patterns, and generally interesting insights.

Human Intelligence

Everything I've mentioned so far should be automated in your operation. Of course, the automation will require setup and intervention as you go. But you shouldn't be doing this stuff yourself manually. In fact, you shouldn't even write your own tools for this because good ones already exist.

But none of this is complete without human intervention, so I'll close by mentioning that. Log analysis requires excellent tooling with sophisticated capabilities. But it also requires a team of smart people around it that know how to set it up, monitor it, and act on the insights that it provides.

Your systems generate an awful lot of data about what they're doing, via many log files. Log analysis is critical to gathering, finding, visualizing, understanding, and acting on that information. It can even mean the difference in keeping an edge on your competition.

What Is Serverless Architecture and When Should You Use It?

Samuel James — Tue, 07 Apr 2020 14:30:15 +0000

Cloud computing is constantly evolving, from bare-metal to container technologies. The latest trend in this process is the serverless (Function as a Service, or FaaS) computing model. According to Techbeacon, serverless has an annual growth rate of 75%, making it the fastest growing cloud service model. So, serverless architecture isn't a mere buzzword. More companies than ever are adopting it.

If you're not sure what serverless architecture is, you've come to the right place. Read on to learn what serverless architecture is, when it makes sense, and when it doesn't.

What Is Serverless Architecture?

The term serverless architecture can be confusing. It doesn't mean application designs that allow for running applications in some magical space where servers are nonexistent.

Serverless architectures are application designs that make use of third-party services (Back end as a Service, or BaaS). They may use custom code run in managed, ephemeral containers on a FaaS platform.

Servers still run your application. But a third-party company takes care of the grunt work of provisioning, managing, and scaling servers. In serverless architecture, you manage and provision nothing.

Serverless architecture often incorporates two components: Function as a Service and Backend as a Service.

FaaS is a computing service that allows you to run self-contained code snippets called functions in the cloud. Your functions remain dormant until events trigger them. Functions are self-contained, small, short-lived, and single-purpose. They die after execution.
BaaS is a cloud computing service that completely abstracts backend logic, which takes place on faraway servers. It allows developers to focus on front-end code and integrate with back-end logic that someone else has implemented. BaaS could be authentication, storage services, geolocation services, user management, and so on.

In serverless architecture, you focus on writing code only. You deploy when you’re ready, without caring about what runs it or how it runs.

Is serverless the same as Platform as a Service?

Serverless and Platform as a Service (PaaS): What's the Difference?

Many people mistake Platform as a Service (PaaS) for serverless architecture. Although they are similar in many ways, they aren't the same.

PaaS providers offer software and hardware infrastructure as a platform to users—what many people call a solution stack. That means you can run custom applications using the provider's platform. Serverless, on the other hand, provides an environment where you can write and run custom code (functions) without managing, provisioning, or scaling infrastructure.

What do Paas and serverless architecture have in common? You don't manage infrastructure in either one. The platform provider takes care of that. However, differences exist in how you compose your application and how you scale it.

Composition

In the PaaS model, you write your application using a framework or language that your platform provider supports, and you deploy your business logic as a single unit. In the serverless model, your business logic is broken down into self-contained units that each perform a business function.

Scaling

Scaling isn't automatic for PaaS applications. Instead, you have to configure your app and add resources to handle more requests. In serverless architecture, on the other hand, your app scales automatically as the workload increases.

Lifetime Availability

Like traditional applications, PaaS applications have to be available at all times to continue to serve requests. In serverless, by contrast, functions are short-lived—they die after execution.

Now you understand that PaaS isn't serverless, though the two systems are similar in some ways. But why would you want to go serverless?

Why Use Serverless Architecture?

As you ponder this question, consider these three key attributes of serverless architecture.

It's scalable and highly available. Scaling traditional applications requires you to understand your traffic pattern. You estimate how much of each resource you'd need, and then you'd provision accordingly. Users troop in from all geographical regions to use modern applications. A traditional application could be overwhelmed by a spike—probably on a black Friday. In serverless, your application is highly available, and it scales automatically as your users grow and usage increases.
It costs less. One of the reasons serverless architecture is gaining popularity among startups is because of its pricing model. The cost of running servers 24/7 and paying for idle time is no longer an issue in serverless. You pay for usage only. Functions have allocated time in which they run and die afterward. The provider charges based on the number of executions and the size of memory your workload uses. This helps you optimize costs.
The time to market is faster. Operational tasks such as server provisioning, maintenance, and monitoring infrastructure are off your shoulders. You can focus solely on your business logic (code), experiment with ideas, and hit production on time.

With all the promises of serverless, is it perfect for all use cases? No. In some cases, serverless architecture makes sense, but in others, it might not. We'll discuss this point in detail later on in this post.

Let's explore some common use cases of serverless architecture.

Serverless Architecture Use Cases

In a traditional application, your application runs actively on servers. Since your servers are on 24/7, you also pay for idle time. In the serverless world, though, there are no servers to pay for. You only pay per trigger. In other words, you pay for what you use.
This advantage makes serverless architecture a good option for business cases that don't have to be always on. In this way, you save money from not paying for idle time.

Examples of serverless architecture use cases are:

High-Traffic Websites

If you're still serving your static websites from an EC2 instance, you may be missing out on a lot. With serverless, you can host your static website on S3 bucket and serve your assets with a global, fast cloud delivery network. Not only is it cheaper and fast, but it is also highly available and scalable.

Multimedia Processing Applications

If your business deals with images and videos, then serverless architecture might work well for you. You can use a scalable storage service such as AWS S3 to store your data. An upload event can trigger a lambda function after each successful upload that processes your file asynchronously. Your users can continue to enjoy your app while a highly available and scalable back-end service processes the upload in a non-blocking way.

Mobile Backends

An API gateway gives you an entry point to your business functions. These functions can be exposed as rest API that your mobile app consumes. Serverless services such as AWS AppSync allow you to securely access, manipulate, and combine data from multiple sources in real time.

Internet of Things (IoT)

IoT devices generate a lot of data from their environments through sensors. Organizations often struggle to process this overwhelming data coming from these connected devices in a scalable way. Using a serverless back end like AWS IoT Core, you can scale to billions of devices and trillions of messages.

Big Data Applications

Before cloud computing, the insights big data provided were available only to big enterprises because organizations needed the infrastructures' overheads to make sense of that data. Setting up and maintaining infrastructures for big data isn't easy. With serverless computing, your app can now take advantage of several services, including Amazon S3, Amazon Athena, Amazon Kinesis, AWS Glue, and AWS Lambda to build scalable data pipelines.

Earlier, I mentioned that serverless architecture isn't a silver bullet. There are cases where serverless might not be a good fit.

When Going Serverless Might Not Make Sense

With all the promises of serverless architecture, it has its drawbacks and limitations too. It's important to keep these in mind when considering serverless architecture and plan accordingly.

There's a limit to how long a function can run. This makes serverless architecture unsuitable for tasks that run for hours.
What if you require a fast response with a consistent latency of less than 50 milliseconds? Serverless architecture suffers from the problem of a cold start. In a case like this, you might need to reconsider your options.
You may be afraid of vendor lock-in. Serverless architecture could tie you to a single vendor. Migrating serverless applications from one vendor to another requires a lot of manual effort and major changes in your code.
Serverless vendors limit how much memory you can assign to functions. If you require a complex compute with high memory requirements, then going serverless might not be a good use case.
Serverless vendors enforce hard limits on deployment size. This could vary from one vendor to another. For example, AWS allows a code deployment size up to 250MB. For large code deployment, serverless architecture might not be suitable.
Serverless is still relatively young. For this reason, observability is still a problem. Having 360-degree insights into your functions can be difficult.

Conclusion

Before moving to serverless architecture, you must first evaluate your use cases. It's critical to understand what your business requirements are and whether serverless architecture would make sense for your project and some of the limitations you'd hit down the road. When you have this information, you can better prepare.

In this post, we've seen the advantages of serverless architecture. We also discussed some of the drawbacks of serverless architecture and some typical serverless use cases. To build on this knowledge, I encourage you to read these sources:

Choosing Among Log Management Tools

Erik Dietrich — Tue, 31 Mar 2020 14:55:35 +0000

When you google log management tools, an interesting thing happens. At the time of this writing, you see no fewer than 4 paid ads, followed by a series of posts. These include, and this is not a joke, a post that lists the top 47. As a software developer and tools consumer, this drives me insane. It probably does the same for you.

An author named Barry Schwartz coined a term (along with an eponymous book) for this frustration. He called it "the paradox of choice," and it describes how, while we like to have some choice and autonomy, too much paralyzes us. To understand this in simple, terms, imagine selecting music for a dinner party. If offered two albums from which to choose, you'd make a pretty quick choice. If offered hundreds, you might thumb through them for a long time, trying to consider the likely tastes of all of your guests. And you might actually just give up eventually, and opt for only conversation with no background music at all.

The Paradox of Choice Among Log Management Tools

Back in the DevOps world, you face a similar plight when trying to pick among log management tools. You understand that you need a better way to aggregate and mine your logs than "by hand, using Sublime Text," so you start to do some research. And then, about two searches in, you find yourself staring at post entitled, "The Top 47 Log Management Tools." And, if you're anything like me, you rub your temples and say to yourself, "ugh, never mind, I'll figure this out tomorrow."

That, of course, lines up with Schwartz's findings about human behavior. Beyond having a few options, each additional option presented to a group of people causes fewer people to participate. The higher the number of log management tools in those posts, the fewer people will actually pick any of them at all.

Luckily, there's a path back to joy. And it's not even terribly complicated. You just need to dramatically narrow the field.

So today, I'm not going to add to the pile of "pros/cons/features" posts out there comparing dozens of tools. Instead, I'll speak to heuristics you can employ to help you choose among log management tools. I'm going to help you narrow the field from a paralyzing number of choices that you make you unhappy to a manageable number that empowers you.

Look to Those You Trust

Bar none, the most effective way to narrow a field involves relying on people and sources that you know and trust. I'm not talking about ratings sites ala Yelp, either. I'm talking specifically about colleagues and industry authorities that you follow and trust.

Ask them for their recommendations. What do they use and why? Do they like it? Would they recommend it? And, in terms of who you follow, do they have favorite tools? Does someone you admire work for one of the log management tools companies? Do you like their participation in the community?

Depending on the size of your network and reading sources, you'll get a list of varying sizes. Take this list, and set it aside for later cross-referencing.

"Wait," you're probably saying. Shouldn't this be the first way of filtering out the noise? You might think that, but the issue is that this list will be based entirely on the recommenders' needs and not on yours. Instead, set this list aside and go back to the wider field of potential options.

Narrow First with Pricing Clarity and Buckets

First things first. It might be a little gauche, but let's be frank. Cost matters, and it matters a lot. But I would advise you not to get overly concerned with the specifics of price. Instead, I'd slice things broadly into three buckets. (Actually four, but think of one as the null bucket -- I'll explain momentarily).

Reason about price by looking at tools as free, priced for small business, or priced for the enterprise. If you do not want to pay, you have the easiest way to narrow the field. Simply sort through the universe of options discounting any without a free or freemium option. If you reasonably think that you'll have budget for this, but not a lot, look for modestly priced tools (up to a few hundred dollars per month). If you work for a large enterprise, you know who you are. Assume that you'll want the feature-rich, higher-end options with lots of support. And, "market price" options where they just say to call about pricing fall into the enterprise bucket.

That leaves only the cryptic null case. What I'm talking about here is byzantine pricing schemes designed to confuse you. You know what I'm talking about. It happens when you stare at a pricing page for 10 minutes and, with all the rules, caveats, discount codes, and whatnot, you still can't figure out what it actually costs. Pricing should be honest and straightforward -- if you find yourself confused, cross it off your list and move on.

Disqualify Technical Mismatches

At this point, you've probably culled the field of log management tools down to roughly one-third of its original size on the basis of your appetite for spending. It's time now to slice it further by disqualifying obvious technological mismatches.

This can include the obvious, such as a tool that only installs its agent on Windows servers when you run Linux. But look too for features that you absolutely need. Is it only worth your budget if the tool offers a nice dashboard? Well, then make sure the tool has that dashboard.

I would caution against getting too restrictive about features, though. It's one thing to look a platform compatibility and a few essentials. It's another altogether to have a giant laundry list of "critical features" -- you can wind up eliminating all your options.

Optimize for Ease of Use

Hopefully, by this point you've narrowed the field considerably. That's important, because this last piece of research is a little more involved. You wouldn't want to do it for dozens of different prospective tools.

Set about now filtering tools based on their ease of use. You can figure this out by doing some research on their sites (or anywhere that you can find guides/demos of the products). Look for the install guide. Is it quick and easy, or is it involved, demanding tons of prerequisites and coordination? Next, look to see if they demo an install anywhere, like with a video. If that looks straightforward, you're in good shape.

Of course, you can also evaluate this by actually trying it yourself. That's a little more time consuming, but it speaks to the point of this line of research. Namely, once you've narrowed the field enough, you really just need to try using the tools to see if they work for you. Reading about APIs, libraries, platforms, and tools is one thing. Getting your hands dirty is another, and only that is going to really tell you whether it's a fit.

So if you've sliced your list down considerably and left only the easy to install options, you'll be in a position to try a few out. And, better yet, you're in a position to pivot from one to another if you find in the early going one isn't a fit. You can get going without worrying that you've over-committed.

Decide by Revisiting Your Whitelist

Now it's time to dust off that initial whitelist of recommendations. It's at this point that you've filtered your options down to the most likely candidates and are evaluating them in a meaningful way (via trial). To go back to the paradox of choice, you have now narrowed the field enough that the options empower you rather than paralyzing you.

Social proof, at this point, becomes powerful. Use your recommendations list as a potential deciding factor. Do you have three viable options, but only one of them comes recommended by a bunch of people you know and respect? That's a strong case for the recommended option, not only because of others' experience, but also because you'll have a support network for questions. Of course, your own experience with trying it is also powerful, so weigh those two factors together and decide.

There are so many tools in this space because the functionality is important and powerful. And having so many log management tools really is a wonderful position for consumers. But it's only wonderful if you know how to narrow the field to make your decision manageable.

How to Merge Log Files

Eric Goebelbecker — Tue, 24 Mar 2020 14:20:15 +0000

You have log files from two or more applications, and you need to see them together. Viewing the data together in proper sequence will make it easier to correlate events, and listing them side-by-side in windows or tabs isn’t cutting it.

You need to merge log files by timestamps.

But just merging them by timestamp isn’t the only thing you need. Many log files have entries with more than one line, and not all of those lines have timestamps on them.

Merge Log Files by Timestamp

Let’s take a look at the simple case. We have two files from Linux's syslog daemon. One is the messages file and the other is the crontab log.

Here are four lines from the messages file:

Sep 4 00:00:08 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 120000ms.
Sep 4 00:02:08 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 124910ms.
Sep 4 00:04:13 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 109850ms.
Sep 4 00:06:03 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 112380ms.

And here are five lines from cron:

Sep 4 00:01:01 ip–10–97–55–50 CROND[18843]: (root) CMD (run-parts /etc/cron.hourly)
Sep 4 00:01:01 ip–10–97–55–50 run-parts(/etc/cron.hourly)[18843]: starting 0anacron
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Anacron started on 2018–09–04
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Jobs will be executed sequentially<
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Normal exit (0 jobs run)

When we’re only dealing with ten lines of logs, it’s easy to see where the merge belongs. The five lines in the cron log belong between the first and second lines of the messages log.

But with a bigger dataset, we need a tool that can merge these two files on the date and the time. The good news is that Linux has a tool for this already.

Merge Log Files With Sort

The sort command can, as its name implies, sort input. We can stream both log files into sort and give it a hint on how to sort the two logs.

Let’s give it a try.

cat messages.log cron.log |sort –key=1,2 > merge.log

This creates a new file named merge.log. Here’s what it looks like:

Sep 4 00:00:08 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 120000ms.
Sep 4 00:01:01 ip–10–97–55–50 CROND[18843]: (root) CMD (run-parts /etc/cron.hourly)
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Anacron started on 2018–09–04
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Jobs will be executed sequentially<
Sep 4 00:01:01 ip–10–97–55–50 anacron[18853]: Normal exit (0 jobs run)
Sep 4 00:01:01 ip–10–97–55–50 run-parts(/etc/cron.hourly)[18843]: starting 0anacron
Sep 4 00:02:08 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 124910ms.
Sep 4 00:04:13 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 109850ms.
Sep 4 00:06:03 ip–10–97–55–50 dhclient[2588]: XMT: Solicit on eth0, interval 112380ms.

It worked!

Let’s dissect that command.

cat messages.log cron.log |

Cat concatenates files. We used it to send both logs to standard output. In this case, it sent messages.log first and then cron.log.

The pipe | is what it sounds like. It’s a pipe between two programs. It sends the contents of the two files to the next part of the command. As we’ll see below, sort can accept a single filename on the command line. When we want to sort more than one file, we use a pipe to send the files on standard input.

sort –key=2,3 > merge.log

Sort receives the contents of two files and sorts them. Its output goes to the > redirect operator, which creates the new file.

The most important part of this command is –key=2,3. We used this to tell sort to sort its input using two fields and three of the files. For some reason, sort starts counting fields at one instead of zero.

So sort was able to merge the two files using the day of the month and the timestamp.

This is our easy case. These log files both had single line entries, and our dataset was for less than thirty days. So we don't have to worry about sorting by months.

Let’s look at something that’s a little more complicated.

Merge Log Files With Multiline Entries

Here are a couple of Java application logs that we would like to merge.

Here’s the first:

2018-09-06 15:20:40,980 [INFO] Heimdall main:26 [main] 

Fix Engine is starting.


2018-09-06 15:20:45,639 [ERROR] AcceptorFactory createSessionSettings:92 [main] 

Session settings: [default]
SocketAcceptPort=7000
ConnectionType=acceptor
ValidateUserDefinedFields=N
ValidateLengthAndChecksum=N
ValidateFieldsOutOfOrder=N


2018-09-06 15:20:50,645 [ERROR] AcceptorFactory getSessionSettings:123 [main]

Second Session settings: [default]
SocketAcceptPort=7000
ConnectionType=acceptor
ValidateUserDefinedFields=N
ValidateLengthAndChecksum=N
ValidateFieldsOutOfOrder=N


2018-09-06 15:21:45,653 [INFO] ThreadedSocketAcceptor startSessionTimer:291 [main] SessionTimer started
2018-09-06 15:21:47,711 [INFO] NetworkingOptions logOption:119 [main] Socket option: SocketTcpNoDelay=true
2018-09-06 15:21:59,919 [INFO] SendMessageToSolace addSession:51 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02] Adding session: FIX.4.2:FOOU->TEST02
2018-09-06 15:22:59,920 [INFO] MessageClient openTopic:422 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02]
Opening FOO/DEV/AMER/FixEngine/Admin/*/TEST02
2018-09-06 15:23:59,937 [ERROR] ConsumerNodeStatusHandler setStateUp:186 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02] Setting State up: TEST02
2018-09-06 15:24:03,962 [INFO] MessageClient openTopic:422 [stateHeartbeat]

Opening FOO/DEV/AMER/State/Admin/Events
2018-09-06 15:25:00,536 [INFO] incoming messageReceived:146 [NioProcessor-2] FIX.4.2:FOOU->TEST02: 8=FIX.4.29=6235=149=TEST0256=FOOU34=252=20180906-15:21:00.528112=TEST10=198

This log has a lot of whitespace and entries that span multiple lines.

Here’s the other:

2018-09-06 15:20:43:031 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-10-5] Adding session: TEST02 at 1536243961031
2018-09-06 15:20:46:031 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-13-4] Adding session: TEST02 at 1536243961031
2018-09-06 15:23:15:032 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-7-5] Adding session: TEST02 at 1536243961032
2018-09-06 15:24:35:257 [INFO] com.foobar.atr.rest.controller.StatusController getSessionStatus():67 [http-nio-8010-exec-4] Received request a fix session, senderCompId:RBSG2
2018-09-06 15:27:30:691 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-10-5] Adding session: PLOP02 at 1536244050691

This log is more uniform, with entries that only span a single line.

When we merge these two files, we want the multiline log message to remain together. So, sort's numeric sorting won’t work. We need a tool that's capable of associating the lines without timestamps with the last line that has one.

Unfortunately, no command line tool does this. We’re going to have to write some code.

A Merging Algorithm

Here’s an algorithm for merging log files that have multiline entries.

First, we need to preprocess the log files.

Scan the log file line by line until we reach the end.
If a line has a timestamp, save it and print the last saved line to a new file.
If a line has no timestamp, append it to the saved line, after replacing the new line with a special character
Continue with step #1.

We could do this in memory, but what happens when we’re dealing with huge log files? We’ll save the preprocessed log entries to disk so that this tool will work on huge log files.

After we perform this on both files, we have a new one that is full of single line entries. We’ll use the sort command to sort it for us, rather than reinventing the wheel. Then, we’ll replace the special characters with new lines, and we have a merged log file.

And we’re done!

Let's do it.

Merge Log Files With Python

We’ll use python. It’s available on all systems, and it’s easy to write a cross-platform tool that manipulates text files. I wrote the code for this article with version 2.7.14. You can find the entire script here on Github.

First, we need to process our input files.

parser = argparse.ArgumentParser(description="Process input and output file names")
parser.add_argument("-f", "--files", help="list of input files", required=True, nargs='+')
parser.add_argument("-o", "--output", help="output file", required=True, type=argparse.FileType('w'))
args = parser.parse_args()

line_regex = re.compile("^[^0-90-90-90-9\-0-90-9\-0-90-9]")

with open("tmp.log", "w") as out_file:
    for filename in args.files:
        lastline = ""
        with open(filename, "r") as in_file:
            for line in in_file:
                if line_regex.search(line):
                    lastline = lastline.rstrip('\n')
                    lastline += '\1'
                    lastline += line
                else:
                    out_file.write(lastline)
                    lastline = line

We'll start by processing command line arguments. This script accepts two:

-f is a comma-separated list of input files
-o is the name of the file to write the output to

Argparse gives us a list from the arguments passed to -f and opens the output file for us, as we’ll see below.

Python Regular Expressions

Then we'll create a regular expression. Let’s take a close look at it since this is what you’ll need to change if your logs are formatted differently.

Here’s the whole expression:

^[^0-90-90-90-9\-0-90-9\-0-90-9]

The expression starts with a caret ^. This means the beginning of a line.

But then we have this: [^ ] with some characters in the middle. Square brackets with a caret at the beginning mean not.

So the expression means "if this is not at the beginning of the line."

The pattern we're matching is inside the brackets.

0–90–90–90–9\-0–90–9\-0–90–9

Each 0–9 corresponds to a numeral. Each \- is a dash. So it could be read like this: NNNN-NN-NN. It’s a pattern for the date we see at the beginning of each log entry.

So in English, the expression means “if the line does not begin with a date.”

If you need to process logs with a different format, you'll need to change this. There's a guide to python regular expressions here.

Sorting the Results

Now, we'll start the real work.

Open a temporary file.
Open the first log file.
Join lines with no timestamp to their predecessors, as described above.
Repeat this for each file passed on the command line.

For the third step, we'll chop the newline '\n' from the end of the last line we saved. Then we'll add an SOH ('\1') character and concatenate the lines. (I could've done this in one line, but I spelled it out to make it clear.)

We're replacing newlines '\n' with the SOH character instead of NULLs ('\0') because nulls would confuse python's string processing libraries and we'd lose data.

Finally, the result of this code is a file named tmp.log that contains the log files preprocessed to be one line per entry.

Let’s finish the job.

sorted_logs = check_output(["/usr/bin/sort", "--key=1,2", "tmp.log"])

os.remove("tmp.log")

lines = sorted_logs.split('\n')
for line in lines:
    newline = line.replace('\1', '\n')
    args.output.write(newline + "\n")

Check_output executes an external command and captures the output.

So we'll use it to run sort on our temporary file and return the results to us as a string. Then, we'll remove the temporary file.

We wouldn’t want to capture the result in memory with a large file, but to keep this post short, I cheated. An alternative is to send the output of sort to a file with the -o option and then open that file and remove the special characters.

Next, we'll split the output on the new lines into an array. Then we'll process that array and undo the special characters. We'll write each line to the file opened for us by argparse.

We’re done!

Let's run this script on two files:

./mergelogs.py -f foo.log bar.log -o output.log

And we'll see this.

2018-09-06 15:20:40,980 [INFO] Heimdall main:26 [main] 

Fix Engine is starting.


2018-09-06 15:20:43:031 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-10-5] Adding session: TEST02 at 1536243961031
2018-09-06 15:20:45,639 [ERROR] AcceptorFactory createSessionSettings:92 [main] 

Session settings: [default]
SocketAcceptPort=7000
ConnectionType=acceptor
ValidateUserDefinedFields=N
ValidateLengthAndChecksum=N
ValidateFieldsOutOfOrder=N


2018-09-06 15:20:46:031 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-13-4] Adding session: TEST02 at 1536243961031
2018-09-06 15:20:50,645 [ERROR] AcceptorFactory getSessionSettings:123 [main]

Second Session settings: [default]
SocketAcceptPort=7000
ConnectionType=acceptor
ValidateUserDefinedFields=N
ValidateLengthAndChecksum=N
ValidateFieldsOutOfOrder=N


2018-09-06 15:21:45,653 [INFO] ThreadedSocketAcceptor startSessionTimer:291 [main] SessionTimer started
2018-09-06 15:21:47,711 [INFO] NetworkingOptions logOption:119 [main] Socket option: SocketTcpNoDelay=true
2018-09-06 15:21:59,919 [INFO] SendMessageToSolace addSession:51 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02] Adding session: FIX.4.2:FOOU->TEST02
2018-09-06 15:22:59,920 [INFO] MessageClient openTopic:422 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02]

Opening FOO/DEV/AMER/FixEngine/Admin/*/TEST02
2018-09-06 15:23:15:032 [INFO] com.foobar.atr.rest.controller.SessionStatusCache addSessionStatus():28 [lettuce-nioEventLoop-7-5] Adding session: TEST02 at 1536243961032
2018-09-06 15:23:59,937 [ERROR] ConsumerNodeStatusHandler setStateUp:186 [QF/J Session dispatcher: FIX.4.2:FOOU->TEST02] Setting State up: TEST02
2018-09-06 15:24:03,962 [INFO] MessageClient openTopic:422 [stateHeartbeat]

Opening FOO/DEV/AMER/State/Admin/Events
2018-09-06 15:24:35:257 [INFO] com.foobar.atr.rest.controller.StatusController getSessionStatus():67 [http-nio-8010-exec-4] Received request a fix session, senderCompId:RBSG2

Log Files, Merged

In this tutorial, we covered how to merge log files, looking at a straightforward case and then a more complicated situation. The code for this is available on Github, and you're free to download and modify it for your individual needs.

Five Reasons You Need Log Monitoring

Erik Dietrich — Tue, 17 Mar 2020 15:58:20 +0000

You probably regard application logging the way you think of buying auto insurance. You sigh, do it, and hope you never need it. And aren't you kind of required to do it anyway, or something? Not exactly the scintillating stuff that makes you jump out of bed in the morning.

It feels this way because of how we've historically used log files. You dutifully instrument database calls and controller route handlers with information about what's going on. Maybe you do this by hand, or maybe you use a mature existing tool. Or maybe you even use something fancy, like aspect-oriented programming (AOP). Whatever your decision, you probably make it early and then further information becomes rote and obligatory. You forget about it.

At least, you forget about it until, weeks, months, or years later, something happens. Something in production blows up. Hopefully, it's something innocuous and easily fixed, like your log file getting too big. But more likely some critical and maddeningly intractable production issue has cropped up. And there you sit, scrolling through screens filled with "called WriteEntry() at 2017-04-31 13:54:12," hoping to pluck the needle of your issue from that haystack.

This represents the iconic use of the log file, dating back decades. And yet it's an utterly missed opportunity. Your log file can be so much more than just an afterthought and a hail mary for addressing production defects. You just need the right tooling.

Log Monitoring To the Rescue

I've talked in the past about one form of upgrade from this logging paradigm: log aggregation. A log aggregation tool brings your log files into one central place, parses them, and allows you to search them rapidly. But you can do even more than that, making use of log monitoring via dashboards.

Log monitoring is what it sounds like. Or at least, it's what it sounds like if you envision automated tooling and not some poor sap sitting and staring at the tail of a log file for eight hours straight. Log monitoring provides you with intelligent means to process your log files as they come in and glean insight from them.

This may sound strange if you're familiar only with the "production behavior archaeology" use case of the log file. But you really do have many valid reasons to want to process your logs intelligently in real time. I'll take you through some of those today.

Know About Issues Before Users Call Support

I'll start with the use case that requires the smallest mental leap from the default archaeology one. The default case is inherently reactive. You go scrambling for the log file in response to users reporting an error. But log monitoring lets you turn that around -- turn it into something proactive.

To understand what I mean, let's consider the simplest sort of scenario. Let's say that you don't even have an application log and you're just monitoring your application's web server logs. On a normal day, this involves a lot of 200 responses, with a smattering of 300s, and the occasional 400 when a user makes a mistake. A normal day does not involve 500 responses, indicating server errors.

From years of blogging, e-commerce, and entrepreneurial activities, I've learned that users can be funny and strange. You might assume that their response to a 500 code would be to file a bug report. But in reality, they probably just shrug and leave your site. You might be throwing 500 codes for hours or even days before anyone bothers to tell you. But if you have a log monitoring strategy in place, you'll see these errors right away. You might even correct the problem before any of your users have a chance to get upset.

Detect Suspicious Activity Before It Blows Up On You

Avoiding upsetting users with bugs and outages is critical to your application's success. But it's not the only thing that matters. You have other perils to avoid as well.

If you put it on the internet, it's only a matter of time before someone tries to hack it, compromise it, or otherwise take advantage of it. This ranges from the annoying, like link spam in comments, to the malicious, like calculated attempts to compromise financial data. Sooner or later, someone's going to try it.

Log monitoring can help here as well. Take the same simple example of logging the web server's activity. But now imagine your monitoring logic is keeping an eye for something slightly different as well: a huge spike in 401 and 403 codes, indicating unauthorized and forbidden requests. This represents a crude, brute force example, but hopefully, you get the idea. Log monitoring shows you that someone is trying to break in before they actually succeed. This gives you the ability to engage in prevention instead of damage control.

Regulatory Compliance, Needed Now or Later

Let's switch gears a bit now. Instead of considering the early intelligence log monitoring gives you, consider what the act itself gives you. The act of log monitoring itself is vital for regulated industries.

If you work in such an industry, you probably know it, at least implicitly. Without going into too much detail, suffice it to say that regulated industries are ones that must comply with governmental regulations. Manufacturers have to comply with safety regulations, while financial institutions have plenty of regulations on both handling your money and acting ethically.

Where does log monitoring fit? Well, consider HIPAA, for example, which ensures medical data privacy. HIPAA regulations not only require extensive logging but also monitoring of log files to look for certain discrepancies. If you're in a regulated industry or plan to break into one, adopting log monitoring now may save you headaches later.

Baseline System Performance For Future Early Detection

You don't need to be subject to any regulations to benefit from this next point. Log monitoring can help you track your application's production performance in unique ways.

I can't offer a simple case of HTTP response codes for this one. Realizing this benefit will depend on what you put into your log files and how you put it there. But let's imagine a different, relatively simple scenario. Say that you have a site that logs both inbound requests and outbound responses, easily allowing you to tie those together. In other words, given a user or session ID, you can reconstruct the request's cycle time from the log.

Using log monitoring, you can establish a known baseline of those cycle times. In other words, you can look at your logs for a period of time where you know the system to be behaving well. Then, you can take the average time and record it. Finally, you can implement a log monitoring scheme where you receive notification if the average response time, say, doubles. Slowdown can sometimes cost an e-commerce site just as dearly as outages, and log monitoring can help you detect it, even when there's not an error to be found.

Look For Patterns To Help Your Business

Finally, let's speak to something a little higher up on the Maslow's Hierarchy of Needs for a business. So far, we've looked at getting out in front of errors, hacks, and regulatory agencies before they hurt you. But you can use log monitoring to gather intelligence and improve your business as well.

Your log file contains data, and you can use that data for strategic and tactical decisions. Let's say you set up an alert that informs you about spikes in traffic to a page on your site. Maybe you see that alert for a page and realize you could really help yourself by adding a promo or discount coupon there for another offering. Or, maybe you find that you're getting weekly alerts about a drop in traffic on Sunday mornings. You then use this information to work on a strategy to boost Sunday morning traffic.

The idea of using your logs as a form of business intelligence should tell you just how far we've come from the days of log files as an inefficient tool for hunting down production defects. Gather your logs in one place, turn them into data, and then monitor them. This will give you a leg up on your competition, and it will make life easier for everyone involved in application development and support.

What Is Whitebox Monitoring? Everything You Need to Know

Mark Henke — Tue, 10 Mar 2020 16:11:12 +0000

When I first heard about whitebox monitoring, I thought I knew what it was just by name. But like many software terms, it's easy to conflate one thing for another. It's also easy to jump to conclusions too quickly and think you know everything about a term. Whitebox monitoring is a very valuable tool in our DevOps toolbox. Because of this, it's prudent for us to ensure we fully understand what it is and how to use it.

We're going to cover quite a bit about whitebox monitoring: why it's valuable, how it differs from blackbox monitoring, and how such monitoring is implemented in a system.

Blackbox vs. Whitebox Monitoring

Whitebox has often come with its counterpart, blackbox, through many variations. One famous example is blackbox vs. whitebox testing. I'll quickly cover the difference between blackbox and whitebox monitoring.

Blackbox Monitoring

The name blackbox invokes a feeling of mystery. Black is opaque; we cannot see through it. In the same way, we can't see into the system we're monitoring. Take a series of houses on a street, for example. From the outside, we can know a few things. We can see if anyone is home by whether the lights are on or how many cars are in the driveway. If we read the gas meter, we know how much gas they use per month. It's the same for the water meter and water usage.

But we don't know what happens inside. We have no idea what sort of interesting celebrations, arguments, or hobbies could be happening in these houses. It's the same for monitoring. I can monitor the traffic and responses of the software. I can measure the CPU and memory utilization. But in blackbox monitoring, I have no idea what's happening inside these requests. I don't know what database calls it's making or what sort of fancy rules it applies to the data.

Whitebox Monitoring

Whitebox monitoring is, of course, the opposite of blackbox. The idea of a whitebox invokes a feeling of transparency, though maybe clearbox would be a better name. Imagine a household with a swear jar, a calendar of events, or a weekly menu. I can record and watch these things change, as long as I have a window to see inside. Whitebox monitoring is the same. It's the act of putting windows into our code and showing the outside world, usually our developers, what's happening inside.

What Do I Get out of Whitebox Monitoring?

If you know exactly what's happening inside your application, you can easily answer questions that may pop up from time to time. Back to our household, what if we wanted to know how often they eat fish? Well, we just need to peek through the window and record their menu. We keep doing this every week, and we can get an average for how much fish they eat per month.

With software, having such things gives us some key benefits in two categories: operations and product insights.

Operations

Operational whitebox monitoring helps us keep our system alive and healthy. It also helps us keep the risk low when trying new architectures and techniques. It does this by giving us quick feedback once we've released something. For example, if we see traffic is slowing down for one of our endpoints, we need a way to drill down and see what internal method is causing the bottleneck. Monitoring database queries is a fantastic example of a whitebox monitor. It gives us insight into a key part of our system that is often the cause of slowdowns. Operational whitebox monitoring helps us build a thing right.

Product Insights

While operational monitoring helps us build a thing right, product insight monitors help us build the right thing. They help us answer the question, are we making customers happy? One popular version of this monitoring is shopping cart abandonment. It's common in e-commerce to see how often customers put things inside a shopping cart, but never place an order. By looking at what they put in, or where they clicked next, we can improve our systems to encourage customers to place more orders.

How Do I Wire up Whitebox Monitoring?

Armed with an understanding of what whitebox monitoring is, the question naturally pops up on how to implement it. This is extremely sensitive to the languages and frameworks you're using, but I'll describe some important characteristics here.

Monitoring Categories

There are two main categories of whitebox monitoring that affect how we implement them: event-centric monitors and time-series data.

Event-Centric Monitors

Event-centric monitoring is where we report out interesting events from our system. For our household, this could be tallying on a chalkboard every time a family has an argument. We may record what room the argument was in, how many decibels the speaking reached, and how long it lasted. We don't know up front how the data will be used—we just know it's interesting. With software, this could be things like what web requests occurred, whether they were successful, how many database calls they made, and how long they took.

Event-centric data is great when we need to answer questions that we don't know we have ahead of time. Oftentimes production incidents catch us by surprise, and we need answers right now! Event-centric data can give us these answers.

Time-Series Data

Time-series data is information that is aggregated over a time period. For our households, it might be the number of dishes used per week or pounds of food consumed per day. In software, this can be things like the number of requests per minute or average latency per hour. The key to time-series data is that it's aggregated inside the application before being sent through our window. This makes it ready to be shown in graphs so that we can see trends over time.

Reporting

The cost of whitebox monitoring is that it must be built into the app as a first-class concept. I need the application itself to tell me what's going on inside it. In a household, this would be equivalent to the family telling me what they ate that night or giving me a key to their house. This usually consists of a reporter component that blasts out the data to somewhere else. It also involves some sort of instrumentation that plugs into our workflow life cycle. Many frameworks give us extensions to plug in such instrumentation. Sometimes we have to do it ourselves via things like aspect-oriented programming.

Key Traits

However we report, here are some key traits. Most reporters, especially event-centric ones, should be asynchronous. This is because we don't want the time it takes to report out the data to interfere with our responsiveness to customers. We should also ensure its failures are isolated. We don't want a failed log event to stop up someone trying to make a payment!

Storage

Once we have a reporter, we need something to report to. Sometimes we report to multiple places. We should have a persistent store somewhere that we can send this data to. As a note, this place should exist outside our production system. It'd be quite a pain to troubleshoot when production is down if your monitoring goes down with it!

For the record, Scalyr offers both storage and reporters as part of its product. Other offerings may focus solely on one or the other. Spring Sleuth, for example, only handles reporting and instrumentation. It leaves where to put it up to you.

Viewing Monitors

Now that we have the data stored, we likely want to see it and see it frequently. We have a few options for that.

Dashboard

The almighty dashboard is the most popular way for visualizing monitors. There's nothing quite like the feeling of confidence a trending graph gives. Dashboards excel at giving you a pulse of the system at a glance. Healthy ones let you see patterns and detect abnormalities. They should be uncluttered, focusing only on a few things. The flip side to this is that they should easily let you drill down at runtime into more specific data. This will make it easier to diagnose problems quickly.

Search

Search is the ultimate ability to answer questions you didn't know you would need to ask in advance. Searchable storage works very well with event-centric monitoring. It gives you the full flexibility to ask very specific questions about your system. No amount of dashboards can always tell you all the information about a bug, but searching can.

Alerts

Alerts are when the system tells you when something is going on, as opposed to the other way around. These are great for letting you relax, and not having to spend time every day looking through graphs or logs. They let the system do the hard work of detecting abnormalities. On the other hand, they can easily get out of hand. I have seen badly managed alerts that can ping you every five minutes for the next four hours while your system is down.

What to Monitor

Once we get whitebox monitoring up and running, we want to leverage it effectively. I'll be using a couple of ideas from Google's Site Reliability Engineering book to touch on this.

Service Level Indices

Service level indices are the baseline of what's happening in your system today. They're useful to get a pulse on how your system works. You may find some surprising results. What things you want to index are usually based on what service level objectives you have.

Service Level Objectives

This is one of the keys to all monitoring. Monitoring is by no means a purely technical endeavor. A team should have objectives on how to make its customers happy. If your system deals with payments it could be "we want payments 20% more likely to succeed without a snag." In this case, let's measure the rate of successful vs. failed payment transactions. Service level objectives are something for which you should intimately collaborate with all of your stakeholders. You all are responsible for these, and they'll help guide what monitors you plug in.

Conclusion

In a household, there are many interesting things that happen inside. We have to explicitly create systems that let us measure those things, like swear jars. In the same way, software systems are full of valuable details that let us peer into what's happening. Building whitebox monitoring into our app will let us turn these details into valuable insights. There are many tools, like Scalyr, that help us implement this quickly, leaving us time to figure out what service level objectives we want to measure.

Introduction to Continuous Integration Tools

Erik Dietrich — Tue, 03 Mar 2020 13:52:12 +0000

In a sense, you could call continuous integration the lifeblood of modern software development. So it stands to reason that you'd want to avail yourself of continuous integration tools. But before we get to the continuous integration tools themselves, let me explain why I just made the claim that I did. How do I justify calling continuous integration the lifeblood of software development? Well, the practice of continuous integration has given rise to modern standards for how we collaborate around and deploy software.

What Is Continuous Integration?

If you don't know what continuous integration is, don't worry. You're not alone. For plenty of people, it's a vague industry buzzword. Even some people who think they know what it means have the definition a little muddled. The reason? Well, the way the industry defines it is a little muddled.

The industry mixes up the definitions of continuous integration itself, and the continuous integration tools that enable it. To understand this, imagine if you asked someone to define software testing, and they responded with, "oh, that's Selenium." "No," you'd say, "that's a tool that helps you with software testing -- it isn't software testing itself." So it goes with continuous integration.

Continuous integration is conceptually simple. It's a practice. Specifically, it's the practice of a development team syncing its code quite regularly. Several times per day, at least.

Merge Parties: The Continuous Integration Origin Story

To understand why this matters, let me explain something about the bad old days. 20 years ago, teams used rudimentary source control tools, if any at all. Concepts like branching and merging didn't really exist in any meaningful sense.

So, here's how software development would go. First, everyone would start with the same basic codebase. Then, management would assign features to developers. From there, each developer would go back to his or her desk, code for months, and then declare their features done. And, finally, at the end, you'd have a merge party.

What's a merge party? It's what happens when a bunch of software developers all slam months worth of changes into the codebase at the same time. You're probably wondering why anyone would call this a party when it sounds awful. Well, it got this name because it was so awful and time-consuming that teams made it into an event. They'd stay into the evenings, ordering pizza and soda, and work on this for days at a time. The "party" was partially ironic and partially a weird form of team building.

But whatever you called it, and party favors or not, it was really, really inefficient. So some forward thinking teams started to address the problem.

"What if," they wondered, "instead of doing this all at once in the end, with a big bang, we did it much sooner?" This line of thinking led to an important conclusion. If you integrated changes all of the time, it added a little friction each day, but it saved monumental pain at the end. And it made the whole process more efficient.

The Simplest of Continuous Integration Tools

So far, this really just sounds like a policy. "Let's all make sure to deliver our changes multiple times per day, and also pull down everyone else's changes." How does this translate into tooling?

To understand that, let's do a thought exercise. We'll build the simplest imaginable continuous integration setup.

As you write your code, you commit your changes regularly. To pull other people's changes, you write a little shell script that updates/pulls changes from the team's source control. You just kick it off in the background while you work, ensuring that you have the latest code every hour or so.

That's it. You now have the world's simplest continuous integration tool.

The Evolution of Continuous Integration Tools

Of course, full-featured continuous integration tools do a lot more than this. I just laid that example out to help you understand the core principles of the practice of continuous integration and then automating it.

Actual, modern continuous integration tools add something important on top of the simple concept of frequent integration. They layer on the concept of the build. Continuous integration as a practice just keeps everyone's code in sync. Continuous integration tooling keeps the code in sync, but also builds that in-sync code, packing it into something that you can deploy.

In this fashion, the continuous integration practice has evolved a lot with the tooling that supports it. Developers deliver code frequently, and, when they do, their continuous integration tool builds the software and executes other activities as well, such as running automated tests and static analysis.

Continuous integration tools thus ensure that the software stays in sync and also that the software remains in a perpetually deployable state (theoretically, anyway). It does this by responding to compilation or test failures with the concept of a failed build. If a code commit triggers a build failure, the team leaps on it quickly, getting the build back to a passing state (green).

Let's now take a look at some popular continuous integration tools. Many such tools layer on other features as well, but this concept lies at the core of all of them.

Jenkins

Jenkins is arguably the most well-known tool out there. It is an open source tool, and it bills itself as an "automation server." Make no mistake, though. Continuous integration lies at the absolute heart of its value proposition, notwithstanding other capabilities that it offers.

Jenkins is a Java-based tool, and it will run on Windows, Mac, or Linux (and other NIX) operating systems. But you can use it for all sorts of tech stacks -- not just Java. It's easy to install, has a nice web interface, and has an extremely rich plugin ecosystem. This means that you can use it to automate all sorts of tasks that you might need as part of your build.

Oh, and it's free.

Travis CI

If you're wondering, the CI in Travis CI stands for continuous integration. So that should give you some insight into the tool's bread and butter.

Travis CI has been around for a long time and it offers a hosted solution as well as something you can install onsite if you don't want them hosting your code. It caters to the enterprise and it sports some nice integrations, such as with Slack and HipChat.

You can use Travis for free if you have an open source project, and it's trivial to get started if you store your code on Github. If you want to use it for your private repo, you can have 100 builds for free, but then you will have to pay.

Team City

Team City comes from popular dev tools company JetBrains, and it bills itself as a "hassle-free CI and CD server" (CD standing for continuous delivery). Since it comes from JetBrains, it is both a mature product and one for which you can expect support.

Like Jenkins, Team City is a Java-based tool, but it is well known for offering excellent support for .NET projects as well. This combination of tech stacks tends to make it popular in enterprise environments.

Team City is quite customizable and extensible, and it comes in free and paid versions. You can use it for free for relatively lightweight build scenarios, but after that, it has a paid licensing model.

Team Foundation Server

Microsoft has a suite of tools called Team Foundation Server (TFS). TFS has evolved over the years, and it actually provides a lot more than just a continuous integration server. It also includes source control, deployment technologies, issue tracking, and a number of other tools besides. But it does offer CI.

Team Foundation Server really shines in a heavily integrated Microsoft environment. So if you have a shop that uses Microsoft technologies exclusively, this may be the choice for you (though TFS does now support other tech stacks as well).

TFS is a paid tool, though its licensing is somewhat complicated. You can buy it outright, or you can take advantage of licensing models that include other Microsoft offerings, such as MSDN subscriptions.

Getting in the Continuous Integration Game

I've honestly just begun to scratch the surface of tools available to you. This is a space with a lot of different tools competing for your business, which should say something to you about its importance and prevalence.

But if you're new to continuous integration and continuous integration tools, I'll leave you with this advice. The set of tools can be overwhelming, but it's more important to pick one and to get started than it is to pick the perfect one. As I said at the beginning of the post, continuous integration is the modern lifeblood of our industry. So just make sure you get started and realize the efficiency and professionalism that you'll get from doing it.

CI/CD Tools: How to Choose Yours Wisely

Carlos M. — Tue, 25 Feb 2020 14:56:58 +0000

Continous integration (CI) and continuous deployment (CD) tools allow teams to merge, build, test, and deploy code branches automatically. Implementing them along with conventions like "commit frequently" forces developers to test their code when it's combined with other people's work. Results include shorter development cycles and better visibility of code evolution among different teams.

Once you commit to using CI/CD in your software development cycle, you're immediately faced with a galore of options: Travis, Jenkins, GitLab, CodeShip, TeamCity, and CircleCI, among others. Their names are catchy, but they hardly describe what the tools do. So here's a roadmap for choosing the right tool for your needs.

What Platforms and Integrations Should It Support?

Whether you're part of an enterprise or a startup, you'll first need to figure out all your platform requirements. Think about your operating systems and their versions, programming languages, access to third-party APIs, libraries, frameworks, and testing suites; collect all these data and check that each tool is able to support them all. You don't want painful troubleshooting sessions or loss of commercial provider support because of a version mismatch.

There are some shortcuts that I can provide:

BuildBot is a popular choice among Pythonistas. It's also written in Python, and it's very flexible. FOSS projects like Mozilla Firefox and MariaDB currently use BuildBot for their multi-platform builds and testing.
People using JIRA or Bitbucket will find that Bamboo and Bitbucket pipelines fit their ecosystem. But beware of price increases as your number of environments and need for parallelization grow.
Jenkins' impressive list of plugins offers compatibility with every major developer tool in the market. Java shops will probably be comfortable deploying Jenkins because it's written in that language.
If you're building .NET projects to run on a Microsoft stack, you'll find two natural choices: Visual Studio Team Services (VSTS) and TeamCity. Both include Docker support.

Where Will It Run and Who Will Maintain It?

Some enterprises have regulations to meet, so they can't put their codebases in cloud services. Startups lack the manpower and time to run their own tools. Teams in both groups might be wary of security breaches caused by a wrong decision when using publicly available services. Whichever is your position, here's a set of tips to navigate the cloud vs. on-premises conundrum:

Cloud services ease the workload for our teams—but always at the expense of fees above certain usage.
Extra charges from CI/CD cloud services come in the form of higher concurrency, build events, build time, and/or user count.
Some cloud services have a zero-cost plan for small teams. You could test the waters using that in a greenfield project.
On-premises options like GitLab, Travis-CI, and TeamCity have hosted offerings that let you test before committing to run your own copy.
On-premises tools require resources ready to orchestrate your builds and tests. That means more complex tasks to run, troubleshoot, and maintain.
Teams should be aware of legal and business requirements when uploading code and data to cloud services.

Who Will Use It and How Often?

It's only a matter of time before the results of using CI/CD become visible to management. At that point, your organization will be increasingly dependent on CI/CD. User adoption and build frequency will increase. Take a look at these items before deciding your team's headcount and licensing requirements:

Provide some managers read access to your CI/CD. Giving them visibility improves the chances of them supporting these efforts.
Choose a tool that integrates with your organization's favorite authorization method. Nobody wants one more password.
Get a list of desired environments and take note of how often your teams would hypothetically push code to them.
Recalculate your usage when you add a new team or user. Don't let your bills get out of control.

What Are Our Estimated Costs?

Depending on the nature of your company and team, you'll either prefer an operating expense or a capital expenditure. I'm an advocate of pay-as-you-go services, personally. They allow you to keep costs down when you're starting, and they remove complexity. But as soon as your SaaS yearly bills reach what you'd pay an engineer, you might want to get some of those toolsets under your control or migrate to other providers with lower costs. Here are some gotchas when estimating costs:

TeamCity and Bamboo require extra payment when you need more than one remote agent, so each additional environment costs more.
VSTS is free for teams of five people or less. After that, you pay per additional user each month.
GitLab only supports multiple active directory sources in its enterprise edition. Jenkins, on the other hand, supports it through a plugin.
Some SaaSs charge per build time. If this is the case with your chosen option, configure your pipelines to use prebuilt images with resolved dependencies.
Jenkins, GoCD, BuildBot, and GitLab-CE have OSI-compliant licenses while Bamboo, TeamCity, Travis-CI Enterprise, and GitLab-EE are commercially available. Some of them charge based on the number of users, remote agents, and features.

Availability of Documentation, Training, and Support

It might not be evident now, but you'll end up deploying to production with this new tool. So you'll want to set up the same kind of expectations you currently have for all your other tooling in production because, at some point, you will probably

need training for more than your vanguard CI/CD team.
troubleshoot problems that affect your availability to internal and external customers.
extend your implementation to more environments, additional third-party integrations, and new business needs.
require emergency support beyond what a user community may offer.
provide onboarding documentation to new team members.

Metrics and Vendor Lock-In

I decided to leave these two strategic factors as final arguments—not for their lack of importance, but because they're only important once you've successfully adopted a CI/CD tool. While it's great to implement a delivery pipeline, it's hard to see improvements over time if you're not measuring and comparing to a baseline.

Here's a set of questions that will help you know what's relevant to measure in your line of business:

What's your business focus? Is it maintaining a reliable platform? Is it increasing your user base? Or is it adding software features ASAP?
Does your team deploy software to production and let it fail as part of debugging?
How many bad builds are slipping through and failing in production?
What is your mean time to recover from a failure in production?
How does your delivery pipeline design affect a push to production during an outage incident?

Now, regarding the vendor lock-in factor, the CI/CD ecosystem is blooming. Odds are that you'll probably change tools when your requirements evolve. Or perhaps your SaaS provider goes out of business. Don't forget to consider how much effort it would take your team to migrate your delivery pipeline to a different tool or provider in case a change is necessary. Here's some food for thought:

Can you export workflows and configurations?
Do exports come in formats that might be automatically transformed as inputs for other tools?
How tightly integrated is your CI/CD tool to specific computing resources?

Take the Next Step

If your current software development cycle is made of stages where people step on each others' toes while trying to survive the minefield created by libraries, APIs, and interservice dependencies, perhaps it's time to choose a CI/CD tool and start optimizing your software delivery pipeline.

Have you experienced this process already? Share your tips with us in the comments section!