The Surprising Complexities of Building Audit Logs

#features #compliance #developer #softwaredevelopment

Audit logs are a feature that every enterprise customer wants in all of their products. Customers need to know who changed which settings and at what time. They need to know when someone creates a user account in their company’s instance of the product, who accessed what data, and more. If something goes wrong, they need to track down what happened and who is responsible. They not only want these logs for their own internal usage, but they also need these logs for compliance.

New Audit Logs at Split

At Split, we just launched vast expansions to our audit logs to include logging and displaying all administrative activity. We also added a webhook to allow subscribing to some or all of these admin audit logs. While we had already been recording many of these logs in the backend, we took the opportunity to standardize and simplify our entire system.

Similar to most products, we initially got requests for audit logs for one or two small things. Meanwhile, every minute of software development time is precious, so like most companies, we built something quickly without a lot of detailed design. Over time, customers request more and more audit log types and what was a small, easy system quickly becomes large, complicated, and convoluted. As we took this opportunity to simplify our system, we learned more than a few things, and realized the criticality of getting logging right. These are the things that we found to be the most important to remember for audit logs.

Audit Logs Are Forever

One of the tricky things about audit logs, similar to APIs, is that they last forever. Once a user creates a log, most companies are going to want that log for years. Often seven or more years if they care about compliance. You absolutely can not delete that log, lose that log, or stop showing it because it is in an old format. If a record disappears for any reason, even if you still have the data, you risk upsetting your customers.

While you can change your log format, it will require either a database migration (to update your old logs) or conversion code to convert old data into the new format. Neither of these options is ideal. If you often have a considerable volume of logs, running a migration is going to be slow and challenging. If you instead go with conversion code, you will have unneeded complexity and indirection to your codebase. If you change formats more than once, this code will quickly become unmanageable. Your best option is to build in such a way to avoid changing your log format.

Additionally, while it is at least possible to restructure data – changing the format or what fields are displayed – it is nearly impossible to change a log type. For example, if we had a log that was something like enterprise.create_api_key and wanted to change that to api_key.create, any customers previously looking for that event as a change to an enterprise object would no longer find it. Even with the best communication of the change to customers, some of them will have trouble. Additionally, if anyone has built on your API, getting them to migrate will be a lengthy and time-consuming process.

Design For Generic

As I mentioned, it is common for products to start with a few logs that customers are requesting, then add a few more and a few more. Before our most recent change, Split had logs for splits, segments, and metrics. We were also logging many more events without exposing them. Even if you have logs for everything now, this is only the start. As you add new features, you will be continuously adding even more logs. Because things are often done piecemeal like this, logs are not typically thought about holistically. Without thinking about everything at once, it is easy to end up with a lot of different event types and lots of unique data structures. The abundance of event types and consistency is problematic for both your developers and your customers. A plethora of inconsistent logs results in complex branching code supporting it. To use the example I gave previously of a customer creating an API key. A lack of consistency will make it hard for that user to figure out where to find it. Is the log a change to an enterprise object or a change to an api_key object or a change to a user object?

Unique data structures make it extremely difficult to migrate your data to a new system or to rework your UI should you ever need to. You will need to account for all of the special cases that exist with anything new that you build. This increases bugs and the complexity involved in any migration. Additionally, it is more difficult to store, search, and filter consistently. Coming up with an understandable API design will be nearly impossible. While it can be challenging to map all of your log types into a generic data structure, it’s even harder to do this after everyone has gotten used to each one as a unique and special log.

It might still be tempting to find a generic way to support logs that are all unique. However, even if you find a way to store and process logs with a variety of data structures, you will still need to find a way to display all of those logs consistently and understandably in the UI. Additionally, if you build an API for your logs, you will need to find a way to make your logs consistent for that API response. No developer is going to want to use an API that has different response types for every single log type. Even if you convinced them to anyway, it would be almost impossible to build in a way that can also handle future log types.

The other key reason to design as generically as possible is related to the fact that your logs last forever. Even if you have data structures that are flexible enough to cover all of your current use cases, you may still run into trouble when you add more log types. When this happens, you will need to either migrate your data or repurpose a data structure in a way it was never intended to be used. Both of these have obvious problems, so design as generically as you can.

Design for API

While everyone likes to talk about building APIs first, in practice, I often see APIs get lower priority. For many features in many companies, the API use case is harder to prioritize because it has less demand. As much as I love the idea of clean, consistent, publicly-available APIs, I understand why they might not be the first thing built. That said, for audit logs, even if you don’t get the request for APIs immediately, that request is coming. There are different formats that this may take – an events stream, such as the webhooks we built as a part of this release, or an on-demand search API. Ultimately, it is likely that customers will want both of these. They want to take actions programmatically based on what happens in your product. They also want to be able to run audits using an API that searches their events or to be able to export large chunks of the logs for compliance. Even if these are not part of your initial audit log requirements, they are coming, so design for both of these API types upfront.

Similar to logs living forever, APIs live forever. When you combine the two, audit log APIs are pretty much immortal. Because there is a strong use case for using audit log APIs, they will get used. The more customers that start using them, the harder it will be to change these APIs. Ever. I watched a company try to get rid of their v1 API for over six years. All of this is to say that taking an extra couple of weeks designing your APIs now can save you a lot of headaches later. Also, since these APIs will get used, their design will either please or frustrate your customers.

Focus On Audit Log Use Cases

In many companies, the audit log use case is slightly different from any other product use cases. It is easy to think that audit logs have pages of results similar to everywhere else in the product with pages of results and try to do the same thing we do everywhere else. Furthermore, it is easy to reuse frontend components, backend components, and even infrastructure. In software development, it usually is preferable to reuse as much as possible. It seems like a bad idea to introduce a new database or new infrastructure for what looks like a small feature to add a couple of log types. However, there are a few key differentiators between audit logs and most of the other use cases at Split.

The first is volume. I will have far more logs than any other object. Assuming that I log (at a minimum) object creation, update, and deletion, that guarantees I have at least one log per object even without anything getting updated or deleted. In some cases, we also want access logs – who viewed those items? You can see how we are almost guaranteed to have many more logs than any other object type in our system.

The second key difference is the access pattern. If I search for a file in a folder, I may want to sort by created date, file name, or date updated. Likewise, if I view all of my feature flags, I want to be able to sort on several different attributes – name, created time, when the rollout last changed, how much traffic it is getting, and more. Meanwhile, for logs, I rarely ever want to sort by anything other than date (with the newest first). Similarly, unless I am downloading all logs associated with some particular audit trail or event, it is unlikely that I even care about more logs than the first few pages of results. If I do want older logs, I want to be able to filter down to a particular time window before I look at any results. In the rare cases where I do want more than a couple of pages of results, I will want those results programmatically from an API, not through the UI. It is impractical to go through that much data manually. For almost any other case where sorting may come into play, filtering alone can also solve the use case.

Even within log types, there can be different use cases, data volume, and access patterns. For example, at Split we have general and admin logs. General logs are related to what a user did to splits, segments, and metrics. These are things like changing the rollout percentage on a split or adding a userId to a segment. Admin logs, meanwhile, are associated with anything that happens within the admin panel. These are things like creating a user, turning on a 2FA requirement, or configuring an integration. As you can imagine, the general logs have a significantly higher volume. They are also more likely to be used for things like an events engine both for internal and external applications. For these logs, any given use case typically only cares about super recent logs and only a very filtered subset of log types (for example, only split changes). By comparison, the admin logs happen less frequently and are accessed much less often. They are primarily accessed for debugging or for audit purposes.

While these differences may not matter at lower volumes and while creating interfaces that allow for lots of flexibility is useful, when you reach higher volumes, some of these can matter a lot. Some things to consider here include how you are storing the data – if the logs are living for a long time, that can be a lot of data, which can get costly. Would it make sense to only have the last month or two of logs in fast-access storage? Many companies may require you to keep 7 or 10 years of logs, but do they need instant access to those? Or is it good enough if you can send the logs to them in an hour? Does it have to be through the UI or can older logs be accessed in some other way? If you have something like a webhook, what guarantees do you need around that log getting delivered to the webhook? Is there an SLA on when that delivery needs to happen relative to the time of the event? While it is nice to be able to sort results on basically any field, sorting, particularly in conjunction with paging, can become extremely difficult on large data sets. Can you get away with only sorting by date?

Take the Time to Get Logging Right

I am not writing this to scare you out of building audit logs. They are an important, powerful, and often needed feature. However, when you are building them, it is worth taking a little extra time to make sure you get the design right. They are deceptively challenging, so it is worth taking the time to make sure you have a good design that can last you for many years to come.

To see how our Admin Audit Logs turned out, visit our documentation. For more content on feature flags, experimentation, and testing in production, check us out on Twitter or LinkedIn, and subscribe to our YouTube channel!