For an enterprise to succeed, it’s no longer enough to launch a product that’s “good enough.” Businesses today must provide high-quality digital experiences that are not only performant and highly available, but also private and secure.
But how do you achieve all of this? One way is for DevSecOps teams to adopt an observability practice that uses logs (and other means) to collect wide swaths of data across your user experience and threat environments. By logging and analyzing both security and observability data, you can better detect and remediate a host of problems, such as performance issues, vulnerabilities, and security breaches, resulting in a higher-quality experience.
In this article, we’ll focus on the ins and outs of wide-reaching data collection, and how logs in particular can help you. We’ll look at the difference between observability data and security data, and how best to collect all this data. Then we’ll explore how to use this data to improve your app and finally see how to implement a centralized, one-stop shop for your data collection.
What is Observability Data? What is Security Data?
Observability is a property of systems that defines what an operator can tell about the state of a system, both currently and historically. Leading from that, observability data is the representation of that system state — think percentage of errors, memory usage, and flow of traffic. Observability data can be raw data recorded from various sources or down-sampled, aggregated, and processed data.
How is that different from security data?
Security data is the subset of your observability data that is used to discover security threats. If this seems fuzzy, that’s because it is!
For example, if an application exhibits slow response times, is it security data? It depends. Maybe the application is simply misconfigured for the current load, or maybe an attacker has compromised the system and is running a workload that has slowed everything down.
Even what is typically not considered security data could be. For example, if a new CVE is disclosed for a particular library, you’ll need to know which services depend on that version of the library. And sometimes clear-cut cases of security data — for example, the identity of a caller to some API endpoint — also end up being useful for tracking non-security bugs.
In the end, we can say that a piece of observability data might be security data…and it might not. It depends on the specifics of the use case, which often isn’t known until after the fact.
The Three Pillars of Observability Data
When we look at observability data, we often break it down into three pillars: logs, metrics, and traces. Let’s look at those and how they might apply to both observability and security data.
Logs capture time-stamped events and provide a timeline of what happened to the system. For example, your application may emit a log message if it fails to connect to a database, and may write who performed a certain action to an audit log. Any non-trivial system will have many different logs. Recording and organizing these logs in a centralized logging system where they can be sliced and diced are a primary facet of observability data collection.
Log analytics data is critical for monitoring, troubleshooting and investigating reliability and security issues, to get down to the root cause of “why” that issue occurred. Log data is often the most detailed information available about a company’s systems, so it makes sense to put that data to work, pulling log files across the organization for end-to-end visibility and faster troubleshooting.
While logs generally contain textual information, metrics are numerical time series — for example, the number of pending pods in your Kubernetes cluster over time. Metrics are typically recorded periodically and can help identify trends and anomalies. Metrics are much more economical than logs. They don’t require parsing and transformation, and they cost less to transmit and store. Metrics are the vital signs of your system. Alerting is primarily built on top of metrics.
Distributed tracing associates an identifier with each request. A trace of a request is a collection of spans and references. You can think of a trace as a directed acyclic graph that represents the request’s traversal through the components of a distributed system. Each span records the time the request spent in a given component, and references are the edges of the graph that connect one span to the following spans. Traces are a very important tool for modern systems built on microservices, serverless functions, and queues. It provides a “chain of custody” to the system operators that is crucial for analyzing both performance and security issues.
So how do these pillars cross with security data?
Nearly all security information comes from logs. While metrics and traces can help you understand when something is wrong (for example, the CPU is high), you most likely won’t know what is actually happening (in this case, the CPU is high because of a crypto-miner) until you examine the logs.
How to Collect Data
Now that we understand observability and security data, why they are important, and where they come from, let’s look at some more practical aspects: how to collect data and how to use it effectively.
Let’s look at each pillar, some best practices, and a few tools that can help. These steps apply to all your data — observability and security.
You probably have many logs created by your OS, Kubernetes, and your cloud provider. Your applications and third-party software also generate log files. However, standard logs bring several challenges:
Accessing a wide variety of log files across many hosts is inefficient and not user-friendly.
Log files will eventually fill the disk (unless you’re using cloud persistent storage).
If you’re rotating log files to address the disk space issue, then you lose historical data.
Your servers may crash from time to time.
Disks can get corrupted.
To address these issues, a best practice is to use a log collector agent on every host that ships the logs to a centralized observability platform. The centralized log management system will store the logs in durable storage, provide a query interface, provide aggregations and other transforms, and will retain the log backups as needed for compliance and finding historical trends.
To alleviate some of the pain, it’s a good idea to use industry standards and tooling like OpenTelemetry (https://opentelemetry.io). For data collection specific to logs, open-source tools like LogStash and Fluentd are also popular.
Metrics are somewhat easier to handle than logs since all metrics from any source can be represented uniformly as a single time series with a set of tags associated with it. Similar to the log collector, a metrics collector can run on every host and periodically read relevant data, such as CPU and memory usage of every process or the number of requests in flight, and send it to the observability platform.
Application-level metrics can be collected in two ways: the application may expose a standard interface that the metrics agent can scrape or, alternatively, the application can be instrumented (typically using a library) to send its metrics directly to the observability platform. Data points are captured at fixed intervals and sent to an observability platform for analysis.
For metrics, Prometheus is the undisputed king, designed to scrape application endpoints that expose application metrics. Grafana works very well to visualize your metrics.
Traces are collected by instrumenting applications and services to intercept requests and attach an id that follows the request when making calls to other services. The traces are sent to the observability platform for analysis.
For distributed tracing, Jaeger (originally developed by Uber) leads the pack. Other projects include Zipkin and Signoz.
How is data used for improving application security?
How do you use all this data to specifically improve your application security? Let’s consider the role that these pieces of observability data might play in the different aspects of incident detection and response.
We begin with incident detection, which hinges on capturing metrics to detect abnormal activity. Monitored by an observability platform, these metrics will trigger an alert to the DevSecOps team.
At this point, the team only knows that something abnormal is happening, but it doesn’t know if this is a security-related incident or not. To identify the incident, the inspection of logs becomes critical. If the incident is security-related, the team can use logs to determine how the malicious actors are accessing the system and block further access by revoking access for compromised users or services.
Next, the team can use the combination of logs, metrics, and traces to assess breach impact, determining which parts of the system have been impacted and whether data was exfiltrated. By knowing the breadth of impact, the team can reduce the blast radius by making those components unavailable until they can be fixed.
Finally, the team can use logs and metrics, specifically when performing a post-mortem, to help in root cause analysis for all incidents, security or otherwise. The resulting actions may include tightening access control, patching vulnerable systems, or fine-tuning the observability strategy further.
What does this look like in practice?
The elite level of observability comes when you integrate all the various data sources and technologies into a cohesive platform that allows you to gain insights into your system.
You can build this yourself using many of the open source projects already mentioned. While this offers high customization, it requires a lot of effort to build, maintain, and operate, especially if you want a system that is robust, scalable, and always available. Additionally, you also need to observe your observability platform.
Another option is to choose a managed platform to handle the observability concerns. This allows you to focus on your core competencies.
An example from Black Friday
Let’s look at an example around Black Friday. With many sites — especially in retail — experiencing an uptick in user activity, how might leveraging observability data and a platform help mitigate security incidents? I found this timely post about Ulta Beauty using metrics like invalid password attempts and login attempts per IP address and per country, all possible indicators of fraudulent activity.
Ulta pumps these metrics into an observability platform from Sumo Logic, which provides a Brute Force Attack dashboard to aid in incident detection. From there, an incident response team can begin analyzing logs to determine and address the root cause of the security incident, just as we walked through above.
The platform from Sumo Logic ingests, aggregates, enriches, and retains all of the observability data. Sumo Logic lets you decide if you want to install data collectors locally or have hosted collectors running on AWS. It also integrates with observability staples like Prometheus and OpenTelemetry, which can ease migration efforts and alleviate concerns about vendor lock-in.
Collecting observability and security data in wide swaths is key to a high-quality digital experience for your users. To succeed, you’ll need to rely on all three pillars of observability: logs, metrics, and distributed tracing. You’ll need to use this data to detect, isolate, and mitigate attacks.
In the next post in the series, we will take a deeper look at monitoring, which is the next logical step. Once you have collected all this data, what do you do with it?
Have a really great day!
Top comments (0)