- Teams of 1: Use minimal images, add only the software you need, add a cronjob to auto-update using the package manager, or in the case of containers: pin your Dockerfile to the latest tag and redeploy as often as possible.
- Teams of 10: Automate your build process to create a golden image or container base image with best security practices.
- Teams of 100: Automate and monitor as much as possible. Try to keep your developers excited about patching, and start getting strict about not letting anything but your approved images go into production. Security team responsible for updates and patching strategy.
- Teams of 1000: Dedicated team for building, updating, and pentesting base images. Demand full E2E automation. Monitor in realtime and define RRDs with consequences.
Lately I've spent some time thinking about Vulnerability Management, hereafter 'vulnmgmt' - a major blueteam responsibility, this refers to keeping packages, kernels, and OSes up-to-date with patches. Generally if you deploy a VM with the latest distribution of all software there will be a public CVE for it within a few weeks, which now leaves you vulnerable. While this sounds like a huge problem I want to say that I don't believe vulnmgmt should be anywhere near the top of the priority list for a team just starting to improve their security posture - there can be as many 10.0 CVEs on a box as you like if it is airgapped, sat in a closet somewhere collecting dust. Like all of security, this is a game of risk considerations - I would prefer to spend time and energy on ensuring strong network segmentation and good appsec than vulnmgmt. Inevitably though, it does become a priority - especially because it's an easy business sell to auditors, customers, etc.
This is a huge industry with numerous different products and solutions being offered by most major vendors within the Cyber space, which of course means there's a lot of bullsh*t. I'm a big believer in build-not-buy as a general approach, although managers and senior engineers seem keen to tell me this will change as I get older/higher up. In short, I think Cyber is stuck in the 2000s-era of product development, trying to come up with these catch-all solutions which offer a silver bullet rather than keeping their products and feature sets inline with the unix philosophy of 'do one thing and do it well', and promoting interoperability. We should try to kill the idea that spending €100,000/yr on a product means we have good security.
For a brief primer on vulnmgmt in an engineering-led organisation, we have several types of compute resources we want to secure: likely bare-metal or virtual, and containers. For each of those we have two states, pre- and post-deploy. Some of these resources may have very short lifetimes eg. EC2 instances in an autoscaling group, while some might be long-running eg. a database instance for some back-office legacy app. N.B. Most cloud-native organisations will have a reasonable amount of serverless code as well, which I won't touch on here.
Bare-metal and virtual instances will be deployed from an image, either from a generic OS distribution or with a 'Golden Image' AMI/snapshot (take a generic, use something like Packer or Puppet to run some setup steps, pickle it into a reusable artifact). In this state, the possible vulnerability sources are:
- From the generic base image, more likely if it is out-of-date
- From any packages or modifications made to the base during initialization.
Containers are conceptually similar at this stage, except the base image isn't a single artifact but a multiple of layers comprising the container image that we're extending. Many applications tend to extend 'minimal' images (see alpine, ubuntu-minimal, debian-buster etc) which focus on small image size, but it is entirely possible that by the time we reach application images we have 10+ layers, each of which is a new opportunity to have packages pinned to a specific, vulnerable version.
At this stage we should be focusing on a few things:
- We do not use public / non-hardened base images.
- They're unlikely to be set up with defaults which are applicable for our use-case
- It is so cheap to maintain a clone of a public image but it ensures we start in a clean, healthy state. The further along in the process we apply controls, the more likely they are to fail. Catch problems early.
- We should be publishing our own base images as frequently as possible, which should pre-updated and upgraded, running the latest OS version and package upgrades.
- These images should be pre-enrolled into whatever monitoring/SIEM programs we're running, reducing workload for the end-user of them.
- We should use static scanners during this process, and prevent the publishing of images which contain fixable vulnerabilities. Here is an awesome description of OVO's approach.
Luckily there's a multitude of tool options we have at our disposal:
- Ansible, Puppet, Chef - build-as-code providing strong repeatability and consistency.
- Hashicorp Packer, Vagrant, AWS CodeBuild - create Golden Images or deploy during CI and publish snapshots.
- cloud-init - the gold standard of consistent Unix initialization.
- Vuls - agentless scanner for Unix systems, checking package versions against NVD. People will get tired of me talking about this project but it's such a great concept.
- osquery/osquery - query your VM like it's a SQL db
- wazuh/wazuh - I haven't used this personally, I've heard good things.
- quay/clair, aquasec/trivy - static analyzers for container images.
- jsitech/JShielder, CISOfy/Lynis, lateralblast/lunar - automated Linux hardening/compliance checkers.
My perfect scenario IMO looks something like this: we have a hardened base image which is rebuilt on a daily/weekly basis using Packer. When that gets published to staging we use Lambda to spin up an instance with it, and perform whatever scans against it we want, either using Vuls or Lynis. If those tools pass then we can continue the build, publishing the image to production. If not, report the results and remediate issues. We should also validate that the instance connected successfully to our SIEM, and maybe we could attempt a portscan or try to drop a shell to verify it's catching low hanging fruit.
This is where things get more complex because our assets are now in the wild becoming more outdated and unpatched by the day. The longer we are in this state the further we deviate from our nice, safe, clean starting point - so a lot of effort should be reducing the expected lifetime of any single asset before redeploy. I would preach more for ensuring repeatable infrastructure than for perfect monitoring and patching of assets but unfortunately, that's just not a reality for a lot of contexts. Some guy will always keep a pet server that you can't upgrade because 'it might break something and it's mission-critical'.
For VMs we will be relying on some solution to continuously monitor drift from initial state, and report back so that we can keep track of it. Previously I've used Vuls for this purpose, but if you have something like the AWS SSM agent installed on the instance then it's possible to run whichever tools best fit. This can be a minefield as you'll have to either 1) enable inbound access to the machine, increasing risk or 2) upload results from the box to some shared location. I'd prefer #2, as it's less complex from a networking ACL standpoint - but there could be complications with that too.
Containers at runtime is slightly harder; if you've got a reasonably hardened base image then it is likely that inbound shell access is forbidden, and if it's not then you're unlikely to have access to the tools you need to perform runtime heuristic tests. If the containers are running within something like Istio it is easier to extract logs and metrics, so it would be a good idea to integrate these into whatever alerting engine we're using. Importantly though, the results of a static container image analysis tool will be different between build time and some point when it's running in the future, so we can periodically re-run the checks we ran at build time to establish whether that container has become vulnerable to any new CVEs since we launched it. If so, we probably want to rotate in a version which has been patched.
The strongest approach to me seems to be to put all efforts into monitoring heuristics of the container, and killing it once it deviates.
At some point some company might release an "ML-backed container RASP solution" which actually knocks it out of the park, and when that happens I'll welcome it. Until then, rather than submitting to extortionate subscription fees and vendor lock-in, we can achieve great security posture for VMs and containers using open-source tools and a little engineering. As a result we'll be a lot more confident in our claims and have developed a deeper understanding of our environments, allowing us to deploy tools which are genuinely extensible and well-suited for our use-cases.