Patrick Smyth for Chainguard

Posted on Nov 12, 2024

Deep Dive 🤿: Where Does Grype Data Come From?

#devops #opensource #security #docker

Grype is a vulnerability scanner for container images and filesystems. It's developed by Anchore and written in Golang. When you point Grype at a container image, it will scan the files and folders on that image, compare what it finds to a database of CVEs (known vulnerabilities), and spit out a report telling you what CVEs have been detected.

We like Grype at Chainguard because it's open source, customizable, and reliable enough to integrate into our CI and CVE remediation workflows. (You can read more about why we like Grype in this post.)

I know, my photoshop skills scare even me sometimes.

In this article, we'll answer a question that comes up frequently: where does Grype's vulnerability data come from? In the process, we'll take a look at Grype's open data pipe line and do some light analysis of the vulnerability data that Grype uses to scan containers.

How Grype Works

If you haven't used Grype before, here's a brief overview of how it works.

You point Grype at a container image (or filesystem).
Grype downloads a fresh instance of its vulnerability.db database, then scans the image for specific packages, files, configurations, and so on, building a manifest in the form of a Software Bill of Materials (SBOM) itemizing the software contained in the image. (Under the hood, Grype uses a sister tool, Syft, for this step.)
Grype then compares the specific versions of each package against the vulnerability data in its database.
Finally, a list of CVEs detected in the image is returned to the user.

For a comprehensive overview of Grype's functionality, check out Using Grype to Scan Container Images for Vulnerabilities on Chainguard Academy.

Grype's Data Sources

Grype relies on a set of upstream providers for its vulnerability data. As of November 2024, the list of providers includes:

Note that the above links are to endpoints where data is provided. Chainguard is one of the upstream providers, and updating scanners like Grype on the fixed status of packages in our upstream OS, Wolfi, is a key element in maintaining the low-to-no CVE status of Chainguard Images).

Grype's vulnerability.db gets rebuilt daily from data sourced from these upstream providers. To build this database, Grype uses two open source tools, vunnel and grype-db. The vunnel tool downloads, standardizes, and stores vulnerability data from the above upstream providers. Basically, it accesses the various provider endpoints and stores a local vulnerability database and metadata for each provider locally. The grype-db utility collates this vulnerability data, building a much smaller vulnerability.db usable by Grype.

Building the Grype Database with `vunnel` and `grype-db`

In this section, we'll try out the vunnel and grype-db utilities, building a local vulnerability cache and database.

Since a built-daily vulnerability.db file gets downloaded every time you run a Grype scan, why would you want to build Grype's vulnerability.db manually? Building manually is useful if:

You want to use a subset of upstream sources
You'd like to integrate other sources to create a custom vulnerability.db
You require older Grype schemas
You'd like to contribute to Grype
You want to understand more about Grype's upstream providers and data structure

`vunnel`

Thegrype-db utility uses vunnel under the hood, but let's first try out vunnel explicitly to see how it works. You'll need Python 3 installed for this section, and I'll assume it's accessible on your system using the python command. (vunnel is written in Python.)

First, let's create a project folder:

mkdir -p ~/vulnerability-data && cd $_

Next, create a virtual environment and activate it:

python -m venv venv && source venv/bin/activate

Now install vunnel to the activated virtual environment:

pip install vunnel

Once vunnel is installed, we can use the vunnel list command to show the current list of providers:

vunnel list

alpine
amazon
chainguard
...
sles
ubuntu
wolfi

You can download a local cache of provider data for any of these providers with the following (using chainguard as an example provider):

vunnel run chainguard

This creates a data folder where all provider data used as input and the standardized output database are contained:

data
└── chainguard
    ├── checksums
    ├── input
    │   └── secdb
    │       └── security.json
    ├── metadata.json
    └── results
        └── results.db

`grype-db`

The grype-db utility can pull provider data with vunnel under the hood, and can also collect and package up this data into the vulnerability.db file used by Grype.

In the following, we'll use grype-db to download all provider data using vunnel under the hood, then build the vulnerability.db file. The process of downloading all provider data can take some time and uses about 8 GB of disk space.

First, download the grype-db script to the project folder we created previously:

curl -sSfL https://raw.githubusercontent.com/anchore/grype-db/main/install.sh | sh -s -- -b .

If you'd like to build from all available data, you'll need a GitHub token capable of authenticating as a user. This is because GitHub rate limits API access for non-authenticated users. You can follow these instructions provided by GitHub, but in short head to this token settings page on GitHub. Remember to safeguard your token as you would a password, and I recommend creating a scoped and short-lived (i.e. 7 days) token.

Once you have your token, create a configuration file for grype-db in our ~/vulnerability-data project folder. First, set your generated GitHub token to an environment variable:

GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Next, generate the config in our ~/vulnerability-data project folder:

cat << EOF > ~/vulnerability-data/.grype-db.yaml
provider:
  vunnel:
    executor: local
    generate-configs: true
    env:
      GITHUB_TOKEN: $GITHUB_TOKEN
EOF

Now we have the grype-db script and the .grype-db.yaml configuration file in our project folder. Let's run a command that will pull all provider data, create a database file, and package it up for inclusion in a CI or other workflow. (For this step, you'll need vunnel available, so the virtual environment created in the previous section on vunnel will need to still be activated, and you should run this command from the project folder where we downloaded the grype-db script.)

./grype-db -g

Downloading and processing all provider data can take a long time, possibly hours, so go watch Master of the Flying Guillotine or get some work done, I guess. ☕

Once the process completes, a build folder will have been created in our ~/vulnerability-data/ project folder:

build
├── listing.json
├── metadata.json
├── provider-metadata.json
├── vulnerability.db
└── vulnerability-db_v5_2024-11-05T19:09:08Z_1730848219.tar.gz

The vulnerability.db database should be the same as the database built daily for use by Grype.

Grype's `vulnerability.db`

I plan on following up this post with an analysis of the data in Grype's vulnerability.db database, but here are some quick notes on the structure of this SQLite3 file:

When Grype runs, it checks against the last time the database was updated. If it's been longer than a day (Grype rebuilds the database daily) a new vulnerability.db is downloaded to a cache. - On Linux, it's stored in ~/.cache/grype/db/5/vulnerability.db, where the numbered folder (5) corresponds to the current Grype schema version number.) On Mac OS, it's stored in ~/Library/Caches/grype/db by default.

The vulnerability.db database has five tables, but only two have significant data. The vulnerability_metadata table stores information on CVEs as they apply on a per-platform basis. The entities in the vulnerability table represent vulnerabilities as they apply to specific package versions.

The platforms with the most vulnerability metadata entries are Ubuntu, NVD (NIST's National Vulnerability Database,, and Susa.

While these results are somewhat interesting when considering where Grype data comes from, the number here reflects many factors, mainly the date the provider started recording vulnerabilities and, for platform-specific providers, the attack surface of the platform. Other details such as duplication of distros lower the signal here and would require more analysis to parse out.

We can also check the number of vulnerability metadata entries by year:

This chart mainly shows a movement toward maturity in the ecosystem, leading to some stability after 2016-2017. (The early twenty-teens were a time of rapid development in cloud technologies in particular.)

Digging into the data in Grype's vulnerability.db can also help to answer much more specific questions about how CVE affect different platforms. For example, imagine we host a mailserver and we wish to know the fixed status of CVE-2024-37383, a known-exploited vulnerability which allows cross-site scripting in RoundCube, a webmail client. We can narrow the data in the vulnerability table to answer this question:

fix_state	namespace
fixed	nvd:cpe
fixed	debian:distro:debian:11
fixed	debian:distro:debian:12
fixed	debian:distro:debian:13
fixed	debian:distro:debian:unstable
not-fixed	ubuntu:distro:ubuntu:20.04
not-fixed	ubuntu:distro:ubuntu:22.04
fixed	ubuntu:distro:ubuntu:23.10

In a follow-up post, I'll show you how to load Grype's vulnerability.db database into Pandas to get a better sense of Grype's data schema and how it can be used to answer specific questions in platform security and broader CVE trends.

Conclusion

One of the most remarkable aspects of the Grype image scanner is the openness of its data pipeline. This is great for transparency and makes Grype's vulnerability.db into a more flexible and useful tool. If you've found this post useful, let us know and maybe you'll see more security deep dives like this. 🛡️🤿 You can follow Chainguard on dev.to or LinkedIn or sign up for our newsletter to keep in touch.

Resources

Top comments (3)

Alex Goodman • Dec 10 '24

Hey! I'm one of the developers on grype -- nice write up!

Wanted to give a heads up that we're working on the next DB schema now, which is heavily inspired by OSV (very different from today's DB layout). It should be released within a couple months. A community member contributed the grype db search command in recent months to make fetching raw vulnerability info from the DB more casual; we're going to be enhancing this command even further as a part of the DB schema work. If you wanted to build the new (unfinalized) v6 schema you can do so with the latest grype-db release (grype-db build -v -s 6).