DEV Community

Cover image for Deep Dive 🀿: Where Does Grype Data Come From?
Patrick Smyth for Chainguard

Posted on

Deep Dive 🀿: Where Does Grype Data Come From?

Grype is a vulnerability scanner for container images and filesystems. It's developed by Anchore and written in Golang. When you point Grype at a container image, it will scan the files and folders on that image, compare what it finds to a database of CVEs (known vulnerabilities), and spit out a report telling you what CVEs have been detected.

Grype logo, it's like a creepy monster guy

We like Grype at Chainguard because it's open source, customizable, and reliable enough to integrate into our CI and CVE remediation workflows. (You can read more about why we like Grype in this post.)

Expertly photoshopped I Like Ike pin edited to read I Like Grype


I know, my photoshop skills scare even me sometimes.

In this article, we'll answer a question that comes up frequently: where does Grype's vulnerability data come from? In the process, we'll take a look at Grype's open data pipe line and do some light analysis of the vulnerability data that Grype uses to scan containers.

How Grype Works

If you haven't used Grype before, here's a brief overview of how it works.

  • You point Grype at a container image (or filesystem).
  • Grype downloads a fresh instance of its vulnerability.db database, then scans the image for specific packages, files, configurations, and so on, building a manifest in the form of a Software Bill of Materials (SBOM) itemizing the software contained in the image. (Under the hood, Grype uses a sister tool, Syft, for this step.)
  • Grype then compares the specific versions of each package against the vulnerability data in its database.
  • Finally, a list of CVEs detected in the image is returned to the user.

For a comprehensive overview of Grype's functionality, check out Using Grype to Scan Container Images for Vulnerabilities on Chainguard Academy.

Grype's Data Sources

Grype relies on a set of upstream providers for its vulnerability data. As of November 2024, the list of providers includes:

Note that the above links are to endpoints where data is provided. Chainguard is one of the upstream providers, and updating scanners like Grype on the fixed status of packages in our upstream OS, Wolfi, is a key element in maintaining the low-to-no CVE status of Chainguard Images).

Grype's vulnerability.db gets rebuilt daily from data sourced from these upstream providers. To build this database, Grype uses two open source tools, vunnel and grype-db. The vunnel tool downloads, standardizes, and stores vulnerability data from the above upstream providers. Basically, it accesses the various provider endpoints and stores a local vulnerability database and metadata for each provider locally. The grype-db utility collates this vulnerability data, building a much smaller vulnerability.db usable by Grype.

Building the Grype Database with vunnel and grype-db

In this section, we'll try out the vunnel and grype-db utilities, building a local vulnerability cache and database.

Since a built-daily vulnerability.db file gets downloaded every time you run a Grype scan, why would you want to build Grype's vulnerability.db manually? Building manually is useful if:

  • You want to use a subset of upstream sources
  • You'd like to integrate other sources to create a custom vulnerability.db
  • You require older Grype schemas
  • You'd like to contribute to Grype
  • You want to understand more about Grype's upstream providers and data structure

vunnel

Thegrype-db utility uses vunnel under the hood, but let's first try out vunnel explicitly to see how it works. You'll need Python 3 installed for this section, and I'll assume it's accessible on your system using the python command. (vunnel is written in Python.)

First, let's create a project folder:

mkdir -p ~/vulnerability-data && cd $_
Enter fullscreen mode Exit fullscreen mode

Next, create a virtual environment and activate it:

python -m venv venv && source venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

Now install vunnel to the activated virtual environment:

pip install vunnel
Enter fullscreen mode Exit fullscreen mode

Once vunnel is installed, we can use the vunnel list command to show the current list of providers:

vunnel list
Enter fullscreen mode Exit fullscreen mode
alpine
amazon
chainguard
...
sles
ubuntu
wolfi
Enter fullscreen mode Exit fullscreen mode

You can download a local cache of provider data for any of these providers with the following (using chainguard as an example provider):

vunnel run chainguard
Enter fullscreen mode Exit fullscreen mode

This creates a data folder where all provider data used as input and the standardized output database are contained:

data
└── chainguard
    β”œβ”€β”€ checksums
    β”œβ”€β”€ input
    β”‚Β Β  └── secdb
    β”‚Β Β      └── security.json
    β”œβ”€β”€ metadata.json
    └── results
        └── results.db
Enter fullscreen mode Exit fullscreen mode

grype-db

The grype-db utility can pull provider data with vunnel under the hood, and can also collect and package up this data into the vulnerability.db file used by Grype.

In the following, we'll use grype-db to download all provider data using vunnel under the hood, then build the vulnerability.db file. The process of downloading all provider data can take some time and uses about 8 GB of disk space.

First, download the grype-db script to the project folder we created previously:

curl -sSfL https://raw.githubusercontent.com/anchore/grype-db/main/install.sh | sh -s -- -b .
Enter fullscreen mode Exit fullscreen mode

If you'd like to build from all available data, you'll need a GitHub token capable of authenticating as a user. This is because GitHub rate limits API access for non-authenticated users. You can follow these instructions provided by GitHub, but in short head to this token settings page on GitHub. Remember to safeguard your token as you would a password, and I recommend creating a scoped and short-lived (i.e. 7 days) token.

Once you have your token, create a configuration file for grype-db in our ~/vulnerability-data project folder. First, set your generated GitHub token to an environment variable:

GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Enter fullscreen mode Exit fullscreen mode

Next, generate the config in our ~/vulnerability-data project folder:

cat << EOF > ~/vulnerability-data/.grype-db.yaml
provider:
  vunnel:
    executor: local
    generate-configs: true
    env:
      GITHUB_TOKEN: $GITHUB_TOKEN
EOF
Enter fullscreen mode Exit fullscreen mode

Now we have the grype-db script and the .grype-db.yaml configuration file in our project folder. Let's run a command that will pull all provider data, create a database file, and package it up for inclusion in a CI or other workflow. (For this step, you'll need vunnel available, so the virtual environment created in the previous section on vunnel will need to still be activated, and you should run this command from the project folder where we downloaded the grype-db script.)

./grype-db -g
Enter fullscreen mode Exit fullscreen mode

Downloading and processing all provider data can take a long time, possibly hours, so go watch Master of the Flying Guillotine or get some work done, I guess. β˜•

Once the process completes, a build folder will have been created in our ~/vulnerability-data/ project folder:

build
β”œβ”€β”€ listing.json
β”œβ”€β”€ metadata.json
β”œβ”€β”€ provider-metadata.json
β”œβ”€β”€ vulnerability.db
└── vulnerability-db_v5_2024-11-05T19:09:08Z_1730848219.tar.gz
Enter fullscreen mode Exit fullscreen mode

The vulnerability.db database should be the same as the database built daily for use by Grype.

Grype's vulnerability.db

I plan on following up this post with an analysis of the data in Grype's vulnerability.db database, but here are some quick notes on the structure of this SQLite3 file:

When Grype runs, it checks against the last time the database was updated. If it's been longer than a day (Grype rebuilds the database daily) a new vulnerability.db is downloaded to a cache. - On Linux, it's stored in ~/.cache/grype/db/5/vulnerability.db, where the numbered folder (5) corresponds to the current Grype schema version number.) On Mac OS, it's stored in ~/Library/Caches/grype/db by default.

The vulnerability.db database has five tables, but only two have significant data. The vulnerability_metadata table stores information on CVEs as they apply on a per-platform basis. The entities in the vulnerability table represent vulnerabilities as they apply to specific package versions.

The platforms with the most vulnerability metadata entries are Ubuntu, NVD (NIST's National Vulnerability Database,, and Susa.

Chart showing the upstream providers, Ubuntu is the biggest, Chainguard is pretty new

While these results are somewhat interesting when considering where Grype data comes from, the number here reflects many factors, mainly the date the provider started recording vulnerabilities and, for platform-specific providers, the attack surface of the platform. Other details such as duplication of distros lower the signal here and would require more analysis to parse out.

We can also check the number of vulnerability metadata entries by year:

Chart showing number of metadata entries ramping up from the late 90s andstabilizing around 2016-2017

This chart mainly shows a movement toward maturity in the ecosystem, leading to some stability after 2016-2017. (The early twenty-teens were a time of rapid development in cloud technologies in particular.)

Digging into the data in Grype's vulnerability.db can also help to answer much more specific questions about how CVE affect different platforms. For example, imagine we host a mailserver and we wish to know the fixed status of CVE-2024-37383, a known-exploited vulnerability which allows cross-site scripting in RoundCube, a webmail client. We can narrow the data in the vulnerability table to answer this question:

fix_state namespace
fixed nvd:cpe
fixed debian:distro:debian:11
fixed debian:distro:debian:12
fixed debian:distro:debian:13
fixed debian:distro:debian:unstable
not-fixed ubuntu:distro:ubuntu:20.04
not-fixed ubuntu:distro:ubuntu:22.04
fixed ubuntu:distro:ubuntu:23.10

In a follow-up post, I'll show you how to load Grype's vulnerability.db database into Pandas to get a better sense of Grype's data schema and how it can be used to answer specific questions in platform security and broader CVE trends.

Conclusion

One of the most remarkable aspects of the Grype image scanner is the openness of its data pipeline. This is great for transparency and makes Grype's vulnerability.db into a more flexible and useful tool. If you've found this post useful, let us know and maybe you'll see more security deep dives like this. πŸ›‘οΈπŸ€Ώ You can follow Chainguard on dev.to or LinkedIn or sign up for our newsletter to keep in touch.

Resources

Top comments (1)

Collapse
 
erikaheidi profile image
Erika Heidi

Nice! Thank you for doing this deep dive, it's very interesting to see what goes underneath the Grype ecosystem.