DEV Community

Jacob for AWS Community Builders

Posted on

AWS Glue vulnerabilities in default packages

Securing AWS Glue: A Guide to Identifying and Fixing Python Package Vulnerabilities

Introduction

Did you know that the default Python packages in AWS Glue contain a number of known vulnerabilities? While instances, containers, and Lambda functions are often scanned by tools like AWS Inspector, Trivy, and Snyk, data pipelines are frequently overlooked. Whether by accident or design, many data pipelines—often laden with Python code—interact with external systems and APIs to ingest data. As such, securing these pipelines is just as important as securing any other part of your infrastructure.

In this post, I’ll walk you through how to enhance the security of your AWS Glue data pipelines. The first issue I encountered is that these pipelines often combine system and runtime dependencies with application code. AWS Glue and Apache Airflow both provide Python environments with pre-installed packages, along with the option to add custom ones.

AWS Glue

For this post, I'll focus specifically on the AWS Glue environment.

AWS Glue allows you to create three types of jobs:

  1. Glue ETL (PySpark):

    Glue ETL Python Libraries

  2. Python Shell:

    Python Shell Jobs in AWS Glue

  3. Ray (not supported):

The Glue ETL job spins up an on-demand Spark environment, while the Python Shell is more akin to a Lambda function. It doesn’t have the same 15-minute time limit but does have limited capacity.

Exporting System Requirements

While browsing the Glue documentation, I came across tables listing the pre-installed Python packages. I wrote a small program to parse these tables and export them to a requirements.txt file.

For a Python Shell job using Python 3.9, this is the output:

awscli==1.23.5
botocore==1.23.5
Enter fullscreen mode Exit fullscreen mode

For Python Shell jobs, there’s also an option to set the library-set to analytics, which provides a set of commonly-used packages, including the useful AWS SDK for pandas. However, note that the version included is fairly outdated:

avro==1.11.0
awscli==1.23.5
awswrangler==2.15.1
botocore==1.24.21
boto3==1.21.21
elasticsearch==8.2.0
numpy==1.22.3
pandas==1.4.2
psycopg2==2.9.3
pyathena==2.5.3
PyMySQL==1.0.2
pyodbc==4.0.32
pyorc==0.6.0
redshift-connector==2.0.907
requests==2.27.1
scikit-learn==1.0.2
scipy==1.8.0
SQLAlchemy==1.4.36
s3fs==2022.3.0
Enter fullscreen mode Exit fullscreen mode

Now we have the system dependencies in a workable format.

Run-Time Dependencies

AWS Glue also allows you to install additional packages at runtime using pip. You can extend or override the pre-installed Python packages as needed.

For more details, check the official AWS Glue Programming Python Libraries documentation.

Glue Inspector

With the above information, I created a tool called Glue Inspector. It downloads the AWS system dependencies, caches them locally, and then retrieves runtime dependencies. These are merged into a list and exported as a CycloneDX Software Bill of Materials (SBOM) in JSON format.

To use it:

  1. Set your AWS credentials in the environment.
  2. Run the following command to inspect a Glue job:
glue-inspector inspect mygluejob --output mygluejob-sbom.json
Enter fullscreen mode Exit fullscreen mode

You can then use the resulting SBOM to manage the software supply chain with tools like DependencyTrack, or scan for vulnerabilities using tools like Trivy:

trivy sbom mygluejob-sbom.json --scanners vuln,license --list-all-pkgs -d --format cyclonedx --output mygluejob-sbom-trivy.json
Enter fullscreen mode Exit fullscreen mode

I’ve just released version 0.2.0 of Glue Inspector.

AWS Vulnerabilities in Glue

While working on this tool, I was surprised by the number of critical and high-severity vulnerabilities present in the default packages. I filed a report with AWS Security, and after weeks of waiting, I was told that the runtime is isolated and therefore not considered an AWS system issue. However, users are encouraged to update their packages as needed.

I believe more awareness is needed in this area.

Glue Runtime Vulnerabilities

Here’s an overview of vulnerabilities in the Glue runtimes:

Filename Critical High Medium Low
glueetl-2.0 5 12 12 1
glueetl-3.0 4 16 20 2
glueetl-4.0 4 14 18 2
glueetl-5.0 0 6 11 3
pythonshell-3.6 1 1 6 0
pythonshell-3.9 0 0 0 0
pythonshell-3.9-analytics 1 1 3 0

Vulnerabilities in AWS Glue 5.0 GlueETL

Here are some critical and high-severity vulnerabilities in the newly released Glue ETL 5.0 runtime:

Package Severity Id Installed Version Fixed Version Title
Pygments MEDIUM CVE-2022-40896 2.7.4 2.15.0 pygments: ReDoS in pygments
aiohttp MEDIUM CVE-2024-42367 3.10.1 3.10.2 aiohttp: python-aiohttp: Compressed files as symlinks are not protected from path traversal
aiohttp MEDIUM CVE-2024-52304 3.10.1 3.10.11 aiohttp: aiohttp vulnerable to request smuggling due to incorrect parsing of chunk extensions
cryptography HIGH CVE-2023-0286 36.0.1 39.0.1 openssl: X.400 address type confusion in X.509 GeneralName
cryptography HIGH CVE-2023-50782 36.0.1 42.0.0 python-cryptography: Bleichenbacher timing oracle attack against RSA decryption - incomplete fix for CVE-2020-25659
cryptography MEDIUM CVE-2023-23931 36.0.1 39.0.1 python-cryptography: memory corruption via immutable objects
cryptography MEDIUM CVE-2023-49083 36.0.1 41.0.6 python-cryptography: NULL-dereference when loading PKCS7 certificates
cryptography MEDIUM CVE-2024-0727 36.0.1 42.0.2 openssl: denial of service via null dereference
cryptography LOW GHSA-5cpq-8wj7-hf2v 36.0.1 41.0.0 Vulnerable OpenSSL included in cryptography wheels
cryptography LOW GHSA-jm77-qphf-c4w8 36.0.1 41.0.3 pyca/cryptography's wheels include vulnerable OpenSSL
cryptography LOW GHSA-v8gr-m533-ghj9 36.0.1 41.0.4 Vulnerable OpenSSL included in cryptography wheels
idna MEDIUM CVE-2024-3651 2.10 3.7 python-idna: potential DoS via resource consumption via specially crafted inputs to idna.encode()
pip MEDIUM CVE-2023-5752 21.3.1 23.3 pip: Mercurial configuration injectable in repo revision when installing via pip
pip MEDIUM CVE-2023-5752 22.3.1 23.3 pip: Mercurial configuration injectable in repo revision when installing via pip
setuptools HIGH CVE-2022-40897 59.6.0 65.5.1 pypa-setuptools: Regular Expression Denial of Service (ReDoS) in package_index.py
setuptools HIGH CVE-2024-6345 59.6.0 70.0.0 pypa/setuptools: Remote code execution via download functions in the package_index module in pypa/setuptools
urllib3 HIGH CVE-2021-33503 1.25.10 1.26.5 python-urllib3: ReDoS in the parsing of authority part of URL
urllib3 HIGH CVE-2023-43804 1.25.10 2.0.6, 1.26.17 python-urllib3: Cookie request header isn't stripped during cross-origin redirects
urllib3 MEDIUM CVE-2023-45803 1.25.10 2.0.7, 1.26.18 urllib3: Request body not stripped after redirect from 303 status changes request method to GET
urllib3 MEDIUM CVE-2024-37891 1.25.10 1.26.19, 2.2.2 urllib3: proxy-authorization request header is not stripped during cross-origin redirects

Mitigating Vulnerabilities

If your Glue jobs access external resources, be sure to update the required packages using the runtime installation option. However, this could lead to a "dependency hell" situation, so use your favorite tools or something like pur to help update the requirements.

Here’s an overview of some key packages that are outdated:

Updated aiobotocore: 2.13.1 -> 2.16.1
Updated aiohappyeyeballs: 2.3.5 -> 2.4.4
Updated aiohttp: 3.10.1 -> 3.11.11
Updated aioitertools: 0.11.0 -> 0.12.0
Updated aiosignal: 1.3.1 -> 1.3.2
Updated async-timeout: 4.0.3 -> 5.0.1
Updated attrs: 24.2.0 -> 24.3.0
Updated awscrt: 0.19.19 -> 0.23.6
Updated boto3: 1.34.131 -> 1.35.92
Updated botocore: 1.34.131 -> 1.35.92
Updated certifi: 2024.7.4 -> 2024.12.14
Updated cffi: 1.14.5 -> 1.17.1
Updated charset-normalizer: 3.3.2 -> 3.4.1
Updated colorama: 0.4.4 -> 0.4.6
Updated contourpy: 1.2.1 -> 1.3.1
Updated cryptography: 36.0.1 -> 44.0.0
Updated distlib: 0.3.1 -> 0.3.9
Updated distro: 1.5.0 -> 1.9.0
Updated docutils: 0.16 -> 0.21.2
Updated filelock: 3.0.12 -> 3.16.1
Updated fonttools: 4.53.1 -> 4.55.3
Updated frozenlist: 1.4.1 -> 1.5.0
Updated fsspec: 2024.6.1 -> 2024.12.0
Updated idna: 2.10 -> 3.10
Updated importlib_resources: 6.4.0 -> 6.5.2
Updated jmespath: 0.10.0 -> 1.0.1
Updated kiwisolver: 1.4.5 -> 1.4.8
Updated libcomps: 0.1.20 -> 0.1.21.post1
Updated matplotlib: 3.9.0 -> 3.10.0
Updated multidict: 6.0.5 -> 6.1.0
Updated numpy: 1.26.4 -> 2.2.1
Updated packaging: 24.1 -> 24.2
Updated pandas: 2.2.2 -> 2.2.3
Updated pillow: 10.4.0 -> 11.1.0
Updated pip: 21.3.1 -> 24.3.1
Updated pip: 22.3.1 -> 24.3.1
Updated plotly: 5.23.0 -> 5.24.1
Updated prompt-toolkit: 3.0.24 -> 3.0.48
Updated pyarrow: 17.0.0 -> 18.1.0
Updated pycparser: 2.20 -> 2.22
Updated Pygments: 2.7.4 -> 2.19.0
Updated pyparsing: 3.1.2 -> 3.2.1
Updated pytz: 2024.1 -> 2024.2
Updated requests: 2.32.2 -> 2.32.3
Updated ruamel.yaml: 0.16.6 -> 0.18.9
Updated ruamel.yaml.clib: 0.1.2 -> 0.2.12
Updated s3fs: 2024.6.1 -> 2024.12.0
Updated s3transfer: 0.10.2 -> 0.10.4
Updated setuptools: 59.6.0 -> 75.7.0
Updated six: 1.16.0 -> 1.17.0
Updated tzdata: 2024.1 -> 2024.2
Updated urllib3: 1.25.10 -> 2.3.0
Updated virtualenv: 20.4.0 -> 20.28.1
Updated wcwidth: 0.2.5 -> 0.2.13
Updated wrapt: 1.16.0 -> 1.17.0
Updated yarl: 1.9.4 -> 1.18.3
Updated zipp: 3.19.2 -> 3.21.0
Enter fullscreen mode Exit fullscreen mode

Luckily, Glue 5 now supports the use of a requirements.txt file uploaded to S3, which can be parsed by pip:

Add custom Python modules

This opens up the possibility of using local checks and tools like GitHub Dependabot to monitor your dependencies for vulnerabilities.

Conclusion

  1. Data pipelines are applications and need to be treated with the same level of scrutiny as any other software. Managing their lifecycle is critical for security.

  2. Be aware of vulnerabilities in default runtimes, whether using AWS Glue, Apache Airflow, or other similar tools.

  3. Use Glue Inspector to scan your Glue jobs and generate an SBOM for better software supply chain management. SBOMs are becoming an industry standard, with requirements from norms like DORA and U.S. government standards for critical infrastructure.

Top comments (0)