Background of the Problem
I spent quite sometime figuring out how to install Python Packages in AWS Glue inside a VPC without internet access and I managed to figure it out after some tinkering. Just to recall, AWS introduced the support for installation of Python Packages via --additional-python-modules
option. While this is a lifesaver - for those who started working with Glue 1.0, it only works if your Glue Job can connect to the internet.
Given the emphasis on security, a number of customers chose to limit/restrict egress traffic from their VPC to the public internet and require a method to manage the packages used by their data pipelines.
This article focuses on that challenge. This is a step-by-step process on how to setup your Glue Job to connect to a pypi mirror via AWS CodeArtifact, allowing you to install packages in a Private Subnet. For this tutorial, it is recommended to have a working knowledge of basic stuffs (e.g. Networking, Services) on AWS. But, I'll try my best to explain each part.
Let's get started!
Solution Overview
Fig. 1. Architecture for the AWS CodeArtifact and AWS Glue Integration
The core of the solution is the AWS CodeArtifact, which allows you to use it as tool to securely store, publish, and share packages, in this case, PyPi
packages, across your private network without directly connecting into the Public PyPi Repository. This is made possible by VPC Endpoints through PrivateLink connections.
You do need to create endpoints for S3 and CodeArtifact for this to work, or else, you'll get errors like Connection timed out
errors.
Here's some resources to help you out with that:
Gateway endpoints for Amazon S3
Create VPC endpoints for CodeArtifact - if via console, kindly follow the same steps as with the S3 Endpoint.
What you will need
An AWS account, of course
Note: Test this on your dev environment first
- AWS Glue
- AWS CodeArtifact
- Docker
- AWS Access Keys (with permissions on AWS CodeArtifact)
I won't go over these tools one by one as I believe ChatGPT can you give those definitions and its use better than me.
The Solution
In this section, I'll go over the step-by-step solution for each process.
Let's start by setting up our CodeArtifact Repository.
Setting up the AWS Codeartifact
Create a CodeArtifact Repository
Fill up the details
Repository Name
-
Repository Details
(Optional) -
Public upstream repositories
- I chose PyPi
Select the domain
Specify your domain name
You should have the following repositories after creation:
<your-repo>
pypi-store
Now that's done, you can inspect the created repositories. The pypi-store
was automatically created. The <your-repo>
is the one that we're interested in since this will contain our Python Packages.
With that, let's proceed with configuring your local environment.
Setting up your local environment
Step 1: Install Docker
Install here:
https://docs.docker.com/get-docker/
Step 2: Pull the Amazon Linux 2 Image
$ docker pull amazonlinux:latest
Step 3: Run the container
Run the container and interact with the command line of the container using -it
$ docker run -it --rm -v /path/on/host:/path/in/container image_name /bin/bash
Some notes:
-v /path/on/host:/path/in/container
: This is the volume mount option. It mounts a directory from your host(/path/on/host)
into the container(/path/in/container)
. Any changes made in the mounted directory inside the container will be reflected on the host directory and vice versa.--rm
: This tells Docker to automatically remove the container when it exits. This means that once you're done with the bash session and exit, the container will be cleaned up, and no container filesystem will be left on your host system. Feel free to remove this option if you do not want your container to behave like that.
Step 4: Install Python 3.10
$ wget https://www.python.org/ftp/python/3.10.0/Python-3.10.0.tgz
$ tar -xf Python-3.10.0.tgz
$ cd Python-3.10.0
$ ./configure --enable-optimizations
$ sudo make altinstall
Note that AWS Glue 4.0
runs Python 3.10
version. For others, kindly refer to the documentation.
Step 5: Install AWS CLI
Using pip
$ pip install awscli
Step 6: Configure AWS Credentials
Refer to this for creating your access keys:
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
After getting the values for the access keys, configure your AWS CLI:
$ aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json
Step 7: Connect to Repository
Go back to the AWS Console and click on your created repository.
Click View connection instructions
Copy and run the command in Step 3
of the Connection instructions
$ aws codeartifact login \
--tool pip \
--repository <your-repo-name> \
--domain <your-domain-name> \
--domain-owner <your-account-id> \
--region <your-region>
Once successfully logged in, kindly note that any pip install
command will be pushed to this repository instead of the Python environment on the Docker container.
Step 8: Install Python Packages
Install your packages!
Now that the repository is ready, we can now install from AWS Glue using this Pypi mirror that we created!
AWS CodeArtifact and AWS Glue Integration
This section discusses how you can point the installation of Python Packages in AWS Glue to AWS Codeartifact.
Step 1: Get the Authorization Token
We need to generate an authorization token
from AWS CodeArtifact. This is done using this command:
$ aws codeartifact get-authorization-token \
--domain my_domain \
--domain-owner 111122223333 \
--query authorizationToken \
--output text
Note that the maximum duration of this token is 12 hours
. And yes, you do need to generate this every day if you are planning to run your jobs daily.
Store this into a .txt
file.
Step 2: Configure Job Details in Glue Job
Navigate to your Glue Job
I'm assuming you have already configured the Data Connections
. If not kindly configure it before proceeding to this step. The idea is that the Glue Job will run inside the Private Subnet of the VPC.
See screenshot below
Under Job Parameters
, add the following key-value
pairs:
Parameter 1
Key - "--additional-python-modules" // without double quotes
Value - "<your-python-package>==<version>"
Parameter 2
Key - "--python-modules-installer-option"
Value - "--no-cache-dir --verbose --index-url https://aws:<CODEARTIFACT-AUTH-TOKEN>@<DOMAIN-NAME>-<ACCOUNT-ID>.d.codeartifact.<REGION-NAME>.amazonaws.com/pypi/pypi-store/simple/"
Change the following values:
-
CODEARTIFACT-AUTH-TOKEN
- refer to Step 1 DOMAIN-NAME
ACCOUNT-ID
REGION-NAME
Step 3: Run your Glue Job
After configuring all of that, run your Glue Job and check the CloudWatch Logs to confirm if it's being installed correctly. You should see some text there that says:
Looking in indexes: https://aws:****@test-mirror-1234561234.d.codeartifact.ap-southeast-1.amazonaws.com/pypi/pypi-store/simple/
Kindly make sure that the IAM_ROLE
that you are using for the Glue Jobs has access to write
to CloudWatch Logs
, some engineers usually forgets this. Also tick the Enable logs in CloudWatch
on Glue Jobs.
Wrap up
That's it! In this article, we demonstrated how we can leverage CodeArtifact for managing Python packages and modules for AWS Glue jobs that run inside a Private Subnet that have no internet access.
Do let me know if you have any questions on this, happy to answer any queries you might have.
Happy Coding, builders!
This blog is authored solely by me and reflects my personal opinions, not those of my employer. All references to products, including names, logos, and trademarks, belong to their respective owners and are used for identification purposes only.
Top comments (0)