AWS Glue is a powerful service for data integration and ETL (Extract, Transform, Load) workloads, making it easier to prepare and transform data for analytics. If you’re looking to automate the creation of Glue jobs using Infrastructure as Code (IaC), AWS CDK (Cloud Development Kit) is a great choice. In this post, we’ll walk through the process of defining and deploying an AWS Glue job using AWS CDK. We will be creating a job that can connect to cross-account RDS cluster and execute an etl scripts.
Note : We will not be covering AWS CLI and CDK package setup in this article.
Step 1: Define the VPC stack for Glue Job
export class GlueVpcStack extends DeploymentStack {
public readonly vpc: Vpc;
public readonly vpcDefaultSecurityGroupId: string;
constructor(scope: Construct, id: string, props: DeploymentStackProps) {
super(scope, id, props);
const vpc = new Vpc(this, 'VPCForGlue', {
ipAddresses: IpAddresses.cidr(Vpc.DEFAULT_CIDR_RANGE),
subnetConfiguration: [
{
cidrMask: 24,
name: 'Public',
subnetType: SubnetType.PUBLIC,
},
{
cidrMask: 24,
name: 'Private',
subnetType: SubnetType.PRIVATE_WITH_EGRESS,
},
],
natGateways: 1,
});
vpc.addGatewayEndpoint('S3GatewayEndpoint', {
service: GatewayVpcEndpointAwsService.S3,
});
vpc.addInterfaceEndpoint('SecretsManagerEndpoint', {
service: InterfaceVpcEndpointAwsService.SECRETS_MANAGER,
});
const vpcDefaultSecurityGroup = SecurityGroup.fromSecurityGroupId(
this,
'SecurityGroup',
vpc.vpcDefaultSecurityGroup,
{
allowAllOutbound: false,
mutable: true,
},
);
this.vpc = vpc;
this.vpcDefaultSecurityGroupId = vpc.vpcDefaultSecurityGroup;
vpcDefaultSecurityGroup.addEgressRule(Peer.anyIpv4(), Port.allTraffic());
vpcDefaultSecurityGroup.addIngressRule(Peer.securityGroupId(this.vpcDefaultSecurityGroupId), Port.allTraffic());
}
}
Note: We must update the default security group of the VPC to include a self-referencing inbound rule and an outbound rule to allow all traffic from all ports. Later, we attach this security group to an AWS Glue connection to let network interfaces set up by AWS Glue communicate with each other within a private subnet.
Step 2: Define stack for Glue Job and Glue connection
export class InfraStack extends DeploymentStack {
constructor(scope: Construct, id: string, props: DeploymentStackProps, stageName: string) {
super(scope, id, props);
const vpcStack = new GlueVpcStack(this, id + '-VPC', props);
//Creating an IAM role to let AWS Glue access required service
const glueRole = new Role(this, id + '-GlueJobsRole', {
roleName: 'GlueJobsRole-' + stageName,
assumedBy: new ServicePrincipal('glue.amazonaws.com'),
});
glueRole.addManagedPolicy(ManagedPolicy.fromAwsManagedPolicyName('AmazonS3FullAccess'));
glueRole.addManagedPolicy(ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSGlueServiceRole'));
glueRole.addManagedPolicy(ManagedPolicy.fromAwsManagedPolicyName('SecretsManagerReadWrite'));
//Creating an AWS Glue connection
const glueConnection = new CfnConnection(this, 'GlueConnection', {
catalogId: this.account,
connectionInput: {
connectionType: 'JDBC',
connectionProperties: {
JDBC_CONNECTION_URL: "jdbcUrl",
SECRET_ID: "secretId", //secret that has DB credentials, we need to create this manually if
JDBC_ENFORCE_SSL: true,
},
physicalConnectionRequirements: {
securityGroupIdList: [vpcStack.vpcDefaultSecurityGroupId],
subnetId: vpcStack.vpc.privateSubnets[0].subnetId,
availabilityZone: vpcStack.vpc.privateSubnets[0].availabilityZone,
},
name: 'GlueConnection',
description: 'GlueConnection',
},
});
//Creating a bucket to keep scripts run in job
const testBucket = this.createBucket(id.toLowerCase() + '-testgluejobscripts', id + '-testgluejobscripts');
new BucketDeployment(this, 'DeployTestScripts', {
sources: [Source.asset('test_glue_job_scripts')], //This folder should be present under root of CDK package
destinationBucket: testBucket,
});
const job = new CfnJob(this, 'TestGlueJob', {
name: 'TestGlueJob',
role: glueRole.roleArn,
command: {
name: 'pythonshell',
pythonVersion: '3.9',
scriptLocation: `s3://<bucket>/script_name.py`,
},
glueVersion: '4.0',
executionProperty: {
maxConcurrentRuns: 1,
},
connections: {
connections: ['GlueConnection'], //This should match with the name set inside new CfnConnection() constructor
},
});
}
private createBucket(name: string, id: string) {
return new Bucket(this, id, {
enforceSSL: true,
bucketName: name,
});
}
}
Step 3: Allow Amazon RDS to accept network traffic from AWS Glue
For this, we update the security group attached to the Amazon RDS cluster, and whitelist the Elastic IP address attached to the NAT gateway for the AWS Glue VPC.
Step 4: Deploy the CDK Stack
cdk deploy
Step 5: Verify the Glue Job
Once the deployment is complete, navigate to the AWS Glue Console to verify that the job has been created. The job should appear with the specified configurations and script. You can run the job from console and check expected outcome.
Top comments (0)