DEV Community

Cover image for Practical Way to Use AWS Glue with Postgresql
andre aliaman
andre aliaman

Posted on

Practical Way to Use AWS Glue with Postgresql

AWS Glue is an event-driven, serverless computing platform provided by Amazon as part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.

As a popular ETL service, Glue offers numerous options to connect to various databases, including PostgreSQL, which is a widely-used RDBMS.

Glue provides several ways to set up ETL (Extract, Transform, Load) processes, as shown below:

Creating AWS Glue Job

With its visual setup, performing ETL tasks becomes much easier.

Visual Setup Detail

You only need a few clicks to create an ETL job that helps transform data from an S3 input to a PostgreSQL output.

Visual Setup Detail2

However, this setup has several restrictions because you need to follow all the available options before you can create a properly functioning ETL job.

If you are looking for more flexibility in configuration, you can consider using a script setup.

With a script setup, you can connect to your data source or output directly from the script. To do this, switch from the visual setup to the script page as shown below:

Script Page

For the code, you can use simple scripts like the following:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

# Initialize Glue context and job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read data from S3
s3_path = 's3://your-S3-REPO/'
datasource = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [s3_path]},
    format="csv",  # Adjust format as necessary
    format_options={"withHeader": True, "separator": ","}
)

datasource.printSchema()

# Transform data if needed (this is a simple pass-through in this example)
transformed = ApplyMapping.apply(
    frame = datasource, 
    mappings = [
        ("id", "string", "id", "int"),
        ("name", "string", "name", "string"),
        ("age", "string", "age", "int")
    ]
)

transformed.printSchema()

# Write data to PostgreSQL
glueContext.write_dynamic_frame.from_options(
    frame = transformed,
    connection_type = "postgresql",
    connection_options = {
        "url": "jdbc:postgresql://your-PostgresqlDB-Endpoint",
        "dbtable": "your_table",
        "user": "your-Posgresql-User",
        "password": "your-Posgresql-Password"
    }
)

# Commit the job
job.commit()
Enter fullscreen mode Exit fullscreen mode

And for the input, you can use a CSV format file like this:

id,name,age
1,John Doe,30
2,Jane Smith, 15
3,Bob Yellow,20
4,Roshan Brown,18
5,Bam Black,55
Enter fullscreen mode Exit fullscreen mode

After that, you can start the job and wait until it finishes. If it succeeds, as shown below:

AWS Glue Job Status

you can check the latest result in your posgresql.

I think that's it for now for this article. Leave a comment below about your thoughts! Thanks.

Top comments (0)