Practical Way to Use AWS Glue with Postgresql

#beginners #tutorial #aws #etl

AWS Glue is an event-driven, serverless computing platform provided by Amazon as part of Amazon Web Services. It is a computing service that runs code in response to events and automatically manages the computing resources required by that code.

As a popular ETL service, Glue offers numerous options to connect to various databases, including PostgreSQL, which is a widely-used RDBMS.

Glue provides several ways to set up ETL (Extract, Transform, Load) processes, as shown below:

With its visual setup, performing ETL tasks becomes much easier.

You only need a few clicks to create an ETL job that helps transform data from an S3 input to a PostgreSQL output.

However, this setup has several restrictions because you need to follow all the available options before you can create a properly functioning ETL job.

If you are looking for more flexibility in configuration, you can consider using a script setup.

With a script setup, you can connect to your data source or output directly from the script. To do this, switch from the visual setup to the script page as shown below:

For the code, you can use simple scripts like the following:



import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3

# Initialize Glue context and job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read data from S3
s3_path = 's3://your-S3-REPO/'
datasource = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [s3_path]},
    format="csv",  # Adjust format as necessary
    format_options={"withHeader": True, "separator": ","}
)

datasource.printSchema()

# Transform data if needed (this is a simple pass-through in this example)
transformed = ApplyMapping.apply(
    frame = datasource, 
    mappings = [
        ("id", "string", "id", "int"),
        ("name", "string", "name", "string"),
        ("age", "string", "age", "int")
    ]
)

transformed.printSchema()

# Write data to PostgreSQL
glueContext.write_dynamic_frame.from_options(
    frame = transformed,
    connection_type = "postgresql",
    connection_options = {
        "url": "jdbc:postgresql://your-PostgresqlDB-Endpoint",
        "dbtable": "your_table",
        "user": "your-Posgresql-User",
        "password": "your-Posgresql-Password"
    }
)

# Commit the job
job.commit()

And for the input, you can use a CSV format file like this:



id,name,age
1,John Doe,30
2,Jane Smith, 15
3,Bob Yellow,20
4,Roshan Brown,18
5,Bam Black,55

After that, you can start the job and wait until it finishes. If it succeeds, as shown below:

you can check the latest result in your posgresql.

I think that's it for now for this article. Leave a comment below about your thoughts! Thanks.

Deploy and scale your apps on AWS and GCP with a world class developer experience

Coherence makes it easy to set up and maintain cloud infrastructure. Harness the extensibility, compliance and cost efficiency of the cloud.

Learn more

DEV Community

Practical Way to Use AWS Glue with Postgresql

Deploy and scale your apps on AWS and GCP with a world class developer experience

Top comments (0)

Read next

Why Is Spark Slow??

Journey from 0 to DevRel

How to chat with Local LLM in Obsidian

Building Bedrock Agents for AWS Account Metadata and Cost Analysis