Almost every application requires some kind of config or parameters to start with the expected state, AWS Glue applications are no different.
Our code is supposed to run in 3 different environments (accounts), DEV, TEST, PROD and several configuration values were required eg. log level, SNS topic (for status updates) and few more.
Documentation mentions special parameters however they are not all arguments you can expect to get. We will explore this later in this section.
During my work on a project, there was only one set of
DefaultArguments which could have been overwritten prior to job start. By the time of writing this article, there are now two sets of these
NonOverridableArguments where the latter has been added recently.
Some of these arguments were supplied as SSM Parameters while others were submitted as
DefaultArguments. This could be very useful in case the job fails and we'd like to run the job again with a different log level eg. default
WARN vs non-default
To add or change an argument of the job prior to it's run you can either use a console
Security configuration, script libraries, and job parameters -> Job parameters
Or when using CLI/API add your argument into the section of
When I started with my journey the function
getResolvedOptions was not available for Python Shell jobs and I also planned to create a config object which holds the necessary configuration for the job. It got implemented later.
There is a difference between the implementation of
awsglue present in PySpark jobs and
awsglue present in Python Shell jobs.
The code used inside Python Shell jobs is this.
The main problem of this function is that it makes all
DefaultArguments required. Which is rather clumsy considering the fact that it also requires you to use
-- (double dash) in front of your argument which is generally used for optional arguments.
It is possible to re-implement optional argument this by wrapping this function as suggested in this StackOverflow answer. However, this is rather a workaround which may break if the AWS team decides to fix this.
Also when specifying
DefaultArguments via console it feels more natural not to include
-- as the UI is not mentioning this at all.
My first few jobs were only using PySpark and I discovered that there are some additional arguments present in
sys.argv which are used in examples inside developers guide but not described. To get a description of these arguments one should visit AWS Glue API docs page which is a bit hidden because there is only 1 direct link pointing there from the developers' guide.
Here are arguments present in
sys.argv for PySpark job (Glue 1.0).
[ 'script_2020-06-24-07-06-36.py', '--JOB_NAME', 'my-pyspark-job', '--JOB_ID', 'j_dfbe1590b8a1429eb16a4a7883c0a99f1a47470d8d32531619babc5e283dffa7', '--JOB_RUN_ID', 'jr_59e400f5f1e77c8d600de86c2c86cefab9e66d8d64d3ae937169d766d3edce52', '--job-bookmark-option', 'job-bookmark-disable', '--TempDir', 's3://aws-glue-temporary-<accountID>-us-east-1/admin' ]
JOB_RUN_ID can be used for self-reference from inside the job without hard coding the
JOB_NAME in your code.
This could be a very useful feature for self-configuration or some sort of state management. For example, you could use
boto3 client to access the job's connections and use it inside your code. Without specifying the connection name in your code directly. Or if your job has been triggered from the workflow it would be possible to refer to the current workflow and its properties.
sys.argv of Python Shell jobs
[ '/tmp/glue-python-scripts-7pbpva1h/my_pyshell_job.py', '--job-bookmark-option', 'job-bookmark-disable', '--scriptLocation', 's3://aws-glue-scripts-133919474178-us-east-1/my_pyshell_job.py', '--job-language', 'python' ]
Above we can see that set arguments available in Python Shell job.
The arguments are a bit different from what we've got in PySpark job but the major problem is that arguments
JOB_RUN_ID are not available.
This creates a very inconsistent developer experience and prevents the self-reference from inside the job which diminishes the potential of these parameters.
Like I already mentioned AWS Glue Job logs are sent to AWS CloudWatch logs.
There are two log groups for each job.
/aws-glue/python-jobs/output which contains the
stderr. Inside log groups you can find the log stream of your job named with
When the job is started there are already 2 links helping you to navigate to the particular log.
Even though the links are present, the log streams are not created until the job starts.
When using logging in your jobs, you may want to avoid logging to
stderr or redirect it to
error log stream is only created when the job finishes with failure.
Glue 1.0 PySpark job logs are very verbose and contain a lot of "clutter", unrelated to your code. This clutter comes from Spark underlying services. This issue has been addressed in Glue 2.0 where the exposure to the logs of unrelated services is minimal and you can comfortably focus on your own logs. Good job AWS Team!
Python Shell jobs do not suffer from this condition and you can expect to get exactly what you log.
And that's it about the config and logging. In the next episode, we are going to look into packaging and deployment.
The code for the examples in this article can be found in my GitHub repository aws-glue-monorepo-style