DEV Community: Jan Gazda

Remote/Virtual pair programming

Jan Gazda — Tue, 18 May 2021 05:10:03 +0000

Whenever you are learning or working together on a common codebase with your friends and colleagues you often need to discuss "things" and code review on GitHub is not enough.

Fortunately, other people had the same ideas and thought how nice it would be if you could:

See the code on your computer (not via sharing the screen)
Edit the code together
Follow the cursor of one another while navigating the text
Being able to interact with an integrated terminal on the other side
Speak with each other without using another app

In May 2017 Microsoft has announced Visual Studio Live Share which changed how you can interact with your co-coder.

Live share is available for VSCode as well as for visual studio.

As a long time PyCharm user I was longing to have the same feature and often opened the VSCode just so I can share the code and explain the thought process.
After a long time, they finally delivered a similar feature with their Code with me.

Comparison

Live Share

Uses Microsoft servers hosted on Azure (requires public internet access)
To initiate the live share session you must sign in with Microsoft or GitHub account.
Share the code via VSCode, Visual Studio or a Web browser
Main Features:
- Co-editing - let's multiple participants edit the code
- Following and focusing - follow the cursor of others
- Share terminals - see/use the terminal of the session organiser
- Share server/Share port - expose a local server (eg. https://localhost:8000) to the participants so they can access your app.
- Unlimited session length

Code With Me

Uses Jetbrains servers, however, offers an on-premise solution for closed networks.
Does not require any account to share the session
Free session is limited to 3 participants and 30 minutes
Not all InteliJ IDEs are yet supported
Requires Code With me client app (included in IDE or standalone)
Main features:
- Audio and Video Calls - hear and see each other
- Simultaneous editing - let's multiple participants edit the code
- Following - follow the cursor of others
- Terminal access - see/use the terminal of the session organiser

Summary

I used Live Share for quite a long time and it has helped me a lot and I cannot imagine working in a team without it!
It's quite easy to use however sometimes VSCode required me to sign in before sharing each session which I find a bit annoying mainly because I want to sing-in in different than my default web browser.
I also noticed that it sometimes takes quite a long to connect or the session does not start correctly so you have to re-create the session.
Also switching between two IDEs "just to share" the code wasn't ideal.
Performance-wise I haven't noticed any difference and working on remote or local editor worked as expected.

When Code With me became publicly available I switched to it and was very pleasantly surprised how smooth the integration with PyCharm is. I haven't tried it on a large codebase to comment on performance yet.
Free session 30 minutes could be a limiting factor but also motivation to keep the meetings short!

As you can see both solutions offer a quite similar set of features with few differences so you may still need to switch between IDEs in some cases but concurrency in this sector will lead to better tools and support.

Let me know in the comment section which solution you use and why what features you'd like to have. And leave a like if you liked the article!

update 09-11-2021 - code with me can now also do port forwarding and screen sharing

AWS Glue first experience - part 5 - Glue Workflow, monitoring and rants

Jan Gazda — Tue, 01 Sep 2020 08:23:46 +0000

In this episode, we are going to look at AWS Glue Workflow, mention time-consuming tasks during development and wrap up.

Challenge number 7: the Workflow

To define ETL pipelines AWS Glue offers a feature called Workflow, where you can orchestrate your Crawlers and Jobs into a flow using predefined triggers. Workflow provides a visual representation of your ETL pipeline and some level of monitoring.

This, in theory, is a nice thing to have for your ETL pipeline however I discovered a lot of problems prevented me from using it's promised potential.

Definition

Defining the workflow via the AWS Console is quite simple.

The user interface resembles AWS CloudFormation template Designer however with a very limited set of features.

Adding your jobs and triggers to the workflow graph feel quite bulky because the graph screen does not scale well in the browser (at least on 13" MacBook).

Due to the number of required parameters, each click brings up a modal window where you need to fill something or select. The navigation supports mouse only and there aren't any keyboard shortcuts supported which require a lot clicking around, especially if you just want to remove a few nodes. It is not possible to select multiple components or delete the previous chain.

Delete action on triggers is immediately performed on an actual resource which can take quite some time. Workflow graph is automatically saved after each activity which can be potentially dangerous if one is not careful when editing if the workflow is triggered by the cron.

Also undo and redo buttons are not present so the only way to recover modified workflow is from the previous run (if there was any).

Defining the workflow with IaC either Terraform or AWS CloudFormation gets much more difficult. Just due to the fact that your workflow is defined as a set of trigger resources bound to the workflow resource, where each trigger contains a reference to the Job or Crawler resource, so the final flow not really obvious. So the best way to define the workflow is to use the user interface and then export it using aws glue get-workflow --name or same API function. Then adapt the graph JSON into your IaC code.

State management

Like jobs have DefaultArguments, the workflow has run properties which can be accessed or modified prior to or during the workflow run. This provides a good way of sharing the state or specific config.

Workflow properties are submitted to your job via system arguments '--WORKFLOW_NAME', 'my-workflow', '--WORKFLOW_RUN_ID', 'wr_1e927676cc826831704712dc067b46480042ca7aead5caa6bea4fc587311e29d'. Yet again this was inconsistent between PySpark and Python Shell jobs but seems it's not anymore.

Execution & Monitoring

There are several ways to start the workflow but the first node has to be the trigger on-demand or schedule(cron). In case you select event and add your job as a first node, workflow won't start.

AWS Glue Workflow Console view

When the workflow is started (manually or by the trigger) it still takes some time to kick off the first job (I experienced times between 1 - 4 minutes). There is also a delay in between jobs during the workflow run which is around 1 minute.

From my observation, it looks like that workflow appears to be a pull-based system instead of the expected event-based system.

Which also means that any events you'd like to know about eg. workflow started, the job started, job finished, have to be implemented inside the code of the job itself.

This also brings me to another point that workflow is not able to provide information whether it finished successfully or not. The only states you can get at the moment are 'RUNNING'|'COMPLETED'|'STOPPING'|'STOPPED'. To find out which job or crawler failed via AWS Console you have to navigate to the Workflows, select workflow and then History tab, then select workflow run you want to investigate and click View run details. After that, you can see which job has failed.

In case your workflow has been running for some time and accumulated a lot of workflow runs rendering of the History tab can take a noticeable amount of time.

To investigate the workflow via CLI/API with a function get_workflow_run (include graph), you then have to iterate through two arrays JobDetails.JobRuns and CrawlerDetails.Crawls to find out what failed.

Challenge number 8: Time

As a developer, your most valuable thing is your time and ability to iterate as fast and as much as possible. With AWS Glue you should prepare yourself for quite long breaks.

This is a rough summary of how long take certain operations.

Start PySpark Glue 1.0 job - it takes a minimum 10 minutes to start the cluster and run your job. And then a minute or two to run your code. This can be really frustrating if you've made a typo which wasn't caught by the linter.
Start PySpark Glue 2.0 job - Fortunately the startup time has been significantly reduced and it took roughly 4-6 minutes (often much less) to run the code.
Start Python Shell Glue 1.0 job - this takes roughly 2 minutes to start the execution of your code
Preparing dependencies - If you decide not to use online editor, you will be required to perform a lot of operations to get your code up to the cloud. So this was the first task to automate.
Checking logs in AWS CloudWatch Logs - there is another ~1-2 minutes delay before the logs are populated
Workflow run takes longer than expected due to polling mechanism.

Summary

This was my first experience with AWS Glue ETL service.
In general, it was a good experience and it is a fully managed service it simplifies a lot, especially when it comes to creating and maintenance of a cluster.

The service is being actively developed and things may change without you knowing. It is nice in general but has yet a long way to go to become more user and developer-friendly.

Here is the list of my grumbles:

The inability to update C based libraries (NumPy/Pandas) in PySpark jobs may pose a problem.
Packaging and defining and supplying dependencies for the jobs is clumsy.
Startup time of PySpark Glue 1.0 is really long.
The inconsistency in API and parameters prevents better implementations.
Unnecessary workaround needed to implement optional arguments.
Open-source code may not be up to date.
Workflow interface and definition could have been more intuitive.
Event-driven or extended functionality of the workflow must be developed outside Glue service eg. using AWS Lambda and step functions.

From the future versions of AWS Glue, I'd like to see little more focus on development experience.
Especially addressing the inconsistent API between jobs, packaging/deployment and AWS Glue lib maintenance.

The code for the examples in this article can be found in my GitHub repository aws-glue-monorepo-style

AWS Glue first experience - part 4 - Deployment & packaging

Jan Gazda — Sat, 29 Aug 2020 18:13:00 +0000

In this episode, we are going to explore how can we reuse our code and how to deploy AWS Glue application which consists of more than one file.
I expected the workflow to be very similar to AWS Lambda which is already well known and optimised for python but due to involvement of Spark this is not entirely true for AWS Glue.

Challenge number 5: Stay DRY

Because the initialisation process for all data sources and jobs was similar I decided it would be a good idea not having to repeat myself much and create a library with a set of functions which take care of parsing arguments, getting configuration data, simplify the PySpark or Pandas interface.

Due to the nature of different types of dependencies each job type requires. PySpark - .py, .zip, Python Shell - .egg, .whl. And the fact that all our code is held in monorepo.

I decided to create a simple python package with setuptools and follow the src structure.

This gives me enough flexibility to produce needed formats and also reference to the library from inside requirements.txt.

Challenge number 6: Deployment & packaging

Okay so now that I have all the necessary components covered let's put them together and deploy with Terraform.

For each data source, we have defined two transitions raw to refined and refined to curated.

AWS Glue requires 1 .py file as an entry point and rest of the files must be plain .py or contained inside .zip or .whl and each job should be able to have a different set of requirements.

Another requirement from AWS Glue is that entry point script file and dependencies have to be uploaded to S3.

Anything uploaded to S3 then also has to be listed in Terraform as a Special parameter --extra-py-files in form of comma separated list of S3 URLs, eg. s3://bucket/dep1.zip, s3://bucket/deb2.zip or s3://bucket/dep1.whl, s3://bucket/deb2.whl.

Since this list can be very dynamic it's best to keep it as short as possible. As you can see there are a number of operations require from the developer and developing more than one job requires a significant effort. Therefore I decided to use the following structure

/
├── glue/
│   ├── data_sources/
│   ├── ├── ds1/
│   │   └── ├── raw_to_refined/
│   │       │   ├── Makefile
│   │       │   ├── config.py
│   │       │   ├── raw_to_refined.py
│   │       │   └── requirements.txt
│   │       └── refined_to_curated/
│   │           ├── Makefile
│   │           ├── config.py
│   │           ├── another_dependency.py
│   │           ├── refined_to_curated.py
│   │           └── requirements.txt
│   └── shared/
│       └── glue_shared_lib/
│           ├── Makefile
│           ├── setup.py
│           ├── src
│           │   └── glue_shared/__init__.py
│           └── tests/
└── terraform/

Let's describe the structure above.

/glue/ holds all the python code
/glue/data_sources/ holds the code of jobs for each data source
/glue/data_sources/ds1/ - is a directory of 1 particular data source containing transformation
/glue/data_sources/ds1/raw_to_refined and /glue/data_sources/ds1/raw_to_refined
are the transformations whose content is then deployed as a particular AWS Glue Job
/glue/shared/ - contains shared items among the glue (jobs, files, etc...)
/glue/shared/glue_shared_lib - is the library used by the jobs, contains configuration interface and other useful functions
/terraform/ holds all resources required to be deployed our Glue Jobs, IAM roles, lambda functions, etc...

Now that we understand the structure, we can look closer at a particular job.

Glue Job structure

This is a standard blueprint which fit my purpose of developing and deploying several AWS Glue jobs.

ds1/
└── raw_to_refined/
   ├── Makefile
   ├── config.py
   ├── raw_to_refined.py
   └── requirements.txt

In this case, we are looking at a transformation job from raw zone to refined zone.

Makefile - contains several make targets which names are common across all jobs, clean, package, test, upload-job, upload, deploy then implementation of each target is job-specific.
- clean - Cleans up the local temporary files.
- package - For PySpark job creates a .zip file with dependencies. For Python shell job it runs pip and downloads all the wheel files.
- upload-job - uploads entry point script to S3 - useful for quick updates during the development in case you are not doing any changes inside dependent files.
- upload - upload all related files .zip, .whl and entrypoint .py file to S3.
- deploy - performs clean, package and upload
config.py - is responsible for creating a configuration object. This is an extra .py file which is later packaged and used as a dependency. For the sake of saving some time I used python dictionary but with growing complexity of the job I'd recommend spending time on creating a better approach.
raw_to_refined.py - this is the main entry point file executed by AWS Glue. U can use this file execute the code in dependencies or directly implement transformation logic. The name of this file is purposely the same as it's parent directory which will be explained later.
requirements.txt - Is a standard requirements file It's a very simple way of managing your dependencies.

This setup gives me enough flexibility as a developer to run and update jobs in the cloud from within my local environment as well as using CI/CD. Another benefit is that if you have PySpark with Glue running locally, you can use that as well!

Terraform part

This is an example of deploying PySpark Job via Terraform, Python Shell job follows the same process with a slight difference (mentioned later).

To create or update the job via Terraform we need to supply several parameters Glue API which Terraform resource requires. Plus the parameters our job expects.

We need to provide:

Command.ScriptLocation - represented as ds1_raw_to_refined_job_script_location - this is our entrypoint script
DefaultArguments - represented as map ds1_raw_to_refined_job_default_arguments -- this holds the main configuration

Key --extra-py-files in the map ds1_raw_to_refined_job_default_arguments is a comma separated string of S3 urls pointing to our dependencies eg. s3://bucket/dep1.zip,s3://bucket/deb2.zip

All extra dependencies fit in 1 .zip file and once you get the shape of these parameters there is no need to change it.

This brings a potential problem of human oversight, especially with Python Shell jobs. Where dependencies are wheels and by default, wheel name contains a version number numpy-1.19.0-cp36-cp36m-manylinux2010_x86_64.whl.

Then any change in requirements.txt or job arguments also requires a change in Terraform resource which is maintained in a different directory.

I haven't solved this problem during the project but this could be potentially avoided by maintaining a list of dependencies in the S3 bucket in the form of a file which could be generated during the make. And then Terraform would just download this information. However this theoretical the solution can lead to chicken-egg problems and I wish AWS Glue had a better option to maintain dependencies and the job config. Just allowing to use S3 prefix instead of the full URL would be a good start.

The code for the examples in this article can be found in my GitHub repository aws-glue-monorepo-style

AWS Glue first experience - part 3 - Arguments & Logging

Jan Gazda — Fri, 28 Aug 2020 13:28:13 +0000

Challenge number 3: Arguments & Config

Almost every application requires some kind of config or parameters to start with the expected state, AWS Glue applications are no different.

Our code is supposed to run in 3 different environments (accounts), DEV, TEST, PROD and several configuration values were required eg. log level, SNS topic (for status updates) and few more.

Documentation mentions special parameters however they are not all arguments you can expect to get. We will explore this later in this section.

During my work on a project, there was only one set of DefaultArguments which could have been overwritten prior to job start. By the time of writing this article, there are now two sets of these DefaultArguments and NonOverridableArguments where the latter has been added recently.

Some of these arguments were supplied as SSM Parameters while others were submitted as DefaultArguments. This could be very useful in case the job fails and we'd like to run the job again with a different log level eg. default WARN vs non-default DEBUG.

To add or change an argument of the job prior to it's run you can either use a console

Security configuration, script libraries, and job parameters -> Job parameters

AWS Glue Job parameters

Or when using CLI/API add your argument into the section of DefaultArguments.

Then inside the code of your job you can use built-in argparse module or function provided by aws-glue-lib getResolvedOptions (awsglue.utils.getResolvedOptions).

When I started with my journey the function getResolvedOptions was not available for Python Shell jobs and I also planned to create a config object which holds the necessary configuration for the job. It got implemented later.

There is a difference between the implementation of getResolvedOptions between awsglue present in PySpark jobs and awsglue present in Python Shell jobs.

The code of awsglue used in PySpark jobs can be located at GitHub inside aws-glue-lib repository. The main difference is that PySpark job handles some cases of reserved arguments

The code used inside Python Shell jobs is this.

The main problem of this function is that it makes all DefaultArguments required. Which is rather clumsy considering the fact that it also requires you to use -- (double dash) in front of your argument which is generally used for optional arguments.

It is possible to re-implement optional argument this by wrapping this function as suggested in this StackOverflow answer. However, this is rather a workaround which may break if the AWS team decides to fix this.

Also when specifying DefaultArguments via console it feels more natural not to include -- as the UI is not mentioning this at all.

Missing arguments in sys.argv

My first few jobs were only using PySpark and I discovered that there are some additional arguments present in sys.argv which are used in examples inside developers guide but not described. To get a description of these arguments one should visit AWS Glue API docs page which is a bit hidden because there is only 1 direct link pointing there from the developers' guide.

Here are arguments present in sys.argv for PySpark job (Glue 1.0).



[
  'script_2020-06-24-07-06-36.py',
  '--JOB_NAME', 'my-pyspark-job',
  '--JOB_ID', 'j_dfbe1590b8a1429eb16a4a7883c0a99f1a47470d8d32531619babc5e283dffa7',
  '--JOB_RUN_ID', 'jr_59e400f5f1e77c8d600de86c2c86cefab9e66d8d64d3ae937169d766d3edce52',
  '--job-bookmark-option', 'job-bookmark-disable',
  '--TempDir', 's3://aws-glue-temporary-<accountID>-us-east-1/admin'
]

Parameters JOB_NAME, JOB_ID, JOB_RUN_ID can be used for self-reference from inside the job without hard coding the JOB_NAME in your code.

This could be a very useful feature for self-configuration or some sort of state management. For example, you could use boto3 client to access the job's connections and use it inside your code. Without specifying the connection name in your code directly. Or if your job has been triggered from the workflow it would be possible to refer to the current workflow and its properties.

Let's explore sys.argv of Python Shell jobs



[
  '/tmp/glue-python-scripts-7pbpva1h/my_pyshell_job.py',
  '--job-bookmark-option', 'job-bookmark-disable',
  '--scriptLocation', 's3://aws-glue-scripts-133919474178-us-east-1/my_pyshell_job.py',
  '--job-language', 'python'
]

Above we can see that set arguments available in Python Shell job.

The arguments are a bit different from what we've got in PySpark job but the major problem is that arguments JOB_NAME, JOB_ID, JOB_RUN_ID are not available.

This creates a very inconsistent developer experience and prevents the self-reference from inside the job which diminishes the potential of these parameters.

Challenge number 4: Logging

Like I already mentioned AWS Glue Job logs are sent to AWS CloudWatch logs.

There are two log groups for each job. /aws-glue/python-jobs/output which contains the stdout and /aws-glue/python-jobs/error for stderr. Inside log groups you can find the log stream of your job named with JOB_RUN_ID eg. /aws-glue/python-jobs/output/jr_3c9c24f19d1d2d5f9114061b13d4e5c97881577c26bfc45b99089f2e1abe13cc.

When the job is started there are already 2 links helping you to navigate to the particular log.

Showcase of aws console.

Even though the links are present, the log streams are not created until the job starts.

When using logging in your jobs, you may want to avoid logging to stderr or redirect it to stdout because error log stream is only created when the job finishes with failure.

Glue 1.0 PySpark job logs are very verbose and contain a lot of "clutter", unrelated to your code. This clutter comes from Spark underlying services. This issue has been addressed in Glue 2.0 where the exposure to the logs of unrelated services is minimal and you can comfortably focus on your own logs. Good job AWS Team!

Python Shell jobs do not suffer from this condition and you can expect to get exactly what you log.

And that's it about the config and logging. In the next episode, we are going to look into packaging and deployment.

The code for the examples in this article can be found in my GitHub repository aws-glue-monorepo-style

AWS Glue first experience - part 2 - Dependencies and guts

Jan Gazda — Tue, 25 Aug 2020 21:25:02 +0000

Challenge number 2: Dependencies

In the previous episode, we have learned that it's rather simple to put the code in one python file and run it. However, this is often not the best solution and it makes more sense to split the code into separate modules or use the code from the existing libraries. I was about to call this part external dependencies but you will soon find out why I left the external out.

There are multiple documentation pages mentioning how to work with dependencies and differences between PySpark and Python Shell jobs. This may not be really obvious when you browse the docs for the first time.

For PySpark jobs, there is this a page: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
For Python Shell jobs there is another page: https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#python-shell-supported-library

Inconsistencies: Each of the pages can be found in completely different sections and therefore easily missed.

Providing dependencies for PySpark Jobs

English isn't my native language and documentation has confused me here with an explanation instead of giving a simple example.

At the moment Glue supports only pure python libraries which means we are not able to use C based libraries (pandas, numpy) or extensions from other languages.

The dependencies can be supplied in two forms.

a. Single .py file
b. .zip an archive containing python packages.

Packages inside the .zip archive needs to be in the root of the archive.



$ zipinfo -1 dependencies.zip
pkg2/
pkg2/__init__.py
pkg1/
pkg1/__init__.py

Each dependency has to be uploaded to S3 Bucket and then supplied as an "Special parameter_" --extra-py-files to the particular job in a format of comma-separated S3 URLs.

You have to specify each file separately
s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip which becomes annoyingly clumsy if your job has more than one dependency.

Behind the scenes

Your .py or .zip file is copied into a /tmp directory accessible during the job runtime. Injected into a PYTHONPATH and also added as an argument --py-files to spark-submit.

Note:

We have been given access to the Glue 2.0 preview which has broken the spark-submit argument and Spark did could not see the code provided inside the zip file. Fortunately, it is possible to supply the py-files during the job runtime to the SparkSession. This has been later fixed by AWS team and we were notified about the fix when our jobs started to fail due to applied workaround above. It's worth mentioning that Glue 2.0 is running Python 3.7.

Providing dependencies for Python Shell Jobs

Beware:

The documentation of Python Shell jobs is really tricky and sometimes confusing mainly because the examples are provided without enough context and some code examples are written in legacy python while others in Python 3.

To provide external dependencies documentation advise using .egg or .whl files. Upload them to S3 and list as —extra-py-files argument.

The runtime environment includes several pre-installed packages.



Boto3
collections
CSV
gzip
multiprocessing
NumPy
pandas (required to be installed via the python setuptools configuration, setup.py)
pickle
PyGreSQL
re
SciPy
sklearn
sklearn.feature_extraction
sklearn.preprocessing
xml.etree.ElementTree
zipfile

Due to my past experience with AWS Lambda, I suspected that versions of pre-installed packages may be outdated.

So I decided to create a simple job my-pyshell-job and edit the code using the online editor, just to get versions of specific libraries. The code is very simple:

The result of the code above



boto3 1.9.203
numpy 1.16.2
pandas 0.24.2

confirmed that my assumption was correct and the versions are a bit dated so you won't get the latest stable versions.

Which left me no other choice than installing it.

This is the point where the documentation is lacking an explanation and I was left with trial and error approach.

The installation

Knowing that I need to provide .whl I ran the command pip wheel pandas on my macOS. And started to experiments.

This has downloaded multiple wheel files:



numpy-1.19.0-cp36-cp36m-macosx_10_9_x86_64.whl
python_dateutil-2.8.1-py2.py3-none-any.whl
six-1.15.0-py2.py3-none-any.whl
pandas-1.0.5-cp36-cp36m-macosx_10_9_x86_64.whl
pytz-2020.1-py2.py3-none-any.whl

I took my sample code from above and called it my_pyshell_job.py and together with
the .whl files above uploaded to S3 bucket.

Updated a job with AWS CLI command.



  aws glue update-job --job-name my-pyshell-job --job-update '{
    "Role": "MyGlueRole",
    "Command": {"Name": "my-pyshell-job", "ScriptLocation": "s3://bucket/my_pyshell_job.py"},
    "DefaultArguments": {
      "--extra-py-files": "s3://bucket/dependencies/library/numpy-1.19.0-cp36-cp36m-macosx_10_9_x86_64.whl,s3://bucket/dependencies/library/python_dateutil-2.8.1-py2.py3-none-any.whl,s3://bucket/dependencies/library/six-1.15.0-py2.py3-none-any.whl,s3://bucket/dependencies/library/pandas-1.0.5-cp36-cp36m-macosx_10_9_x86_64.whl,s3://bucket/dependencies/library/pytz-2020.1-py2.py3-none-any.whl"
    }
  }'

And started the job via console.

The job finished with error ImportError:.

AWS Glue uses AWS Cloudwatch logs, there are two log groups for each job. /aws-glue/python-jobs/output which contains the stdout and /aws-glue/python-jobs/error for stderr.
Inside log groups you can find the log stream of your job named with JOB_RUN_ID eg. /aws-glue/python-jobs/output/jr_3c9c24f19d1d2d5f9114061b13d4e5c97881577c26bfc45b99089f2e1abe13cc. I also found out that error log is only created if the job finishes with error so if you are planning to log errors which won't fail the job log it to stdout.

The content of the output log contains pip logs. And reported that all libraries were successfully installed. However, the error log contains the whole stack trace of this error.



  ModuleNotFoundError: No module named 'numpy.core._multiarray_umath'{% raw %}`.

  ImportError:

  IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

  Importing the numpy C-extensions failed. This error can happen for
  many reasons, often due to issues with your setup or how NumPy was
  installed.
```

The error message tells us that this is most likely caused by OS mismatch.

So I ran another simple job to give me more details about the platform.

{% gist https://gist.github.com/1oglop1/f5ee29229f3b9b0a10d354c2f59aa7ad file=print_platform.py %}

I tried this with both PySpark and Python Shell jobs and the results were a bit surprising.

Python Shell jobs run on `debian`: `Linux-4.14.123-86.109.amzn1.x86_64-x86_64-with-debian-10.2`
while PySpark jobs run on Amazon Linux `Linux-4.14.133-88.112.amzn1.x86_64-x86_64-with-glibc2.3.4` likely to be a `amazoncorretto`.

The next step was clear, I needed a `wheel` with `numpy` built on `Debian` Linux. Since the plan was to deploy using CI/CD I decided to use [official python docker image][docker_python] `python:3.6-slim-buster`.

Run `pip wheel pandas` again to get the correct packages and remove those already installed.

```
numpy-1.19.0-cp36-cp36m-manylinux2010_x86_64.whl
pandas-1.0.5-cp36-cp36m-manylinux1_x86_64.whl
```

Upload `wheels` to S3.
Update and run the job.

The job has PASSED!
And this is how the result looks:

```
boto3 1.9.203
numpy 1.19.0
pandas 1.0.5
```

But wait there is more in the log.

```
Processing ./glue-python-libs-psoetpzo/numpy-1.19.0-cp36-cp36m-manylinux2010_x86_64.whl
Installing collected packages: numpy
Successfully installed numpy-1.19.0
Processing ./glue-python-libs-psoetpzo/pandas-1.0.5-cp36-cp36m-manylinux1_x86_64.whl
Collecting pytz>=2017.2
Downloading pytz-2020.1-py2.py3-none-any.whl (510 kB)
Collecting numpy>=1.13.3
Downloading numpy-1.19.0-cp36-cp36m-manylinux2010_x86_64.whl (14.6 MB)
Collecting python-dateutil>=2.6.1
Downloading python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting six>=1.5
Downloading six-1.15.0-py2.py3-none-any.whl (10 kB)
Installing collected packages: pytz, numpy, six, python-dateutil, pandas
Successfully installed numpy-1.19.0 pandas-1.0.5 python-dateutil-2.8.1 pytz-2020.1 six-1.15.0
```

If you look closely, you can notice that `numpy` has been installed twice, the question is why?

The answer lies inside another file called `/tmp/runscript.py`. This file is responsible for orchestrating the installation and running your code as well as providing the stack trace in case the exception.

##### /tmp/runscript.py

Out of pure curiosity, I decided to print out the content of this file to find out why `numpy` has been installed twice.

The content of this file was enlightening but also shocking.

After the first 12 lines of imports, there was a comment on line 14

`##TODO: add basic unittest`

Okay, this is a bit scary because a comment like this either suggests that developer did not have enough time to write the tests when rushing into production and the reviewer left this to slip.

If your code does not contain enough tests you should keep it internally or loudly warn your customers about potential consequences.

By further examination, I noticed that this script is also used for PySpark jobs.

Little further into the file, there is a function `download_and_install`. The comment above function describes it's purpose `# Download extra py files and add them to the python path` (might as well have used the docstring right?).

The function takes each file supplied in `--extra-py-files` argument and installs it separately with `pip`.

This tells me a couple of things:

1. Dependencies mentioned in `setup.py` are automatically downloaded from PyPI, so since `pandas` depend on `numpy` it will download it as well hence the second installation.

2. Using a private package repository where dependencies are located there as well won't work flawlessly.

3. The order of dependencies supplied matters


### AWS Glue examples and libs

Speaking of dependencies AWS Glue provides its core functionality via a library called [awsglue][aws-glue-lib]. The code is located on GitHub. Even though the code is public, the repository maintainers do not seem to be interested in community ideas and pull requests because there are many pull requests without any kind of response from AWS Glue team. Judging from the reactions on the [unmaintained issue][unmaintained] I am not the only one who feels this way.

**Friendly warning**, if you are going attempt anything using this repository on your local machine I strongly suggest doing it in an isolated environment either using `docker` container or VM. The code modifies some local directories and `PYTHONPATH`.

During my work on this project, the repository contained 2 versions one in each branch `glue-0.9` and `glue-1.0` + readme file which was a bit confusing because it described the process for both versions.

The "installation" process is also somewhat painful if the only thing you want is code completion in your IDE because the library does not provide any mechanism for a simple installation. And the only way how to run this code is to zip `awsglue` directory `zip -r PyGlue.zip awsglue`. And add it to your `PYTHONPATH` (see the `bin/setup.sh` for example). As you may have noticed the repository contains `bash` commands and Windows is not currently supported (there is [an open pull request][pr_open]).

In case you are visiting the repository to look for the examples of tests you will be very disappointed because there aren't any tests at all!

#### Coding practices

Since the repository does not contain any kind of automation for linting and testing there are a lot of lint errors and potentially dangerous lines.

One example is using immutable as a default argument:

{% gist https://gist.github.com/1oglop1/f5ee29229f3b9b0a10d354c2f59aa7ad file=glue_immutable.py %}

This is the score of default pylint configuration `Your code has been rated at 2.94/10` Using a plain text and pylint score is not always ideal when presenting the errors to a wider audience. With [codeac.io][codeac] we can see the report in numbers but also track issue counts over time.

I forked the repository and run the report for you.
Please keep in mind that I left all configuration on default and therefore the results might not be 100% accurate but should give the idea about the current state.

<figure>
  <img src="https://gist.githubusercontent.com/1oglop1/f5ee29229f3b9b0a10d354c2f59aa7ad/raw/0653a962d1912d28e4fda6eb2a3926c0f9774e9a/codeac.png" 
  alt="codeac io for aws-glue-libs" style="width:50%">
  <figcaption>Results of codeac.io for <a href="https://github.com/awslabs/aws-glue-libs">aws-glue-libs</a></figcaption>
</figure>


Apart from dangerous default values, you can find missing docstrings, `#TODO` comments, too long lines, unused imports and many others.

Before the latest update on May 5, 2020, there were several import errors preventing the code from even running or testing it. This update has patched a few holes however there is still a mix of imports present waiting to cause more problems. If you are curious about how python imports work I suggest reading [this article from realpython][article_rp]

AWS maintains another repository called [aws-glue-samples][aws-glue-samples]. The story of this repository is slightly better, it appears to be actively maintained [I reported and issue][my_issue] about bad coding practices of using `import *` and received a response 2 days later what I consider a rather quick action in an open-source world.

Although I did not find examples particularly useful for myself and the code quality is a bit better than mentioned `aws-glue-libs`. The examples usually contain a lot of comments which could be really useful if you are
new to AWS Glue and data science in particular.

As you can see trivial example of executing one script file without dependencies can quickly escalate and catch you off guard. In the next episode, we are going to explore how to parametrise and configure the glue application.

The code for the examples in this article can be found in my GitHub repository [aws-glue-monorepo-style]


[aws-glue-monorepo-style]: https://github.com/1oglop1/aws-glue-monorepo-style

[docker_python]: https://hub.docker.com/_/python
[aws-glue-lib]: https://github.com/awslabs/aws-glue-libs
[unmaintained]: https://github.com/awslabs/aws-glue-libs/issues/51
[pr_open]: https://github.com/awslabs/aws-glue-libs/pull/54
[codeac]: https://codeac.io
[article_rp]: https://realpython.com/absolute-vs-relative-python-imports/
[aws-glue-samples]: https://github.com/aws-samples/aws-glue-samples/
[my_issue]: https://github.com/aws-samples/aws-glue-samples/issues/62

AWS Glue first experience - part 1 - How to run your code?

Jan Gazda — Tue, 25 Aug 2020 21:24:45 +0000

My first project with AWS Glue

I have received an assignment to help build a data lake with a data pipeline using AWS Glue.
It was my first exposure to this service, and there were many challenges along the way worth sharing.

AWS Glue is a serverless, fully managed extract, transform,
and load (ETL) service to prepare and load data for analytics. Supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, as well as common database engines and databases running on Amazon EC2. Helps you to construct a Data Catalog using Crawlers and pre-built Classifiers then suggests schemas and transformations and generates the code.

The goal of the project

The data lake followed typical 3 zone architecture: Raw, Refined and Curated.

Data for raw and refined zones are stored in S3 bucket while curated data is written to PostgreSQL database running on AWS Aurora.

The idea was to process and transform data incoming from 4 different data sources. All data sources dropped data into a raw zone S3 bucket. Where they have been picked up by individual Glue jobs.

The glue jobs itself have been orchestrated using Glue Workflow feature.

First steps

Since we have identified our data sources and ETL zones it was time to write 8 Glue jobs. Understand four for jobs per transition: 4x raw to refined, 4x refined to curated.

Glue supports multiple run times: Apache Spark with Scala, Apache spark with Python - PySpark, Python shell - pure python (3.6 by the time of writing) interpreter.

We will stick with Python and use PySpark and python shell.

AWS Glue provides an extension (soft wrapper) around pyspark.sql.context and adds Glue specific features such as DynamicFrame, etc. To provide maximum portability I'm going to avoid using AWS Glue specific features.

Challenge number 1: How to run your code?

As a python developer, I'm used to splitting the code into modules and packages to prevent large files.

The Glue documentation is pretty straight forward in pointing you to use the
pre-generated code or write it using an online editor built-in AWS Glue Console.

Apart from that, there is a short page about providing your scripts.
Most of the documentation shows the examples within the console and our code was expected to be deployed via
Terraform.

The Glue Job code requires a script file to be stored in an S3 bucket.

Then you have to point your Terraform resource: aws_glue_job to the script_location
which contains an S3 URL to your file eg. s3://code-bucket/glue_job.py

We can now write the code in 1 file, which is enough for small ETL scripts based purely on Spark. In the next episode, we are going to explore how to deal with dependencies.

The code for the examples in this article can be found in my GitHub repository aws-glue-monorepo-style