DEV Community

Sarma
Sarma

Posted on • Updated on

Running AWS Glue locally on a MacBook-M1

by Udaybhaskar Sarma Seetamraju
ToSarma@gmail.com
Dec 31 2023

Highest-level Context

If you are into “Shift-Left” (whether re: Testing, Security, or Replicating-problems-on-developer-laptop, etc ..), then this article is for you.

For the very first time that you switch to an M1-chipset based MacBooks (from intel-chip based MacBooks) .. Productivity is significantly impacted when doing development/testing/troubleshooting “locally” on your laptop. Out-of-scope of this article is supporting those switching from Windoze.

Towards enabling up to 5x developer-productivity by allowing developers to robustly SIMULATE the Cloud-environment on a laptop — I have the following series of articles re: M1-chipset based MacBooks:

  1. Running AWS CodeBuild locally on MacBook-M1.
    • Running Containers based on older Ubuntu 20.04 (released in the year 2020) as well as on the newer Ubuntu 22.04 (released in the year 2022)
    • Running Containers based on arm64-based Linux
  2. (This) Running AWS Glue locally on MacBook-M1. Various scenarios covered like: you do Not have “aws credentials” on your Laptop (forcing you to mock all the AWS API calls like S3 GET, Glue-Catalog queries, etc..)
  3. New Security-related Best-Practices when creating arm64/aarch64 Docker-Images on a MacBook-M1.

Quick Summary

Aiming for very simple single command, based on bash-shell scripts --> to execute your python-code as a Glue Job’s build locally on your MacBook-M1.

In addition, I have a section on how to significantly raise your productivity, in debugging/developing your python-code, even if your company denies your AWS CLI credentials.

To state the obvious, everything here is 100% Python + Bash-Scripts.

Note: You should aim to have your software work on arm64 containers, which invariably is cheapest compute on cloud. More below.

Problem Statements

  1. Using Git for capturing ALL code-changes while simultaneously copying-n-pasting into AWS Glue-Studio (for testing/troubleshooting) is painful, error prone and frustrating.
  2. As time progresses, your Glue Job will become complicated and require more than one Python-script.
Worse, you have one or more folder-hierarchies, all of which contain PY files that you need to import!
  3. Many developers prefer to develop/test/troubleshoot their code as “plain python”, and Not as a Glue-Job.
There is No good reason to deny such developers from doing just that.. .. while ensuring that code will work without any issues inside AWS Glue running locally on MacBook-M1, and eventually work without issues inside Glue on AWS.
    1. When running as “plain python”, there should be No runtime dependencies (like “import awsglue”).
    2. When running as “plain python”, there should be No spark-dependency.
    3. When running as “plain python”, all inputs/files should be on local laptop’s filesystem. All output should be written to local filesystem only.
    4. When running as “plain python”, all information from Glue-Catalog should be available OFFline (as a Python Dict object)
  4. How to proactively ensure the Glue Job will work on all chip-architectures - without having to scramble later? How to explicitly utilize all x86_64/amd64/arm64/aarch64 architectures locally on laptop?
  5. If the Enterprise does Not allow Laptops to have AWS-Credentials (in ~/.aws/credentials file);
Even so, how can I EFFICIENTLY test/debug the python-code file locally on my laptop, even as it needs access to Glue Catalog and/or S3 buckets?

Get started!

export BUILDPLATFORM="linux/aarch64"
ONLY when for running on MacBook-M1 laptop, if you'd like to take advantage of native-performance boost !!

Based on your needs on AWS choose between these 2:
export BUILDPLATFORM="linux/amd64"
export BUILDPLATFORM="linux/arm64"/

export DOCKER_DEFAULT_PLATFORM="${BUILDPLATFORM}"
export TARGETPLATFORM="${DOCKER_DEFAULT_PLATFORM}"

WORK_AREA=~
cd ${WORK_AREA}
git clone https://gitlab.com/tosarma/macbook-m1.git
Enter fullscreen mode Exit fullscreen mode

To try out a sample ..

cd macbook-m1
cd AWS-Glue/src
${WORK_AREA}/macbook-m1/AWS-Glue/bin/run-glue-job-LOCALLY.sh  sample-glue-job.py
Enter fullscreen mode Exit fullscreen mode

No Bash? Want Python instead?

Just replace the “.sh” with “.py” — in the script name “run-glue-job-LOCALLY” (as shown above).
And, of course, you must insert “python3” at the very beginning of the CLI (this is a platform-independent advice).

WARNING: Without the benefit of “docker cli”, you get ZERO visibility into the progress of docker-activity. This is due to use of un-friendly Docker’s Python APIs, because of which the python-code _ WILL _ _ HANG _ for a long time!

To repeat, “run-glue-job-LOCALLY.py” will hang with NO output, for roughly 2-to-5 minutes (depending on how much CPU and MEMORY you have allocated to the Docker-Desktop, as well as speed of your internet connection).

Important - Note these:

  1. I only tested using Python3.11; No other Python version tested.
  2. PRE-REQUISITES:
    • pip3 install docker

Ready to use it for your own Glue-Script?

  1. First, read the full details within the macbook-m1/AWS-Glue/README.md file.
  2. Copy the *.py files in the macbook-m1/AWS-Glue/src/common subfolder into --> YOUR project’s TOPMOST-folder.
    • ATTENTION: the files ./src/common/*.py must exist in your project, after you are done copying.
  3. Make sure to edit your PY file, to look like the example file provided (sample-glue-job.py)
  4. From your project root, run:
cd <your-project's-root-folder>

${WORK_AREA}/macbook-m1/AWS-Glue/bin/run-glue-job-LOCALLY.sh \
          path/to/your/file.py
Enter fullscreen mode Exit fullscreen mode

If you are having problems importing the new files under ./src/common, then try adding this command below and then retry the above command.

export PYTHONPATH="your-project's-root-folder"
Enter fullscreen mode Exit fullscreen mode

Is your Python code-base consisting of multiple files across multiple folder-hierarchies?
See section below titled “Complex folder-hierarchies?

Want to change the CLI-arguments?

Three simple steps:

  1. Edit the file macbook-m1/AWS-Glue/src/common/cli_utils.py
  2. Look inside “process_all_argparse_cli_args()” and make changes in that function.
  3. Look inside “process_std_glue_cli_args()” and make changes inside that function.
    • Note: make sure to make similar changes in above steps 2 & 3.

Example:-

  1. You would like to support a new cli-arg as:
    *
 --JOB_NAME 123_ABC

  2. Insert a new line at (say) line # 125 for JOB_NAME.

    • This will ensure your code will get the value 123_ABC when running __ INSIDE __ AWS-Glue !!
  3. Insert a new line at (say) line # 78 for --JOB_NAME

    • This will allow you to run your python-code as a PLAIN python-command and read this CLI-arg.
    • See more re: this in a following section titled “running as a PLAIN python-command

Never edit the file:
macbook-m1/AWS-Glue/src/common/glue_utils.py

The files “common.py” and “names.py” in that same folder can be edited. Feel free to play around with them.

Tips, Issues & Errors

See Appendix sections, for tips on configuring Docker-DESKTOP.

Question: Want to automatically cleanup/delete the Docker-containers - after they exit?
Answer: INSERT the cli-arg “--cleanup” BEFORE the python-filename, to that “run-glue-job-LOCALLY.sh” script.

Advanced User - Complex folder-hierarchies?

Is your Python code-base consisting of multiple files across multiple folder-hierarchies?
Are you aware that Glue requires you to ZIP up all those OTHER python-files into a single Zip-file?
FYI only - this requirement is driven by Spark!

That script “run-glue-job-LOCALLY.sh” will automatically do that for you --> that is, it will automatically look UNDER the current-working-directory, and find all **/*.py files and put them in a temporary ZIP-file.
The script will then automatically pass it on to Glue-inside-Docker (running on your laptop).

If you need to import PY files in parent/ancestor levels, I recommend that you add a “symlink” (Linux command “ln -s”) to those files, and put that symlink in your current-working-folder.
Git will preserve these “symlinks” as exactly just that. It will NOT convert them into files. So, feel better already!

No AWS Credentials on your laptop?

For security-reasons, many companies are denying developers the AWS-credentials for AWS-CLI use.

That means you have a showstopper -> re: locally testing/debugging your python-code, for scenarios like:

  1. COPY INPUT-files from S3-buckets --> into the “current-working directory”.
  2. COPY OUTPUT-files from the “current-working directory” --> into S3-buckets.
  3. Lookup Glue Catalog.
  4. .. etc ..

To workaround this restriction ..
You need to write code that detects whether its running on a MacBook-M1 -versus- actually running inside AWS-Cloud.
In other words, you need to “Short-Circuit” all that code that interacts with AWS-APIs (Glue-Catalog, S3, ..) and mock the expected response from those AWS-APIs.

If you use my script “run-glue-job-LOCALLY.sh”, it automatically sets an environment-variable calledrunning_on_LAPTOP” when running your python-code inside a Docker-Glue container on your laptop!!

How-Toshort-circuit”:

if ( os.environ.get('running_on_LAPTOP') ):
    print( "!!!!!!!!!!!!!!!!! running on laptop !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" )
    .. ### assume S3-get is already done and file is available in current-directory
    .. ### assume Glue-Catalog-Query is already done and ..
       ###      the "JSON-Response" is available in current-directory as a JSON-file
    ..
else:
    ..
    ..
Enter fullscreen mode Exit fullscreen mode

To state the obvious, on AWS-Cloud, AWS-GLUE does __ NOT __ support environment-variables.
So, this environment-variable called “running_on_LAPTOP” will be UN-defined when running inside AWS-Cloud.

Running as a plain python command

EXAMPLE:
I’m going to use the same “macbook-m1/AWS-Glue/src/sample-glue-job.py” file, to show how to run as PLAIN Python-program.

FYI: My python-code in that sample-glue-job.py expects the following 6 CLI-arguments (with their values).

If you do _ NOT _ like this list of CLI-args, see section above titled “Want to change the CLI-args?

python3 sample-glue-job.py --ENV sandbox \
     --commonDatabaseName MyCOMMONDATABASENAME \
     --glueCatalogEntryName MyDICT \
     --rawBucket MyRAWBUCKET \
     --processedBucket MyPROCESSEDBUCKET  \
     --finalBucket MyFINALBUCKET
Enter fullscreen mode Exit fullscreen mode

If you want the ability to run both as a plain python-script as well as run it inside AWS-glue, you __ MUST __ replicate the structure and code within this “macbook-m1/AWS-Glue/src/sample-glue-job.py”.

APPENDIX

Docker-Desktop settings for aarch64-chipset

See screenshot below.
Turn ON the setting titled “Use containerd for pulling and storing images”!
Note: for other scenarios, you may have to turn it OFF.
I can’t help explain this crazy conflicting instructions.
As of 2023, this is a Docker-on-MacBook issue, resolvable only by Docker + Apple Corp.

Image description

Running out of Disk-space or Memory?

Screenshot below shows the recommended “high” settings.
After building images, you can reduce:

  • “CPU” can be lowered to “2”.
  • “Memory” can be lowered to “4GB”.

FYI only - To run on a MacBook-M1, many amd64 emulated containers like Neo4j v4.x will frequently fail, unless you provide Docker with a minimum of 5 cpus and 8GB of RAM!

Image description

End of Article.

Top comments (0)