Stephen Whitmore

Posted on Nov 11, 2022

Python Powered Pipelines? Preposterous!

#devops #python #angular #productivity

No, not preposterous. Powerful.

DevOps is all about rapid delivery. Using Python to make your pipelines "smart" can help you achieve that goal.

Let's say you have a suite of Angular libraries your team created to use for their applications. A robust pipeline for this project will include the following:

validation jobs that make sure
- there's no "fit" or "fdescribe" to narrow down unit tests
- there's no "dist" or ".angular" folders present
- eslint passes
tests
- unit tests
- integration tests
security scans
- npm audit
- SAST scans (SonarQube, Fortify, etc)
publication
- snapshots
- release candidates
- releases

At an enterprise level, you could be dealing with upwards of 20 libraries. You could have a job for each library. Speaking from experience, having jobs for each will quickly turn your pipeline into a big nasty pile of spaghetti. Nobody wants to deal with a big nasty pile of spaghetti.

Worse still, the time it'll take for your pipelines to run will slow things down to an agonizing crawl and have your team pulling their hair out in frustration every time they push a change up.

Or you can have an easy to read and maintain Python script that will figure out which libraries actually changed, then run the jobs for those libraries.

Let's use a mock scenario to illustrate my point. For the sake of brevity we'll just have our pipeline:

Run eslint
Run unit tests
Publish snapshots off of our feature branches

Here's a project that demonstrates this using both GitHub Actions and GitLab CI/CD. I'm including both because I think most people use GitHub Actions but I know many companies also use GitLab for their enterprise applications (plus I'm way more familiar with GitLab CI/CD).

The major points I'll be going over will be:

Setting up a runner image
Setting up your CI file
- GitHub Actions
- GitLab CI/CD
Handling secrets
- Creating an auth token for npmjs.org
- Creating Actions secrets in GitHub
- Creating masked environment variables in GitLab
Pythonizing your pipeline

Setting up a runner image

There are a boat-load of Docker images to choose from for using as your pipeline runner image. Personally, I like having total control over my pipelines and prefer to use my own image. I like Alpine because it's tiny compared to the more popular Ubuntu. Tiny is good because it loads faster and there's a smaller attack surface.

Based on the listed requirements above, our image will need to support Python, the packages our pipeline script will be using, nodejs, and a browser for our unit tests.

Here's a good example of an image that will serve our needs:

FROM alpine:latest

RUN apk add --no-cache --update python3-dev gcc libc-dev libffi-dev git && \
    ln -sf /usr/share/zoneinfo/America/Chicago /etc/localtime && \
    ln -sf python3 /usr/bin/python && \
    ln -sf pip3 /usr/bin/pip

COPY ./scripts/requirements.txt /tmp

RUN python -m ensurepip && \
    python -m pip install --no-cache --upgrade -r /tmp/requirements.txt

RUN apk add --update --repository http://dl-cdn.alpinelinux.org/alpine/v3.16/main nodejs=16.17.1-r0 npm && \
    apk add --no-cache chromium --repository http://dl-cdn.alpinelinux.org/alpine/v3.16/community

ENV CHROME_BIN=/usr/bin/chromium-browser CHROME_PATH=/usr/lib/chromium

Super duper. Assuming you know your way around Docker we can publish our image to whatever registry we use. If you're not familiar with Docker, don't fret. Their documentation is outstanding and well worth spending time reading through.

Setting up your CI file

GitHub Actions

As far as I can tell GitHub doesn't support using a custom pipeline runner image directly. You'll need to wrap a natively supported image around it and set your image as the "container". Kinda icky but it will still serve you.

Disclaimer: I'm still learning my way around GitHub Actions so the below yml file is far from perfect. It works but I'd love to hear about ways it could be improved/optimized.

.github/workflows/ci.yml

name: CI
on: push

jobs:
  CI:
    runs-on: ubuntu-latest
    container: stevewhitmore/nodejs-python

    steps:
      - uses: actions/checkout@v3

      - name: Cache node modules
        id: cache-npm
        uses: actions/cache@v3
        env:
          cache-name: cache-node-modules
        with:
          path: ~/.npm
          key: ${{ runner.os }}-build-${{ env.cache-name }}-${{ hashFiles('**/package-lock.json') }}
          restore-keys: |
            ${{ runner.os }}-build-${{ env.cache-name }}-
            ${{ runner.os }}-build-
            ${{ runner.os }}-

      - name: Allow me to run my script
        run: git config --global safe.directory '*'

      - name: pylint
        run: |
          python -m pylint --version
          PYTHONPATH=${PYTHONPATH}:$(dirname %d) python -m pylint scripts/ci.py

      - name: eslint
        run: python scripts/ci.py eslint

      - name: Unit Tests
        run: python scripts/ci.py unit_tests

      - name: Publish Snapshots
        run: |
          echo "//registry.npmjs.org/:_authToken=${{ secrets.NPM_TOKEN }}" > .npmrc
          python scripts/ci.py publish_snapshots

GitLab CI/CD

.gitlab-ci.yml

image: stevewhitmore/nodejs-python

stages:
  - validation
  - test
  - snapshot

cache:
  key: ${CI_COMMIT_REF_SLUG}
  paths:
    - node_modules/
    - .npm/

pylint:
  stage: validation
  script:
    - python -m pylint --version
    - PYTHONPATH=${PYTHONPATH}:$(dirname %d) python -m pylint scripts/ci.py
  except:
    - tags

eslint:
  stage: validation
  cache:
    key: ${CI_COMMIT_REF_SLUG}
  script:
    - python scripts/ci.py eslint
  except:
    - tags

unit_tests:
  stage: test
  cache:
    key: ${CI_COMMIT_REF_SLUG}
  script:
    - python scripts/ci.py unit_tests
  except:
    - tags

npm_publish_snapshot:
  stage: snapshot
  cache:
    key: ${CI_COMMIT_REF_SLUG}
  script:
    - echo "//registry.npmjs.org/:_authToken=${NPM_TOKEN}" > .npmrc
    - python scripts/ci.py publish_snapshots
  except:
    - main
    - tags

Handling secrets

NPM needs to know where your packages (libraries) will be registered and there needs to be some kind of authentication. It gets this information from the .npmrc file. Assuming you're publishing to npmjs.org, your pipeline's .npmrc file will look pretty similar.

Note: This .npmrc file is created and only exists during the lifecycle of the pipeline job. DON'T use the .npmrc file from your personal workstation!

The NPM_TOKEN is an environment variable we'll pass from the project's settings. Be sure to mask this variable or you'll be inviting the world to publish npm packages on your bahalf. GitHub does this automatically but GitLab requires an additional step.

Creating an auth token for npmjs.org

Sign into npmjs.org and click on your username on the far right. Select "Access Tokens"

Click "Generate New Token" on the far right

Name your token and select the "Automation" option. This is the ideal option because it will bypass two-factor authentication (which you absolutely should have set up).

Copy the token to a file so you don't lose it.

No, you can't use this token. It has been deleted 😉

Creating Actions secrets in GitHub

Go to the project Settings > Secrets > Actions. Click "New repository secret"

Fill out the "Name" and "Secret" keys with NPM_TOKEN and whatever your auth token is, then click "Add secret"

Creating masked environment variables in GitLab

Go to the project Settings > CI/CD
Expand the "Variables" section and click "Add Variable"
Add your auth token to your project. Note the Masked box is checked.

Pythonizing your pipeline

Now for the fun part. Let the Pythonization commence!

Create a folder named "scripts" at the root of your project. In that folder, create a file ci.py.

There was an argument passed in for each Python script call in our CI file. Those arguments in turn are intended to trigger a specific function that will live in the script file.

For example, the unit_tests job has the following line:

python scripts/ci.py unit_tests

We're passing unit_tests to the ci.py file.

scripts/ci.py

import sys

# ...

def unit_tests():
    """Runs unit tests on libraries with changes"""
    npm_command("test")

locals()[sys.argv[1]]()

Let's assume npm_command() handles whatever npm command you pass in (shocking, I know). It would look something like this:

def npm_command(command):
    """Runs npm commands depending on input"""
    npm_install()
    diffs = get_diffs()

    for library in diffs:
        subprocess.check_call(f"npm run {command}-{library}", shell=True)

That get_diffs() function uses the GitPython package to compare the changes on your branch with the default origin branch (main). It finds all the git diffs, plucks out the library name, and returns a set of library names to avoid duplication.

def get_diffs():
    """Gets the git diffs to determine which libraries to run operations on"""
    path = os.getcwd()
    repo = Repo(path)
    repo.remotes.origin.fetch()
    diffs = str(repo.git.diff('origin/main', name_only=True)).splitlines()

    updated_libraries = []

    for diff in diffs:
        if diff.startswith("projects"):
            path_parts = diff.split("/")
            updated_libraries.append(path_parts[1])

    return set(updated_libraries)

Great, but what about something a little more complex, like publishing snapshots? NPM doesn't allow for duplicate version numbers, so how would we handle that?

Let's take another look at that npm_publish_snapshot job.

python scripts/ci.py publish_snapshots

So it'll call the publish_snapshots() function in our script.

def publish_snapshots():
    """Publishes npm snapshots on libraries with changes"""
    npm_command("publish snapshots")

Not super helpful so far. Let's take another look at that npm_command() function.

def npm_command(command):
    """Runs npm commands depending on input"""
    npm_install()
    diffs = get_diffs()

    for library in diffs:
        if command == "publish snapshots":
            handle_snapshot_publication(library)
        else:
            subprocess.check_call(f"npm run {command}-{library}", shell=True)

Now there's an if/else block in our loop. The handle_snapshot_publication() function should append a unique snapshot version to the changed library, build, then publish it.

def handle_snapshot_publication(library):
    """Updates version with snapshot, builds, and publishes snapshot"""
    package_json_path = f"./projects/{library}/package.json"
    with open(package_json_path, "r", encoding="UTF-8") as package_json:
        contents = json.load(package_json)

    version = contents["version"]
    timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
    is_snapshot_version = re.match("\\s*([\\d.]+)-SNAPSHOT-([\\d-]+)", version)

    if is_snapshot_version:
        version = version.split("-")[0]

    contents["version"] = f"{version}-SNAPSHOT-{timestamp}"
    with open(package_json_path, "w", encoding="UTF-8") as package_json:
        package_json.write(json.dumps(contents, indent=2))

    subprocess.check_call(f"npm run build-{library}", shell=True)
    subprocess.check_call(f"npm publish --access=public ./dist/{library}", shell=True)

The above function reads the changed library's package.json file, parses out the version number, and replaces it with the version number plus a timestamp. So a version 1.2.3 becomes version 1.2.3-SNAPSHOT-{year/month/day-hour/minute/second} (e.g. 1.2.3-SNAPSHOT-20221110-075530). It also has a check in there is_snapshot_version in case you're rerunning a job. This will avoid funky versions from being generated, like 1.2.3-SNAPSHOT-{timestamp}-SNAPSHOT-{timestamp}.

Let's clean that up a little bit to be more singly-minded.

def append_snapshot_version(library):
    """Appends "-SNAPSHOT-" plus timestamp (down to the second) to library version"""
    package_json_path = f"./projects/{library}/package.json"
    with open(package_json_path, "r", encoding="UTF-8") as package_json:
        contents = json.load(package_json)

    version = contents["version"]
    timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
    is_snapshot_version = re.match("\\s*([\\d.]+)-SNAPSHOT-([\\d-]+)", version)

    if is_snapshot_version:
        version = version.split("-")[0]

    contents["version"] = f"{version}-SNAPSHOT-{timestamp}"
    with open(package_json_path, "w", encoding="UTF-8") as package_json:
        package_json.write(json.dumps(contents, indent=2))

def handle_snapshot_publication(library):
    """Updates version with snapshot, builds, and publishes snapshot"""
    append_snapshot_version(library)
    subprocess.check_call(f"npm run build-{library}", shell=True)
    subprocess.check_call(f"npm publish --access=public ./dist/{library}", shell=True)

By now you should be seeing the pattern. You can see from the job outputs that the pipeline is only running the jobs on the changed libraries:

GitHub

GitLab

Enjoy a less painful pipeline with the power of Python! 🦸‍♂️

Top comments (3)

Lucy Linder • Nov 12 '22 • Edited

As much as I agree this is awesome, I don't see exactly where python shines here. I mean, you could do the same with nodejs, hence avoiding setting up yet another language framework (one less requirement). Linting with a js linter would also be easier in this case.

Let's be clear, I love python and would also choose it above nodejs, but would you care to explain a bit more your arguments in its favor?

PS great article and good job!

Stephen Whitmore • Nov 12 '22

You know what, you make a very good point. I opted to go with Python because my team uses a common pipeline that serves many projects written in different languages and different frameworks. I just followed that pattern for this particular scenario and I'm a bit embarrassed to admit that this had not even occurred to me!

Stephen Whitmore • Nov 12 '22

I'd be interested in hearing more about these tools. I agree less code is always better. My team went this route because we had pretty custom needs and we felt using our own scripts would be the most straight forward route.