DEV Community: Edgar Roman

Workflow for transcribing audio in AWS

Edgar Roman — Mon, 27 Feb 2023 21:16:55 +0000

My sister had an interesting challenge for me recently. She had some audio that needed to be automatically run through AWS Transcribe from files uploaded to S3.

Initially I thought that AWS Step Functions would be great, but opted for a simpler solution using direct events. Normally a workflow tool would be more flexible for more complex solutions and hard coding event driven events tend to make solutions brittle when future changes come, but often simple is better.

Architecture Overview

Here's a high level overview of the flow:

We can describe the flow thus:

User uploads audio file (in mp3 format) to the uploads/ folder in the bucket
Event Bridge detects the created object and triggers the Start Transcribe Job lambda function
The Start Transcribe Job lambda initiates the AWS Transcribe Job and exits
AWS Transcribe completes the conversion from audio to text and writes the file back to the S3 bucket in the output/ folder
Event Bridge detects a state change in the AWS Transcribe job and triggers the Cleanup lambda function

Guide

Let's get into the how to build this system using the AWS console. Once you are familiar with the important steps, you may automate this process using an automated tool like Terraform

Create the bucket

Load up the AWS console for S3: https://s3.console.aws.amazon.com/s3/buckets
Click on the "Create Bucket" button
Enter the bucket name and you can accept all default options. In this example I put in:
- Name: transcribe-workflow-experiment
Once created, if you click on the bucket you should see a button for "Create Folder". Use this button to create:
- A folder named uploads
- A folder named output
Also on the properties page of the S3 bucket be sure to enable AWS Event Bridge or you won't get any events!

Create your "Start Transcribe Job" Lambda

Load up the lambda part of the AWS Console: https://us-east-1.console.aws.amazon.com/lambda/home
Click on "Create Function"
Use the following settings:
- Name: transcribe-workflow-start-transcribe
- RunTime: Python 3.9
- Architecture: x86_64
And accept all other default parameters by clicking 'Create Function'

Enhance the Lambda IAM role

When creating a new Lambda function, the default AWS settings will automatically create an IAM role associated with the function. This IAM roll will have the bare minimum permissions to execute the Lambda function and almost nothing else. We will need to add additional permissions to the IAM role in order to complete this overall architecture.

Once the Lambda function is created, click to see the details.
Click on 'Configuration'
Then click on the 'Role Name' and it should open a new tab
You should now see the IAM permissions for the role associated with your Lambda function. We need to add the ability to:
- Access the S3 bucket we created previously
- Start AWS Transcribe jobs
First, we add S3 access. Click on the 'Add Permissions' dropdown and select 'Create Inline Policy'. Click on the 'JSON' tab and input the following policy and click 'Review Policy'. Make sure your bucket name you added replaces transcribe-workflow-experiment in the JSON.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ListObjectsInBucket",
            "Effect": "Allow",
            "Action": ["s3:ListBucket"],
            "Resource": ["arn:aws:s3:::transcribe-workflow-experiment"]
        },
        {
            "Sid": "AllObjectActions",
            "Effect": "Allow",
            "Action": "s3:*Object",
            "Resource": ["arn:aws:s3:::transcribe-workflow-experiment/*"]
        }
    ]
}

On the next screen, name like policy something descriptive like: "allow-bucket-access-for-transcribe-workflow".
Now we add permission to start AWS Transcribe jobs. Repeat the same steps with a new policy and this JSON:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowTranscribeJobs",
            "Effect": "Allow",
            "Action": "transcribe:StartTranscriptionJob",
            "Resource": "*"
        }
    ]
}

Add the Lambda code

Next, we populate the python code that runs the Lambda function. You can see the code below is pretty straightforward, but there a couple things to note:

The Lambda function does not actually interact directly with S3. So why did we add the IAM policy for S3 access? Because the AWS Transcribe job will use the same IAM role of the Lambda function that called it by default.
The function instructs AWS Transcribe to output the results back to the same bucket, but to a different folder. See the parameters OutputBucketName and OutputKey.
The output file will be named after the TranscriptionJobName. So we add a little utility function to 'slugify' the filename. To wit: "Audio File #3.mp3" becomes "audio-file-3-mp3".
Additional Transcribe options can be added as you see fit. E.g. Leveraging a custom dictionary.

import json
import boto3
import logging

log = logging.getLogger()
log.setLevel(logging.INFO)


def lambda_handler(event, context):
    transcribe = boto3.client("transcribe")

    log.info("transcribe-workflow-start-job started...")

    if event:
        bucket = event["detail"]["bucket"]["name"]
        key = event["detail"]["object"]["key"]
        # The job name is important because the output file will be named after
        # the original audio file name, but 'slugified'.  This means it contains
        # only lower case characters, numbers, and hyphens
        job_name = slugify(key.split("/")[-1])
        log.info(
            f"transcribe-workflow-start-job: received the following file {bucket} {key}"
        )
        s3_uri = f"s3://{bucket}/{key}"
        # Kick off the transcription job
        transcribe.start_transcription_job(
            TranscriptionJobName=job_name,
            Media={"MediaFileUri": s3_uri},
            LanguageCode="en-US",
            OutputBucketName=bucket,
            OutputKey="output/",
        )

    return {"statusCode": 200, "body": "Lambda: Start Transcribe Complete"}


# Small slugify code taken from:
# https://gist.github.com/gergelypolonkai/1866fd363f75f4da5f86103952e387f6
# Converts "Audio File #3.mp3" to "audio-file-3-mp3"
#
import re
from unicodedata import normalize

_punctuation_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?@\[\\\]^_`{|},.]+')


def slugify(text, delim="-"):
    """
    Generate an ASCII-only slug.
    """

    result = []
    for word in _punctuation_re.split(text.lower()):
        word = normalize("NFKD", word).encode("ascii", "ignore").decode("utf-8")

        if word:
            result.append(word)

    return delim.join(result)

Create the Cleanup Lambda function

Once the transcribe job is complete, we want this function to clean up a bit. Tasks needed:

Move the original audio file to a completed folder or failed folder depending on the job outcome
(Optional) Delete the transcription job. Even though AWS Transcribe automatically deletes jobs after 90 days, it could reduce clutter in the AWS console
(Optional) Contact a user with a notification that a job has completed

For this blog post, we will only manage the original audio file. We primarily wanted to highlight the capability of a post-processing Lambda function.

Load up the Lambda part of the AWS Console: https://us-east-1.console.aws.amazon.com/lambda/home
Click on "Create Function"
Use the following settings:
- Name: transcribe-workflow-transcribe-completed
- RunTime: Python 3.9
- Architecture: x86_64
And accept all other default parameters by clicking 'Create Function'

Once complete, insert the following code:

import json
import boto3
import logging

log = logging.getLogger()
log.setLevel(logging.INFO)


def lambda_handler(event, context):

    transcribe = boto3.client("transcribe")
    s3 = boto3.client("s3")

    log.info(f"transcribe-workflow-cleanup started...")

    if event:
        job_name = event["detail"]["TranscriptionJobName"]
        log.info(f"Transcribe job: {job_name}")

        response = transcribe.get_transcription_job(TranscriptionJobName=job_name)
        if not response:
            return {"statusCode": 404, "body": "Job not found"}

        # Now extract the name of the original file
        # in the form:
        # s3://transcribe-workflow-experiment/uploads/Sample Audio.mp3
        original_audio_s3_uri = response["TranscriptionJob"]["Media"]["MediaFileUri"]
        # Use [5:] to remove the 's3://' and partition returns the string split by the first '/'
        bucket, _, key = original_audio_s3_uri[5:].partition("/")
        _, _, media_file = key.partition("/")
        log.info(f"Media file {media_file} is at: key {key} in bucket {bucket}")

        # Job status will either be "COMPLETED" or "FAILED"
        job_status = event["detail"]["TranscriptionJobStatus"]

        new_folder = "completed/"
        if job_status == "FAILED":
            new_folder = "failed/"

        # Tell AWS to 'move' the file in S3 which really means making a copy and deleting the original
        s3.copy_object(
            Bucket=bucket,
            CopySource=f"{bucket}/{key}",
            Key=f"{new_folder}{media_file}",
        )
        s3.delete_object(Bucket=bucket, Key=key)

    return {"statusCode": 200, "body": "Lambda: Transcribe Cleanup Complete"}

Enhance the Lambda IAM role for 2nd Lambda function

We will need to add additional permissions to the IAM role in order to complete this overall architecture, specifically the ability to access the S3 bucket we created previously.

Once the Lambda function is created, click to see the details.
Click on 'Configuration'
Then click on the 'Role Name' and it should open a new tab
You should now see the IAM permissions for the role associated with your Lambda function.

Click on the 'Add Permissions' dropdown and select 'Create Inline Policy'. Click on the 'JSON' tab and input the following policy and click 'Review Policy'. Make sure your bucket name you added replaces transcribe-workflow-experiment in the JSON.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ListObjectsInBucket",
            "Effect": "Allow",
            "Action": ["s3:ListBucket"],
            "Resource": ["arn:aws:s3:::transcribe-workflow-experiment"]
        },
        {
            "Sid": "AllObjectActions",
            "Effect": "Allow",
            "Action": "s3:*Object",
            "Resource": ["arn:aws:s3:::transcribe-workflow-experiment/*"]
        }
    ]
}

On the next screen, name like policy something descriptive like: "allow-bucket-access-for-transcribe-workflow-cleanup".
Now we add permission to query AWS Transcribe jobs. Repeat the same steps with a new policy and this JSON:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "QueryTranscriptionJobs",
            "Effect": "Allow",
            "Action": "transcribe:GetTranscriptionJob",
            "Resource": "*"
        }
    ]
}

Wire up events with EventBridge

Now we need to connect all these elements together. We do this with AWS EventBridge. EventBridge is kind of a universal event bus that can tap into many events throughout the AWS ecosystem. For example, the creation of an object in S3 can directly trigger a Lambda function. However, only EventBridge can detect the AWS Transcribe job status change. So to be consistent, we will use EventBridge for both S3 object creation and AWS Transcribe job status change.

EventBridge setup for new S3 object

In this setup, we will indicate to EventBridge to process all 'Object Created' events in the bucket 'transcribe-workflow-experiment' in the folder 'uploads/'. More information about EventBridge patterns can be found in the AWS Docs. Any other objects created in other folders in the S3 bucket will not be processed by this rule.

Navigate to the EventBridge page in the AWS console: https://us-east-1.console.aws.amazon.com/events/home
From the left menu, select 'Rules'
Click on the 'Create Rule' button
Use the following parameters and click 'Next'
- Name: transcribe-workflow-new-object-rule
- Description: Notifies Lambda there is a new audio file in S3 in the '/uploads' folder
- Event Bus: default

On the 'Built Event Pattern' page:

Event source: AWS events or EventBridge partner events
Scroll past the Sample Event section
Creation method: Select 'Custom Pattern (JSON Editor)'
Insert the following and make sure you update the bucket name to your bucket, then click 'Next'

{
    "source": ["aws.s3"],
    "detail-type": ["Object Created"],
    "detail": {
        "bucket": {
            "name": ["transcribe-workflow-experiment"]
        },
        "object": {
            "key": [
                {
                    "prefix": "uploads"
                }
            ]
        }
    }
}

On the 'Select target(s)` screen:
- Target 1: Select AWS Service
- In the dropdown, use target type 'Lambda Function'
- Select the first Lambda function that starts the transcribe job
- Select 'Next'
- You may skip tags and click 'Next' and then complete the process

EventBridge setup for AWS Transcribe Job Complete

From the left menu, select 'Rules'
Click on the 'Create Rule' button
Use the following parameters and click 'Next'
- Name: transcribe-workflow-job-complete-rule
- Description: Notifies Lambda the transcribe job is complete
- Event Bus: default
On the 'Built Event Pattern' page:
- Event source: AWS events or EventBridge partner events
- Scroll past the Sample Event section
- Creation method: Select 'Custom Pattern (JSON Editor)'
- Insert the following: json { "source": ["aws.transcribe"], "detail-type": ["Transcribe Job State Change"], "detail": { "TranscriptionJobStatus": ["COMPLETED", "FAILED"] } }
- Click 'Next'
On the 'Select target(s)` screen:
- Target 1: Select AWS Service
- In the dropdown, use target type 'Lambda Function'
- Select the second Lambda function that cleans up the transcribe job
- Select 'Next'
- You may skip tags and click 'Next' and then complete the process

Test it out

You can try this out by using the AWS Console and uploading files into the 'uploads/' folder of the bucket. Try loading an audio file and seeing the output. Next, try loading a non-audio file like a jpg and watch the file get swept into the 'failed' folder.

Next steps

As you can see, this is essentially a two step workflow. More tasks can be added to the 2nd Lambda function if you wish. To further extend this, you can insert Step Functions to have a more flexible workflow. Most of the code will be the same with some slight tweaks to the event formats.

One difficulty you will have with Step Functions is to pause the flow while AWS Transcribe is processing the job. Step Functions generates a token that is needed to resume the workflow and that token will not be stored by AWS Transcribe. So you'd need a mechanism to either store the token in something like DynamoDB or do some deep inspection of the workflow to extract the resume token.

Regardless, I hope this helps someone working on a similar effort.

Cover Photo by Michael Maasen on Unsplash

Opinionated Docker development workflow for Node.js projects - Part 2

Edgar Roman — Wed, 31 Aug 2022 20:15:00 +0000

In a recent post, I described why you'd want to use Docker to develop server applications. Then I described how to use an opinionated workflow. In this post, I'll describe how the workflow operates under the covers.

If you haven't read part 1, I strongly suggest you check it out now.

Directory Structure and Files

Remember the directory structure? We established a clear directory structure that isolates all your application specific code into a single sub-folder and the top level directory holds all the workflow files.

It looks like this:

└── Main_Project_Directory/
    ├── server-code/
    │   ├── server.js
    │   ├── package.json
    │   └── ... (All your other source code files)
    ├── .gitignore
    ├── .dockerignore
    ├── Dockerfile
    ├── docker-compose.yml
    └── README.md

Dockerfile

Here's the complete Dockerfile

# Base node images can be found here: https://hub.docker.com/_/node?tab=description&amp%3Bpage=1&amp%3Bname=alpine
ARG NODE_IMAGE=node:16.17-alpine

#####################################################################
# Base Image
#
# All these commands are common to both development and production builds
#
#####################################################################
FROM $NODE_IMAGE AS base
ARG NPM_VERSION=npm@8.18.0

# While root is the default user to run as, why not be explicit?
USER root

# Run tini as the init process and it will clean zombie processes as needed
# Generally you can achieve this same effect by adding `--init` in your `docker RUN` command
# And Nodejs servers tend not to spawn processes, so this is belt and suspenders
# More info: https://github.com/krallin/tini
RUN apk add --no-cache tini
# Tini is now available at /sbin/tini
ENTRYPOINT ["/sbin/tini", "--"]

# Upgrade some global packages
RUN npm install -g $NPM_VERSION

# Specific to your framework
#
# Some frameworks force a global install tool such as aws-amplify or firebase.  Run those commands here
# RUN npm install -g firebase

# Create space for our code to live
RUN mkdir -p /home/node/app && chown -R node:node /home/node/app
WORKDIR /home/node/app

# Switch to the `node` user instead of running as `root` for improved security
USER node

# Expose the port to listen on here.  Express uses 8080 by default so we'll set that here.
ENV PORT=8080
EXPOSE $PORT

#####################################################################
# Development build
#
# These commands are unique to the development builds
#
#####################################################################
FROM base AS development

# Copy the package.json file over and run `npm install`
COPY server-code/package*.json ./
RUN npm install

# Now copy rest of the code.  We separate these copies so that Docker can cache the node_modules directory
# So only when you add/remove/update package.json file will Docker rebuild the node_modules dir.
COPY server-code ./

# Finally, if the container is run in headless, non-interactive mode, start up node
# This can be overridden by the user running the Docker CLI by specifying a different endpoint
CMD ["npx", "nodemon","server.js"]

#####################################################################
# Production build
#
# These commands are unique to the production builds
#
#####################################################################
FROM base AS production

# Indicate to all processes in the container that this is a production build
ARG NODE_ENV=production
ENV NODE_ENV=${NODE_ENV}

# Now copy all source code
COPY --chown=node:node server-code ./
RUN npm install && npm cache clean --force

# Finally, if the container is run in headless, non-interactive mode, start up node
# This can be overridden by the user running the Docker CLI by specifying a different endpoint
CMD ["node","server.js"]

Dockerfile: Version Management and Multi-Stage

Let's start at the top of the file:

# Base node images can be found here: https://hub.docker.com/_/node?tab=description&amp%3Bpage=1&amp%3Bname=alpine
ARG NODE_IMAGE=node:16.17-alpine

#####################################################################
# Base Image
#
# All these commands are common to both development and production builds
#
#####################################################################
FROM $NODE_IMAGE AS base
ARG NPM_VERSION=npm@8.18.0

First we know that versions of Node.js and npm change over time. These are critical dependencies so we put them up front at the top so developers can update them when starting a new project.

Also note that this is a [multi-stage Dockerfile (https://docs.docker.com/develop/develop-images/multistage-build/). This is to accommodate a single file for both development and production builds. So we have:

A common base stage
A development stage
a production stage

Dockerfile: Base stage

These are the commands in the Dockerfile for the base stage

We explicitly set the user to be root for the next several instructions

# While root is the default user to run as, why not be explicit?
USER root

We're adding a handler to capture termination signals from the system to gracefully shutdown. While not super important for Node.js containers, it's a good practice in case we want to use this Dockerfile for other languages such as Python. More details can be found at https://github.com/krallin/tini.

# Run tini as the init process and it will clean zombie processes as needed
# Generally you can achieve this same effect by adding `--init` in your `docker RUN` command
# And Nodejs servers tend not to spawn processes, so this is belt and suspenders
# More info: https://github.com/krallin/tini
RUN apk add --no-cache tini
# Tini is now available at /sbin/tini
ENTRYPOINT ["/sbin/tini", "--"]

We install the specified version of npm here. We also have the opportunity to install other packages that are required to be global here just as aws-amplify or firebase.

While I hope the trend of requiring global pacakages goes away, we can easily support it using Docker. Be sure to identify a specific version of your global package to prevent nasty surprises later!

# Upgrade some global packages
RUN npm install -g $NPM_VERSION

# Specific to your framework
#
# Some frameworks force a global install tool such as aws-amplify or firebase.  Run those commands here
# RUN npm install -g firebase

Create an arbitrary location for the container to store your code. You can call this anything you'd like, but we're following the traditional Linux approach here for consistency. Note that we also have to change the ownership of this directory to the proper user (called node) so that the user can properly create/read/write files in this directory.

So your application will live in the directory /home/node/app. Finally, switch subsequent build instructions to be executed by the node. This follows the recommended best practices

# Create space for our code to live
RUN mkdir -p /home/node/app && chown -R node:node /home/node/app
WORKDIR /home/node/app

# Switch to the `node` user instead of running as `root` for improved security
USER node

Expose the port on which the application will listen. This is configurable, but must be changed in several files.

# Expose the port to listen on here.  Express uses 8080 by default so we'll set that here.
ENV PORT=8080
EXPOSE $PORT

Dockerfile: Development stage

These are the commands in the Dockerfile for the development stage. When doing a build, you must specify the stage using the flag: --target=development.

First we copy over the package.json file from the server-code directory into the working directory /home/node/app (set above by WORKDIR). We also include any package-lock.json files.

#####################################################################
# Development build
#
# These commands are unique to the development builds
#
#####################################################################
FROM base AS development

# Copy the package.json file over and run `npm install`
COPY server-code/package*.json ./

Next, we run npm install to establish all the application dependencies including development dependencies.

RUN npm install

Then we copy the rest of the code. As the comment states, we separate this step for build optimization. If we don't modify the package requirements, then we won't have to rebuild the npm install step which could take a long time. More often, we'll be changing just the application source code, so subsequent builds would be very fast.

And finally, we run npx nodemon server.js to start the application using nodemon to reload the code when it changes.

# Now copy rest of the code.  We separate these copies so that Docker can cache the node_modules directory
# So only when you add/remove/update package.json file will Docker rebuild the node_modules dir.
COPY server-code ./

# Finally, if the container is run in headless, non-interactive mode, start up node
# This can be overridden by the user running the Docker CLI by specifying a different endpoint
CMD ["npx", "nodemon","server.js"]

Dockerfile: Production stage

These are the commands in the Dockerfile for the development stage. When doing a build, you must specify the stage using the flag: --target=development.

#####################################################################
# Production build
#
# These commands are unique to the production builds
#
#####################################################################
FROM base AS production

Set some environment variables so all downstream scripts can discover this is a production build and act accordingly.

# Indicate to all processes in the container that this is a production build
ARG NODE_ENV=production
ENV NODE_ENV=${NODE_ENV}

Copy over all the source code at once. We are going to do a full npm install so there's no need to break up the multiple copy steps like we did in the development stage. After we do the install, we tell npm to clear it's cache in an attempt to make the image size smaller.

# Now copy all source code
COPY --chown=node:node server-code ./
RUN npm install && npm cache clean --force

Here we kick off the application by telling node to run the application server.js.

# Finally, if the container is run in headless, non-interactive mode, start up node
# This can be overridden by the user running the Docker CLI by specifying a different endpoint
CMD ["node","server.js"]

Development Mode without Docker Compose

We'll show how to use this Dockerfile using just the command line interface (CLI). In the next section, we'll simplify this with Docker Compose.

If this the first time you are running the container, or if you have changed any package dependencies, then run:
```
docker build . -t mynodeapp:DEV --target=development
```
This will build the image using the development stage instructions from the Dockerfile. Note that you have to call it something, so we're using mynodeapp with the version of DEV. Using DEV helps to avoid production deployment since it's not using semantic versioning.

To run the container, type the following command:

docker run -ti --rm -p 8080:8080 -v "$(pwd)/server-code:/home/node/app" -v /home/node/app/node_modules mynodeapp:DEV

What you should see is Docker will start to run your container in the terminal window and any console messages will appear as they are printed out.

Changes to the source code should trigger a reload of Node and will be reflected in the console.

Notes

If you make any changes to dependent packages, then you'll have to run the docker build command as shown above. Run the build any time you add, update, or remove a package.
We assume that node will be running on port 8080. If this is not the case for your project, feel free to change it, but make sure to change it everywhere.

This workflow is made possible by some clever Docker commands. We'll expand on the command above here:

docker run: This is the primary Docker command to take a container image and run it
-ti: This instructs Docker to run this container interactively so you can see the output console
--rm: After you exit the container instance by pressing Ctrl-c this flag instructs Docker to clean up after the container
-p 8080:8080: Ensure port 8080 on the container is mapped to port 8080 on your local machine so you can use http://localhost:8080
-v "$(pwd)/server-code:/home/node/app": This maps the directory server-code (along with your source code) into the container directory /home/node/app. So your source code and everything in the server-code directory is available in the container.
-v /home/node/app/node_modules: This is a special command that excludes the node_modules directory on your local machine and instead keeps the container's node_modules directory that was created during the build phase. This is important because the node_modules on your local machine may be full of packages that are specific to the local machine operating system. And since we want the packages for the container, this flag makes that one directory take priority.
mynodeapp:DEV: This is whatever you want to call your container image. We tag this image with DEV to make sure you don't accidentaly deploy this version.

Development Mode simplified with Docker Compose

The commands to enable this workflow are very long, complex, and difficult to remember. To simplify this workflow, we have a few options. One common option is create shell scripts for the commands above. This works, but will be operating system specific (e.g. Windows will need a different solution than most shells).

We can use Docker Compose to simplify our workflow. Docker Compose is a tool that allows you to put all the commands into a file that does the work for you. It also allows you to spin up multiple containers at the same time. This is useful if you have to spin up both a web server and a database server for local development. That scenario is outside the scope of this post.

Note that the Docker Compose we are going to show here is for development flow only.

First, create the docker-compose.yml file:

services:
  app:
    build:
      context: .
      target: development
      args:
        - NODE_ENV=development
    environment:
        - NODE_ENV=development
    ports:
      - "8080:8080"
    volumes:
      - ./server-code:/home/node/app
      - /home/node/app/node_modules

Note that we've put the same command line arguments shown earlier into to the file itself. So that makes it easy to build:

docker compose build

And really easy to run:

docker compose up

And when you're finished

docker compose down

Conclusion

And that's about it. Don't forget the .dockerignore and optionally the .gitignore but you can customize those as you see fit.

Opinionated Docker development workflow for Node.js projects - Part 1

Edgar Roman — Wed, 31 Aug 2022 19:24:59 +0000

In a recent post, I described why you'd want to use Docker to develop server applications. In this post, I'll describe how to develop a Node.js application with Docker.

Overview

The goals we'd like to accomplish in this blog post:

We are focusing on Node.js environment to write a server side application (e.g. Express, Sails, or other)
Allow local development without Docker (optional)
Allow local development with Docker with hot refresh when code changes
Provide instructions to build images for testing and production
Isolate container scripts from source code so one folder structure can be used for many projects
Be straightforward, but explain all the steps so modifications and updates can be made

This post is divided into two parts:

How to use the workflow (this post!)
Dive into the details of how the Dockerfiles work

We'll start with how to use the workflow and readers can continue on to the working details if interested.

Quick Docker Terminology

For those who are new to Docker, I use some terms in this post and wanted to quickly define them:

Dockerfile: A file that describes a set of instructions to Docker Desktop to build an image.

Image: A file that contains the end results of instructions of a Dockerfile after Docker Desktop performs
a build

Container: A running instance of your image that can execute code

That should be enough to get you rolling - let's get to the workflow!

How to use the workflow

Directory Structure and Files

We establish a clear directory structure that isolates all your application specific code into a single sub-folder and the top level directory holds all the workflow files.

It looks like this:

└── Main_Project_Directory/
    ├── server-code/
    │   ├── server.js
    │   ├── package.json
    │   └── ... (All your other source code files)
    ├── .gitignore
    ├── .dockerignore
    ├── Dockerfile
    ├── docker-compose.yml
    └── README.md

This graph generated on https://tree.nathanfriend.io/

Notes

The server-code directory is an arbitrary name.
You may rename it, but be sure to update all the references in the Dockerfiles and Docker commands shown in this blog post. The purpose of this subdirectory is to isolate your server code from all the workflow stuff.
The Dockerfile, docker-compose.yml, and .dockerignore files will be taken from this repo.
Your application must start with the file server.js because the container will run node server.js when launching. If you want to rename this, then you will have to update the files to refer to your own start file.

Setup

You'll need to install Docker Desktop.

Clone this repo: https://github.com/edgarroman/docker-setup-node-container or just take the Docker related files and build a directory structure as shown above.

Updates

As time goes on, you'll want to modify / upgrade the versions of Node.js and npm. You can find the versions at the top of the Dockerfile. At the time of this writing the lines look like:

# Base node images can be found here: https://hub.docker.com/_/node?tab=description&amp%3Bpage=1&amp%3Bname=alpine
ARG NODE_IMAGE=node:16.17-alpine

For the version of Node.js, head to the official node docker hub and pick your base docker image. I recommend you stick with alpine unless you have additional needs. Replace the 2nd line in the Dockerfile with your desired tag.

For the npm verison, see line 11. Update this as you see fit.

ARG NPM_VERSION=npm@8.18.0

All other versions of packages and whatnot are up to your preferences inside your app.

Workflow Guide

We'll explore workflows of developing and testing your code. There are a number of workflows that we'll talk about in this post.

Local development without containers
Local development with containers (Preferred)
Production Build and Local Testing with containers

Local development without containers

This optional workflow does not use Docker at all.

Using this workflow allows you to develop your code locally on your system with the least number of abstractions and complications. But it also means you have to install the correct version of Node.js and npm locally.

Your local system will be directly running Node and directly loading your code. This workflow requires the least amount of processing power by your machine and will provide the most responsive development environment. When you make changes to your code, they be reflected as quickly as possible. (using nodemon to hot reload your code when changes are detected)

The downside to this approach is that most likely your local machine is not running the operating system that your final container will be running. If you're running Windows, MacOS, or even some flavors of Linux, the packages used locally may not be identical to those ultimately used in production.

The differences these packages have between platforms could inject subtle bugs and errors that would be confounding and difficult to debug. While many straightforward Javascript packages may be identical between platforms, there may be differences when your code needs to interact with the host machine's operating system.

With the pitfalls noted above, why should you take this approach? I would only recommend this approach if you are working in an environment where running Docker Desktop puts too much stress on your machine.

In general, I suggest using the next workflow.

Local development with containers (Preferred)

This workflow allows you to develop by running your code in a container environment. This container environment matches exactly what you will be deploying to production. And you don't need to install anything on your local machine aside from Docker Desktop.

In addition, if you are working with a team, then you can be assured that regardless of operating system they are running, the code will behave the same across all hosts.

A key benefit of this workflow is that you can edit your source code and any updates will be reflected in the container. We are still using nodemon to detect source code changes and reload Node. This greatly eases development by allowing developers to see changes much faster than having to rebuild the image on every change.

Steps to get up and running

Start Docker Desktop on your local machine
Navigate to the main project directory (not in server-code)
If this the first time you are running this workflow, or if you have changed any package dependencies, then run:
```
docker compose build
```
This step will run npm install in your image and lock in whatever you list in package.json.
Now run the following command to create a container (running instance of your image)
```
docker compose up
```
You'll be able to see your project running at http://localhost:8080/. And you'll be able to see any logs printed out to the console.
Press Control-C to exit the console and stop the container. (Equivalent to docker compose stop if you're familiar with Docker commands)
At this point your container is stopped, but Docker has it ready to start up again just in case. If you're finished developing or you need to make package changes, type the following to have Docker Desktop do a complete cleanup. It will remove the container, but keep your image around in case you want to start it again.
```
docker compose down
```

Notes

If you make any changes to dependent packages, then you'll have to run the docker compose build command as shown above. Do this anytime you add, update, or remove a package.
We assume that node will be running on port 8080. If this is not the case for your project, feel free to change it, but make sure to change it everywhere, especially Dockerfile and docker-compose.yml.
You may find an empty node_modules under the server-code directory. That's ok. When you do a build, the node_modules directory is created inside your Docker image, but not pulled from your local machine. So an empty node_modules is normal.

Production Build and Local Testing with containers

This workflow allows you to test your container by running it locally but with production settings. It's an exact match of what you would deploy in production, but it allows you to view the console output to help remove any bugs or errors.

For this workflow, there is no live reloading of source code. So if you make a change to the source code, you'll have to run the build step for every change.

Steps to get up and running

Start Docker Desktop on your local machine
Navigate to the main project directory (not in server-code)
To build the image:
```
docker build . --target=production -t mynodeapp:1.00
```
The --target=production is a very important flag here. It indicates to the build process to strip out any development dependencies and extraneous files.

A note here on the name of your image. I've picked mynodeapp as the name and the version as 1.00. I suggest you call your application something that is meaningful to you and follow semantic versioning
To run an instance of your production image locally, run the following command:
```
docker run -ti --rm -p 8080:8080 mynodeapp:1.00
```
Another note here is that you're running the production version of your application, but most non-trivial apps will need connectivity to other services such as a database. It's left as an exercise for the reader to provide such connectivity.

Production Deployment

Deploying your production image is outside the scope of this blog post. Especially since it varies wildly based on your Docker hosting environment.

Conclusion

This is end of the first part of this post where we explained how to use this workflow. Part two will dive into the details of how the Dockerfiles were created and how they enable the workflow.

Here's a link to part 2: https://dev.to/edgarroman/opinionated-docker-development-workflow-for-nodejs-projects-part-2-21ao

Note that you'll probably want to read part 2 when you decide you want to make modifications to the workflow illustrated here.

Why use Docker in 2022?

Edgar Roman — Wed, 31 Aug 2022 19:00:03 +0000

When Docker first became popular a few years ago, I generally dismissed it as a fad: Something that the hipsters were adopting just so they could be cool.

I mean, why add this extra layer on top of everything and add complexity to your development workflow? And any benefit you may glean would be minimal.

I didn't dive right into containers, but now things have changed. Over the past few years, cloud providers have rolled out some fantastic offerings that make me want to use Docker. Here are the reasons why using Docker containers in your developer workflow make a lot of sense in 2022.

Stable development environments

Gone are the days when you have a single development environment that all your projects leverage. If you start a handful of projects every year, then keeping those projects up to date could take a huge amount of time. In recent history, Python offers an upgrade every year. Using Node.js means you face a new version every six months!

What I've experienced is that if I take a break from one of my projects and come back to it a year later, I'm faced with numerous hours of just upgrading the environment to get the darn thing to work!

While keeping up with the latest security patches is important, I don't always need the latest and greatest features in the language. Sometimes I just want it to work so I can move on with whatever features I'm trying to implement. With containers, you can put a project on hold for a little bit and when you return, all the dependencies will still be there and, unless you depend on an external service, things should just work.

Isolated development environments

Another issue I have run into is having multiple versions of the language installed on my system. Python 2? Python 3? Node 12? Node 14? Many projects use a wide spectrum of these tools. Nothing drives me crazier than when someone comes up with a new shiny tool and asks me to install it globally!

People have solutions for multiple environments on a single system. Just to name a few: virtual environments for Python; nvm and volta for Node.js.

Using Docker means that you can create a container that is dedicated to the environment and language version of your choice. Each of your projects can have different versions of languages and none of them will conflict with your host machine!

Easily repeatable development environments

If you're working with multiple folks or just working with more than one computer, then you may appreciate that containers make sharing easy. All the instructions for building your container are checked into your source code repository.

If a teammate or friend wants to collaborate with you on the project, just point them to the source code and they can have an exact replica of your container within minutes. No more fussing around with Windows vs Mac vs Linux desktops. They all can participate.

Another scenario could be that you'd like to share your project in a demonstrable stage, but not share the source code. You can hand over the built container image and your friend can run an instance locally to check out your project. While this does not secure your source code, it does make sharing your project more palatable to folks who maybe are not developers.

Easier Continuous Integration

Since all the instructions on how to build your project are in the source code, then building the container image is very easy. You can build it on your personal computer. Or you can leverage one of the many cloud-based Continuous Integration (CI) offerings and have them build your container anytime you'd like. This is most helpful when working on a team,
but it's important to point out that containers make this step much easier!

Cloud based hosting

One of the biggest reasons to use containers these days is the plethora of online services that will host containers. The biggest drawback in the past was that even if you took the time and effort to make your project run in a container - you couldn't run it! At least you couldn't run it easily. Kubernetes fighting for developer mindshare with Mesos and Docker Swarm. Ultimately Kubernetes became very popular and now all major and many minor cloud providers offer some container hosting services.

Based on your hosting needs, you can run a hobby project in the cloud for free (using various Free Tier offerings) or you can run up a robust and geographically redundant flotilla of containers for your production project. It's almost too overwhelming to choose from the options. In fact, Corey Quinn on the Last Week in AWS published a blog post on 17 ways to run containers on AWS - and then wrote another blog post on 17 more ways!

Serverless Ready

Finally, using containers for your deployment allows you to fully embrace the Serverless movement. Allow me to explain the Serverless movement. In any commercial hosting relationship, you, as a project owner, delegate some responsibilities to your hosting provider. If you are renting physical rack space in a data center, then you allow them to provide a roof, power, and air conditioning. By using containers, you delegate a lot more to your provider. Essentially you are just renting CPU cycles from these container providers (although how you are billed varies from provider to provider).

As a project owner, you are free of the responsibility of server management (mostly). If the load on your project goes up, then you just rent more CPU cycles. Conversely, if the load goes down, you rent fewer CPU cycles. This is the beauty of Serverless: the line of responsibility allows you to focus more on your project's delivered value and less on the boring tasks.

Why not run Docker containers?

Ok, so we've discussed the advantages of using containers. What about the downsides?

Docker Licensing

According to the Docker Pricing page:

Docker Desktop can be used for free as part of a Docker Personal subscription for: small companies (fewer than 250 employees AND less than $10 million in annual revenue), personal use, education, and non-commercial open source projects.

Docker Desktop requires a per user paid Pro, Team or Business subscription for professional use in larger companies with subscriptions available for as little as $5 per month.

This is a nominal cost for most developers, but something worth considering.

Docker Desktop overhead

When running Docker in the background, your computer will use more energy. I have found from personal experience that most modern computers handle the load easily.

Only runs Linux

Your Docker Container can only run Linux. That's because Docker takes advantage of runC which uses the Linux kernel. You can't run Windows inside of Docker. But of course you can run Docker on a Windows host machine. If you are running a .NET project, you won't be able to use Docker. On the other hand, if you're running .NET Core then you're in luck!

While you may think only using Linux would be limitation, but there are literally hundreds of Linux distributions from which to choose.

Complex workflows

Most developers are taught to develop locally on their machines. While this may be slowly shifting to online development, there is a lot of existing momentum. Exacerbating the problem are the myriad of options that Docker exposes to use the tool. It's extremely flexible and thus is quite complex to set up properly. Facing the task of setting up Docker from scratch is quite intimidating. It could cause some developers to have a bad experience and thus resist leveraging a fantastic tool.

Getting Started with containers

In the next posts, I would like to propose a flexible but opinionated developer workflow for the Python and Node.js environments.

Opinionated Docker development workflow for Node.js projects

Cover Photo by Venti Views on Unsplash