DEV Community

Cover image for Supercharge your containerized IoT workloads with GPU Acceleration on Nvidia Jetson devices
Paul DeCarlo for Microsoft Azure

Posted on • Updated on


Supercharge your containerized IoT workloads with GPU Acceleration on Nvidia Jetson devices

In this article, we will walk through the steps for creating GPU accelerated containers for use in IoT Solutions on Nvidia Jetson family devices. This will enable you to perform enhanced processing of Artificial Intelligence and Machine Learning workloads by exposing access to on-board GPU hardware for use in containerized processes. When coupled with an IoT focused container orchestrator like Azure IoT Edge, we can deploy these accelerated workloads as modules that can be configured and deployed directly from the cloud and then shipped down to our devices. This provides the ability to create a full end-to-end GPU accelerated IoT solution that can be updated securely and remotely as needed.

Getting Started

Nvidia produces a number of devices suited for IoT solutions in their Jetson line of device offerings. These include the beefy 512-Core Jetson AGX Xavier, mid-range 256-Core Jetson TX2, and the entry-level $99 128-Core Jetson Nano.

To follow along with this article, you will need one of the following devices:

Note: We will specifically employ the Jetson Nano device in this article. Keep in mind that the process involved in the steps that will be provided are technically adaptable across the full family of Jetson devices, however, the specifics for each platform are too complex to cover in a single article. For this reason, we will explain the overall process in enough detail to hopefully point you in the right direction if you are using a different device.

If you are interested in how to construct similar examples of Dockerfiles for other Nvidia platforms, I highly suggest taking a look at the jetson-containers repo published by Ian Davis. This repository contains a wealth of relevant information and was highly leveraged during the creation of this content.

High Level Overview

To make use of GPU drivers on a device, we need to:

The process for creating a container with access to host GPU resources may appear straightforward. It is technically no different than the steps that would be followed if you wanted to run an accelerated workload on the host itself. However, in practice there are a few issues which make this process a bit more difficult.

  • For example, which drivers are relevant to my device? (Is it a TX2/Xavier or Nano device)
  • Which version of CUDA Toolkit should I install? (These are also provided per platform)
  • How do I ensure that my container has access to the appropriate devices? (Thankfully, most OPENCL / CUDA applications will tell you what they need in error messages)

This process is further confounded by the availability of the BSP and CUDA Toolkit as they are not openly available for download. They must be retrieved and hosted elsewhere if they are to be used in a Dockerfile.

If you clicked on the links above for obtaining these packages, there will be some issues present. The link provided for the BSP does not extract properly as-is on aarch64, so it is not possible to use in an aarch64 container without modification. That's okay, we can re-pack it and host it elsewhere to get around this issue.

Second, you may notice that there are no links provided for an aarch64 compatible installer for the CUDA Toolkit. These can be obtained by downloading and installing the Nvidia SDK Manager to an X64 / Debian compatible host. From there, you can obtain the appropriate links by monitoring the terminal output of the SDK manager.

Still following? The good news is, if you are using a Jetson Nano device, I will provide pre-built publicly hosted docker images in this article that you can use. These may run on other devices but will not be optimized due to differences in the driver package present.

Some of you may be wondering, why don't we just mount the host directories which contains the relevant drivers and SDK? For example:

docker run -it --rm \

 -v /usr/local/cuda-10.0:/usr/local/cuda-10.0 \

 -v /usr/lib/aarch64-linux-gnu/tegra:/usr/lib/aarch64-linux-gnu/tegra \

 --device=/dev/nvmap \

 --device=/dev/nvhost-ctrl \

 --device=/dev/nvhost-ctrl-gpu \


The problem with this approach is that our code is now deeply coupled to a configuration that may or may not be available on the host. It would be preferable in productioN to provide a full container which contains the all of the relevant dependencies so that we can rest assured that so long as the hardware is present, we can successfully run our containerized application. In this manner, we can ship BSP and CUDA updates in our container without any need to make changes to the host OS, pretty cool huh?

Building the containers

In order to build a GPU accelerated container, we need to think about how to best approach the solution so that changes can be made later on. Docker affords us some niceties out of the box through it's layered approach to filesystem changes. A layer is created each time a set of commands is executed in the RUN block of Dockerfile. These allow us to compose a set of successive filesystem changes on top of previous layers which can be cached by the docker build system. Thoughtful design of our Dockerfiles should allow us to build an image which does not write excessive data into our layer to allow us to form a light-weight base to create additional containers from.

We also need to think about how to organize our base images. If we create a base which builds off of the previous in a proper fashion, it should allow us to add changes to our larger solution without the need for rebuilding things from scratch. We want to avoid that situation as much as possible and take advantage of pre-built bases where possible.

When we look at the individual steps to enable GPU acceleration, a design pattern is implied. We can create a base image for each step, i.e. a container which contains only the drivers, another which contains the CUDA Toolkit installed on top of the drivers, and another which compiles an application (OpenCV) against the CUDA Toolkit in the previous base etc. This will allow us to easily create new projects using pre-existing base containers, without the need for rebuilding common layers.

Following this approach, we will define three base images described below:

  • jetson-nano-l4t : Contains installation of Jetson L4T Driver Package
  • jetson-nano-l4t-cuda : Installs CUDA Toolkit 10 on top of L4T Drivers
  • jetson-nano-l4t-cuda-opencv : Compiles OpenCV against CUDA Toolkit 10

Our initial image, jetson-nano-l4t, will be based on balenalib/jetson-tx2-ubuntu:bionic. We could also base off of another stock Ubuntu base like arm64v8/ubuntu. This initial base serve will function as an initial root filesystem.

We will now look at how to construct these base images individually with some notes on how they operate under the hood.


FROM balenalib/jetson-tx2-ubuntu:bionic

ARG DRIVER_PACK=Jetson-210_Linux_R32.1.0_aarch64.tbz2

RUN apt-get update && apt-get install -y --no-install-recommends \
    bzip2 \
    ca-certificates \
    curl \
    lbzip2 \
    sudo \
    && \
    curl -sSL $URL -o ${DRIVER_PACK} && \
    echo "9138c7dd844eb290a20b31446b757e1781080f63 *./${DRIVER_PACK}" | sha1sum -c --strict - && \
    tar -xpj --overwrite -f ./${DRIVER_PACK} && \
    sed -i '/.*tar -I lbzip2 -xpmf ${LDK_NV_TEGRA_DIR}\/config\.tbz2.*/c\tar -I lbzip2 -xpm --overwrite -f ${LDK_NV_TEGRA_DIR}\/config.tbz2' ./Linux_for_Tegra/ && \
    ./Linux_for_Tegra/ -r / && \
    rm -rf ./Linux_for_Tegra && \
    rm ./${DRIVER_PACK} \
    && \
    apt-get purge --autoremove -y bzip2 curl lbzip2 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

ENV LD_LIBRARY_PATH=/usr/lib/aarch64-linux-gnu/tegra:/usr/lib/aarch64-linux-gnu/tegra-egl:${LD_LIBRARY_PATH}

RUN ln -s /usr/lib/aarch64-linux-gnu/tegra/ /usr/lib/aarch64-linux-gnu/tegra/ && \
    ln -s /usr/lib/aarch64-linux-gnu/tegra/ /usr/lib/aarch64-linux-gnu/tegra/ && \
    ln -sf /usr/lib/aarch64-linux-gnu/tegra/ /usr/lib/aarch64-linux-gnu/ && \
    ln -s /usr/lib/aarch64-linux-gnu/ /usr/lib/aarch64-linux-gnu/ && \
    ln -sf /usr/lib/aarch64-linux-gnu/tegra-egl/ /usr/lib/aarch64-linux-gnu/

This Dockerfile pulls down a re-packaged archive of the Jetson Nanon BSP. A sed operation is required to direct the installer to overwrite files that already exist on the root filesystem. Finally, we add the newly installed modules to the LD_LIBRARY_PATH to allow them to be dynamically linked to by other applications and symlink relevant shared objects to common path names.


FROM toolboc/jetson-nano-l4t

#INSTALL CUDA Toolkit for L4T
ARG CUDA_TOOLKIT_PKG="cuda-repo-l4t-10-0-local-10.0.166_1.0-1_arm64.deb"

RUN apt-get update && \
    apt-get install -y --no-install-recommends curl && \
    curl -sL ${URL} -o ${CUDA_TOOLKIT_PKG} && \
    echo "5e3eedc3707305f9022d41754d6becde ${CUDA_TOOLKIT_PKG}" | md5sum -c - && \
    dpkg --force-all -i ${CUDA_TOOLKIT_PKG} && \
    rm ${CUDA_TOOLKIT_PKG} && \
    apt-key add var/cuda-repo-*-local*/*.pub && \
    apt-get update && \
    apt-get install -y --allow-downgrades cuda-toolkit-10-0 libgomp1 libfreeimage-dev libopenmpi-dev openmpi-bin && \
    dpkg --purge cuda-repo-l4t-10-0-local-10.0.166  && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

ENV CUDA_HOME=/usr/local/cuda
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64

This Dockerfile installs CUDA Toolkit 10.0.166_1 for arm64 along with necessary dependencies. We set the LD_LIBRARY_PATH to allow for dynamic linking of the installed modules and make the CUDA Toolkit binaries (nvcc etc.) accessible to future applications by adding the CUDA_HOME/bin directory to $PATH.


FROM toolboc/jetson-nano-l4t-cuda

#Required for libjasper-dev
RUN echo "deb xenial-security main restricted" | sudo tee -a /etc/apt/sources.list

#INSTALL OPENCV dependencies
RUN apt update && apt purge *libopencv* && apt install -y build-essential cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev \
    libgstreamer1.0-dev libgstreamer-plugins-base1.0-dev \
    python2.7-dev python3.6-dev python-dev python-numpy python3-numpy \
    libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev \
    libv4l-dev v4l-utils qv4l2 v4l2ucp \
    curl unzip && \
    rm -rf /var/lib/apt/lists/*

#GET OPENCV sources
WORKDIR /usr/local/src
RUN curl -L -o && \
    curl -L -o && \
    unzip && \
    unzip && \
    rm -rf opencv*.zip

RUN cd opencv-4.1.0/ && mkdir release && cd release/ && \
    make -j3 && \
    make install && \
    rm -rf /usr/local/src/opencv-4.1.0/release

In this Dockerfile, we update our sources.list to allow us to install libjasper-dev as it is not available in the bionic repos. We then obtain a 4.1.0 release of OpenCV and compile it with support for CUDA, python2, and python3. We are able to compile with CUDA support because the image is based on toolboc/jetson-nano-l4t-cuda

Using GPU accelerated Containers

Using these containers requires that the container is created with the appropriate host GPU devices accessible to the container. This can be done from the command line with:

docker run \
    --device=/dev/nvhost-ctrl \
    --device=/dev/nvhost-ctrl-gpu \
    --device=/dev/nvhost-prof-gpu \
    --device=/dev/nvmap \
    --device=/dev/nvhost-gpu \
    --device=/dev/nvhost-as-gpu \
    --device=/dev/nvhost-vic \
    --device=/dev/tegra_dc_ctrl \

Or in an IoT Edge Module by modifying the HostConfig section of deployment.template.json as follows:

                "HostConfig": {
                  "Devices": [
                      "PathOnHost": "/dev/nvhost-ctrl",
                      "PathOnHost": "/dev/nvhost-ctrl-gpu",
                      "PathOnHost": "/dev/nvhost-prof-gpu",
                      "PathInContainer":"dev/nvhost-prof-gpu ",
                      "PathOnHost": "/dev/nvmap",
                      "PathOnHost": "dev/nvhost-gpu",
                      "PathOnHost": "/dev/nvhost-as-gpu",
                      "PathOnHost": "/dev/nvhost-vic",
                      "PathOnHost": "/dev/tegra_dc_ctrl",

Let's verify this on an Nvidia Jetson Nano device by running:

docker run --rm -it \
    --device=/dev/nvhost-ctrl \
    --device=/dev/nvhost-ctrl-gpu \
    --device=/dev/nvhost-prof-gpu \
    --device=/dev/nvmap \
    --device=/dev/nvhost-gpu \
    --device=/dev/nvhost-as-gpu \
    --device=/dev/nvhost-vic \
    --device=/dev/tegra_dc_ctrl \
    toolboc/jetson-nano-l4t-cuda \

This will drop you into an interactive bash session with the jetson-nano-l4t-cuda base image.

We will build the deviceQuery sample included in the CUDA Toolkit to verify that our GPU is accessible from the container. To do this, run the following commands inside the interactive session:

cd /usr/local/cuda/samples/1_Utilities/deviceQuery

You should receive output similar to the following:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3957 MBytes (4148756480 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

If you get a similar message, congratulations! You are now have access to the GPU from a container!


We have demonstrated how to allow the GPU present on Nvidia Jetson devices to be made available to containerized processes. This opens up a vast array of possibilities for enhancing IoT solutions distributed as containers. We have also demonstrated how we can use these base containers to compile additional applications with GPU support. Using these techniques, it should be possible to convert any host-compatible GPU accelerated workload to run in a container, making this a viable path for development of GPU accelerated IoT Edge workloads.

If you would like to check out additional examples of Dockerfiles for Nvidia Jetson Platforms including the Nano, TX, and Xavier - you can check out the jetson-containers repo by Ian Davis which contains a variety of example configurations for additional software packages like CUDNN, Tensorflow, and PyTorch.

If you are interested in learning more about developing of IoT Edge workloads for Nvidia Jetson devices, you can check out these additional article on

Until next time, Happy Hacking!

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

An Animated Guide to Node.js Event Loop

Node.js doesn’t stop from running other operations because of Libuv, a C++ library responsible for the event loop and asynchronously handling tasks such as network requests, DNS resolution, file system operations, data encryption, etc.

What happens under the hood when Node.js works on tasks such as database queries? We will explore it by following this piece of code step by step.