DEV Community

Cover image for Jetson Containers Quickstart on NVIDIA Jetson AGX Orin 64GB
Sergio Andres Usma
Sergio Andres Usma

Posted on

Jetson Containers Quickstart on NVIDIA Jetson AGX Orin 64GB

Abstract

This document describes how to run NVIDIA Jetson‑optimized AI containers from the dustynv/jetson-containers project on an NVIDIA Jetson AGX Orin 64GB Developer Kit with Ubuntu 22.04.5 LTS and JetPack 6.2.2 (L4T 36.5.0), focusing on LLMs, speech, vision, and development tools. It consolidates the original Jetson Containers Quickstart PDF into an operational tutorial with copy‑paste docker run commands and n8n integration pointers tailored to a system where n8n itself runs in Docker on port 5678. The tutorial targets engineers who want to run multiple local AI services on the same Jetson and orchestrate them via OpenAI‑compatible APIs without relying on external cloud providers.


1. Target Hardware and Software Environment

Your system matches the reference environment of the Jetson Containers Quickstart Guide: Jetson AGX Orin 64GB, Ubuntu 22.04.5 aarch64, JetPack 6.2.2 (L4T 36.5.0), CUDA 12.6, cuDNN 9.3.0, and TensorRT 10.3.0.30. This platform has 64 GB unified memory and is validated to run all 51 containers in the guide, including 70B‑parameter LLMs in GPU‑accelerated runtimes.

Before launching AI containers, ensure:

  1. Docker is installed and configured with NVIDIA runtime (JetPack 6.x already provides nvidia-container-runtime, you mainly add Docker itself).
  2. GPU works inside Docker:
docker run --runtime nvidia --rm \
  dustynv/cuda:12.8-samples-r36.4.0-cu128-24.04 \
  /usr/local/cuda/extras/demo_suite/deviceQuery
Enter fullscreen mode Exit fullscreen mode
  1. (Optional) Create directories for persistent data:
mkdir -p ~/.ollama \
         ~/.cache/huggingface \
         ~/sd-models \
         ~/comfyui-models \
         ~/comfyui-output \
         ~/ml-workspace \
         ~/notebooks \
         ~/aim-data \
         ~/ha-config
Enter fullscreen mode Exit fullscreen mode

Use these directories as bind‑mounts so models and configuration survive container recreation.


2. LLM Inference Engines (OpenAI-Compatible)

2.1 Ollama — General Purpose LLM Runtime

Ollama is a user‑friendly way to run LLaMA, Mistral, Qwen, Gemma, Phi, and DeepSeek models with an OpenAI‑compatible REST API on your Jetson. The guide notes that AGX Orin 64GB can run 70B models comfortably using this runtime.

Start Ollama:

docker run --runtime nvidia -it -d \
  --name ollama \
  --network host \
  -v ~/.ollama:/root/.ollama \
  dustynv/ollama:r36.4.0
Enter fullscreen mode Exit fullscreen mode

Pull a model and chat:

# Pull a model
curl http://localhost:11434/api/pull \
  -d '{"name": "llama3.2:3b"}'

# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role":"user","content":"Hello from n8n!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

n8n configuration:

  • Credential: OpenAI API credential.
  • API Key: any string (e.g. ollama).
  • Base URL: http://<jetson-ip>:11434/v1.
  • Model: llama3.2:3b or any model pulled into Ollama.

2.2 llama.cpp — GGUF, Quantized LLM Server

llama.cpp excels at running quantized GGUF models with low latency and memory usage. The quickstart provides an OpenAI‑compatible server configuration suitable for AGX Orin.

Start llama.cpp server:

docker run --runtime nvidia -it -d \
  --name llama-server \
  --network host \
  -v /models:/models \
  dustynv/llama_cpp:r36.4.0 \
  llama-server \
    --model /models/llama-3.1-8b-q4.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    --n-gpu-layers 999 \
    --ctx-size 8192
Enter fullscreen mode Exit fullscreen mode
  • The server exposes OpenAI‑style endpoints on http://<jetson-ip>:8080/v1.

n8n configuration:

  • Node: OpenAI Chat Model.
  • Base URL: http://<jetson-ip>:8080/v1.
  • Model: choose name according to your server configuration; llama.cpp will map GGUF to logical model IDs.

2.3 vLLM — High Throughput LLM Serving

vLLM uses PagedAttention to reach significantly higher throughput than naive Hugging Face inference, which is useful for multi‑user services.

Start vLLM:

docker run --runtime nvidia -it -d \
  --name vllm \
  --network host \
  --shm-size=8g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/vllm:r36.4.0 \
  python3 -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --host 0.0.0.0 \
    --port 8000
Enter fullscreen mode Exit fullscreen mode
  • Exposes OpenAI‑compatible endpoints at http://<jetson-ip>:8000/v1.

n8n configuration:

  • Node: OpenAI Chat Model.
  • Base URL: http://<jetson-ip>:8000/v1.
  • Enable streaming mode if you want streamed responses.

2.4 SGLang — Structured Output and JSON

SGLang is designed for structured outputs and JSON‑constrained decoding using RadixAttention.

Start SGLang:

docker run --runtime nvidia -it -d \
  --name sglang \
  --network host \
  --shm-size=8g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/sglang:r36.4.0 \
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.2-3B-Instruct \
    --host 0.0.0.0 \
    --port 30000
Enter fullscreen mode Exit fullscreen mode

n8n usage pattern:

  • Use HTTP Request node pointing to http://<jetson-ip>:30000/v1/chat/completions and include response_format: {"type":"json_object"} in the body when you need strict JSON.

2.5 MLC and nanoLLM — Orin‑Optimized and Multimodal

MLC LLM compiles models targeting Jetson’s GPU architecture for fast token generation.

Start MLC LLM:

docker run --runtime nvidia -it -d \
  --name mlc \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/mlc:r36.4.0
Enter fullscreen mode Exit fullscreen mode
  • According to the quickstart, MLC frequently achieves the fastest token rates on AGX Orin among the tested engines.

nanoLLM provides higher‑level multimodal pipelines with vision‑language and voice capabilities.

Start nanoLLM with VILA:

docker run --runtime nvidia -it -d \
  --name nano-llm \
  --network host \
  --shm-size=8g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/nano_llm:r36.4.0 \
  python3 -m nano_llm.serve \
    --model Efficient-Large-Model/VILA1.5-3b \
    --host 0.0.0.0 \
    --port 9000
Enter fullscreen mode Exit fullscreen mode

Multimodal example:

curl http://localhost:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "VILA1.5-3b",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "http://example.com/img.jpg"}},
        {"type": "text", "text": "What is in this image?"}
      ]
    }]
  }'
Enter fullscreen mode Exit fullscreen mode

n8n:

  • Node: OpenAI Chat Model.
  • Base URL: http://<jetson-ip>:9000/v1.
  • Use messages with image_url and text parts when building prompts.

3. Speech and Audio Containers

3.1 faster-whisper — STT Server

faster‑whisper is a fast speech‑to‑text server offering OpenAI‑compatible endpoints on Jetson.

Start faster‑whisper:

docker run --runtime nvidia -it -d \
  --name faster-whisper \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/faster-whisper:r36.4.0 \
  python3 -m faster_whisper.server \
    --host 0.0.0.0 \
    --port 8000
Enter fullscreen mode Exit fullscreen mode
  • Exposes /v1/audio/transcriptions and works with OpenAI Chat Model or HTTP Request nodes.

n8n pattern:

  • HTTP Request, method POST, URL http://<jetson-ip>:8000/v1/audio/transcriptions.
  • Body: form‑data with file (binary audio) and model (e.g. "whisper-1").

3.2 kokoro-tts — Lightweight Local TTS

kokoro‑tts offers an OpenAI‑compatible /v1/audio/speech endpoint with multiple voices.

Start kokoro‑tts:

docker run --runtime nvidia -it -d \
  --name kokoro-tts \
  --network host \
  dustynv/kokoro-tts:r36.4.0
Enter fullscreen mode Exit fullscreen mode

Generate MP3:

curl http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "input": "Hello from your Jetson!",
    "voice": "af_bella",
    "response_format": "mp3"
  }' \
  --output speech.mp3
Enter fullscreen mode Exit fullscreen mode

n8n:

  • HTTP Request, Response Format = File, then return or store the binary audio.

3.3 speaches — Unified Speech In/Out

speaches exposes both STT and TTS endpoints compatible with OpenAI’s audio APIs.

Start speaches:

docker run --runtime nvidia -it -d \
  --name speaches \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/speaches:r36.4.0
Enter fullscreen mode Exit fullscreen mode
  • Ports and endpoints are listed in the API quick reference (port 8000, OpenAI‑compatible).

A complete on‑device voice pipeline can be built as: Webhook (audio) → faster‑whisper STT → LLM (Ollama or vLLM) → kokoro‑tts or speaches TTS → Webhook response.


4. Vision, Diffusion, and VLM Containers

4.1 Stable Diffusion WebUI — Text‑to‑Image UI + API

The Stable Diffusion WebUI container gives you a full browser interface and REST API for image generation.

Start Stable Diffusion WebUI:

docker run --runtime nvidia -it -d \
  --name sd-webui \
  --network host \
  -v ~/sd-models:/workspace/stable-diffusion-webui/models \
  dustynv/stable-diffusion-webui:r36.4.0 \
  python3 launch.py --api --listen --port 7860
Enter fullscreen mode Exit fullscreen mode
  • Web UI: http://<jetson-ip>:7860.

API txt2img example:

curl http://localhost:7860/sdapi/v1/txt2img \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "mountain landscape",
    "steps": 20,
    "width": 512,
    "height": 512
  }'
Enter fullscreen mode Exit fullscreen mode

n8n:

  • HTTP Request → parse JSON → Move Binary Data to convert base64 images[0] to binary → send to Telegram, save file, etc.

4.2 ComfyUI — Graph‑Based Diffusion Workflows

ComfyUI is a node‑based interface with an HTTP API.

Start ComfyUI:

docker run --runtime nvidia -it -d \
  --name comfyui \
  --network host \
  -v ~/comfyui-models:/root/ComfyUI/models \
  -v ~/comfyui-output:/root/ComfyUI/output \
  dustynv/comfyui:r36.4.0
Enter fullscreen mode Exit fullscreen mode

API flow:

  1. POST /prompt → get prompt_id.
  2. GET /history/{prompt_id} repeatedly until outputs appear.
  3. GET /view?filename={filename}&type=output to download the image.

Use a sequence of HTTP Request nodes in n8n to implement the polling and retrieval.


4.3 VILA and Related VLMs

The VILA container provides an efficient vision‑language model with an OpenAI‑compatible API.

Start VILA:

docker run --runtime nvidia -it -d \
  --name vila \
  --network host \
  --shm-size=8g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/vila:r36.4.0
Enter fullscreen mode Exit fullscreen mode
  • According to the quick reference, VILA uses port 8000 and integrates via OpenAI Chat Model node.

In n8n, send messages that include an image_url object and text, similar to the nanoLLM example.


5. Development, Experiment Tracking, and Smart Home

5.1 L4T-ML, PyTorch, and JupyterLab

L4T‑ML is an all‑in‑one ML environment that bundles PyTorch, TensorFlow, scikit‑learn, and JupyterLab optimized for JetPack 6.x.

Start L4T‑ML JupyterLab:

docker run --runtime nvidia -it -d \
  --name l4t-ml \
  --network host \
  --shm-size=8g \
  -v ~/ml-workspace:/workspace \
  dustynv/l4t-ml:r36.4.0 \
  jupyter lab --ip=0.0.0.0 --allow-root --no-browser
Enter fullscreen mode Exit fullscreen mode
  • Access via http://<jetson-ip>:8888 in your browser.

Alternatively, the standalone dustynv/jupyterlab:r36.4.0 container provides just JupyterLab:

docker run --runtime nvidia -it -d \
  --name jupyterlab \
  --network host \
  -v ~/notebooks:/notebooks \
  dustynv/jupyterlab:r36.4.0 \
  jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token=''
Enter fullscreen mode Exit fullscreen mode

PyTorch‑focused images (dustynv/pytorch, dustynv/l4t-pytorch) can be run via jetson-containers run ... as described in the build docs and are fully compatible with JetPack 6.2.2.


5.2 AIM Experiment Tracker

AIM is a lightweight REST‑accessible experiment tracker container.

Start AIM:

docker run --runtime nvidia -it -d \
  --name aim \
  --network host \
  -v ~/aim-data:/aim/data \
  dustynv/aim:r36.4.0 \
  aim up --host 0.0.0.0 --port 43800
Enter fullscreen mode Exit fullscreen mode
  • Web UI and API at http://<jetson-ip>:43800.
  • n8n can poll api/runs and api/metrics using HTTP Request nodes to monitor training.

5.3 Home Assistant Core on Jetson

Home Assistant Core can run as a container for local smart‑home control.

Start Home Assistant:

docker run -it -d \
  --name homeassistant \
  --network host \
  -v ~/ha-config:/config \
  dustynv/homeassistant-core:r36.4.0
Enter fullscreen mode Exit fullscreen mode
  • Access UI at http://<jetson-ip>:8123 and create a Long‑Lived Access Token under your profile.

n8n integration:

  • HTTP Request node with URL like http://<jetson-ip>:8123/api/states or /api/services/....
  • Authentication: Bearer token using the Long‑Lived Access Token.
  • Build flows like "sensor state change → LLM decision → Home Assistant service call" as outlined in the quickstart.

6. n8n Integration Patterns and Networking Notes

The quickstart highlights that your n8n instance runs in Docker on port 5678 and must reach Jetson services via the Jetson's LAN IP, not localhost, because container networking isolates localhost inside the n8n container. For OpenAI‑compatible services, configure the OpenAI Chat Model node with the Base URL pointing to http://<jetson-ip>:<port>/v1, while for other services use HTTP Request nodes and explicit paths.

OpenAI‑compatible containers and ports (from API quick reference):

Container Port Base URL example
ollama 11434 http://<jetson-ip>:11434/v1
llama_cpp 8080 http://<jetson-ip>:8080/v1
vLLM 8000 http://<jetson-ip>:8000/v1
sglang 30000 http://<jetson-ip>:30000/v1
mlc 8080 http://<jetson-ip>:8080/v1
nano_llm 9000 http://<jetson-ip>:9000/v1
speaches 8000 http://<jetson-ip>:8000/v1
faster-whisper 8000 http://<jetson-ip>:8000/v1 or audio paths
kokoro-tts 8880 http://<jetson-ip>:8880/v1
VILA 8000 http://<jetson-ip>:8000/v1

Example: full on‑device voice assistant pipeline in n8n:

Webhook (POST /voice-input) → receives audio
  ↓
HTTP Request → POST /v1/audio/transcriptions (faster-whisper or speaches)
  Body: form-data (file: binary audio, model: "whisper-1")
  ↓
OpenAI Chat Model → local LLM (Base URL = Ollama or vLLM)
  ↓
HTTP Request → POST /v1/audio/speech (kokoro-tts or speaches)
  Body: {"model":"kokoro","input":"{{$json.text}}","voice":"af_bella"}
  ↓
Webhook Response → returns audio binary
Enter fullscreen mode Exit fullscreen mode

This pattern uses only local containers on Jetson and keeps all data on‑device.


7. Practical Recommendations and Next Steps

The quickstart confirms that all 51 dustynv/jetson-containers images tagged r36.4.0 are compatible with JetPack 6.x and have been tested on Jetson AGX Orin 64GB with CUDA 12.6. For production use on your board, the guide suggests mounting persistent caches, using --shm-size=8g for transformer‑based containers, benchmarking vLLM vs MLC vs llama.cpp on your target models, and eventually switching from --network host to explicit port mappings on isolated Docker networks.

Top comments (0)