Sergio Andres Usma

Posted on Apr 5

Jetson Containers Quickstart on NVIDIA Jetson AGX Orin 64GB

#jetson #agxorin #ai #llm

Abstract

This document describes how to run NVIDIA Jetson‑optimized AI containers from the dustynv/jetson-containers project on an NVIDIA Jetson AGX Orin 64GB Developer Kit with Ubuntu 22.04.5 LTS and JetPack 6.2.2 (L4T 36.5.0), focusing on LLMs, speech, vision, and development tools. It consolidates the original Jetson Containers Quickstart PDF into an operational tutorial with copy‑paste docker run commands and n8n integration pointers tailored to a system where n8n itself runs in Docker on port 5678. The tutorial targets engineers who want to run multiple local AI services on the same Jetson and orchestrate them via OpenAI‑compatible APIs without relying on external cloud providers.

1. Target Hardware and Software Environment

Your system matches the reference environment of the Jetson Containers Quickstart Guide: Jetson AGX Orin 64GB, Ubuntu 22.04.5 aarch64, JetPack 6.2.2 (L4T 36.5.0), CUDA 12.6, cuDNN 9.3.0, and TensorRT 10.3.0.30. This platform has 64 GB unified memory and is validated to run all 51 containers in the guide, including 70B‑parameter LLMs in GPU‑accelerated runtimes.

Before launching AI containers, ensure:

Docker is installed and configured with NVIDIA runtime (JetPack 6.x already provides nvidia-container-runtime, you mainly add Docker itself).
GPU works inside Docker:

docker run --runtime nvidia --rm \
  dustynv/cuda:12.8-samples-r36.4.0-cu128-24.04 \
  /usr/local/cuda/extras/demo_suite/deviceQuery

(Optional) Create directories for persistent data:

mkdir -p ~/.ollama \
         ~/.cache/huggingface \
         ~/sd-models \
         ~/comfyui-models \
         ~/comfyui-output \
         ~/ml-workspace \
         ~/notebooks \
         ~/aim-data \
         ~/ha-config

Use these directories as bind‑mounts so models and configuration survive container recreation.

2. LLM Inference Engines (OpenAI-Compatible)

2.1 Ollama — General Purpose LLM Runtime

Ollama is a user‑friendly way to run LLaMA, Mistral, Qwen, Gemma, Phi, and DeepSeek models with an OpenAI‑compatible REST API on your Jetson. The guide notes that AGX Orin 64GB can run 70B models comfortably using this runtime.

Start Ollama:

docker run --runtime nvidia -it -d \
  --name ollama \
  --network host \
  -v ~/.ollama:/root/.ollama \
  dustynv/ollama:r36.4.0

Pull a model and chat:

# Pull a model
curl http://localhost:11434/api/pull \
  -d '{"name": "llama3.2:3b"}'

# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role":"user","content":"Hello from n8n!"}]
  }'

n8n configuration:

Credential: OpenAI API credential.
API Key: any string (e.g. ollama).
Base URL: http://<jetson-ip>:11434/v1.
Model: llama3.2:3b or any model pulled into Ollama.

2.2 llama.cpp — GGUF, Quantized LLM Server

llama.cpp excels at running quantized GGUF models with low latency and memory usage. The quickstart provides an OpenAI‑compatible server configuration suitable for AGX Orin.

Start llama.cpp server:

docker run --runtime nvidia -it -d \
  --name llama-server \
  --network host \
  -v /models:/models \
  dustynv/llama_cpp:r36.4.0 \
  llama-server \
    --model /models/llama-3.1-8b-q4.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    --n-gpu-layers 999 \
    --ctx-size 8192

The server exposes OpenAI‑style endpoints on http://<jetson-ip>:8080/v1.

n8n configuration:

Node: OpenAI Chat Model.
Base URL: http://<jetson-ip>:8080/v1.
Model: choose name according to your server configuration; llama.cpp will map GGUF to logical model IDs.

2.3 vLLM — High Throughput LLM Serving

vLLM uses PagedAttention to reach significantly higher throughput than naive Hugging Face inference, which is useful for multi‑user services.

Start vLLM:

docker run --runtime nvidia -it -d \
  --name vllm \
  --network host \
  --shm-size=8g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/vllm:r36.4.0 \
  python3 -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --host 0.0.0.0 \
    --port 8000

Exposes OpenAI‑compatible endpoints at http://<jetson-ip>:8000/v1.

n8n configuration:

Node: OpenAI Chat Model.
Base URL: http://<jetson-ip>:8000/v1.
Enable streaming mode if you want streamed responses.

2.4 SGLang — Structured Output and JSON

SGLang is designed for structured outputs and JSON‑constrained decoding using RadixAttention.

Start SGLang:

docker run --runtime nvidia -it -d \
  --name sglang \
  --network host \
  --shm-size=8g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/sglang:r36.4.0 \
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.2-3B-Instruct \
    --host 0.0.0.0 \
    --port 30000

n8n usage pattern:

Use HTTP Request node pointing to http://<jetson-ip>:30000/v1/chat/completions and include response_format: {"type":"json_object"} in the body when you need strict JSON.

2.5 MLC and nanoLLM — Orin‑Optimized and Multimodal

MLC LLM compiles models targeting Jetson’s GPU architecture for fast token generation.

Start MLC LLM:

docker run --runtime nvidia -it -d \
  --name mlc \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/mlc:r36.4.0

According to the quickstart, MLC frequently achieves the fastest token rates on AGX Orin among the tested engines.

nanoLLM provides higher‑level multimodal pipelines with vision‑language and voice capabilities.

Start nanoLLM with VILA:

docker run --runtime nvidia -it -d \
  --name nano-llm \
  --network host \
  --shm-size=8g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/nano_llm:r36.4.0 \
  python3 -m nano_llm.serve \
    --model Efficient-Large-Model/VILA1.5-3b \
    --host 0.0.0.0 \
    --port 9000

Multimodal example:

curl http://localhost:9000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "VILA1.5-3b",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "http://example.com/img.jpg"}},
        {"type": "text", "text": "What is in this image?"}
      ]
    }]
  }'

n8n:

Node: OpenAI Chat Model.
Base URL: http://<jetson-ip>:9000/v1.
Use messages with image_url and text parts when building prompts.

3. Speech and Audio Containers

3.1 faster-whisper — STT Server

faster‑whisper is a fast speech‑to‑text server offering OpenAI‑compatible endpoints on Jetson.

Start faster‑whisper:

docker run --runtime nvidia -it -d \
  --name faster-whisper \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/faster-whisper:r36.4.0 \
  python3 -m faster_whisper.server \
    --host 0.0.0.0 \
    --port 8000

Exposes /v1/audio/transcriptions and works with OpenAI Chat Model or HTTP Request nodes.

n8n pattern:

HTTP Request, method POST, URL http://<jetson-ip>:8000/v1/audio/transcriptions.
Body: form‑data with file (binary audio) and model (e.g. "whisper-1").

3.2 kokoro-tts — Lightweight Local TTS

kokoro‑tts offers an OpenAI‑compatible /v1/audio/speech endpoint with multiple voices.

Start kokoro‑tts:

docker run --runtime nvidia -it -d \
  --name kokoro-tts \
  --network host \
  dustynv/kokoro-tts:r36.4.0

Generate MP3:

curl http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "input": "Hello from your Jetson!",
    "voice": "af_bella",
    "response_format": "mp3"
  }' \
  --output speech.mp3

n8n:

HTTP Request, Response Format = File, then return or store the binary audio.

3.3 speaches — Unified Speech In/Out

speaches exposes both STT and TTS endpoints compatible with OpenAI’s audio APIs.

Start speaches:

docker run --runtime nvidia -it -d \
  --name speaches \
  --network host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/speaches:r36.4.0

Ports and endpoints are listed in the API quick reference (port 8000, OpenAI‑compatible).

A complete on‑device voice pipeline can be built as: Webhook (audio) → faster‑whisper STT → LLM (Ollama or vLLM) → kokoro‑tts or speaches TTS → Webhook response.

4. Vision, Diffusion, and VLM Containers

4.1 Stable Diffusion WebUI — Text‑to‑Image UI + API

The Stable Diffusion WebUI container gives you a full browser interface and REST API for image generation.

Start Stable Diffusion WebUI:

docker run --runtime nvidia -it -d \
  --name sd-webui \
  --network host \
  -v ~/sd-models:/workspace/stable-diffusion-webui/models \
  dustynv/stable-diffusion-webui:r36.4.0 \
  python3 launch.py --api --listen --port 7860

Web UI: http://<jetson-ip>:7860.

API txt2img example:

curl http://localhost:7860/sdapi/v1/txt2img \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "mountain landscape",
    "steps": 20,
    "width": 512,
    "height": 512
  }'

n8n:

HTTP Request → parse JSON → Move Binary Data to convert base64 images[0] to binary → send to Telegram, save file, etc.

4.2 ComfyUI — Graph‑Based Diffusion Workflows

ComfyUI is a node‑based interface with an HTTP API.

Start ComfyUI:

docker run --runtime nvidia -it -d \
  --name comfyui \
  --network host \
  -v ~/comfyui-models:/root/ComfyUI/models \
  -v ~/comfyui-output:/root/ComfyUI/output \
  dustynv/comfyui:r36.4.0

API flow:

POST /prompt → get prompt_id.
GET /history/{prompt_id} repeatedly until outputs appear.
GET /view?filename={filename}&type=output to download the image.

Use a sequence of HTTP Request nodes in n8n to implement the polling and retrieval.

4.3 VILA and Related VLMs

The VILA container provides an efficient vision‑language model with an OpenAI‑compatible API.

Start VILA:

docker run --runtime nvidia -it -d \
  --name vila \
  --network host \
  --shm-size=8g \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  dustynv/vila:r36.4.0

According to the quick reference, VILA uses port 8000 and integrates via OpenAI Chat Model node.

In n8n, send messages that include an image_url object and text, similar to the nanoLLM example.

5. Development, Experiment Tracking, and Smart Home

5.1 L4T-ML, PyTorch, and JupyterLab

L4T‑ML is an all‑in‑one ML environment that bundles PyTorch, TensorFlow, scikit‑learn, and JupyterLab optimized for JetPack 6.x.

Start L4T‑ML JupyterLab:

docker run --runtime nvidia -it -d \
  --name l4t-ml \
  --network host \
  --shm-size=8g \
  -v ~/ml-workspace:/workspace \
  dustynv/l4t-ml:r36.4.0 \
  jupyter lab --ip=0.0.0.0 --allow-root --no-browser

Access via http://<jetson-ip>:8888 in your browser.

Alternatively, the standalone dustynv/jupyterlab:r36.4.0 container provides just JupyterLab:

docker run --runtime nvidia -it -d \
  --name jupyterlab \
  --network host \
  -v ~/notebooks:/notebooks \
  dustynv/jupyterlab:r36.4.0 \
  jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token=''

PyTorch‑focused images (dustynv/pytorch, dustynv/l4t-pytorch) can be run via jetson-containers run ... as described in the build docs and are fully compatible with JetPack 6.2.2.

5.2 AIM Experiment Tracker

AIM is a lightweight REST‑accessible experiment tracker container.

Start AIM:

docker run --runtime nvidia -it -d \
  --name aim \
  --network host \
  -v ~/aim-data:/aim/data \
  dustynv/aim:r36.4.0 \
  aim up --host 0.0.0.0 --port 43800

Web UI and API at http://<jetson-ip>:43800.
n8n can poll api/runs and api/metrics using HTTP Request nodes to monitor training.

5.3 Home Assistant Core on Jetson

Home Assistant Core can run as a container for local smart‑home control.

Start Home Assistant:

docker run -it -d \
  --name homeassistant \
  --network host \
  -v ~/ha-config:/config \
  dustynv/homeassistant-core:r36.4.0

Access UI at http://<jetson-ip>:8123 and create a Long‑Lived Access Token under your profile.

n8n integration:

HTTP Request node with URL like http://<jetson-ip>:8123/api/states or /api/services/....
Authentication: Bearer token using the Long‑Lived Access Token.
Build flows like "sensor state change → LLM decision → Home Assistant service call" as outlined in the quickstart.

6. n8n Integration Patterns and Networking Notes

The quickstart highlights that your n8n instance runs in Docker on port 5678 and must reach Jetson services via the Jetson's LAN IP, not localhost, because container networking isolates localhost inside the n8n container. For OpenAI‑compatible services, configure the OpenAI Chat Model node with the Base URL pointing to http://<jetson-ip>:<port>/v1, while for other services use HTTP Request nodes and explicit paths.

OpenAI‑compatible containers and ports (from API quick reference):

Container	Port	Base URL example
ollama	11434	`http://<jetson-ip>:11434/v1`
llama_cpp	8080	`http://<jetson-ip>:8080/v1`
vLLM	8000	`http://<jetson-ip>:8000/v1`
sglang	30000	`http://<jetson-ip>:30000/v1`
mlc	8080	`http://<jetson-ip>:8080/v1`
nano_llm	9000	`http://<jetson-ip>:9000/v1`
speaches	8000	`http://<jetson-ip>:8000/v1`
faster-whisper	8000	`http://<jetson-ip>:8000/v1` or audio paths
kokoro-tts	8880	`http://<jetson-ip>:8880/v1`
VILA	8000	`http://<jetson-ip>:8000/v1`

Example: full on‑device voice assistant pipeline in n8n:

Webhook (POST /voice-input) → receives audio
  ↓
HTTP Request → POST /v1/audio/transcriptions (faster-whisper or speaches)
  Body: form-data (file: binary audio, model: "whisper-1")
  ↓
OpenAI Chat Model → local LLM (Base URL = Ollama or vLLM)
  ↓
HTTP Request → POST /v1/audio/speech (kokoro-tts or speaches)
  Body: {"model":"kokoro","input":"{{$json.text}}","voice":"af_bella"}
  ↓
Webhook Response → returns audio binary

This pattern uses only local containers on Jetson and keeps all data on‑device.

7. Practical Recommendations and Next Steps

The quickstart confirms that all 51 dustynv/jetson-containers images tagged r36.4.0 are compatible with JetPack 6.x and have been tested on Jetson AGX Orin 64GB with CUDA 12.6. For production use on your board, the guide suggests mounting persistent caches, using --shm-size=8g for transformer‑based containers, benchmarking vLLM vs MLC vs llama.cpp on your target models, and eventually switching from --network host to explicit port mappings on isolated Docker networks.

DEV Community

Jetson Containers Quickstart on NVIDIA Jetson AGX Orin 64GB

Abstract

1. Target Hardware and Software Environment

2. LLM Inference Engines (OpenAI-Compatible)

2.1 Ollama — General Purpose LLM Runtime

2.2 llama.cpp — GGUF, Quantized LLM Server

2.3 vLLM — High Throughput LLM Serving

2.4 SGLang — Structured Output and JSON

2.5 MLC and nanoLLM — Orin‑Optimized and Multimodal

3. Speech and Audio Containers

3.1 faster-whisper — STT Server

3.2 kokoro-tts — Lightweight Local TTS

3.3 speaches — Unified Speech In/Out

4. Vision, Diffusion, and VLM Containers

4.1 Stable Diffusion WebUI — Text‑to‑Image UI + API

4.2 ComfyUI — Graph‑Based Diffusion Workflows

4.3 VILA and Related VLMs

5. Development, Experiment Tracking, and Smart Home

5.1 L4T-ML, PyTorch, and JupyterLab

5.2 AIM Experiment Tracker

5.3 Home Assistant Core on Jetson

6. n8n Integration Patterns and Networking Notes

7. Practical Recommendations and Next Steps

Top comments (0)