Abstract
This document describes how to run NVIDIA Jetson‑optimized AI containers from the dustynv/jetson-containers project on an NVIDIA Jetson AGX Orin 64GB Developer Kit with Ubuntu 22.04.5 LTS and JetPack 6.2.2 (L4T 36.5.0), focusing on LLMs, speech, vision, and development tools. It consolidates the original Jetson Containers Quickstart PDF into an operational tutorial with copy‑paste docker run commands and n8n integration pointers tailored to a system where n8n itself runs in Docker on port 5678. The tutorial targets engineers who want to run multiple local AI services on the same Jetson and orchestrate them via OpenAI‑compatible APIs without relying on external cloud providers.
1. Target Hardware and Software Environment
Your system matches the reference environment of the Jetson Containers Quickstart Guide: Jetson AGX Orin 64GB, Ubuntu 22.04.5 aarch64, JetPack 6.2.2 (L4T 36.5.0), CUDA 12.6, cuDNN 9.3.0, and TensorRT 10.3.0.30. This platform has 64 GB unified memory and is validated to run all 51 containers in the guide, including 70B‑parameter LLMs in GPU‑accelerated runtimes.
Before launching AI containers, ensure:
- Docker is installed and configured with NVIDIA runtime (JetPack 6.x already provides
nvidia-container-runtime, you mainly add Docker itself). - GPU works inside Docker:
docker run --runtime nvidia --rm \
dustynv/cuda:12.8-samples-r36.4.0-cu128-24.04 \
/usr/local/cuda/extras/demo_suite/deviceQuery
- (Optional) Create directories for persistent data:
mkdir -p ~/.ollama \
~/.cache/huggingface \
~/sd-models \
~/comfyui-models \
~/comfyui-output \
~/ml-workspace \
~/notebooks \
~/aim-data \
~/ha-config
Use these directories as bind‑mounts so models and configuration survive container recreation.
2. LLM Inference Engines (OpenAI-Compatible)
2.1 Ollama — General Purpose LLM Runtime
Ollama is a user‑friendly way to run LLaMA, Mistral, Qwen, Gemma, Phi, and DeepSeek models with an OpenAI‑compatible REST API on your Jetson. The guide notes that AGX Orin 64GB can run 70B models comfortably using this runtime.
Start Ollama:
docker run --runtime nvidia -it -d \
--name ollama \
--network host \
-v ~/.ollama:/root/.ollama \
dustynv/ollama:r36.4.0
Pull a model and chat:
# Pull a model
curl http://localhost:11434/api/pull \
-d '{"name": "llama3.2:3b"}'
# Chat completion (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:3b",
"messages": [{"role":"user","content":"Hello from n8n!"}]
}'
n8n configuration:
- Credential: OpenAI API credential.
- API Key: any string (e.g.
ollama). - Base URL:
http://<jetson-ip>:11434/v1. - Model:
llama3.2:3bor any model pulled into Ollama.
2.2 llama.cpp — GGUF, Quantized LLM Server
llama.cpp excels at running quantized GGUF models with low latency and memory usage. The quickstart provides an OpenAI‑compatible server configuration suitable for AGX Orin.
Start llama.cpp server:
docker run --runtime nvidia -it -d \
--name llama-server \
--network host \
-v /models:/models \
dustynv/llama_cpp:r36.4.0 \
llama-server \
--model /models/llama-3.1-8b-q4.gguf \
--host 0.0.0.0 \
--port 8080 \
--n-gpu-layers 999 \
--ctx-size 8192
- The server exposes OpenAI‑style endpoints on
http://<jetson-ip>:8080/v1.
n8n configuration:
- Node: OpenAI Chat Model.
- Base URL:
http://<jetson-ip>:8080/v1. - Model: choose name according to your server configuration; llama.cpp will map GGUF to logical model IDs.
2.3 vLLM — High Throughput LLM Serving
vLLM uses PagedAttention to reach significantly higher throughput than naive Hugging Face inference, which is useful for multi‑user services.
Start vLLM:
docker run --runtime nvidia -it -d \
--name vllm \
--network host \
--shm-size=8g \
-v ~/.cache/huggingface:/root/.cache/huggingface \
dustynv/vllm:r36.4.0 \
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--host 0.0.0.0 \
--port 8000
- Exposes OpenAI‑compatible endpoints at
http://<jetson-ip>:8000/v1.
n8n configuration:
- Node: OpenAI Chat Model.
- Base URL:
http://<jetson-ip>:8000/v1. - Enable streaming mode if you want streamed responses.
2.4 SGLang — Structured Output and JSON
SGLang is designed for structured outputs and JSON‑constrained decoding using RadixAttention.
Start SGLang:
docker run --runtime nvidia -it -d \
--name sglang \
--network host \
--shm-size=8g \
-v ~/.cache/huggingface:/root/.cache/huggingface \
dustynv/sglang:r36.4.0 \
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-3B-Instruct \
--host 0.0.0.0 \
--port 30000
n8n usage pattern:
- Use HTTP Request node pointing to
http://<jetson-ip>:30000/v1/chat/completionsand includeresponse_format: {"type":"json_object"}in the body when you need strict JSON.
2.5 MLC and nanoLLM — Orin‑Optimized and Multimodal
MLC LLM compiles models targeting Jetson’s GPU architecture for fast token generation.
Start MLC LLM:
docker run --runtime nvidia -it -d \
--name mlc \
--network host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
dustynv/mlc:r36.4.0
- According to the quickstart, MLC frequently achieves the fastest token rates on AGX Orin among the tested engines.
nanoLLM provides higher‑level multimodal pipelines with vision‑language and voice capabilities.
Start nanoLLM with VILA:
docker run --runtime nvidia -it -d \
--name nano-llm \
--network host \
--shm-size=8g \
-v ~/.cache/huggingface:/root/.cache/huggingface \
dustynv/nano_llm:r36.4.0 \
python3 -m nano_llm.serve \
--model Efficient-Large-Model/VILA1.5-3b \
--host 0.0.0.0 \
--port 9000
Multimodal example:
curl http://localhost:9000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "VILA1.5-3b",
"messages": [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "http://example.com/img.jpg"}},
{"type": "text", "text": "What is in this image?"}
]
}]
}'
n8n:
- Node: OpenAI Chat Model.
- Base URL:
http://<jetson-ip>:9000/v1. - Use messages with
image_urland text parts when building prompts.
3. Speech and Audio Containers
3.1 faster-whisper — STT Server
faster‑whisper is a fast speech‑to‑text server offering OpenAI‑compatible endpoints on Jetson.
Start faster‑whisper:
docker run --runtime nvidia -it -d \
--name faster-whisper \
--network host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
dustynv/faster-whisper:r36.4.0 \
python3 -m faster_whisper.server \
--host 0.0.0.0 \
--port 8000
- Exposes
/v1/audio/transcriptionsand works with OpenAI Chat Model or HTTP Request nodes.
n8n pattern:
- HTTP Request, method POST, URL
http://<jetson-ip>:8000/v1/audio/transcriptions. - Body: form‑data with
file(binary audio) andmodel(e.g."whisper-1").
3.2 kokoro-tts — Lightweight Local TTS
kokoro‑tts offers an OpenAI‑compatible /v1/audio/speech endpoint with multiple voices.
Start kokoro‑tts:
docker run --runtime nvidia -it -d \
--name kokoro-tts \
--network host \
dustynv/kokoro-tts:r36.4.0
Generate MP3:
curl http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro",
"input": "Hello from your Jetson!",
"voice": "af_bella",
"response_format": "mp3"
}' \
--output speech.mp3
n8n:
- HTTP Request, Response Format = File, then return or store the binary audio.
3.3 speaches — Unified Speech In/Out
speaches exposes both STT and TTS endpoints compatible with OpenAI’s audio APIs.
Start speaches:
docker run --runtime nvidia -it -d \
--name speaches \
--network host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
dustynv/speaches:r36.4.0
- Ports and endpoints are listed in the API quick reference (port 8000, OpenAI‑compatible).
A complete on‑device voice pipeline can be built as: Webhook (audio) → faster‑whisper STT → LLM (Ollama or vLLM) → kokoro‑tts or speaches TTS → Webhook response.
4. Vision, Diffusion, and VLM Containers
4.1 Stable Diffusion WebUI — Text‑to‑Image UI + API
The Stable Diffusion WebUI container gives you a full browser interface and REST API for image generation.
Start Stable Diffusion WebUI:
docker run --runtime nvidia -it -d \
--name sd-webui \
--network host \
-v ~/sd-models:/workspace/stable-diffusion-webui/models \
dustynv/stable-diffusion-webui:r36.4.0 \
python3 launch.py --api --listen --port 7860
- Web UI:
http://<jetson-ip>:7860.
API txt2img example:
curl http://localhost:7860/sdapi/v1/txt2img \
-H "Content-Type: application/json" \
-d '{
"prompt": "mountain landscape",
"steps": 20,
"width": 512,
"height": 512
}'
n8n:
- HTTP Request → parse JSON → Move Binary Data to convert base64
images[0]to binary → send to Telegram, save file, etc.
4.2 ComfyUI — Graph‑Based Diffusion Workflows
ComfyUI is a node‑based interface with an HTTP API.
Start ComfyUI:
docker run --runtime nvidia -it -d \
--name comfyui \
--network host \
-v ~/comfyui-models:/root/ComfyUI/models \
-v ~/comfyui-output:/root/ComfyUI/output \
dustynv/comfyui:r36.4.0
API flow:
- POST
/prompt→ getprompt_id. - GET
/history/{prompt_id}repeatedly until outputs appear. - GET
/view?filename={filename}&type=outputto download the image.
Use a sequence of HTTP Request nodes in n8n to implement the polling and retrieval.
4.3 VILA and Related VLMs
The VILA container provides an efficient vision‑language model with an OpenAI‑compatible API.
Start VILA:
docker run --runtime nvidia -it -d \
--name vila \
--network host \
--shm-size=8g \
-v ~/.cache/huggingface:/root/.cache/huggingface \
dustynv/vila:r36.4.0
- According to the quick reference, VILA uses port 8000 and integrates via OpenAI Chat Model node.
In n8n, send messages that include an image_url object and text, similar to the nanoLLM example.
5. Development, Experiment Tracking, and Smart Home
5.1 L4T-ML, PyTorch, and JupyterLab
L4T‑ML is an all‑in‑one ML environment that bundles PyTorch, TensorFlow, scikit‑learn, and JupyterLab optimized for JetPack 6.x.
Start L4T‑ML JupyterLab:
docker run --runtime nvidia -it -d \
--name l4t-ml \
--network host \
--shm-size=8g \
-v ~/ml-workspace:/workspace \
dustynv/l4t-ml:r36.4.0 \
jupyter lab --ip=0.0.0.0 --allow-root --no-browser
- Access via
http://<jetson-ip>:8888in your browser.
Alternatively, the standalone dustynv/jupyterlab:r36.4.0 container provides just JupyterLab:
docker run --runtime nvidia -it -d \
--name jupyterlab \
--network host \
-v ~/notebooks:/notebooks \
dustynv/jupyterlab:r36.4.0 \
jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token=''
PyTorch‑focused images (dustynv/pytorch, dustynv/l4t-pytorch) can be run via jetson-containers run ... as described in the build docs and are fully compatible with JetPack 6.2.2.
5.2 AIM Experiment Tracker
AIM is a lightweight REST‑accessible experiment tracker container.
Start AIM:
docker run --runtime nvidia -it -d \
--name aim \
--network host \
-v ~/aim-data:/aim/data \
dustynv/aim:r36.4.0 \
aim up --host 0.0.0.0 --port 43800
- Web UI and API at
http://<jetson-ip>:43800. - n8n can poll
api/runsandapi/metricsusing HTTP Request nodes to monitor training.
5.3 Home Assistant Core on Jetson
Home Assistant Core can run as a container for local smart‑home control.
Start Home Assistant:
docker run -it -d \
--name homeassistant \
--network host \
-v ~/ha-config:/config \
dustynv/homeassistant-core:r36.4.0
- Access UI at
http://<jetson-ip>:8123and create a Long‑Lived Access Token under your profile.
n8n integration:
- HTTP Request node with URL like
http://<jetson-ip>:8123/api/statesor/api/services/.... - Authentication: Bearer token using the Long‑Lived Access Token.
- Build flows like "sensor state change → LLM decision → Home Assistant service call" as outlined in the quickstart.
6. n8n Integration Patterns and Networking Notes
The quickstart highlights that your n8n instance runs in Docker on port 5678 and must reach Jetson services via the Jetson's LAN IP, not localhost, because container networking isolates localhost inside the n8n container. For OpenAI‑compatible services, configure the OpenAI Chat Model node with the Base URL pointing to http://<jetson-ip>:<port>/v1, while for other services use HTTP Request nodes and explicit paths.
OpenAI‑compatible containers and ports (from API quick reference):
| Container | Port | Base URL example |
|---|---|---|
| ollama | 11434 | http://<jetson-ip>:11434/v1 |
| llama_cpp | 8080 | http://<jetson-ip>:8080/v1 |
| vLLM | 8000 | http://<jetson-ip>:8000/v1 |
| sglang | 30000 | http://<jetson-ip>:30000/v1 |
| mlc | 8080 | http://<jetson-ip>:8080/v1 |
| nano_llm | 9000 | http://<jetson-ip>:9000/v1 |
| speaches | 8000 | http://<jetson-ip>:8000/v1 |
| faster-whisper | 8000 |
http://<jetson-ip>:8000/v1 or audio paths |
| kokoro-tts | 8880 | http://<jetson-ip>:8880/v1 |
| VILA | 8000 | http://<jetson-ip>:8000/v1 |
Example: full on‑device voice assistant pipeline in n8n:
Webhook (POST /voice-input) → receives audio
↓
HTTP Request → POST /v1/audio/transcriptions (faster-whisper or speaches)
Body: form-data (file: binary audio, model: "whisper-1")
↓
OpenAI Chat Model → local LLM (Base URL = Ollama or vLLM)
↓
HTTP Request → POST /v1/audio/speech (kokoro-tts or speaches)
Body: {"model":"kokoro","input":"{{$json.text}}","voice":"af_bella"}
↓
Webhook Response → returns audio binary
This pattern uses only local containers on Jetson and keeps all data on‑device.
7. Practical Recommendations and Next Steps
The quickstart confirms that all 51 dustynv/jetson-containers images tagged r36.4.0 are compatible with JetPack 6.x and have been tested on Jetson AGX Orin 64GB with CUDA 12.6. For production use on your board, the guide suggests mounting persistent caches, using --shm-size=8g for transformer‑based containers, benchmarking vLLM vs MLC vs llama.cpp on your target models, and eventually switching from --network host to explicit port mappings on isolated Docker networks.
Top comments (0)