Sergio Andres Usma

Posted on Apr 5

Exploratory Installation of Unsloth on NVIDIA Jetson AGX Orin 64 GB

#nvidia #jetson #unsloth #llm

Abstract

This report documents an exploratory attempt to install and run Unsloth (including Unsloth Studio) on an NVIDIA Jetson AGX Orin 64 GB using a Docker-based workflow with dustynv/l4t-ml:r36.4.0 as the base image.

The process successfully validated GPU-accelerated PyTorch and Unsloth’s core Python package on Jetson, but exposed substantial friction and incompatibilities in getting Unsloth Studio’s full stack (Studio backend, frontend, Triton/TorchInductor/TorchAo dependencies, and custom virtual environment) to run reliably on this ARM-based edge platform.

The goal of this write-up is to provide a precise technical account so that other practitioners (and the Unsloth team) can (a) reproduce or avoid the same pitfalls, and (b) better assess the current suitability of Unsloth Studio for Jetson-class devices.

1. Hardware and Software Environment

The experiments were conducted on the following platform:

Device: NVIDIA Jetson AGX Orin Developer Kit (64 GB)
OS: Ubuntu 22.04.5 LTS, aarch64
JetPack / L4T: JetPack 6.2.2, L4T 36.5.0
CUDA: 12.6 (nvcc 12.6.68)
cuDNN: 9.3.0
TensorRT: 10.3.0
Docker: Engine with NVIDIA Container Runtime enabled (--runtime=nvidia)
Base ML image: dustynv/l4t-ml:r36.4.0 (from Jetson Containers), which provides:
- PyTorch compiled for Jetson (aarch64) with CUDA and TensorRT integration
- JupyterLab and common ML tooling

Host-side persistent storage for this project was centralized under:

~/unsloth/
  build/      # Dockerfile and build context
  work/       # notebooks, datasets, outputs
  cache/      # general cache inside the container
  hf/         # Hugging Face cache
  jupyter/    # Jupyter config
  ssh/        # SSH keys/config (optional)

This layout was bind-mounted into the container to ensure persistence across container rebuilds.

2. Docker Image Construction

2.1 Base Dockerfile

The starting point was a custom image layered on top of dustynv/l4t-ml:r36.4.0:

FROM dustynv/l4t-ml:r36.4.0

ENV DEBIAN_FRONTEND=noninteractive \
    PIP_NO_CACHE_DIR=1 \
    PYTHONUNBUFFERED=1 \
    SHELL=/bin/bash \
    JUPYTER_PORT=8888 \
    STUDIO_PORT=8000 \
    WORKSPACE=/workspace \
    HF_HOME=/workspace/.cache/huggingface \
    TRANSFORMERS_CACHE=/workspace/.cache/huggingface \
    HUGGINGFACE_HUB_CACHE=/workspace/.cache/huggingface

USER root

RUN apt-get update && apt-get install -y --no-install-recommends \
    curl git wget ca-certificates build-essential pkg-config \
    python3-pip python3-dev python3-venv \
    openssh-server sudo nano htop tmux \
    libopenblas-dev libssl-dev libffi-dev \
    && rm -rf /var/lib/apt/lists/*

RUN mkdir -p /var/run/sshd /workspace/work /workspace/.cache/huggingface /root/.jupyter

RUN python3 -m pip install --upgrade pip setuptools wheel

# Remove Jetson-specific custom pip indexes to avoid transient outages
RUN python3 -m pip config unset global.index-url || true && \
    python3 -m pip config unset global.extra-index-url || true

# Generic Python dependencies via PyPI
RUN PIP_INDEX_URL=https://pypi.org/simple python3 -m pip install \
    fastapi "uvicorn[standard]" gradio \
    accelerate transformers peft trl datasets sentencepiece protobuf safetensors \
    huggingface_hub

# Install Unsloth (core + zoo) from GitHub/PyPI
RUN PIP_INDEX_URL=https://pypi.org/simple python3 -m pip install \
    "unsloth @ git+https://github.com/unslothai/unsloth.git" \
    "unsloth-zoo @ git+https://github.com/unslothai/unsloth.git" || true

# Optionally attempt bitsandbytes (may be fragile on Jetson)
RUN PIP_INDEX_URL=https://pypi.org/simple python3 -m pip install bitsandbytes || true

WORKDIR /workspace
EXPOSE 8000 8888 22

CMD ["/bin/bash"]

Key design choices:

Reuse NVIDIA’s l4t-ml stack instead of installing PyTorch/TensorRT manually, since it is tuned for Jetson.
Explicitly unset custom Jetson pip indexes before installing Unsloth, to avoid failures due to unavailable Jetson-specific mirrors while installing generic packages (e.g. fastapi).
Install Unsloth via GitHub (or PyPI) rather than using the x86-oriented Docker image unsloth/unsloth.

The image was built with:

cd ~/unsloth/build
sudo docker build --no-cache -t local/unsloth-studio:jetson-l4tml-r36.4.0 .

3. Container Runtime and GPU Validation

A persistent container was created with host networking and bind mounts:

sudo docker run -d \
  --name unsloth-studio \
  --restart unless-stopped \
  --runtime nvidia \
  --network host \
  --shm-size=16g \
  -e HF_HOME=/workspace/.cache/huggingface \
  -e TRANSFORMERS_CACHE=/workspace/.cache/huggingface \
  -e HUGGINGFACE_HUB_CACHE=/workspace/.cache/huggingface \
  -v ~/unsloth/work:/workspace/work \
  -v ~/unsloth/cache:/workspace/.cache \
  -v ~/unsloth/hf:/root/.cache/huggingface \
  -v ~/unsloth/jupyter:/root/.jupyter \
  -v ~/unsloth/ssh:/root/.ssh \
  local/unsloth-studio:jetson-l4tml-r36.4.0 \
  tail -f /dev/null

Inside the container, GPU support was verified with:

python3 -c "import torch;
print(torch.__version__);
print(torch.cuda.is_available());
print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'no cuda')"

This confirmed:

torch version 2.6.0 (from the l4t-ml stack),
CUDA available,
device name reported as “Orin”.

Thus, the base ML environment inside the container was correctly accelerated on Jetson.

4. Installing Unsloth Core

Within the container:

python3 -c "import unsloth; print('unsloth ok')"
unsloth --help

The CLI output showed the main Unsloth commands:

train
inference
export
list-checkpoints
studio (subcommand group)

However, importing Unsloth triggered a warning stacktrace related to Triton, TorchInductor, and TorchAo:

ImportError: cannot import name 'AttrsDescriptor' from triton.compiler.compiler
Errors inside torch._inductor.runtime.hints and torchao.quantization

This indicates that parts of the current Unsloth stack assume a Triton/TorchInductor/TorchAo configuration aligned with x86_64 desktop/server builds of PyTorch, which is not trivially compatible with the Jetson-specific PyTorch build shipping in l4t-ml.

Despite these warnings, the CLI remained usable for basic commands, and GPU acceleration for standard PyTorch operations was intact.

5. Attempting to Enable Unsloth Studio

5.1 CLI-Level Status

The unsloth studio subcommand was present:

unsloth studio --help

showed options such as --host, --port, --frontend, and subcommands:

stop
update
reset-password

Attempting to start Studio directly:

unsloth studio --host 0.0.0.0 --port 8000

returned:

Studio not set up. Run install.sh first.

This implies that Studio expects an auxiliary installation step that sets up its environment (frontend, backend, and venv).

5.2 Running `unsloth studio setup`

Unsloth documentation describes a developer mode where Studio is installed via uv and a dedicated virtual environment.

Following this pattern, the command:

unsloth studio setup

produced:

Successful installation of nvm, Node LTS, and bun
Successful build of the frontend (“frontend built”)
But then:

python venv not found at /root/.unsloth/studio/unsloth_studio
Run install.sh first to create the environment:
  curl -fsSL https://unsloth.ai/install.sh | sh

Thus, the CLI expects a virtual environment under /root/.unsloth/studio/unsloth_studio that appears to be normally created by the official install.sh script.

5.3 Manual Creation of the Studio Virtual Environment

Rather than relying on install.sh (which is tuned for other platforms and may interfere with the Jetson-specific PyTorch/Triton stack), a manual venv was created:

cd /root/.unsloth/studio
uv venv unsloth_studio --python 3.10
source /root/.unsloth/studio/unsloth_studio/bin/activate

uv pip install --index-url https://pypi.org/simple unsloth

This installed Unsloth (and a complete stack of dependencies) into the venv, including:

torch, torchao, triton
transformers, accelerate, peft, trl
bitsandbytes
unsloth, unsloth-zoo

Within the venv, unsloth studio -H 0.0.0.0 -p 8000 still failed due to missing backend dependencies (structlog), which were then installed.

However, repeated attempts to start Studio continued to reveal issues:

ModuleNotFoundError: No module named 'structlog' (due to pip confusion between global and venv environments)
Friction in adding pip to the venv (pip not present or not found via python -m pip)
A recurring tension between the uv-managed environment and the classical pip expectations coming from Studio’s backend modules.

Ultimately, even after installing the necessary Python packages, the CLI still treated Studio as “not set up” and insisted on running the global install.sh script.

6. Failure Modes and Root Causes

The main failure modes observed were:

Triton / TorchInductor / TorchAo incompatibilities
- Errors when importing Unsloth related to AttrsDescriptor in Triton and TorchInductor.
- These components are not officially supported or tuned for the Jetson-specific PyTorch build, causing runtime import and registration issues.
Studio’s tight coupling to its own venv and installer
- Studio expects a very particular environment layout under ~/.unsloth/studio/unsloth_studio created by install.sh.
- Deviating from the installer (e.g., manual or uv-only installation) leads to missing venv markers, which the CLI interprets as “Studio not set up.”
Tooling friction on Jetson (uv + venv + pip)
- The combination of uv-managed environments with a system Python and Docker base image that already has a global pip led to situations where:
  - The venv had no pip initially.
  - python -m ensurepip installed pip globally rather than into the venv.
  - The actual pip used to install backend dependencies was the global one, leaving the venv incomplete.
Mismatch with Jetson Containers philosophy
- Jetson Containers and l4t-ml are built around Nvidia’s optimized PyTorch/TensorRT stacks, while Unsloth Studio’s modern pipeline assumes desktop/server-class Triton and TorchInductor configurations.
- This leads to a mismatch that is non-trivial to reconcile in a maintainable way.

7. Practical Outcomes

Despite the failure to get Unsloth Studio fully operational, the following outcomes were achieved:

A validated GPU-accelerated Unsloth core environment on Jetson:
- unsloth CLI installed and usable.
- PyTorch 2.6.0 with CUDA on Orin working correctly.
A reusable Docker-based ML devbox (local/unsloth-studio:jetson-l4tml-r36.4.0) with:
- A clear persistent directory layout (~/unsloth).
- Host networking and shared volumes suitable for integration with other Jetson Containers (e.g., llama.cpp, vLLM, NanoLLM, llama-factory).
Empirical evidence that, as of this experiment, Unsloth Studio is not yet a drop-in web UI solution for Jetson AGX Orin, due to:
- Triton/TorchInductor/TorchAo assumptions, and
- Strong coupling to the install.sh-managed environment.

8. Recommendations for Jetson Practitioners

For current Jetson AGX Orin users:

Use Unsloth core selectively
- Unsloth’s Python API and CLI can still be valuable for fine-tuning/export workflows that do not rely heavily on Triton/TorchInductor-specific optimizations.
- Prefer using the Jetson-optimized PyTorch from l4t-ml and be cautious with features that depend on TorchInductor/Triton.
Rely on Jetson Containers for serving and fine-tuning
- For serving and fine-tuning large models on Jetson, the containers in the Jetson Containers ecosystem (llama.cpp, vLLM, MLC, TensorRT-LLM, NanoLLM, llama-factory) are significantly more mature and better integrated with JetPack and L4T.
Treat Unsloth Studio on Jetson as experimental
- Until there is first-class ARM/Jetson support (or a documented variant of install.sh and Studio’s backend explicitly targeting Jetson), Studio should be considered an experimental integration on this hardware.

9. Suggestions for the Unsloth Team

Based on this experience, the following changes would materially improve the viability of Unsloth Studio on Jetson and similar edge platforms:

Documented “headless / no-Triton” mode
- A configuration profile that can disable or bypass TorchInductor/Triton/TorchAo, relying purely on standard PyTorch kernels when running on unsupported architectures such as Jetson.
Explicit ARM/Jetson support statement and checks
- Clear statements in the documentation regarding ARM/aarch64 support status, with runtime checks that either:
  - Enable a safe, reduced feature set, or
  - Fail fast with a clear, actionable message.
Studio installation mode for preexisting Python stacks
- A variant of install.sh or studio setup that:
  - Can attach to an existing PyTorch environment (e.g., Jetson’s l4t-ml), and
  - Creates only the additional Studio-specific venv/backend/frontend without attempting to reconfigure PyTorch or Triton.
Minimal dependency profile for Studio backend
- A smaller “core backend” dependency set for Studio that avoids complex quantization stacks and heavy compiler integrations when running in constrained or embedded environments.

10. Conclusion

The experiment demonstrates that:

Installing Unsloth core on Jetson AGX Orin via a Dockerized l4t-ml base image is feasible, and the resulting environment is usable for GPU-accelerated LLM workflows.
However, enabling Unsloth Studio—the full web UI for training and serving—on Jetson currently encounters significant hurdles due to the interaction between Triton/TorchInductor, TorchAo, uv-managed venvs, and the assumptions baked into install.sh.

From a practical standpoint, Jetson users are better served today by combining Unsloth core (where useful) with the existing Jetson Containers ecosystem, while treating Unsloth Studio as an experimental component on this hardware.

From a community and engineering perspective, this experiment highlights concrete areas where incremental changes and documentation from the Unsloth team could unlock a powerful edge deployment story on Jetson-class devices.

Top comments (1)

Allan Kipruto • May 23

Really interesting exploration — especially seeing Unsloth running on Jetson AGX Orin. Edge fine-tuning setups like this are still rare in practice, so this is a valuable reference.

I’ve been working with Unsloth as well, specifically for fine-tuning Gemma 4 on a localized education dataset (Kenya KCSE curriculum) for an offline-first learning system (LocalMind).

One thing I noticed is that the bottleneck shifts heavily depending on setup — on constrained hardware, you start optimizing not just for training speed, but for memory efficiency and stability during longer fine-tuning runs.

It’s interesting how Unsloth makes small-model fine-tuning practical enough to actually deploy in real-world, low-resource educational contexts rather than just research environments.

Curious if you ran into stability issues or memory spikes during longer training runs on the Jetson?