DEV Community

Cover image for Why cudaMalloc fails on NVIDIA Jetson Orin Nano Super — and the one flag that fixes it
Hemkesh
Hemkesh

Posted on

Why cudaMalloc fails on NVIDIA Jetson Orin Nano Super — and the one flag that fixes it

Why cudaMalloc fails on NVIDIA Jetson Orin Nano Super — and the one flag that fixes it

If you've tried running a GGUF model via llama.cpp on a Jetson Orin Nano Super and hit this error:

NvMapMemAllocInternalTagged: error 12
cudaMalloc failed: out of memory
Enter fullscreen mode Exit fullscreen mode

...while tegrastats shows the GPU idle and several GB of RAM free — this post is for you.

The hardware context

The Jetson Orin Nano Super (8 GB) uses unified memory. There is no separate VRAM pool — the CPU and GPU share one physical 8 GB block. This is how all Jetson Orin-series SoCs work, and it's what makes them cost-effective for edge AI.

Why stock llama.cpp fails

When llama.cpp allocates GPU tensor buffers, it calls cudaMalloc. On a discrete GPU, cudaMalloc carves out memory from a dedicated VRAM pool. On the Orin Nano Super, that dedicated pool doesn't exist — the allocator hits the NvMap interface, which rejects the request with error 12 (ENOMEM) even though the unified pool has plenty of space.

The result: the model fails to load with a misleading out-of-memory error despite free RAM.

The fix

Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 as an environment variable when you launch the server. This is a runtime flag, not a build option — in llama.cpp's CUDA backend it switches GPU allocations from cudaMalloc to cudaMallocManaged, which works with the unified pool.

First, build llama.cpp with CUDA for the Orin's compute capability (sm_87):

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES=87
cmake --build build --config Release -j$(nproc)
Enter fullscreen mode Exit fullscreen mode

Then enable unified memory at launch by setting the environment variable:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./build/bin/llama-server \
    -m LFM2-VL-1.6B-Q4_0.gguf \
    --mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf \
    --n-gpu-layers 999
Enter fullscreen mode Exit fullscreen mode

That's it. The model loads and runs fully GPU-accelerated.

What I was building

I ran into this while building SENTINEL — a self-hosted AI surveillance system for my B.Tech minor project. It runs YOLOv8n + TensorRT for person detection, DeepFace for face recognition, and LFM2-VL 1.6B for scene understanding — all on the Jetson, zero cloud dependency.

Full project and complete BUILD.md:
https://github.com/hemkesh2021-dotcom/Sentinel_Surveillance

Hope this saves someone a few hours of debugging.

Top comments (0)