Why cudaMalloc fails on NVIDIA Jetson Orin Nano Super — and the one flag that fixes it
If you've tried running a GGUF model via llama.cpp on a Jetson Orin Nano Super and hit this error:
NvMapMemAllocInternalTagged: error 12
cudaMalloc failed: out of memory
...while tegrastats shows the GPU idle and several GB of RAM free — this post is for you.
The hardware context
The Jetson Orin Nano Super (8 GB) uses unified memory. There is no separate VRAM pool — the CPU and GPU share one physical 8 GB block. This is how all Jetson Orin-series SoCs work, and it's what makes them cost-effective for edge AI.
Why stock llama.cpp fails
When llama.cpp allocates GPU tensor buffers, it calls cudaMalloc. On a discrete GPU, cudaMalloc carves out memory from a dedicated VRAM pool. On the Orin Nano Super, that dedicated pool doesn't exist — the allocator hits the NvMap interface, which rejects the request with error 12 (ENOMEM) even though the unified pool has plenty of space.
The result: the model fails to load with a misleading out-of-memory error despite free RAM.
The fix
Set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 as an environment variable when you launch the server. This is a runtime flag, not a build option — in llama.cpp's CUDA backend it switches GPU allocations from cudaMalloc to cudaMallocManaged, which works with the unified pool.
First, build llama.cpp with CUDA for the Orin's compute capability (sm_87):
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=87
cmake --build build --config Release -j$(nproc)
Then enable unified memory at launch by setting the environment variable:
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./build/bin/llama-server \
-m LFM2-VL-1.6B-Q4_0.gguf \
--mmproj mmproj-LFM2-VL-1.6B-Q8_0.gguf \
--n-gpu-layers 999
That's it. The model loads and runs fully GPU-accelerated.
What I was building
I ran into this while building SENTINEL — a self-hosted AI surveillance system for my B.Tech minor project. It runs YOLOv8n + TensorRT for person detection, DeepFace for face recognition, and LFM2-VL 1.6B for scene understanding — all on the Jetson, zero cloud dependency.
Full project and complete BUILD.md:
https://github.com/hemkesh2021-dotcom/Sentinel_Surveillance
Hope this saves someone a few hours of debugging.
Top comments (0)