Run NVIDIA NIM on Your Own GPU — Same API, Different Endpoint

#nvidia #ai #python #tutorial

For Parts 1 through 3 we've been calling NIM through NVIDIA's hosted API Catalog at build.nvidia.com. That's the right starting point. It is also not the only place NIM runs.

NIM ships as a Docker container that exposes the same OpenAI-compatible HTTP API on a local port. Pull the image, run it on a box with an NVIDIA GPU, and the only thing that changes in the Python client is the base_url. The ask() function from Part 1, the retriever from Part 2, and the guardrails from Part 3 all keep working against the new endpoint, unchanged.

This post walks through the swap and the reasons you might want it.

I'm B Torkian, NVIDIA Developer Champion at USC. Same series, same code, just moving where inference happens.

Why bother running NIM locally

The hosted API Catalog is the right default. Don't switch until at least one of these matters:

Data locality. The data you're sending the model has to stay on a machine you control. (Common at universities, hospitals, regulated industries.) USC has a research GPU cluster — for projects where the source documents can't leave that environment, the model has to come to the data, not the other way around.
Predictable latency. Network round-trip + queue time + first-token latency adds up. A locally hosted model gives you a tighter, more predictable budget.
A real understanding of what's in the box. The hosted API hides a lot of useful detail. Running the container yourself surfaces the model files, the inference server, the GPU memory layout, and what knobs you actually have.
Cost at scale. Past a certain volume, running the model on hardware you already own becomes cheaper than per-token billing.

None of those matter for a 30-minute workshop. All of them might matter for the project the workshop is teaching you to build.

What you need

An NVIDIA GPU with enough VRAM for the model you want to run. For meta/llama-3.1-8b-instruct (the model we've been using), expect roughly 16 GB of VRAM. Heavier models want more.
Linux (native or WSL2). NIM containers expect the NVIDIA Container Toolkit, which means the --runtime=nvidia Docker flag works.
Docker with the NVIDIA Container Toolkit installed. Test with docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi — it should print your GPU.
An NGC API key. The key you already have from build.nvidia.com works for pulling NIM images; if not, generate one at ngc.nvidia.com.

If you don't have a GPU box on hand, the rest of the workshop still teaches you something useful — the API shape is identical, so when you do get one, the Python client code does not change.

Step 1 — Log in to NVIDIA's container registry

export NGC_API_KEY="nvapi-...your-key..."
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

The literal username $oauthtoken is correct — that's NGC's convention for API-key logins. Don't substitute anything for it.

Step 2 — Pull and run the NIM container

docker run -it --rm \
  --name llama-3.1-8b-instruct \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$HOME/.cache/nim:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

A few notes:

First run is slow. The image is large and the model weights download on first launch. The -v cache mount means subsequent runs are fast.
Use the exact image tag from the model's Deploy tab on build.nvidia.com. The example above uses :latest, but pinning a specific version is safer for reproducibility.
The container listens on port 8000. That's what -p 8000:8000 exposes to your host.

When the container finishes loading it will log something like Application startup complete. Uvicorn running on http://0.0.0.0:8000. That's your signal that the OpenAI-compatible endpoint is live.

Step 3 — Verify the endpoint with curl

curl http://localhost:8000/v1/models

You should see a JSON response listing the loaded model. If curl hangs or returns connection-refused, the container hasn't finished loading yet — give it another minute and try again.

Step 4 — Point the Python client at localhost

This is the entire Python change.

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:8000/v1',          # ← was 'https://integrate.api.nvidia.com/v1'
    api_key='not-needed-for-local-dev',           # local NIM doesn't validate the key
)

MODEL = 'meta/llama-3.1-8b-instruct'              # same model name as the hosted endpoint

def ask(system_prompt, user_message):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user',   'content': user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content

print(ask(
    system_prompt='You are a concise USC campus assistant.',
    user_message='What does NVIDIA NIM stand for?',
))

Two lines changed — base_url and api_key. The ask() function is the same one we've been using since Part 1. The campus assistant, the embedding retriever, and the guardrail layers from Parts 2 and 3 all run against this client without any further changes.

The repo's part4_local_nim.py reads NIM_BASE_URL from your environment so the same script runs against the hosted endpoint by default and against local NIM when you set the env var. That makes it easy to A/B the two.

Step 5 — Same code, two endpoints (the test that matters)

# Hosted run (what we've done in Parts 1-3)
python3 part4_local_nim.py

# Local NIM run — point the same script at the container
NIM_BASE_URL=http://localhost:8000/v1 python3 part4_local_nim.py

Both should produce the same shape of output — the same ask() call, the same model name, just inference happening in a different place. That's the whole point of an OpenAI-compatible API surface — the application code stops caring where the model lives.

When to use which

Situation	Use
Workshop, prototype, demo, course project	Hosted (`integrate.api.nvidia.com`)
Sensitive data that can't leave a controlled environment	Local NIM on cluster GPU
Latency-critical inner loop, large concurrent load	Local NIM on a sized-up node
First-time student, no GPU on hand	Hosted (don't even mention local until they ask)
Production with a known traffic profile	Either, depending on cost crossover

There is no "winner" here. The hosted API and self-hosted NIM are the same product with different deployment footprints. The thing worth internalizing — and what this post is really about — is that your Python code does not have to care.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab for the hosted version: Open part4_local_nim.ipynb
Local Python: part4_local_nim.py in the repo. Defaults to the hosted endpoint; set NIM_BASE_URL=http://localhost:8000/v1 to point at a local NIM container.

MIT licensed. I run this at USC against both endpoints — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.