NVIDIA quietly built one of the most impressive AI APIs out there — and most developers don't know it exists.
NVIDIA NIM (NVIDIA Inference Microservices) gives you OpenAI-compatible access to 136 models through a single endpoint. We're talking Llama 405B, Kimi K2, Mistral Large 3 675B, Qwen3-Coder 480B. All behind the same interface you already know.
Here's what I found after testing them all.
Setup (60 seconds)
from openai import OpenAI
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="nvapi-YOUR_KEY_HERE"
)
That's it. Get your key at build.nvidia.com. Free tier included.
The 136 Models — What's Actually in There
import requests
headers = {"Authorization": f"Bearer {api_key}"}
response = requests.get("https://integrate.api.nvidia.com/v1/models", headers=headers)
models = response.json()["data"]
print(f"Total: {len(models)} models")
The catalog spans 20+ organizations:
| Org | Notable Models |
|---|---|
| Meta | Llama 3.1 405B, Llama 4 Maverick 17B |
| Mistral | Mistral Large 3 675B, Magistral Small, Codestral |
| Moonshot | Kimi K2, Kimi K2 Thinking |
| Qwen | Qwen3-Coder 480B, Qwen3.5 397B |
| DeepSeek | DeepSeek v3.2, v4 Pro, v4 Flash |
| NVIDIA | Nemotron Ultra 253B, Nemotron Super 49B |
| ByteDance | Seed-OSS 36B |
| OpenAI | GPT-OSS 120B (yes, really) |
What Actually Works (I Tested Them All)
from openai import OpenAI
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="nvapi-YOUR_KEY"
)
working_models = [
"meta/llama-3.1-405b-instruct",
"moonshotai/kimi-k2-instruct",
"qwen/qwen3-coder-480b-a35b-instruct",
"qwen/qwen3.5-397b-a17b",
"mistralai/mistral-large-3-675b-instruct-2512",
"mistralai/magistral-small-2506",
"nvidia/llama-3.3-nemotron-super-49b-v1",
"bytedance/seed-oss-36b-instruct",
]
for model in working_models:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Explain transformers in one sentence"}],
max_tokens=100
)
print(f"\n{model}:")
print(response.choices[0].message.content)
Results from my run:
meta/llama-3.1-405b-instruct: ✅ Fast, coherent
moonshotai/kimi-k2-instruct: ✅ Excellent reasoning
qwen/qwen3-coder-480b-a35b-instruct: ✅ Best for code tasks
mistralai/mistral-large-3-675b-instruct-2512: ✅ Strong instruction following
nvidia/llama-3.3-nemotron-super-49b-v1: ✅ NVIDIA-tuned, solid
deepseek-ai/deepseek-v4-pro: ❌ Timeout (high demand)
moonshotai/kimi-k2-thinking: ❌ Timeout (high demand)
Streaming Support
All working models support streaming — critical for production UX:
stream = client.chat.completions.create(
model="moonshotai/kimi-k2-instruct",
messages=[{"role": "user", "content": "Write a Python async web scraper"}],
max_tokens=500,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Multi-Model Router Pattern
The real power: build a router that falls back across models based on availability and task type.
from openai import OpenAI
from typing import Optional
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="nvapi-YOUR_KEY"
)
ROUTING_TABLE = {
"code": [
"qwen/qwen3-coder-480b-a35b-instruct",
"meta/llama-3.1-405b-instruct",
"mistralai/mistral-large-3-675b-instruct-2512",
],
"reasoning": [
"moonshotai/kimi-k2-instruct",
"meta/llama-3.1-405b-instruct",
"nvidia/llama-3.3-nemotron-super-49b-v1",
],
"general": [
"mistralai/mistral-large-3-675b-instruct-2512",
"meta/llama-3.1-405b-instruct",
"bytedance/seed-oss-36b-instruct",
]
}
def smart_complete(prompt: str, task_type: str = "general", max_tokens: int = 500) -> Optional[str]:
models = ROUTING_TABLE.get(task_type, ROUTING_TABLE["general"])
for model in models:
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
timeout=20
)
return response.choices[0].message.content
except Exception:
continue # fallback to next model
return None
# Usage
result = smart_complete(
"Implement a binary search tree in Python",
task_type="code"
)
print(result)
What Makes This Interesting
1. One API key, 20+ providers. No juggling Anthropic, OpenAI, Mistral keys separately.
2. OpenAI SDK compatible. Zero migration cost from existing code.
3. Specialty models included. BGE-M3 for embeddings, NemoRetriever for parsing, CLIP for vision — not just chat models.
4. Free tier is generous. Enough for development and light production usage.
Limitations
- Some flagship models (DeepSeek v4 Pro, Kimi K2 Thinking) timeout under high demand
- Service keys have different scopes than personal keys — test both
- No fine-tuning support (inference only)
Bottom Line
If you're building LLM-powered apps and not using NVIDIA NIM, you're either paying more than you need to or missing access to models that aren't available anywhere else. The multi-model fallback pattern alone is worth the 60-second setup.
Get your key: build.nvidia.com
Top comments (0)