From Chatbot to Agent — Tool Calling with NVIDIA NIM

#nvidia #ai #python #tutorial

In Parts 1 through 4 we built a useful tool: a USC campus assistant that knows when to retrieve, when to refuse, and which endpoint to call. It is still a chatbot. The model writes a string; we print it. Everything interesting happened inside one model call.

This post turns it into an agent. By agent I mean something specific and small — the model can choose a tool from a list, your Python code runs that tool, and the result goes back into the conversation. That's it. No LangGraph, no AutoGen, no LangChain. Two functions, one loop, and a NIM call with tools=....

You'll watch the model decide for itself whether to consult the clock, search the USC knowledge base, or just answer directly. Once you see the loop, the framework abstractions on top of it are easier to read because you already know what they hide.

I'm B Torkian, NVIDIA Developer Champion at USC. Part 5 of the series.

What you're adding

User question
  → NIM call (with tools schema)
  → model returns either a final answer OR a tool_calls list
  → if tool_calls: run each one, append the result, NIM call again
  → repeat until model returns an answer (or hit the loop limit)

The chat call shape from Part 1 carries forward. The retriever from Part 2 becomes a tool. From Part 3's two guardrail layers, the scoped-prompt-and-fallback layer moves into the agent's system prompt — the grounding-check layer is set aside for now, because tool results replace retrieved context here. And the agent only gets to use tools we expose.

What "agent" actually means here

Most marketing pages use agent to mean "anything with a memory or a loop." For this post the definition is narrower and worth pinning down up front:

You describe a small number of Python functions to the model via a JSON schema (the tools parameter).
The model returns either a normal message OR a tool_calls field with the name and arguments of the function it wants to run.
Your code runs that function and appends the result to the message list as a tool role.
You make another NIM call. The model sees the tool result and either calls another tool or writes the final answer.

That's the entire pattern. Real production agents add planning, retries, sub-agents, and observability. The center is still these four steps.

Step 1 — Carry forward the setup, and switch the model

You need everything from Parts 1, 2, and 3 — the client, MODEL, ask, knowledge_base, embed_texts, and retrieve_context. A compact prerequisite cell is in the Colab notebook for this workshop. The standalone script part5_agent.py in the repo defines everything from scratch so you can run it without any prior cell.

One change worth flagging up front. Parts 1-4 used meta/llama-3.1-8b-instruct — fast, cheap, fine for chat and RAG. For Part 5 we switch to NVIDIA's own nvidia/llama-3.3-nemotron-super-49b-v1.5, a model NVIDIA tuned specifically for reasoning and tool use. Reason — tool calling is noticeably more reliable on it. I tested both; the 8B model called the right tool inconsistently across reruns (some runs it would refuse instead), while Nemotron behaved the same way every time. It's a bigger, reasoning-tuned model, so each call takes longer — a fair trade once a model has to reliably choose between tools instead of just answering. Both run on the same hosted endpoint; only the MODEL string changes.

MODEL = "nvidia/llama-3.3-nemotron-super-49b-v1.5"   # was 'meta/llama-3.1-8b-instruct' in Parts 1-4

One Nemotron-specific detail worth knowing, because it will bite you otherwise. Nemotron is a reasoning model: by default it thinks out loud before answering, which eats your token budget and can leave the actual answer empty on harder turns. The fix is one token. Put /no_think at the top of the system prompt and it switches to direct-answer mode, which is exactly what you want for fast, predictable tool calling. Every system prompt from here on starts with it. (Reasoning mode is great for genuinely hard problems; you just do not want it for a snappy campus assistant.)

Step 2 — Define two tiny tools

import json
from datetime import datetime
from zoneinfo import ZoneInfo

def get_current_time(timezone: str = "America/Los_Angeles") -> str:
    try:
        zone = ZoneInfo(timezone)
    except Exception:
        zone = ZoneInfo("UTC")
    return datetime.now(zone).strftime("%A, %B %d, %Y at %I:%M %p %Z")

def search_campus_info(query: str) -> str:
    # Reuse the retriever from Part 2 — the agent gets semantic search for free.
    return retrieve_context(query, k=3)

Two functions. Plain Python. They don't know anything about the model — the model has no idea they exist yet. That's fixed in the next step.

Step 3 — Describe the tools to the model in JSON schema

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_time",
            "description": "Get the current time in an IANA time zone.",
            "parameters": {
                "type": "object",
                "properties": {
                    "timezone": {
                        "type": "string",
                        "description": "IANA time zone, e.g. America/Los_Angeles or UTC.",
                    },
                },
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "search_campus_info",
            "description": "Search the USC campus assistant knowledge base for information about USC clubs (including AI Club), labs (GPU lab, robotics lab), workshops, faculty office hours, peer tutoring, and the NVIDIA Developer Program at USC. Always call this for any USC-related question — do not answer from your own knowledge.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The USC campus question or search phrase.",
                    },
                },
                "required": ["query"],
            },
        },
    },
]

available_tools = {
    "get_current_time": get_current_time,
    "search_campus_info": search_campus_info,
}

The schema is what the model sees. The names, descriptions, and parameter docs are how it decides which to call. Take these descriptions seriously — vague tool descriptions produce a confused agent.

The available_tools dict is the dispatch table on the Python side. Always pair the two — the schema describes intent, the dict provides execution.

Step 4 — The agent loop

def ask_agent(question: str) -> str:
    messages = [
        {
            "role": "system",
            "content": (
                "/no_think\n\n"
                "You are a USC campus assistant with two tools: "
                "get_current_time and search_campus_info. "
                "When the user asks something a tool can answer, call the tool, "
                "then write the final answer based on the tool's result. "
                "Do not call the same tool twice for the same question. "
                "If after using the tools you still cannot find the answer, "
                "reply exactly: I don't have that information — check with the USC AI Club."
            ),
        },
        {"role": "user", "content": question},
    ]

    for _ in range(3):                                # hard cap on tool calls
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            tools=tools,
            tool_choice="auto",
            temperature=0.2,
            max_tokens=400,
        )
        message = response.choices[0].message
        messages.append(message.model_dump(exclude_none=True))

        if not message.tool_calls:                    # model finished — return its text
            return message.content or "I could not generate an answer. Please try again."

        for tool_call in message.tool_calls:
            name = tool_call.function.name
            arguments = json.loads(tool_call.function.arguments or "{}")

            if name not in available_tools:
                result = f"Tool {name} is not available."
            else:
                result = available_tools[name](**arguments)

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "name": name,
                "content": str(result),
            })

    return "I hit the tool loop limit."

Four things worth slowing down for:

tools=... and tool_choice="auto" — this is how the model knows it has tools available and that it can pick. "auto" means use a tool if useful, otherwise answer directly.
messages.append(message.model_dump(...)) — the model's tool-call request itself becomes part of the conversation. Skip this and the next NIM call has no idea why you're showing it a tool result.
The tool role — when you send the function's return value back, it has to be a message with role="tool" plus the matching tool_call_id. Get that ID wrong and the model treats the result as orphan text.
The loop cap (3 iterations) — agents that don't have a hard stop will sometimes spiral. Keep the cap visible and small for workshops; widen it as you understand the model's behavior.

Step 5 — Run it

for question in [
    "What time is it in Los Angeles?",            # → uses get_current_time
    "When does the USC AI Club meet?",            # → uses search_campus_info
    "Can I get the wifi password?",               # → searches, finds nothing, refuses
]:
    print(f"Q: {question}")
    print(f"A: {ask_agent(question)}\n")

What you should see:

The clock question makes the model call get_current_time and answer from the returned string.
The AI Club question makes it call search_campus_info, read the retrieved chunks, and answer from them.
The wifi question makes it call search_campus_info, see that none of the chunks mention passwords, and fall back to the refusal line — the scoped-prompt guardrail from Part 3, delivered through a different control flow.

Some runs the model will call both tools (e.g. "what time is it and when does the club meet?"). The loop handles that without changes — each iteration appends all the tool results and re-asks.

Step 6 — What you actually built

The full assistant is now agent-shaped:

Workshop 1 gave it a brain (the chat call).
Workshop 2 gave it memory of facts (retrieval).
Workshop 3 gave it judgment (guardrails).
Workshop 4 gave it portability (hosted or local).
Workshop 5 gave it hands (tool calling).

You still own the behavior — the model only gets to call functions you expose, with arguments it has to declare, inside a loop you control. Real systems extend each piece, but the spine is what you just built. The most common follow-ups are:

More tools (calendar, ticketing, web search, code execution sandboxes).
Structured outputs so the final answer is JSON, not prose.
A planner that decomposes a question into sub-questions before any tool fires.
Observability — log every tool call, every argument, every return value. Production agents live or die on this.

If you take one thing from the whole series, take this: an LLM is a normal Python function with a weird interior. Everything you've built — retrieval, guardrails, deployment, tool calling — is normal software wrapped around that function. Frameworks save typing; they don't change the model.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open part5_agent.ipynb
Local Python: part5_agent.py in the repo (python3 part5_agent.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base and the tools for your school, your club, your project, and run it wherever you are.