DEV Community: Torkian

See What Your Agent Did — Tracing and Observability with NVIDIA NIM

Torkian — Wed, 15 Jul 2026 16:48:09 +0000

Somewhere around your twentieth conversation with the agent we've built, it will do something strange. Call a tool twice for no reason. Refuse a question it answered yesterday. Take nine seconds on something that usually takes two. And you'll ask the only question that matters in production: why did it do that?

Rerunning won't tell you. Models aren't deterministic; the moment is gone. The debugging print statements from Part 6 scroll away with the terminal. What you need is a trace: a durable record of what actually happened — every tool call with its arguments and result, every model call with its latency, whether the JSON answer needed repair, and what the agent finally said.

This post adds exactly that, in plain Python — nothing outside the standard library. Every turn appends one JSON line to a file. That one-line-per-turn shape is deliberate: the next post (evals) will load a line and get everything an assertion needs — the input, the tool path, and Part 9's validated six-key answer — with no regrouping. Observability isn't a product you install; it's a file you write.

I'm B Torkian, NVIDIA Developer Champion at USC. Part 10 of the series.

What you're adding

Workshop 9:  turn happens → answer returned → details lost forever
Workshop 10: turn happens → answer returned → one JSON line survives:
             {user_message, steps: [model calls, tool calls, validation], final, latency}

The Workshop 9 loop is unchanged. A small JsonlTracer hooks its five natural seams: turn start, each model call, each tool call, the validation outcome, and the final answer.

Step 1 — Decide what one turn's record holds

{
  "schema_version": "ws10.turn.v1",
  "trace_id": "a3f9c2e81b04",
  "turn_id": 2,
  "timestamp": "2026-07-06T18:22:31+00:00",
  "model": "nvidia/llama-3.3-nemotron-super-49b-v1.5",
  "mode": "chat",
  "user_message": "How many days until that?",
  "steps": [
    {"type": "model_call", "step": 1, "latency_ms": 842,
     "usage": {"prompt_tokens": 1204, "completion_tokens": 31, "total_tokens": 1235},
     "tool_calls": [{"id": "call_abc", "name": "days_until_weekday", "arguments": "{\"weekday\": \"Thursday\"}"}]},
    {"type": "tool_call", "step": 1, "tool_call_id": "call_abc", "name": "days_until_weekday",
     "arguments": {"weekday": "Thursday"}, "result": "The next Thursday is in 3 day(s)...", "latency_ms": 0},
    {"type": "model_call", "step": 2, "latency_ms": 1130, "usage": {"...": "..."}, "tool_calls": []},
    {"type": "validation", "parse_ok": true, "errors": [], "repair_attempted": false, "repair_succeeded": false}
  ],
  "final": {"status": "answered", "answer": "The USC AI Club meets in 3 days...", "category": "campus_event",
            "items": [{"name": "USC AI Club", "day": "Thursday", "days_until": 3}], "missing": [], "sources": ["..."]},
  "total_latency_ms": 1990
}

Everything a "why" question needs is in one line: the model asked for days_until_weekday (so memory resolved "that" correctly), the tool ran in under a millisecond, the second model call produced clean JSON on the first try, and the whole turn took two seconds — most of it the second model call.

A schema_version field costs nothing today and saves you when the shape evolves — future tooling can tell old lines from new.

Step 2 — The tracer (plain file I/O)

class JsonlTracer:
    SCHEMA_VERSION = "ws10.turn.v1"

    def __init__(self, path: Path = TRACE_PATH):
        self.path = Path(path)
        self.path.parent.mkdir(parents=True, exist_ok=True)
        self.trace_id = uuid.uuid4().hex[:12]     # one id per session
        self.turn_id = 0
        self._turn = None

    def begin_turn(self, user_message: str, mode: str) -> None:
        if self._turn is not None:            # last turn crashed before end_turn — flush it
            self.abort_turn("turn abandoned without end_turn")
        self.turn_id += 1
        self._turn = {"schema_version": self.SCHEMA_VERSION, "trace_id": self.trace_id,
                      "turn_id": self.turn_id, "timestamp": datetime.now(ZoneInfo("UTC")).isoformat(),
                      "model": MODEL, "mode": mode, "user_message": user_message,
                      "steps": [], "final": None, "total_latency_ms": None}
        self._start = time.perf_counter()

    # record_model_call / record_tool_call / record_validation append step
    # events — see the repo for the three small methods.

    def end_turn(self, final: dict) -> None:
        self._turn["final"] = final
        self._turn["total_latency_ms"] = int((time.perf_counter() - self._start) * 1000)
        with self.path.open("a") as f:
            f.write(json.dumps(self._turn) + "\n")
        self._turn = None

No logging framework, no decorators, no globals. The session owns a tracer; the tracer owns a file.

("Why not OpenTelemetry?" Same answer as Part 9's "why not Pydantic": OpenTelemetry — and LLM-observability platforms built on it — is what production teams use for exactly this, with spans, exporters, and dashboards. We hand-roll the tracer so you can see what those tools record and why. Once this JSONL file makes sense to you, an OTel span is just this record with a standard schema and somewhere nicer to live.)

Step 3 — Hook the loop at its seams

The session gains one line in __init__ — self.tracer = JsonlTracer(trace_path) — and the loop gains a timing wrapper at each seam:

def _run_turn(self, user_message: str, stream: bool) -> dict:
    self.messages.append({"role": "user", "content": user_message})
    self.tracer.begin_turn(user_message, mode="stream" if stream else "chat")

    for step in range(1, MAX_STEPS + 1):
        t0 = time.perf_counter()
        text, tool_calls, assistant_msg, usage = self._complete(stream)
        self.tracer.record_model_call(step, int((time.perf_counter() - t0) * 1000), usage, tool_calls)

        if tool_calls:
            self.messages.append(assistant_msg)
            for tc in tool_calls:
                # ...parse arguments as in Part 9...
                t1 = time.perf_counter()
                result = run_tool(tc["name"], arguments)
                self.tracer.record_tool_call(step, tc["id"], tc["name"], arguments,
                                             str(result), int((time.perf_counter() - t1) * 1000))
                # ...append the role="tool" result as in Part 9...
            continue

        data, vmeta = self._finalize_json(text)       # now also returns what happened
        self.tracer.record_validation(vmeta["parse_ok"], vmeta["errors"],
                                      vmeta["repair_attempted"], vmeta["repair_succeeded"])
        self.messages.append({"role": "assistant", "content": json.dumps(data)})
        self._trim()
        self.tracer.end_turn(data)
        return data

(One detail the abridged snippet hides: in the real file, _run_turn wraps this loop in a try/except that calls tracer.abort_turn(...) before re-raising — so a crash mid-turn still writes the partial trace. A crash is exactly the moment you'll want the record.)

Two small refactors make this clean, and they're worth naming honestly:

_complete now also returns usage — best-effort token accounting. Non-streaming NIM responses usually include response.usage (prompt, completion, total tokens); streamed responses usually don't. (Some OpenAI-compatible endpoints accept stream_options={"include_usage": true} to report usage on the final chunk — support varies, which is exactly why we log null rather than guess.) Never log zeros you didn't measure — a zero looks like a measurement, a null tells the truth.
_finalize_json now returns (data, validation_meta) — so the tracer can record whether parsing failed and whether the repair call ran, without the parsing code knowing traces exist.

Step 4 — Run it, then ask the file your questions

session = ChatSession(verbose=True)
for q in ["When does the USC AI Club meet?",
          "How many days until that?",
          "Which is sooner, that meeting or the AI/ML office hours?",
          "What is the campus wifi password?"]:
    session.chat(q)

Four turns, four lines in traces/campus_assistant.jsonl. Now the payoff — a dozen-line analysis instead of guesswork (the repo version adds an empty-file guard):

def analyze_traces(path=TRACE_PATH):
    turns = [json.loads(line) for line in Path(path).read_text().splitlines()]
    slowest = max(turns, key=lambda t: t["total_latency_ms"])
    tool_counts, repairs = {}, 0
    for t in turns:
        for e in t["steps"]:
            if e["type"] == "tool_call":
                tool_counts[e["name"]] = tool_counts.get(e["name"], 0) + 1
            if e["type"] == "validation" and e["repair_attempted"]:
                repairs += 1
    print(f"turns:       {len(turns)}")
    print(f"slowest:     turn {slowest['turn_id']} ({slowest['total_latency_ms']} ms) — {slowest['user_message']!r}")
    print(f"tool calls:  {tool_counts}")
    print(f"repair rate: {repairs}/{len(turns)}")

Which turn was slowest, and was it the model or a tool? Did the comparison question really call days_until_weekday twice? How often does the JSON need repair? The trace answers all of it without rerunning the agent — that's the entire point.

Sidebar — the second layer: server metrics on self-hosted NIM

Everything above is app-side observability, and it works identically against the hosted API Catalog and a local NIM container. If you self-host NIM (Part 4), you get a second layer for free: the container exposes Prometheus metrics — GPU utilization, time-to-first-token, requests in flight — on its HTTP port:

# Local NIM container only. The hosted endpoint does NOT expose this.
curl -s http://localhost:8000/v1/metrics | head -20

Mind the path: metrics live under /v1 alongside the inference routes — :8000/v1/metrics, not :8000/metrics. App traces tell you what your agent did; server metrics tell you what the model server did. Production runs both. Docs: https://docs.nvidia.com/nim/large-language-models/latest/reference/logging-and-observability.html

Step 5 — The rule that saves you later: traces hold user data

Look at what we're logging: the user's message, tool results, the final answer. In this demo that's club schedules. In a real deployment it's names, emails, student IDs — whatever people type at your assistant. So three habits, from day one:

Never log secrets. No API keys, no request headers, no environment. (Notice the tracer never touches client or os.environ.)
Never log the full messages array. It re-accumulates the whole conversation every turn — one leak away from a disaster and redundant anyway: the per-turn records already reconstruct it.
Keep trace files out of git. traces/ is in .gitignore in the repo. Trace files are data, not code.

Step 6 — What you actually built

Workshops 1–9 built an agent that retrieves, refuses, plans, remembers, streams, and returns validated JSON.
Workshop 10 made it observable: every turn leaves a one-line record that answers "what happened?" after the fact.

And it set up the next chapter perfectly. With traces on disk and a fixed answer contract, we can finally test the agent like software: evals — replay the questions, assert on status, category, and missing, and catch regressions before students do. That's Part 11.

The agent is still a while loop around a model call. Now it's a while loop that keeps receipts.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open part10_traces.ipynb
Local Python: part10_traces.py in the repo (python3 part10_traces.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base and the tools for your school, your club, your project.

The full series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3: Add Guardrails So Your AI App Doesn't Lie
Part 4: Run NVIDIA NIM on Your Own GPU
Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8: Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9: Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10 (this post): See What Your Agent Did — Tracing and Observability with NVIDIA NIM

A consolidated long-form version of the whole series is on Medium for anyone who'd rather read it in one sitting.

Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM

Torkian — Tue, 07 Jul 2026 15:50:12 +0000

For eight parts the agent has ended every turn the same way: it prints a sentence and we read it. That's fine for a demo and a dead end for a product. The moment you want to build on the agent — render it in a UI, score it in a test, return it from an API — prose fights you. You're reduced to scraping text with regexes and hoping the wording doesn't change.

This post fixes that with one capability: the final answer becomes a validated JSON object with a fixed shape. Same agent, same tools, same memory — but now it returns {"status": "...", "answer": "...", "items": [...], ...} instead of a paragraph. That contract is the hinge the rest of this series turns on: the tracing chapter logs it, the evaluation chapter asserts on its fields, the deployment chapter returns it as the HTTP response body.

There's an honest catch, and it's the real lesson of this post. The tidy answer would be "use response_format with a JSON schema and let the API enforce it." On the hosted API Catalog with an open model, that's not dependable — strict schema mode is a server-side feature that open-model endpoints implement inconsistently, and it interacts badly with tool calling. So we don't lean on it. We ask for JSON in the prompt and then do the grown-up thing: parse it, validate it ourselves, and repair it once if it's wrong. No framework.

(If you self-host NIM — Part 4 — NVIDIA does document server-side structured generation: NIM 1.x accepts extra_body={"nvext": {"guided_json": schema}} for constrained decoding, and the newest releases move to OpenAI-compatible response_format JSON mode. Either way, NVIDIA's own docs tell you to validate the response client-side — which is exactly the ladder this post builds. Docs: https://docs.nvidia.com/nim/large-language-models/latest/structured-generation.html)

I'm B Torkian, NVIDIA Developer Champion at USC. Part 9 of the series.

What you're adding

Workshop 8:  final answer -> a string you print
Workshop 9:  final answer -> JSON text -> parse -> validate -> (repair once) -> a dict you can use

The Workshop 8 behavior is unchanged — tools, multi-turn memory, trim-by-turns, the streaming accumulator. The one behavioral change is at the final-answer boundary: instead of returning message.content, we turn it into a validated dict. (We also do a small bit of housekeeping — Workshop 8 had two near-identical loops, one in chat() and one in stream(); we factor them into a single shared loop so the JSON finalization lives in exactly one place. More on that in Step 4. One honest trade: live token-by-token display is dropped for the final answer — half-formed JSON is useless to show — while the wire-level streaming and tool-call fragment reassembly from Part 8 stay.)

Step 1 — Decide the contract

Before any code, decide what the agent must return. For the campus assistant, six keys cover every case:

{
  "status": "answered",
  "answer": "The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204.",
  "category": "campus_event",
  "items": [
    {"name": "USC AI Club meeting", "day": "Thursday", "time": "5 PM", "location": "engineering building, room 204"}
  ],
  "missing": [],
  "sources": ["The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204."]
}

status and category are enums (a fixed set of allowed values) so downstream code can branch on them. items is the machine-readable payload. missing names anything the user asked for that wasn't found. sources is the grounding — the exact knowledge-base lines used, which the guardrails chapter's spirit lives on in.

Step 2 — Parse, validate, repair (the part that matters)

We treat the model's output as untrusted. Three small functions, plain Python:

STATUSES = {"answered", "not_found", "needs_clarification"}
CATEGORIES = {"campus_event", "campus_hours", "campus_resource", "comparison", "refusal"}
REQUIRED_KEYS = ("status", "answer", "category", "items", "missing", "sources")

def parse_json_object(text: str) -> dict:
    # Models wrap JSON in prose or fenced code blocks. Take the {...} span.
    start, end = text.find("{"), text.rfind("}")
    if start == -1 or end == -1 or end < start:
        raise ValueError("no JSON object found")
    return json.loads(text[start:end + 1])

def validate_answer(data) -> list:
    if not isinstance(data, dict):
        return ["response is not a JSON object"]
    errors = []
    for key in REQUIRED_KEYS:
        if key not in data:
            errors.append(f"missing required key: {key}")
    if data.get("status") not in STATUSES:
        errors.append(f"status must be one of {sorted(STATUSES)}")
    if data.get("category") not in CATEGORIES:
        errors.append(f"category must be one of {sorted(CATEGORIES)}")
    if "answer" in data and not isinstance(data["answer"], str):
        errors.append("answer must be a string")
    for key in ("items", "missing", "sources"):
        if key in data and not isinstance(data[key], list):
            errors.append(f"{key} must be a list")
    if isinstance(data.get("items"), list) and not all(isinstance(it, dict) for it in data["items"]):
        errors.append("each entry in items must be an object")
    return errors

If validation fails, make exactly one repair attempt — hand the broken output back to the model at temperature=0 with the list of problems and ask for a clean object:

def repair_answer_json(raw_text: str, errors: list) -> dict | None:
    # (prompt abridged — the repo version also restates the required keys and enums)
    try:
        fix = client.chat.completions.create(
            model=MODEL, temperature=0, max_tokens=800,
            messages=[
                {"role": "system", "content": "/no_think\n\nYou fix malformed JSON. Return ONLY a valid JSON object."},
                {"role": "user", "content": f"Problems: {errors}. Preserve all facts. Original:\n{raw_text}\n\nReturn corrected JSON only."},
            ],
        )
        data = parse_json_object(fix.choices[0].message.content or "")
    except Exception:          # API error or unparseable — fall back deterministically
        return None
    return data if not validate_answer(data) else None

Notice what goes into that repair prompt: the specific validation failures ("status must be one of …", "missing key: category"), not a vague "return valid JSON." Telling the model exactly which field failed, and why, is most of the reason a single retry usually lands.

And if even the repair fails, return a deterministic error object so callers never crash on a surprise:

def format_error(missing: str = "valid_json") -> dict:
    return {"status": "needs_clarification",
            "answer": "I couldn't produce a valid structured response. Please ask again.",
            "category": "refusal", "items": [], "missing": [missing], "sources": []}

Parse → validate → repair once → deterministic fallback. That four-step ladder is what makes structured output safe in production without a schema-enforcement framework. (The Part 8 step-limit fallback becomes structured too — in the repo, format_error takes an optional cause-specific answer, so even the give-up path honors the contract.)

"Why not Pydantic?" Fair question — in production Python, Pydantic is the standard way to do exactly this: define the contract as a class, call model_validate_json, and get typed objects plus precise error messages to feed the repair prompt. We hand-rolled validate_answer for the same reason this series hand-rolled retrieval and the agent loop: so you can see what the tool automates. It's ~30 lines, and now Pydantic will never be magic to you. Swapping it in is a genuinely good exercise — class Answer(BaseModel) with Literal types for the enums, and the ladder's parse + validate steps collapse into one Answer.model_validate_json(raw) inside a try. The repair-once and deterministic-fallback steps stay: no validator, Pydantic included, can fix output that never arrived.

Step 3 — Tell the model the format

The tool guidance from Workshops 7–8 is unchanged. We append a FINAL ANSWER FORMAT block to the system prompt that pins the contract:

SYSTEM_PROMPT = (
    "/no_think\n\n"
    "...all the Workshop 7-8 tool + memory guidance...\n\n"
    "FINAL ANSWER FORMAT. When you are done using tools, your final reply MUST be a "
    "single JSON object and NOTHING else — no prose, no code fences. Use exactly these "
    'keys: status (answered|not_found|needs_clarification), answer (string), category '
    "(campus_event|campus_hours|campus_resource|comparison|refusal), items (list), "
    "missing (list), sources (list)."
)

Tool-calling turns are unaffected — the model still emits tool_calls while it gathers facts. Only the final reply has to be JSON.

Step 4 — One shared loop, JSON at the finish line

ChatSession keeps Workshop 8's control flow exactly. The housekeeping mentioned above: Workshop 8 had the agent loop written out twice — once in chat(), once in stream(). We pull the single model call into _complete(stream) (which returns the text, the reassembled tool calls, and the assistant message, for both streaming and non-streaming) and the loop itself into one shared _run_turn(user_message, stream). Now chat() and stream() are one-line delegators:

def chat(self, user_message: str) -> dict:
    return self._run_turn(user_message, stream=False)

def stream(self, user_message: str) -> dict:
    return self._run_turn(user_message, stream=True)

_run_turn is the Workshop 8 loop — run tools until none remain. The only new step is the final branch: instead of returning the text, finalize it into a validated dict with _finalize_json.

def _finalize_json(self, raw_text: str) -> dict:
    try:
        data = parse_json_object(raw_text)
        errors = validate_answer(data)
    except (ValueError, json.JSONDecodeError):
        data, errors = None, ["response was not valid JSON"]
    if errors:
        repaired = repair_answer_json(raw_text, errors)
        data = repaired if repaired is not None else format_error()
    return data

# the final branch of _run_turn, once the model stops calling tools:
#     data = self._finalize_json(text)
#     self.messages.append({"role": "assistant", "content": json.dumps(data)})
#     self._trim()
#     return data

Note we store the canonical json.dumps(data) in history, not the model's raw text — so the next turn's memory is clean, validated JSON too. Both chat() and stream() return a dict now.

Step 5 — Run it

session = ChatSession(verbose=True)
for q in [
    "When does the USC AI Club meet?",                          # answered, campus_event
    "How many days until that?",                                # memory + tool
    "Which is sooner, that meeting or the AI/ML office hours?",  # comparison
    "What is the campus wifi password?",                        # not_found / refusal
]:
    result = session.chat(q)
    print(json.dumps(result, indent=2))

You get back four well-formed objects. The wifi question is the satisfying one — instead of a refusal sentence, you get a typed refusal your code can act on:

{"status": "not_found", "answer": "I don't have that information — check with the USC AI Club.",
 "category": "refusal", "items": [], "missing": ["USC campus wifi password"], "sources": []}

Memory and multi-step reasoning are untouched — "how many days until that?" still resolves "that", and the comparison still calls the tool once per day.

Step 6 — What you actually built

Workshop 1 gave it a brain. 2 memory of facts. 3 judgment. 4 portability. 5 hands. 6 a plan. 7 memory of the conversation. 8 a real-time voice.
Workshop 9 gave it a contract — output other software can consume.

That last one is the quiet turning point of the series. Everything before made the agent smarter; this makes it integratable. And it sets up the back half of this series:

Next — Traces: log each turn's tools, latency, and this final object as JSONL, so you can see what the agent did after the fact.
Then — Evals: with traces on disk and a fixed contract, you can finally test the agent like software — replay the questions and assert status, category, and that missing fires on the wifi question.
Then — Durable sessions and Deploy: persist histories and serve this exact JSON body behind an HTTP API.

The agent is still the same while loop. We just taught its last word to be data.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open part9_structured_output.ipynb
Local Python: part9_structured_output.py in the repo (python3 part9_structured_output.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base, the tools, and the contract for your school, your club, your project.

The full series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3: Add Guardrails So Your AI App Doesn't Lie
Part 4: Run NVIDIA NIM on Your Own GPU
Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8: Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9 (this post): Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10: See What Your Agent Did — Tracing and Observability with NVIDIA NIM

A consolidated long-form version of the whole series is on Medium for anyone who'd rather read it in one sitting.

Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM

Torkian — Sat, 27 Jun 2026 18:59:23 +0000

The assistant we've built over seven parts is capable — it retrieves, refuses, plans, chains tools, and remembers a conversation. It also has one glaring UX flaw: you ask a question, it goes silent for a few seconds, and then a whole paragraph appears at once. For a one-line answer that's invisible. For anything longer, it feels broken.

Every chat product you've used solves this the same way: streaming. The text types itself out token by token, so you see progress immediately. This post adds exactly that to our agent, and the payoff is huge for how "alive" it feels — for a change that's mostly one flag.

Mostly. The flag (stream=True) is the easy 20%. The other 80% is what the stream hands back: not one tidy message, but a sequence of small chunks. Plain text is easy to reassemble. Tool calls are not — they arrive split into fragments across many chunks, and you have to stitch them back together before you can run anything. That reassembly is the real lesson of Workshop 8.

I'm B Torkian, NVIDIA Developer Champion at USC. Part 8 of the series.

What you're adding

Workshop 7:  create(...)            -> one message     -> print it all at once
Workshop 8:  create(..., stream=True) -> many chunks   -> print each token as it lands

The agent loop does not change. You stream a turn, reassemble whatever came back (text or tool-call fragments), then do exactly what Workshop 7 did: run the tools and loop, or stop because the answer is done. Streaming is a layer inside the turn, not a new control flow.

Step 1 — Streaming at its simplest (no tools)

Add stream=True and the return value stops being a message — it becomes an iterator of chunks, each carrying a small delta. For plain text, the only field that matters is delta.content:

resp = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "system", "content": "/no_think"},
              {"role": "user", "content": "In two sentences, what is GPU acceleration?"}],
    stream=True,
)

for chunk in resp:
    if not chunk.choices:               # a trailing usage-only chunk has none
        continue
    print(chunk.choices[0].delta.content or "", end="", flush=True)
print()

end="" and flush=True are what make it stream to the terminal instead of buffering. That's the whole trick for text. Run it and the answer types itself out.

Step 2 — The catch: tool calls arrive in fragments

Here's what surprises people. When the model decides to call a tool, the call does not arrive in one piece. The function name shows up in one chunk; the arguments JSON dribbles in across several more. Each fragment is tagged with an index so you know which call it belongs to — because the model can request more than one in a single turn.

So you keep a dictionary keyed by that index. For each fragment, you set the id and name when they appear, and you concatenate the arguments string as the pieces arrive:

text_parts = []
tool_fragments = {}     # index -> {"id", "name", "arguments"}

for chunk in stream_resp:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if delta.content:                              # visible answer text
        print(delta.content, end="", flush=True)
        text_parts.append(delta.content)

    for tc in (delta.tool_calls or []):            # a fragment of a tool call
        slot = tool_fragments.setdefault(tc.index, {"id": "", "name": "", "arguments": ""})
        if tc.id:
            slot["id"] = tc.id
        if tc.function and tc.function.name:
            slot["name"] = tc.function.name
        if tc.function and tc.function.arguments:
            slot["arguments"] += tc.function.arguments   # JSON arrives in pieces

When the stream ends, each bucket holds one complete tool call, ready to parse and run. That's the only genuinely new idea in this workshop. Everything around it is the Workshop 7 loop.

Step 3 — Fold it into `ChatSession`

stream() lives alongside Workshop 7's chat() on the same session — same persistent self.messages, same _trim() (trim-by-turns), same memory. The only difference is that the turn is streamed and the assistant message is rebuilt from the accumulated pieces.

def stream(self, user_message: str) -> str:
    self.messages.append({"role": "user", "content": user_message})

    for step in range(1, MAX_STEPS + 1):
        stream_resp = client.chat.completions.create(
            model=MODEL, messages=self.messages, tools=tools,
            tool_choice="auto", temperature=0.2, max_tokens=400, stream=True,
        )

        text_parts, tool_fragments, header_printed = [], {}, False
        for chunk in stream_resp:
            if not chunk.choices:
                continue
            delta = chunk.choices[0].delta
            if delta.content:
                if not header_printed:
                    print("Assistant: ", end="", flush=True); header_printed = True
                print(delta.content, end="", flush=True)
                text_parts.append(delta.content)
            for tc in (delta.tool_calls or []):
                slot = tool_fragments.setdefault(tc.index, {"id": "", "name": "", "arguments": ""})
                if tc.id: slot["id"] = tc.id
                if tc.function and tc.function.name: slot["name"] = tc.function.name
                if tc.function and tc.function.arguments: slot["arguments"] += tc.function.arguments
        if header_printed:
            print()

        text = "".join(text_parts)
        tool_calls = [tool_fragments[i] for i in sorted(tool_fragments)]

        # Rebuild the assistant message from the streamed pieces and store it.
        assistant_msg = {"role": "assistant"}
        if tool_calls:
            assistant_msg["tool_calls"] = [
                {"id": tc["id"], "type": "function",
                 "function": {"name": tc["name"], "arguments": tc["arguments"]}}
                for tc in tool_calls
            ]
            if text:
                assistant_msg["content"] = text
        else:
            assistant_msg["content"] = text
        self.messages.append(assistant_msg)

        if not tool_calls:                  # final answer already streamed
            self._trim()
            return text or "I could not generate an answer. Please try again."

        for tc in tool_calls:               # run tools, then loop and stream again
            try:
                arguments = json.loads(tc["arguments"] or "{}")
            except json.JSONDecodeError:
                arguments = {}
            result = run_tool(tc["name"], arguments)   # the Part 7 dispatch, factored out:
            # def run_tool(name, arguments):
            #     if name not in available_tools: return f"Tool '{name}' is not available."
            #     try: return available_tools[name](**arguments)
            #     except Exception as exc: return f"Tool '{name}' failed: {exc}"
            self.messages.append({"role": "tool", "tool_call_id": tc["id"],
                                  "name": tc["name"], "content": str(result)})

    # (abridged: the same MAX_STEPS fallback as chat() closes the loop)

Put the chat() version next to this and the structure is identical — the streaming version just builds the assistant message by hand from fragments instead of getting it whole. That isomorphism is the point: streaming is a data-accumulation layer, not a new agent.

Step 4 — Feel the difference

print("── Without streaming (answer arrives all at once) ──")
session = ChatSession(verbose=True)
print(f"Assistant: {session.chat('What are the USC GPU lab hours?')}")

print("\n── With streaming ──")
for q in [
    "When does the USC AI Club meet?",                          # tool call, then streams
    "How many days until that?",                                # memory + tool, then streams
    "Which is sooner, that meeting or the AI/ML office hours?",  # multi-step, then streams
]:
    print(f"\nYou: {q}")
    session.stream(q)

The non-streaming call pauses, then dumps the answer. The streaming calls show the tool step, then the answer types itself out — and memory still works ("that" resolves to Thursday) and so does multi-step comparison. You only changed how the answer is delivered, not how the agent thinks.

Step 5 — The trap worth naming

There's a tempting "simpler" design: do a normal non-streaming call first to check whether the model wants a tool, and only if it doesn't, call again with stream=True to stream the answer. Don't. On the final turn that means you generate the whole answer once (blocking), then generate it again to stream it. Your first visible token now arrives later than if you hadn't streamed at all — the exact opposite of the goal — and you pay for the answer twice.

Streaming the same call that decides on tools is what gives you low time-to-first-token. That's why we accumulate fragments instead of peeking first. It's a few more lines, and it's the difference between streaming that helps and streaming that's theater.

Step 6 — What you actually built

Workshop 1 gave it a brain.
Workshop 2 gave it memory of facts (retrieval).
Workshop 3 gave it judgment (guardrails).
Workshop 4 gave it portability.
Workshop 5 gave it hands (one tool).
Workshop 6 gave it a plan (chained tools).
Workshop 7 gave it memory of the conversation.
Workshop 8 gave it a voice that arrives in real time.

The agent is the same while loop it's been since Part 5. Streaming, like memory and tools before it, is normal software wrapped around the model call — you read the response differently, and the experience transforms. Production systems push this further (streaming over WebSockets to a browser, rendering partial markdown, cancel-mid-stream), but every one of them is doing what you just did: consuming chunks and reassembling them.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open part8_streaming_agent.ipynb
Local Python: part8_streaming_agent.py in the repo (python3 part8_streaming_agent.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base and the tools for your school, your club, your project.

The full series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3: Add Guardrails So Your AI App Doesn't Lie
Part 4: Run NVIDIA NIM on Your Own GPU
Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8 (this post): Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9: Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10: See What Your Agent Did — Tracing and Observability with NVIDIA NIM

A consolidated long-form version of the whole series is on Medium for anyone who'd rather read it in one sitting.

Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM

Torkian — Mon, 22 Jun 2026 23:49:22 +0000

The agent we built in Part 6 is sharp — it plans, chains tools, and answers genuinely hard questions. It also has the memory of a goldfish. Ask it "when does the AI Club meet?", get a good answer, then ask "how many days until that?" — and it has no idea what "that" is. Every question starts from a blank slate.

That's the gap between a query tool and an assistant. A real assistant holds a conversation. It remembers what you just asked, resolves "that" and "those two" and "the second one" against what's already been said, and doesn't make you repeat yourself.

The fix is smaller than you'd think. In Part 6 the messages list lived inside the agent function and got thrown away after each question. In this post we lift that list out of the function and into a session object so it survives from one turn to the next. That's most of the work. The interesting part — the part that bites people — is what happens when the conversation gets long enough that you have to start forgetting old turns without breaking the tool-call bookkeeping.

I'm B Torkian, NVIDIA Developer Champion at USC. Part 7 of the series.

What you're adding

Turn 1: user asks → agent runs the tool loop → answer        ┐
Turn 2: user asks → agent runs the tool loop → answer        │  all sharing
Turn 3: ...                                                  ┘  ONE messages list

The list is never cleared between turns, so each turn sees everything before it.
When it gets too long, drop the OLDEST WHOLE TURN — never half of one.

The chat call from Part 1, the retriever from Part 2, the guardrail from Part 3, and the three tools from Part 6 all carry forward unchanged. The only new idea is persistence: keep the message history alive across calls.

Why "just keep the messages list" has a trap in it

Persisting the history is one line of intent — keep appending to the same list instead of starting a new one. But conversations grow without bound, and eventually you have to trim old turns or you'll blow past the context window and pay for tokens you don't need.

Here's the trap. With tool calling, the API enforces a pairing rule: every role="tool" message must match a tool_calls entry in an earlier assistant message, by ID. So if you naively trim "the oldest 4 messages" and one of them was the assistant message that requested a tool — but you keep the tool result that came right after — you've created an orphan. The tool result now references a tool_call_id that no longer exists in the history, and NVIDIA NIM (like any OpenAI-compatible endpoint) rejects the request with a validation error.

The fix is to think in turns, not messages. A turn is everything from one user message up to the next: the user's question, every assistant/tool exchange in between, and the final answer. You add and remove whole turns. Concretely, that means trim only at a user-message boundary — then you can never split a tool call from its result.

Step 1 — Carry the setup forward

You need the client, MODEL, the knowledge_base + retrieve_context from Part 2, and the three tools from Part 6 (search_campus_info, get_current_time, days_until_weekday). The Colab notebook has a compact prerequisite cell; the standalone part7_memory_agent.py defines everything from scratch.

Same nvidia/llama-3.3-nemotron-super-49b-v1.5 on the same hosted endpoint. Low temperature matters even more here than in Part 6 — more on that at the end.

MODEL = "nvidia/llama-3.3-nemotron-super-49b-v1.5"
LOCAL_TZ = "America/Los_Angeles"

Step 2 — A session that remembers

In Part 6 the loop owned a local messages = [...]. Here we move that list onto an object. That's the whole conceptual jump: state that used to vanish when the function returned now lives on self and persists between calls.

class ChatSession:
    def __init__(self, max_turns: int = 8, verbose: bool = True):
        self.system = {"role": "system", "content": SYSTEM_PROMPT}
        self.messages = [self.system]      # <- persists across .chat() calls
        self.max_turns = max_turns
        self.verbose = verbose

    def reset(self):
        self.messages = [self.system]      # forget everything

    def _trim(self):
        # Keep system + the last `max_turns` turns. Cut ONLY at a user-message
        # boundary, so a tool result is never orphaned from its tool call.
        user_indices = [i for i, m in enumerate(self.messages) if m.get("role") == "user"]
        if len(user_indices) <= self.max_turns:
            return
        cut = user_indices[-self.max_turns]            # first index to keep
        dropped = len(user_indices) - self.max_turns
        self.messages = [self.system] + self.messages[cut:]
        if self.verbose:
            print(f"  (memory: dropped {dropped} old turn(s), keeping last {self.max_turns})")

A class beats a closure here for one reason: the memory is visible. You can print(session.messages) and see exactly what the model remembers, and session.reset() is an obvious way to clear it. Hidden state in a closure teaches the wrong mental model.

Step 3 — The turn loop, now against the full history

chat() is the Part 6 tool loop with two differences: it appends to self.messages (the persistent list) instead of a local one, and it calls _trim() before returning so memory stays bounded.

def chat(self, user_message: str) -> str:
    self.messages.append({"role": "user", "content": user_message})

    for step in range(1, MAX_STEPS + 1):
        response = client.chat.completions.create(
            model=MODEL, messages=self.messages, tools=tools,
            tool_choice="auto", temperature=0.2, max_tokens=400,
        )
        message = response.choices[0].message
        self.messages.append(message.model_dump(exclude_none=True))

        if not message.tool_calls:        # final answer for this turn
            self._trim()
            return message.content or "I could not generate an answer. Please try again."

        for tool_call in message.tool_calls:
            name = tool_call.function.name
            try:
                arguments = json.loads(tool_call.function.arguments or "{}")
            except json.JSONDecodeError:
                arguments = {}
            if name not in available_tools:
                result = f"Tool '{name}' is not available."
            else:
                try:
                    result = available_tools<a href="**arguments">name</a>
                except Exception as exc:
                    result = f"Tool '{name}' failed: {exc}"
            if self.verbose:
                print(f"  step {step} · acting  -> {name}({json.dumps(arguments)})")
                print(f"  step {step} · observe <- {result}")
            self.messages.append({"role": "tool", "tool_call_id": tool_call.id,
                                  "name": name, "content": str(result)})

    self._trim()
    return "I reached the step limit before finishing — try asking a narrower question."

The system prompt does real work in multi-turn mode — it gains three lines over Part 6's prompt, and each earns its keep:

When a question refers back to something already discussed — words like 'that',
'those', 'then', 'it', or 'the second one' — resolve the reference from the
conversation so far before doing anything else.

Before calling a tool, check whether the conversation ALREADY contains the
fact you need — do not re-search for something you found a turn ago.

To compare how soon two days are, call days_until_weekday for EACH day and
compare the numbers it returns — never estimate the number of days yourself.

The first makes back-references resolve. The second matters because, without it, the model will sometimes call search_campus_info again for something it retrieved two turns ago.

One more line earns its keep: it tells the model that to compare how soon two days are, it must call days_until_weekday for each day and compare the numbers it returns — never estimate the day count itself. Without that line, the model cheerfully does the date arithmetic in its head on the "which is sooner?" turn — and gets it wrong. Pushing the comparison back through the tool is the same lesson as Part 6: don't let the model guess when a function can calculate exactly.

Step 4 — Have a conversation

session = ChatSession(verbose=True)
for user_message in [
    "When does the USC AI Club meet?",              # search -> "Thursday"
    "How many days until that?",                    # "that" = Thursday (from memory)
    "And when are the AI/ML faculty office hours?", # search -> "Tuesday"
    "Which of those two is sooner?",                # compares BOTH remembered facts
]:
    print(f"\nYou:       {user_message}")
    print(f"Assistant: {session.chat(user_message)}")

Watch the two turns that can't stand alone:

"How many days until that?" — the word that has no referent in the sentence itself. The model reads Turn 1 from history, resolves it to Thursday, and calls days_until_weekday("Thursday"). Strip the history and this question is meaningless.
"Which of those two is sooner?" — the model has to hold two facts it retrieved on different turns (AI Club = Thursday, office hours = Tuesday) and compare them. That's only possible because both are still in memory.

Step 5 — Prove memory is the thing doing the work

session.reset()
print("You:       How many days until that?")
print(f"Assistant: {session.chat('How many days until that?')}")

Same question, empty history. With nothing behind it, "that" has no referent, so the agent has nothing to resolve and falls back. The only variable that changed was whether the conversation was there — which is exactly the point.

Step 6 — What you actually built, and what's still missing

The assistant now has continuity:

Workshop 1 gave it a brain.
Workshop 2 gave it memory of facts (retrieval).
Workshop 3 gave it judgment.
Workshop 4 gave it portability.
Workshop 5 gave it hands (one tool).
Workshop 6 gave it a plan (chained tools).
Workshop 7 gave it memory of the conversation.

Three things to keep in mind as you take it further:

The history window is a real limit, not a formality. When a fact scrolls out of the kept turns, the model can't refer to it — and the model will sometimes confabulate what was said rather than admit it forgot. Try setting max_turns=2 and asking a follow-up about turn 1; you may see it invent an answer rather than admit it forgot. That failure is exactly why production systems summarize old turns or store memory in a database instead of a list.
Trim by turns, never by messages. The orphaned-tool_call_id error is the most common way a beginner's multi-turn agent breaks. Cutting at user boundaries is the simplest safe rule.
Keep the temperature low. At higher temperatures the model varies its tool path between turns, so a follow-up may take a different route than the question it's following up on. temperature=0.2 keeps the conversation coherent.

Everything past here — summarization, a vector store for long-term memory, per-user sessions, streaming the replies — is normal software wrapped around the same loop. The agent is still a while loop over a model call. Now it just has a list that remembers.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open part7_memory_agent.ipynb
Local Python: part7_memory_agent.py in the repo (python3 part7_memory_agent.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base and the tools for your school, your club, your project.

The full series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3: Add Guardrails So Your AI App Doesn't Lie
Part 4: Run NVIDIA NIM on Your Own GPU
Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7 (this post): Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8: Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9: Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10: See What Your Agent Did — Tracing and Observability with NVIDIA NIM

A consolidated long-form version of the whole series is on Medium for anyone who'd rather read it in one sitting.

From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM

Torkian — Sun, 21 Jun 2026 23:08:53 +0000

In Part 5 we gave the model a list of tools and let it pick one. Ask the time, it calls the clock. Ask about the AI Club, it calls the retriever. That's already an agent — but a shallow one. Every question got answered in a single tool call.

Real questions aren't like that. "How many days until the next AI Club meeting?" has no single tool that answers it. The model has to search the knowledge base to learn the club meets on Thursday, then do date math on "Thursday" to count the days. Two tools, in order, where the second one can't run until the first one comes back.

That's the jump this post makes: from picking a tool to running a plan. The pattern has a name — ReAct, for Reason + Act — and it's the loop underneath almost every agent framework you'll meet later. We build it in plain Python on the same hosted NIM endpoint, and we print the trace so you can watch the agent work through its steps.

I'm B Torkian, NVIDIA Developer Champion at USC. Part 6 of the series.

What you're adding

User question
  → NIM call (with tools schema)
  → model calls a tool       (Act)
  → your code runs it, returns the result   (Observe)
  → NIM call again — model reads the result and decides:
        another tool?  →  loop
        done?          →  final answer       (Reason)
  → repeat until answered or you hit the step cap

Part 5 had this exact loop — but the demo questions only ever went around it once. Part 6 changes two things so it goes around multiple times on purpose:

A third tool that depends on another tool's output, so a single call can't finish the job.
A visible trace, so the multi-step reasoning shows up as control flow you can read.

The chat call from Part 1, the retriever from Part 2, and the refusal fallback from Parts 1 and 3 all carry forward unchanged.

What "multi-step" actually means here

A one-shot tool call looks like this:

Q: When does the AI Club meet?
model → search_campus_info("AI Club meeting") → "every Thursday at 5 PM" → A: Thursdays at 5 PM.

A multi-step plan looks like this:

Q: How many days until the next AI Club meeting?
model → search_campus_info("AI Club meeting day") → "every Thursday"
model reads that, then → days_until_weekday("Thursday") → "in 5 days, on June 18"
model reads that → A: The next meeting is this Thursday, June 18 — 5 days away.

Nothing in the framework changed. The same loop runs twice instead of once, because the model decided — after seeing the first result — that it needed a second tool. The intelligence is in the model choosing the sequence; your job is to give it good tools and a loop that doesn't fall over.

Step 1 — Carry the setup forward

You need the client, MODEL, the knowledge_base, and retrieve_context from Parts 1, 2, and 5. The Colab notebook has a compact prerequisite cell; the standalone part6_react_agent.py defines everything from scratch so it runs on its own.

We stay on nvidia/llama-3.3-nemotron-super-49b-v1.5 — the same NVIDIA model we switched to in Part 5. It matters even more here: choosing one tool is forgiving, but sequencing tools (search first, calculate second) is where a weaker model loses the plot. Same hosted endpoint; only the model string is different from Parts 1–4.

MODEL = "nvidia/llama-3.3-nemotron-super-49b-v1.5"
LOCAL_TZ = "America/Los_Angeles"   # so "today" is consistent across the tools

Step 2 — Three tools, one of which forces chaining

The clock and the retriever you already know. The new one is days_until_weekday — and it's deliberately useless on its own. It needs a weekday as input, and the only way to learn the right weekday is to search the knowledge base first.

WEEKDAYS = ["Monday", "Tuesday", "Wednesday", "Thursday",
            "Friday", "Saturday", "Sunday"]

def get_current_time(timezone: str = LOCAL_TZ) -> str:
    try:
        zone = ZoneInfo(timezone)
    except Exception:
        zone = ZoneInfo("UTC")
    return datetime.now(zone).strftime("%A, %B %d, %Y at %I:%M %p %Z")

def search_campus_info(query: str) -> str:
    return retrieve_context(query, k=3)   # the Part 2 retriever, reused

def days_until_weekday(weekday: str) -> str:
    target = weekday.strip().capitalize()
    if target not in WEEKDAYS:
        return f"'{weekday}' is not a valid weekday."
    today = datetime.now(ZoneInfo(LOCAL_TZ))
    delta = (WEEKDAYS.index(target) - today.weekday()) % 7
    date_str = (today + timedelta(days=delta)).strftime("%B %d, %Y")
    if delta == 0:
        return f"Today is {target} ({date_str}) — that is 0 days away."
    return f"The next {target} is in {delta} day(s), on {date_str}."

That days_until_weekday dependency on search_campus_info is the whole lesson. It's what turns "call a tool" into "make a plan."

Step 3 — Describe the tools, and hint at the order

The schema is what the model reads to decide what to call. For a multi-step agent, the descriptions should hint at sequence, not just purpose. Notice the last line of days_until_weekday:

tools = [
    {"type": "function", "function": {
        "name": "search_campus_info",
        "description": "Search the USC campus knowledge base for facts about "
                       "clubs, labs, workshops, office hours, tutoring, and the "
                       "NVIDIA Developer Program. Use this to find WHEN or WHERE "
                       "something happens. Always call this for any USC fact.",
        "parameters": {"type": "object",
            "properties": {"query": {"type": "string",
                "description": "The USC campus question or search phrase."}},
            "required": ["query"]},
    }},
    {"type": "function", "function": {
        "name": "get_current_time",
        "description": "Get the current date, day of week, and time. Use this when "
                       "the answer depends on what day or time it is right now.",
        "parameters": {"type": "object",
            "properties": {"timezone": {"type": "string",
                "description": "IANA time zone, e.g. America/Los_Angeles."}}},
    }},
    {"type": "function", "function": {
        "name": "days_until_weekday",
        "description": "Calculate how many days from today until the next given "
                       "weekday. Use this AFTER you know which day an event happens. "
                       "You usually have to call search_campus_info first.",
        "parameters": {"type": "object",
            "properties": {"weekday": {"type": "string",
                "description": "A weekday name, e.g. Monday, Thursday."}},
            "required": ["weekday"]},
    }},
]

available_tools = {
    "search_campus_info": search_campus_info,
    "get_current_time": get_current_time,
    "days_until_weekday": days_until_weekday,
}

"You usually have to call search_campus_info first" is prompt engineering aimed at the model's planner. Vague tool docs produce an agent that calls things in the wrong order or skips a step.

Step 4 — The ReAct loop, with the trace turned on

Same skeleton as Part 5, with three things worth slowing down for: a bigger step budget, a printed trace, and tool execution wrapped so a bad call can't crash the loop.

SYSTEM_PROMPT = (
    "/no_think\n\n"   # Nemotron: reasoning-off mode, see Part 5
    "You are a USC campus assistant that solves questions step by step using tools. "
    "Work in a loop: think about what you still need, call ONE tool to get it, read "
    "the result, then decide whether you can answer or need another tool. Many "
    "questions need more than one tool — to find how many days until an event, first "
    "search for the day it happens, then call days_until_weekday with that day. "
    "Base your final answer strictly on tool results. If the tools cannot answer, "
    "reply exactly: I don't have that information — check with the USC AI Club."
)

MAX_STEPS = 5   # multi-step questions need more room than Part 5's cap of 3

def run_agent(question: str, verbose: bool = True) -> str:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]

    for step in range(1, MAX_STEPS + 1):
        response = client.chat.completions.create(
            model=MODEL, messages=messages, tools=tools,
            tool_choice="auto", temperature=0.2, max_tokens=400,
        )
        message = response.choices[0].message
        messages.append(message.model_dump(exclude_none=True))

        if not message.tool_calls:            # model is done → final answer
            return message.content or "I could not generate an answer. Please try again."

        for tool_call in message.tool_calls:  # run every tool it asked for
            name = tool_call.function.name
            try:
                arguments = json.loads(tool_call.function.arguments or "{}")
            except json.JSONDecodeError:
                arguments = {}

            if name not in available_tools:
                result = f"Tool '{name}' is not available."
            else:
                try:
                    result = available_tools<a href="**arguments">name</a>
                except Exception as exc:       # a bad call must not kill the agent
                    result = f"Tool '{name}' failed: {exc}"

            if verbose:
                print(f"  step {step} · acting  -> {name}({json.dumps(arguments)})")
                print(f"  step {step} · observe <- {result}")

            messages.append({"role": "tool", "tool_call_id": tool_call.id,
                             "name": name, "content": str(result)})

    return "I reached the step limit before finishing — try asking a narrower question."

What changed from Part 5, and why:

MAX_STEPS = 5 — a one-shot loop can stop at 3. A planner needs room to search, calculate, and sometimes correct itself. Keep the cap small and visible; an agent with no hard stop will occasionally spiral.
The trace — printing acting -> and observe <- each iteration is the single most useful debugging habit for agents. When an agent misbehaves, it's almost always because it called the wrong tool or read the result wrong, and the trace shows you exactly which.
try/except around the tool call — the model writes the arguments, which means the model can write bad arguments. Catch it and hand the error back as a tool result; the agent will usually recover on the next step instead of crashing your program.

Step 5 — Run it and read the trace

for question in [
    "How many days until the next USC AI Club meeting?",  # search -> days_until_weekday
    "Is the USC GPU lab open right now?",                 # clock + search, then reason
    "When does the USC AI Club meet?",                    # one tool is enough
    "What is the campus wifi password?",                  # nothing to find — refuse
]:
    print(f"Q: {question}")
    print(f"A: {run_agent(question, verbose=True)}\n")

What you should see in the trace:

Days until the meeting — two steps: search_campus_info returns "every Thursday," then days_until_weekday("Thursday") returns the count. The model only answers after the second observation.
Is the lab open right now — the model pulls the current day and hour from get_current_time, the posted hours (Mon–Fri, 10 AM–6 PM) from search_campus_info, then reasons about whether now is inside that window.
When does the club meet — one search, done. A good agent doesn't pad its plan with tools it doesn't need.
Wifi password — it searches, finds nothing, and falls back to the refusal line. The Part 3 refusal pattern still holds, now inside a multi-step loop.

Model behavior isn't perfectly deterministic — some runs take a slightly different path. That's worth seeing too: the trace lets you watch the variance instead of guessing about it.

Step 6 — What you actually built

The assistant can now reason across steps:

Workshop 1 gave it a brain (the chat call).
Workshop 2 gave it memory of facts (retrieval).
Workshop 3 gave it judgment (guardrails).
Workshop 4 gave it portability (hosted or local).
Workshop 5 gave it hands (one tool call).
Workshop 6 gave it a plan (chaining tools in a loop).

This is the architecture under LangGraph, CrewAI, AutoGen, and the rest. They add state machines, retries, sub-agents, and dashboards — but the center is the loop you just wrote: call the model with tools, run what it asks for, feed the result back, repeat. Common next steps:

More tools — a calendar, a ticketing API, a web search, a code sandbox.
A real planner that writes the full step list before any tool fires, instead of deciding one step at a time.
Memory across turns so the agent remembers what it already looked up.
Observability — that acting/observe trace, but logged and searchable. Production agents live or die on it.

If you take one thing from the whole series: an LLM is a normal Python function with a weird interior, and an agent is a while loop around it. You own the loop. The model just fills in the blanks.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open part6_react_agent.ipynb
Local Python: part6_react_agent.py in the repo (python3 part6_react_agent.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base and the tools for your school, your club, your project.

The full series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3: Add Guardrails So Your AI App Doesn't Lie
Part 4: Run NVIDIA NIM on Your Own GPU
Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6 (this post): From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8: Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9: Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10: See What Your Agent Did — Tracing and Observability with NVIDIA NIM

A consolidated long-form version of the whole series is on Medium for anyone who'd rather read it in one sitting.

From Chatbot to Agent — Tool Calling with NVIDIA NIM

Torkian — Tue, 26 May 2026 00:22:48 +0000

In Parts 1 through 4 we built a useful tool: a USC campus assistant that knows when to retrieve, when to refuse, and which endpoint to call. It is still a chatbot. The model writes a string; we print it. Everything interesting happened inside one model call.

This post turns it into an agent. By agent I mean something specific and small — the model can choose a tool from a list, your Python code runs that tool, and the result goes back into the conversation. That's it. No LangGraph, no AutoGen, no LangChain. Two functions, one loop, and a NIM call with tools=....

You'll watch the model decide for itself whether to consult the clock, search the USC knowledge base, or just answer directly. Once you see the loop, the framework abstractions on top of it are easier to read because you already know what they hide.

I'm B Torkian, NVIDIA Developer Champion at USC. Part 5 of the series.

What you're adding

User question
  → NIM call (with tools schema)
  → model returns either a final answer OR a tool_calls list
  → if tool_calls: run each one, append the result, NIM call again
  → repeat until model returns an answer (or hit the loop limit)

The chat call shape from Part 1 carries forward. The retriever from Part 2 becomes a tool. From Part 3's two guardrail layers, the scoped-prompt-and-fallback layer moves into the agent's system prompt — the grounding-check layer is set aside for now, because tool results replace retrieved context here. And the agent only gets to use tools we expose.

What "agent" actually means here

Most marketing pages use agent to mean "anything with a memory or a loop." For this post the definition is narrower and worth pinning down up front:

You describe a small number of Python functions to the model via a JSON schema (the tools parameter).
The model returns either a normal message OR a tool_calls field with the name and arguments of the function it wants to run.
Your code runs that function and appends the result to the message list as a tool role.
You make another NIM call. The model sees the tool result and either calls another tool or writes the final answer.

That's the entire pattern. Real production agents add planning, retries, sub-agents, and observability. The center is still these four steps.

Step 1 — Carry forward the setup, and switch the model

You need everything from Parts 1, 2, and 3 — the client, MODEL, ask, knowledge_base, embed_texts, and retrieve_context. A compact prerequisite cell is in the Colab notebook for this workshop. The standalone script part5_agent.py in the repo defines everything from scratch so you can run it without any prior cell.

One change worth flagging up front. Parts 1-4 used meta/llama-3.1-8b-instruct — fast, cheap, fine for chat and RAG. For Part 5 we switch to NVIDIA's own nvidia/llama-3.3-nemotron-super-49b-v1.5, a model NVIDIA tuned specifically for reasoning and tool use. Reason — tool calling is noticeably more reliable on it. I tested both; the 8B model called the right tool inconsistently across reruns (some runs it would refuse instead), while Nemotron behaved the same way every time. It's a bigger, reasoning-tuned model, so each call takes longer — a fair trade once a model has to reliably choose between tools instead of just answering. Both run on the same hosted endpoint; only the MODEL string changes.

MODEL = "nvidia/llama-3.3-nemotron-super-49b-v1.5"   # was 'meta/llama-3.1-8b-instruct' in Parts 1-4

One Nemotron-specific detail worth knowing, because it will bite you otherwise. Nemotron is a reasoning model: by default it thinks out loud before answering, which eats your token budget and can leave the actual answer empty on harder turns. The fix is one token. Put /no_think at the top of the system prompt and it switches to direct-answer mode, which is exactly what you want for fast, predictable tool calling. Every system prompt from here on starts with it. (Reasoning mode is great for genuinely hard problems; you just do not want it for a snappy campus assistant.)

Step 2 — Define two tiny tools

import json
from datetime import datetime
from zoneinfo import ZoneInfo

def get_current_time(timezone: str = "America/Los_Angeles") -> str:
    try:
        zone = ZoneInfo(timezone)
    except Exception:
        zone = ZoneInfo("UTC")
    return datetime.now(zone).strftime("%A, %B %d, %Y at %I:%M %p %Z")

def search_campus_info(query: str) -> str:
    # Reuse the retriever from Part 2 — the agent gets semantic search for free.
    return retrieve_context(query, k=3)

Two functions. Plain Python. They don't know anything about the model — the model has no idea they exist yet. That's fixed in the next step.

Step 3 — Describe the tools to the model in JSON schema

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_time",
            "description": "Get the current time in an IANA time zone.",
            "parameters": {
                "type": "object",
                "properties": {
                    "timezone": {
                        "type": "string",
                        "description": "IANA time zone, e.g. America/Los_Angeles or UTC.",
                    },
                },
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "search_campus_info",
            "description": "Search the USC campus assistant knowledge base for information about USC clubs (including AI Club), labs (GPU lab, robotics lab), workshops, faculty office hours, peer tutoring, and the NVIDIA Developer Program at USC. Always call this for any USC-related question — do not answer from your own knowledge.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The USC campus question or search phrase.",
                    },
                },
                "required": ["query"],
            },
        },
    },
]

available_tools = {
    "get_current_time": get_current_time,
    "search_campus_info": search_campus_info,
}

The schema is what the model sees. The names, descriptions, and parameter docs are how it decides which to call. Take these descriptions seriously — vague tool descriptions produce a confused agent.

The available_tools dict is the dispatch table on the Python side. Always pair the two — the schema describes intent, the dict provides execution.

Step 4 — The agent loop

def ask_agent(question: str) -> str:
    messages = [
        {
            "role": "system",
            "content": (
                "/no_think\n\n"
                "You are a USC campus assistant with two tools: "
                "get_current_time and search_campus_info. "
                "When the user asks something a tool can answer, call the tool, "
                "then write the final answer based on the tool's result. "
                "Do not call the same tool twice for the same question. "
                "If after using the tools you still cannot find the answer, "
                "reply exactly: I don't have that information — check with the USC AI Club."
            ),
        },
        {"role": "user", "content": question},
    ]

    for _ in range(3):                                # hard cap on tool calls
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            tools=tools,
            tool_choice="auto",
            temperature=0.2,
            max_tokens=400,
        )
        message = response.choices[0].message
        messages.append(message.model_dump(exclude_none=True))

        if not message.tool_calls:                    # model finished — return its text
            return message.content or "I could not generate an answer. Please try again."

        for tool_call in message.tool_calls:
            name = tool_call.function.name
            arguments = json.loads(tool_call.function.arguments or "{}")

            if name not in available_tools:
                result = f"Tool {name} is not available."
            else:
                result = available_tools[name](**arguments)

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "name": name,
                "content": str(result),
            })

    return "I hit the tool loop limit."

Four things worth slowing down for:

tools=... and tool_choice="auto" — this is how the model knows it has tools available and that it can pick. "auto" means use a tool if useful, otherwise answer directly.
messages.append(message.model_dump(...)) — the model's tool-call request itself becomes part of the conversation. Skip this and the next NIM call has no idea why you're showing it a tool result.
The tool role — when you send the function's return value back, it has to be a message with role="tool" plus the matching tool_call_id. Get that ID wrong and the model treats the result as orphan text.
The loop cap (3 iterations) — agents that don't have a hard stop will sometimes spiral. Keep the cap visible and small for workshops; widen it as you understand the model's behavior.

Step 5 — Run it

for question in [
    "What time is it in Los Angeles?",            # → uses get_current_time
    "When does the USC AI Club meet?",            # → uses search_campus_info
    "Can I get the wifi password?",               # → searches, finds nothing, refuses
]:
    print(f"Q: {question}")
    print(f"A: {ask_agent(question)}\n")

What you should see:

The clock question makes the model call get_current_time and answer from the returned string.
The AI Club question makes it call search_campus_info, read the retrieved chunks, and answer from them.
The wifi question makes it call search_campus_info, see that none of the chunks mention passwords, and fall back to the refusal line — the scoped-prompt guardrail from Part 3, delivered through a different control flow.

Some runs the model will call both tools (e.g. "what time is it and when does the club meet?"). The loop handles that without changes — each iteration appends all the tool results and re-asks.

Step 6 — What you actually built

The full assistant is now agent-shaped:

Workshop 1 gave it a brain (the chat call).
Workshop 2 gave it memory of facts (retrieval).
Workshop 3 gave it judgment (guardrails).
Workshop 4 gave it portability (hosted or local).
Workshop 5 gave it hands (tool calling).

You still own the behavior — the model only gets to call functions you expose, with arguments it has to declare, inside a loop you control. Real systems extend each piece, but the spine is what you just built. The most common follow-ups are:

More tools (calendar, ticketing, web search, code execution sandboxes).
Structured outputs so the final answer is JSON, not prose.
A planner that decomposes a question into sub-questions before any tool fires.
Observability — log every tool call, every argument, every return value. Production agents live or die on this.

If you take one thing from the whole series, take this: an LLM is a normal Python function with a weird interior. Everything you've built — retrieval, guardrails, deployment, tool calling — is normal software wrapped around that function. Frameworks save typing; they don't change the model.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open part5_agent.ipynb
Local Python: part5_agent.py in the repo (python3 part5_agent.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base and the tools for your school, your club, your project, and run it wherever you are.

The full series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3: Add Guardrails So Your AI App Doesn't Lie
Part 4: Run NVIDIA NIM on Your Own GPU
Part 5 (this post): From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8: Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9: Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10: See What Your Agent Did — Tracing and Observability with NVIDIA NIM

A consolidated long-form version of the whole series is on Medium for anyone who'd rather read it in one sitting.

Run NVIDIA NIM on Your Own GPU — Same API, Different Endpoint

Torkian — Mon, 25 May 2026 03:08:42 +0000

For Parts 1 through 3 we've been calling NIM through NVIDIA's hosted API Catalog at build.nvidia.com. That's the right starting point. It is also not the only place NIM runs.

NIM ships as a Docker container that exposes the same OpenAI-compatible HTTP API on a local port. Pull the image, run it on a box with an NVIDIA GPU, and the only thing that changes in the Python client is the base_url. The ask() function from Part 1, the retriever from Part 2, and the guardrails from Part 3 all keep working against the new endpoint, unchanged.

This post walks through the swap and the reasons you might want it.

I'm B Torkian, NVIDIA Developer Champion at USC. Same series, same code, just moving where inference happens.

Why bother running NIM locally

The hosted API Catalog is the right default. Don't switch until at least one of these matters:

Data locality. The data you're sending the model has to stay on a machine you control. (Common at universities, hospitals, regulated industries.) USC has a research GPU cluster — for projects where the source documents can't leave that environment, the model has to come to the data, not the other way around.
Predictable latency. Network round-trip + queue time + first-token latency adds up. A locally hosted model gives you a tighter, more predictable budget.
A real understanding of what's in the box. The hosted API hides a lot of useful detail. Running the container yourself surfaces the model files, the inference server, the GPU memory layout, and what knobs you actually have.
Cost at scale. Past a certain volume, running the model on hardware you already own becomes cheaper than per-token billing.

None of those matter for a 30-minute workshop. All of them might matter for the project the workshop is teaching you to build.

What you need

An NVIDIA GPU with enough VRAM for the model you want to run. For meta/llama-3.1-8b-instruct (the model we've been using), expect roughly 16 GB of VRAM. Heavier models want more.
Linux (native or WSL2). NIM containers expect the NVIDIA Container Toolkit, which means the --runtime=nvidia Docker flag works.
Docker with the NVIDIA Container Toolkit installed. Test with docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi — it should print your GPU.
An NGC API key. The key you already have from build.nvidia.com works for pulling NIM images; if not, generate one at ngc.nvidia.com.

If you don't have a GPU box on hand, the rest of the workshop still teaches you something useful — the API shape is identical, so when you do get one, the Python client code does not change.

Step 1 — Log in to NVIDIA's container registry

export NGC_API_KEY="nvapi-...your-key..."
echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

The literal username $oauthtoken is correct — that's NGC's convention for API-key logins. Don't substitute anything for it.

Step 2 — Pull and run the NIM container

docker run -it --rm \
  --name llama-3.1-8b-instruct \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$HOME/.cache/nim:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

A few notes:

First run is slow. The image is large and the model weights download on first launch. The -v cache mount means subsequent runs are fast.
Use the exact image tag from the model's Deploy tab on build.nvidia.com. The example above uses :latest, but pinning a specific version is safer for reproducibility.
The container listens on port 8000. That's what -p 8000:8000 exposes to your host.

When the container finishes loading it will log something like Application startup complete. Uvicorn running on http://0.0.0.0:8000. That's your signal that the OpenAI-compatible endpoint is live.

Step 3 — Verify the endpoint with curl

curl http://localhost:8000/v1/models

You should see a JSON response listing the loaded model. If curl hangs or returns connection-refused, the container hasn't finished loading yet — give it another minute and try again.

Step 4 — Point the Python client at localhost

This is the entire Python change.

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:8000/v1',          # ← was 'https://integrate.api.nvidia.com/v1'
    api_key='not-needed-for-local-dev',           # local NIM doesn't validate the key
)

MODEL = 'meta/llama-3.1-8b-instruct'              # same model name as the hosted endpoint

def ask(system_prompt, user_message):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user',   'content': user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content

print(ask(
    system_prompt='You are a concise USC campus assistant.',
    user_message='What does NVIDIA NIM stand for?',
))

Two lines changed — base_url and api_key. The ask() function is the same one we've been using since Part 1. The campus assistant, the embedding retriever, and the guardrail layers from Parts 2 and 3 all run against this client without any further changes.

The repo's part4_local_nim.py reads NIM_BASE_URL from your environment so the same script runs against the hosted endpoint by default and against local NIM when you set the env var. That makes it easy to A/B the two.

Step 5 — Same code, two endpoints (the test that matters)

# Hosted run (what we've done in Parts 1-3)
python3 part4_local_nim.py

# Local NIM run — point the same script at the container
NIM_BASE_URL=http://localhost:8000/v1 python3 part4_local_nim.py

Both should produce the same shape of output — the same ask() call, the same model name, just inference happening in a different place. That's the whole point of an OpenAI-compatible API surface — the application code stops caring where the model lives.

When to use which

Situation	Use
Workshop, prototype, demo, course project	Hosted (`integrate.api.nvidia.com`)
Sensitive data that can't leave a controlled environment	Local NIM on cluster GPU
Latency-critical inner loop, large concurrent load	Local NIM on a sized-up node
First-time student, no GPU on hand	Hosted (don't even mention local until they ask)
Production with a known traffic profile	Either, depending on cost crossover

There is no "winner" here. The hosted API and self-hosted NIM are the same product with different deployment footprints. The thing worth internalizing — and what this post is really about — is that your Python code does not have to care.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab for the hosted version: Open part4_local_nim.ipynb
Local Python: part4_local_nim.py in the repo. Defaults to the hosted endpoint; set NIM_BASE_URL=http://localhost:8000/v1 to point at a local NIM container.

MIT licensed. I run this at USC against both endpoints — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.

The full series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3: Add Guardrails So Your AI App Doesn't Lie
Part 4 (this post): Run NVIDIA NIM on Your Own GPU
Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8: Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9: Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10: See What Your Agent Did — Tracing and Observability with NVIDIA NIM

Follow this series on dev.to (the series widget at the top of each post lists every published part in order).

Add Guardrails So Your AI App Doesn't Lie — A Two-Layer Approach with NVIDIA NIM

Torkian — Sun, 24 May 2026 00:01:22 +0000

In Part 1 we got a USC campus assistant talking. In Part 2 we taught it to retrieve only the relevant context. Both posts ended with the same observation — when someone asked for the wifi password, the assistant refused. That refusal worked because we told it to. It would have just as happily made something up if we'd phrased the prompt differently.

This post is about hardening that refusal so it's not luck. Two guardrail layers, both small enough to read in one sitting, neither requiring a framework. First, tighten the prompt so the assistant knows what it's allowed to talk about. Second, add a second LLM call that re-reads the answer and the context and decides whether to ship the answer or refuse.

I'm B Torkian, NVIDIA Developer Champion at USC. This is the layer where a demo becomes something I'd actually let students use.

What you're adding

User question
  → retrieve top-k context (from Part 2)
  → scoped prompt: model answers OR returns the exact fallback line
  → grounding check: a second NIM call asks "is the answer supported by the context?"
  → ship the answer, or replace it with the fallback line

The chat call and the embedding setup carry over from Parts 1 and 2. Everything new in this post is fewer than 40 lines.

Why guardrails are not optional

The retrieval step from Part 2 narrowed what the model sees. It does nothing to stop the model from being clever with the data it has, or from drifting into topics outside the assistant's job.

Two real failure modes I've seen in student demos:

Out-of-scope creep. Someone asks "can you write my breakup text?" The model is happy to oblige. The retriever pulled three USC chunks (cosine just returns something), the prompt didn't forbid relationship advice, so the model wrote the text.
Confident-sounding hallucinations. The retrieved chunk says "Monday to Friday, 10 AM to 6 PM." The user asks about Saturday hours. The model decides the friendly answer is "Saturday hours are 11 AM to 4 PM" — a fabrication that sounds like a reasonable inference.

The first failure is solved by prompt scope. The second is what the grounding check is for.

Step 1 — Setup (self-contained)

If you already have Workshops 1 + 2 running in the same Colab session, skip this cell. If you're starting fresh, paste this in — it bundles the client, the embedding model, the USC knowledge base, and the retriever from Parts 1 and 2 so the rest of this post stands on its own.

%pip install -q openai numpy

import os, getpass
from openai import OpenAI
import numpy as np

if not os.getenv("NVIDIA_API_KEY"):
    os.environ["NVIDIA_API_KEY"] = getpass.getpass("Paste your NVIDIA API key (starts with nvapi-): ")

client = OpenAI(
    base_url="https://integrate.api.nvidia.com/v1",
    api_key=os.environ["NVIDIA_API_KEY"],
)

MODEL = "meta/llama-3.1-8b-instruct"
EMBED_MODEL = "nvidia/nv-embedqa-e5-v5"

def ask(system_prompt, user_message):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user",   "content": user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content

knowledge_base = [
    {"title": "USC AI Club meeting",
     "text": "The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204."},
    {"title": "USC GPU lab hours",
     "text": "The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM."},
    {"title": "NVIDIA Developer Program",
     "text": "USC students can join the NVIDIA Developer Program for free."},
    {"title": "Next USC workshop",
     "text": "The next USC AI Club workshop will cover Retrieval Augmented Generation (RAG)."},
    {"title": "USC AI/ML office hours",
     "text": "Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM."},
    {"title": "USC robotics lab",
     "text": "The USC robotics lab requires safety training before students can use the soldering station."},
    {"title": "USC tutoring",
     "text": "Peer tutoring for introductory Python at USC is available Wednesdays from 1 PM to 3 PM."},
]

def embed_texts(texts, input_type="passage"):
    response = client.embeddings.create(
        model=EMBED_MODEL,
        input=texts,
        extra_body={"input_type": input_type},
    )
    return [np.array(item.embedding, dtype=np.float32) for item in response.data]

def cosine_similarity(a, b):
    denom = np.linalg.norm(a) * np.linalg.norm(b)
    if denom == 0:
        return 0.0
    return float(np.dot(a, b) / denom)

def retrieve_context(question, k=3):
    q_emb = embed_texts([question], input_type="query")[0]
    scored = [(cosine_similarity(q_emb, item["embedding"]), item) for item in knowledge_base]
    scored.sort(key=lambda p: p[0], reverse=True)
    return "\n".join(f"- {item['text']}" for _, item in scored[:k])

for item, emb in zip(knowledge_base, embed_texts([i["text"] for i in knowledge_base], "passage")):
    item["embedding"] = emb

print(f"Ready. Embedded {len(knowledge_base)} chunks.")

That cell defines everything Workshops 1 and 2 produced. The Part 3 code below builds on ask, retrieve_context, and the embedded knowledge_base.

Step 2 — Layer 1: prompt scope with a fixed fallback line

FALLBACK = "I don't have that information — check with the USC AI Club."

SCOPED_SYSTEM_PROMPT_TEMPLATE = """You are a USC campus assistant for AI Club,
GPU lab, NVIDIA program, workshop, office hour, robotics lab, and tutoring
questions only.

Rules:
- Answer ONLY using the CONTEXT below.
- If the user asks about anything outside this scope (e.g. weather, jokes,
  personal advice, code generation, general world knowledge), reply with
  exactly: "{fallback}"
- If the answer is not present in the context, reply with exactly: "{fallback}"
- Do not invent names, dates, room numbers, links, passwords, schedules,
  policies, or instructions that are not in the context.

CONTEXT:
{context}
"""

Three things are doing work in this prompt:

A finite topic list. The assistant has a job description. "Anything outside this scope" gives the model a clear opt-out — it doesn't have to guess what's in-bounds.
One exact fallback string. Same wording, every time. This matters in Step 3 — the grounding check returns the same string when it overrides, so downstream code only has to recognize one shape.
An explicit don't-invent list. Models are pliable. Spelling out the dangerous categories (room numbers, passwords, policies) lowers hallucination noticeably with no extra calls.

This layer alone catches most off-topic and most "the context didn't mention it" cases.

Step 3 — Layer 2: a grounding check on every answer

The scoped prompt is a request — the model can still ignore it. Layer 2 is a separate, narrower NIM call whose only job is to look at the context and the answer and decide whether the answer is supported.

def answer_is_grounded(question: str, context: str, answer: str) -> bool:
    verdict = ask(
        system_prompt=(
            "You are a strict grounding verifier. Read the CONTEXT and the "
            "ANSWER. Respond with only 'yes' or 'no'. Say 'yes' if every "
            "factual claim in the ANSWER is directly supported by the CONTEXT. "
            "Say 'no' otherwise — including if the ANSWER adds information not "
            "in the CONTEXT, even if that information sounds plausible."
        ),
        user_message=(
            f"CONTEXT:\n{context}\n\n"
            f"QUESTION:\n{question}\n\n"
            f"ANSWER:\n{answer}\n\n"
            "Is every factual claim in the ANSWER supported by the CONTEXT?"
        ),
    )
    return verdict.strip().lower().startswith("yes")

Three things to notice:

It's just another ask() call — same client, same hosted NIM model, no new infrastructure. Layer 2 costs one extra call per question.
Yes/no only. Constraining the response shape makes the parsing reliable. We check only the start of the string: anything that doesn't begin with "yes" — a "no", an "it depends", a hedged paragraph — counts as a fail. (Note the flip side: a verdict that begins with "yes" passes even if it hedges afterward. Stricter parsing is a good exercise.)
It can be wrong too. The verifier is itself an LLM. For workshop-grade safety this is fine; for production you'd add deterministic checks (regex for room numbers, exact string match for fallback) on top.

Step 4 — Wire both layers into `ask_guarded()`

def ask_guarded(question: str) -> str:
    context = retrieve_context(question)              # from Part 2
    system_prompt = SCOPED_SYSTEM_PROMPT_TEMPLATE.format(
        fallback=FALLBACK, context=context,
    )
    answer = ask(system_prompt, question)             # Layer 1
    if not answer_is_grounded(question, context, answer):
        return FALLBACK                               # Layer 2 override
    return answer

for question in [
    "When does the USC AI Club meet?",        # in scope, in context
    "Can you write my breakup text?",         # OUT of scope
    "What is the wifi password?",             # in scope, NOT in context
    "What are the USC GPU lab Saturday hours?",   # invites a hallucination
]:
    print(f"Q: {question}")
    print(f"A: {ask_guarded(question)}\n")

Read the output carefully.

The AI Club question returns a real answer from the context. Both layers pass.
The breakup-text question hits Layer 1 — the scope rule catches it.
The wifi question also hits Layer 1 — nothing in the context mentions passwords, the scoped prompt forbids inventing them.
The Saturday-hours question is the one that earns its keep. The context says "Monday to Friday." A friendlier model would guess "closed on Saturday." Layer 2 reads that answer, sees "Saturday" is not in the context, and returns the fallback instead.

Step 5 — What you actually built

You took the retriever from Part 2 and put it inside two cheap, inspectable guardrails. The whole thing is still one Python file, still one hosted NIM endpoint, still no vector database. The mental model is:

Retrieval decides what the model sees.
Scoped prompt decides what the model is allowed to write.
Grounding check decides whether what the model wrote ships.

Real production systems extend each of these — deterministic rule checks, structured output, confidence thresholds, dedicated safety models, human review queues. The shape stays the same. Every additional layer is a yes/no gate between the user's question and the final response.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab for Part 3: Open part3_guardrails.ipynb
Local Python: part3_guardrails.py in the repo (python3 part3_guardrails.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.

The full series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3 (this post): Add Guardrails So Your AI App Doesn't Lie
Part 4: Run NVIDIA NIM on Your Own GPU
Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8: Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9: Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10: See What Your Agent Did — Tracing and Observability with NVIDIA NIM

Follow this series on dev.to (the series widget at the top of each post lists every published part in order).

From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM

Torkian — Sat, 23 May 2026 00:33:15 +0000

In Part 1, we built a USC campus assistant by pasting a five-line knowledge base directly into the prompt. That works when "the data" fits in your head. It stops being cute the moment the campus handbook, club docs, and workshop notes all want a seat at the same prompt window.

The fix is retrieval — store the chunks once, and at query time pull only the few that look relevant. That's what RAG (Retrieval-Augmented Generation) actually means once you strip away the marketing.

This post takes the assistant from Part 1 and bolts on a real retriever, using NVIDIA's hosted embedding model. No vector database, no LangChain, no abstraction layer. A Python list and NumPy are enough to understand what's actually happening. Once you've seen the moving parts, swapping in pgvector or Pinecone later is a fifteen-minute job.

I'm B Torkian, NVIDIA Developer Champion at USC. Same workshop series, same campus, one more capability added.

What you're adding

User question → embed query → compare to stored chunks → pick top-k → send only those to the LLM → answer

The model call itself barely changes. The work is in steps 2–4: turn text into vectors, compare vectors, return the closest chunks.

Why the manual approach from Part 1 breaks

In Part 1, the entire knowledge base sat inside the prompt:

campus_info = """
The USC AI Club meets every Thursday at 5 PM...
The USC GPU computing lab is open Monday to Friday...
...
"""

Five lines is fine. But every model has a context window, and every token costs money and latency. You don't want to paste the entire USC student handbook into every question — most of it is irrelevant to "when does the AI Club meet?"

Retrieval is the answer to "which 3 paragraphs out of 3000 are actually about this question?" You compute that before calling the LLM, then send only the winners.

What an embedding actually is

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Two texts that mean similar things land near each other in vector space. Two texts that mean different things land far apart.

NVIDIA's nv-embedqa-e5-v5 is an embedding model tuned specifically for question-answer retrieval. It has a quirk worth knowing about up front — it treats queries and passages differently. You tell it which one you're embedding via an input_type parameter. Getting this wrong is the most common beginner mistake — it still runs, but retrieval quality drops noticeably.

input_type='passage' → use for the documents you store
input_type='query' → use for the user's question at search time

That's it. Same model, two modes.

Step 1: Set up the client and `ask()` from Part 1

If you're continuing from Part 1, you already have these defined and can skip this cell. If you're starting fresh, paste this in first — everything later builds on it.

%pip install -q openai numpy

import os, getpass
from openai import OpenAI

if not os.getenv('NVIDIA_API_KEY'):
    os.environ['NVIDIA_API_KEY'] = getpass.getpass('Paste your NVIDIA API key (starts with nvapi-): ')

client = OpenAI(
    base_url='https://integrate.api.nvidia.com/v1',
    api_key=os.environ['NVIDIA_API_KEY'],
)

MODEL = 'meta/llama-3.1-8b-instruct'

def ask(system_prompt, user_message):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user',   'content': user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content

client calls NVIDIA's API Catalog. ask() is the same chat-completion shape from Part 1. The retriever we're about to build slots in next to these, not instead of them.

Step 2: Build a small knowledge base and embed it as passages

import numpy as np

EMBED_MODEL = 'nvidia/nv-embedqa-e5-v5'

knowledge_base = [
    {'title': 'USC AI Club meeting',
     'text': 'The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204.'},
    {'title': 'USC GPU lab hours',
     'text': 'The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM.'},
    {'title': 'NVIDIA Developer Program',
     'text': 'USC students can join the NVIDIA Developer Program for free.'},
    {'title': 'Next USC workshop',
     'text': 'The next USC AI Club workshop will cover Retrieval Augmented Generation (RAG).'},
    {'title': 'USC AI/ML office hours',
     'text': 'Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM.'},
    {'title': 'USC robotics lab',
     'text': 'The USC robotics lab requires safety training before students can use the soldering station.'},
    {'title': 'USC tutoring',
     'text': 'Peer tutoring for introductory Python at USC is available Wednesdays from 1 PM to 3 PM.'},
]

def embed_texts(texts, input_type='passage'):
    response = client.embeddings.create(
        model=EMBED_MODEL,
        input=texts,
        extra_body={'input_type': input_type},
    )
    return [np.array(item.embedding, dtype=np.float32) for item in response.data]

# Embed every chunk once, as a passage. Store the vector alongside the text.
embeddings = embed_texts([item['text'] for item in knowledge_base], input_type='passage')
for item, embedding in zip(knowledge_base, embeddings):
    item['embedding'] = embedding

print(f'Embedded {len(knowledge_base)} chunks. Vector dim:', embeddings[0].shape[0])

Two things to notice:

The OpenAI Python client doesn't have a native field for NVIDIA's input_type, so we pass it through extra_body. That's the right way to send provider-specific arguments without forking the client.
We're storing the embeddings in plain Python dicts. For seven chunks this is fine. For seven thousand, you'd reach for a vector database (and the only thing that changes is where the vectors live; the cosine math is identical).

Step 3: Retrieve the top-k chunks for a question

def cosine_similarity(a, b):
    denominator = np.linalg.norm(a) * np.linalg.norm(b)
    if denominator == 0:
        return 0.0
    return float(np.dot(a, b) / denominator)

def retrieve_context(question, k=3):
    question_embedding = embed_texts([question], input_type='query')[0]

    scored = []
    for item in knowledge_base:
        score = cosine_similarity(question_embedding, item['embedding'])
        scored.append((score, item))

    scored.sort(key=lambda pair: pair[0], reverse=True)
    top_items = [item for score, item in scored[:k]]

    return '\n'.join(f"- {item['text']}" for item in top_items)

Three things are happening here:

The question is embedded as a query, not a passage. This is the part beginners trip over. Same model, different mode.
Cosine similarity scores how close the question vector is to each stored chunk vector. Numbers near 1.0 mean very similar; numbers near 0 mean unrelated.
Top-k picks the highest-scoring chunks. Three is a reasonable default for a tiny knowledge base; tune it for yours.

There is no magic in step 3. A vector database would do the same comparison but use indexing tricks to do it fast at scale.

Step 4: Plug retrieval into the same `ask()` from Part 1

def ask_with_retrieval(question):
    context = retrieve_context(question)

    system_prompt = f"""You are a USC campus assistant. Answer ONLY using the
context below. If the answer is not in the context, say
"I don't have that information — check with the USC AI Club."

CONTEXT:
{context}
"""

    return ask(system_prompt, question)

for question in [
    'Where does the USC AI Club meet?',
    'When can I get Python tutoring at USC?',
    'What is the wifi password?',
]:
    print(f'Q: {question}')
    print(f'Context:\n{retrieve_context(question)}')
    print(f'A: {ask_with_retrieval(question)}\n')

Run it. Three things to read carefully:

The first question retrieves the AI Club chunk and answers from it. Good.
The second retrieves the tutoring chunk and answers from it. The stored text says "peer tutoring for introductory Python" — not the exact phrase "Python tutoring" — and the embedding model matches them on meaning. (A keyword search would also have found this one; the semantic win gets bigger as your data grows and the wording diverges from the question.)
The wifi question retrieves three chunks anyway (top-k always returns k items), but none of them contain a password. The assistant falls back to the refusal line because the ONLY using the context rule forces it to. That's the guardrail from Part 1 doing its job — and it's exactly the bridge into Part 3.

Step 5: What you actually did

You replaced the hand-picked campus_info string from Part 1 with a real retrieval step. The model call is identical, and the system prompt follows the same guardrail pattern — answer only from the provided context, otherwise fall back. The only structural change is that {context} now comes from a function instead of a hardcoded constant.

That swap is the entire mental model behind RAG. Real production systems add chunking strategies, hybrid search, re-ranking, and a vector database — but the spine stays the same: embed once, embed query, compare, pass top-k to the LLM.

In your own work, the seven-line knowledge_base becomes hundreds of paragraphs scraped from PDFs, lecture notes, club Slack archives, Notion pages, or a wiki. The retriever code doesn't change. The dict-with-vector storage gets replaced by something like pgvector, Qdrant, or Pinecone the moment you outgrow a Python list.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab for Part 2: Open part2_rag.ipynb
Local Python: part2_rag.py in the repo (python3 part2_rag.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.

The full series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2 (this post): From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3: Add Guardrails So Your AI App Doesn't Lie
Part 4: Run NVIDIA NIM on Your Own GPU
Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8: Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9: Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10: See What Your Agent Did — Tracing and Observability with NVIDIA NIM

Follow this series on dev.to (the series widget at the top of each post lists every published part in order).

Build Your First AI App with NVIDIA NIM in 30 Minutes

Torkian — Thu, 21 May 2026 22:43:28 +0000

Most students I've taught at USC have used ChatGPT. Far fewer have called a model from code.

That is the gap this post is meant to close. In 30 minutes, you'll call an NVIDIA-hosted language model from Python, pass it a small knowledge base, and make it answer only from that data. No GPU setup, no CUDA detour, no pretending a notebook is production. The goal is simple — write a normal Python program that talks to an LLM and gets useful text back.

I'm B Torkian, NVIDIA Developer Champion at USC, and I use this as a starter workshop for university and community groups. I've run a version of it with about 40 USC students. What usually surprises people is how ordinary the app feels. Most of it is normal software; one function call in the middle just happens to be weirdly powerful.

Everything runs in Google Colab because, for a room full of mixed laptops (I have made peace with this), boring setup wins.

This is Part 1 of a series that goes from one API call all the way to a streaming, tool-using agent that returns structured data. Each post stands on its own, so start here and move forward as far as you want to go.

What you're building

User question → Python app → NVIDIA NIM API → LLM response → App output

A small USC campus assistant. It will call an NVIDIA-hosted Llama model, use the data you provide, and refuse when the answer isn't there.

That refusal part matters. Demos can guess. Useful apps need to know when to say "I don't know."

What NVIDIA NIM is

NIM stands for NVIDIA Inference Microservices. For this post, treat it as hosted model inference from NVIDIA with a clean API in front.

There are two common ways to use it:

Hosted through NVIDIA's API Catalog at build.nvidia.com. That's what we're using here; check the current catalog terms before you teach it, because credits and available models can change.
Self-hosted on your own GPU later, with the same API shape. (That's Part 4 of this series.)

Whoever decided NVIDIA's API should mimic OpenAI's saved everyone a week of onboarding. You use the client most people have already seen, point it at a different endpoint, and move on.

Prerequisites (5 minutes)

A free NVIDIA Developer account — developer.nvidia.com
An API key from build.nvidia.com → pick any model → Get API Key. It starts with nvapi-.
A Google account for Colab.

The first time I taught this, I forgot to say the key starts with nvapi-, and half the room pasted the wrong thing (usually not their fault). Check that before you debug anything else.

Step 1: Open Colab and install the client

NVIDIA's API Catalog is OpenAI-compatible, so we'll use the standard openai Python client and point it at NVIDIA's endpoint.

%pip install -q openai

import os, getpass
from openai import OpenAI

os.environ['NVIDIA_API_KEY'] = getpass.getpass('Paste your NVIDIA API key: ')

client = OpenAI(
    base_url='https://integrate.api.nvidia.com/v1',
    api_key=os.environ['NVIDIA_API_KEY'],
)

MODEL = 'meta/llama-3.1-8b-instruct'

Notice two things:

base_url points at NVIDIA's hosted inference endpoint.
MODEL is just a model name from the API Catalog. Swap it later if you want; the call shape does not change.

Step 2: Make your first model call

def ask(system_prompt: str, user_message: str) -> str:
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {'role': 'system', 'content': system_prompt},
            {'role': 'user',   'content': user_message},
        ],
        temperature=0.3,
        max_tokens=400,
    )
    return response.choices[0].message.content

print(ask(
    system_prompt='You are a helpful, concise assistant.',
    user_message='Explain GPU acceleration to a first-year CS student in 5 sentences.',
))

Run it.

That ask() function is the basic shape of a lot of AI apps — instructions in, user input in, model response out. Real systems add plumbing, but this is the core.

Step 3: Use the system prompt to steer the model

Now keep the model and change its job description:

print(ask(
    system_prompt='You are a sarcastic but accurate professor. Keep it under 5 sentences.',
    user_message='Explain GPU acceleration to a first-year CS student.',
))

The output changes because the system prompt changes the model's job. A little precision buys you a lot here; vague prompts make debugging miserable.

Treat prompts like tiny specs — include constraints, output shape, and what to do when a question goes off-track. Then test with slightly annoying questions, because users will absolutely ask those.

Step 4: Build the USC campus assistant

An LLM doesn't know the USC schedule. It may still sound confident, which is exactly the problem.

So put the USC campus information directly into the prompt:

campus_info = """
The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204.
The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM.
USC students can join the NVIDIA Developer Program for free to access tools and learning resources.
The next USC AI Club workshop will cover Retrieval Augmented Generation (RAG).
Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM.
"""

assistant_system_prompt = f"""You are a USC campus assistant. Answer ONLY using the
information in CAMPUS INFO below. If the answer is not in there, say
"I don't have that information — check with the USC AI Club."

CAMPUS INFO:
{campus_info}
"""

for question in [
    'When does the USC AI Club meet?',
    'Is the USC GPU lab open on Saturday?',
    'What is the wifi password?',
]:
    print(f'Q: {question}')
    print(f'A: {ask(assistant_system_prompt, question)}\n')

Run it and read the outputs before moving on. The USC AI Club answer should come straight from the text. For Saturday, the model often refuses with the fallback line instead of inferring closed. That is the behavior I want for this exercise — "Monday to Friday" gives a human enough to reason about Saturday, but the exact Saturday answer is not stated in the provided data.

The wifi question should also get the fallback line, because there is nothing in campus_info about passwords. If your model says "I don't have that information — check with the USC AI Club," do not treat that as a failure. It stayed inside the box we gave it, which is the whole point.

Last USC cohort, one student replaced the campus info with their D&D campaign notes and ended up with the most fun bug-hunting session of the day. The pattern works for silly data and useful data, which is why it sticks.

Step 5: What you actually did

You just built manual RAG — you picked the context by hand, inserted it into the prompt, and asked the model to answer from that context. In a production-ish version, the hand-picked campus_info string becomes whatever your retrieval system finds.

In a real app, the context probably comes from PDFs, docs, tickets, lecture notes, or a wiki. You retrieve a few relevant chunks at query time, usually with embeddings and a vector database, then pass only those along.

The model call barely changes — campus_info becomes the output of retrieval. Most of the engineering work lives in that swap.

That swap is exactly what Part 2 of this series is about.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open the notebook
Local Python: app.py in the repo (python3 app.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, change campus_info to your school, your club, your project, and run it wherever you are.

The full series

This is Part 1. The rest of the series builds the same USC campus assistant up one capability at a time.

Part 1 (this post): Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3: Add Guardrails So Your AI App Doesn't Lie
Part 4: Run NVIDIA NIM on Your Own GPU
Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8: Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9: Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10: See What Your Agent Did — Tracing and Observability with NVIDIA NIM

Follow this series on dev.to (the series widget at the top of each post lists every published part in order).

DEV Community: Torkian

See What Your Agent Did — Tracing and Observability with NVIDIA NIM

What you're adding

Step 1 — Decide what one turn's record holds

Step 2 — The tracer (plain file I/O)

Step 3 — Hook the loop at its seams

Step 4 — Run it, then ask the file your questions

Sidebar — the second layer: server metrics on self-hosted NIM

Step 5 — The rule that saves you later: traces hold user data

Step 6 — What you actually built

Get the code

The full series

Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM

What you're adding

Step 1 — Decide the contract

Step 2 — Parse, validate, repair (the part that matters)

Step 3 — Tell the model the format

Step 4 — One shared loop, JSON at the finish line

Step 5 — Run it

Step 6 — What you actually built

Get the code

The full series

Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM

What you're adding

Step 1 — Streaming at its simplest (no tools)

Step 2 — The catch: tool calls arrive in fragments

Step 3 — Fold it into ChatSession

Step 4 — Feel the difference

Step 5 — The trap worth naming

Step 6 — What you actually built

Get the code

The full series

Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM

What you're adding

Why "just keep the messages list" has a trap in it

Step 1 — Carry the setup forward

Step 2 — A session that remembers

Step 3 — The turn loop, now against the full history

Step 4 — Have a conversation

Step 5 — Prove memory is the thing doing the work

Step 6 — What you actually built, and what's still missing

Get the code

The full series

From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM

What you're adding

What "multi-step" actually means here

Step 1 — Carry the setup forward

Step 2 — Three tools, one of which forces chaining

Step 3 — Describe the tools, and hint at the order

Step 4 — The ReAct loop, with the trace turned on

Step 5 — Run it and read the trace

Step 6 — What you actually built

Get the code

The full series

From Chatbot to Agent — Tool Calling with NVIDIA NIM

What you're adding

What "agent" actually means here

Step 1 — Carry forward the setup, and switch the model

Step 2 — Define two tiny tools

Step 3 — Describe the tools to the model in JSON schema

Step 4 — The agent loop

Step 5 — Run it

Step 6 — What you actually built

Get the code

The full series

Run NVIDIA NIM on Your Own GPU — Same API, Different Endpoint

Why bother running NIM locally

What you need

Step 1 — Log in to NVIDIA's container registry

Step 2 — Pull and run the NIM container

Step 3 — Verify the endpoint with curl

Step 4 — Point the Python client at localhost

Step 5 — Same code, two endpoints (the test that matters)

When to use which

Get the code

The full series

Add Guardrails So Your AI App Doesn't Lie — A Two-Layer Approach with NVIDIA NIM

What you're adding

Why guardrails are not optional

Step 1 — Setup (self-contained)

Step 3 — Fold it into `ChatSession`

Step 4 — Wire both layers into `ask_guarded()`

Step 1: Set up the client and `ask()` from Part 1

Step 4: Plug retrieval into the same `ask()` from Part 1