Torkian

Posted on Jun 27 • Edited on Jul 16

Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM

#nvidia #ai #python #tutorial

The assistant we've built over seven parts is capable — it retrieves, refuses, plans, chains tools, and remembers a conversation. It also has one glaring UX flaw: you ask a question, it goes silent for a few seconds, and then a whole paragraph appears at once. For a one-line answer that's invisible. For anything longer, it feels broken.

Every chat product you've used solves this the same way: streaming. The text types itself out token by token, so you see progress immediately. This post adds exactly that to our agent, and the payoff is huge for how "alive" it feels — for a change that's mostly one flag.

Mostly. The flag (stream=True) is the easy 20%. The other 80% is what the stream hands back: not one tidy message, but a sequence of small chunks. Plain text is easy to reassemble. Tool calls are not — they arrive split into fragments across many chunks, and you have to stitch them back together before you can run anything. That reassembly is the real lesson of Workshop 8.

I'm B Torkian, NVIDIA Developer Champion at USC. Part 8 of the series.

What you're adding

Workshop 7:  create(...)            -> one message     -> print it all at once
Workshop 8:  create(..., stream=True) -> many chunks   -> print each token as it lands

The agent loop does not change. You stream a turn, reassemble whatever came back (text or tool-call fragments), then do exactly what Workshop 7 did: run the tools and loop, or stop because the answer is done. Streaming is a layer inside the turn, not a new control flow.

Step 1 — Streaming at its simplest (no tools)

Add stream=True and the return value stops being a message — it becomes an iterator of chunks, each carrying a small delta. For plain text, the only field that matters is delta.content:

resp = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "system", "content": "/no_think"},
              {"role": "user", "content": "In two sentences, what is GPU acceleration?"}],
    stream=True,
)

for chunk in resp:
    if not chunk.choices:               # a trailing usage-only chunk has none
        continue
    print(chunk.choices[0].delta.content or "", end="", flush=True)
print()

end="" and flush=True are what make it stream to the terminal instead of buffering. That's the whole trick for text. Run it and the answer types itself out.

Step 2 — The catch: tool calls arrive in fragments

Here's what surprises people. When the model decides to call a tool, the call does not arrive in one piece. The function name shows up in one chunk; the arguments JSON dribbles in across several more. Each fragment is tagged with an index so you know which call it belongs to — because the model can request more than one in a single turn.

So you keep a dictionary keyed by that index. For each fragment, you set the id and name when they appear, and you concatenate the arguments string as the pieces arrive:

text_parts = []
tool_fragments = {}     # index -> {"id", "name", "arguments"}

for chunk in stream_resp:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if delta.content:                              # visible answer text
        print(delta.content, end="", flush=True)
        text_parts.append(delta.content)

    for tc in (delta.tool_calls or []):            # a fragment of a tool call
        slot = tool_fragments.setdefault(tc.index, {"id": "", "name": "", "arguments": ""})
        if tc.id:
            slot["id"] = tc.id
        if tc.function and tc.function.name:
            slot["name"] = tc.function.name
        if tc.function and tc.function.arguments:
            slot["arguments"] += tc.function.arguments   # JSON arrives in pieces

When the stream ends, each bucket holds one complete tool call, ready to parse and run. That's the only genuinely new idea in this workshop. Everything around it is the Workshop 7 loop.

Step 3 — Fold it into `ChatSession`

stream() lives alongside Workshop 7's chat() on the same session — same persistent self.messages, same _trim() (trim-by-turns), same memory. The only difference is that the turn is streamed and the assistant message is rebuilt from the accumulated pieces.

def stream(self, user_message: str) -> str:
    self.messages.append({"role": "user", "content": user_message})

    for step in range(1, MAX_STEPS + 1):
        stream_resp = client.chat.completions.create(
            model=MODEL, messages=self.messages, tools=tools,
            tool_choice="auto", temperature=0.2, max_tokens=400, stream=True,
        )

        text_parts, tool_fragments, header_printed = [], {}, False
        for chunk in stream_resp:
            if not chunk.choices:
                continue
            delta = chunk.choices[0].delta
            if delta.content:
                if not header_printed:
                    print("Assistant: ", end="", flush=True); header_printed = True
                print(delta.content, end="", flush=True)
                text_parts.append(delta.content)
            for tc in (delta.tool_calls or []):
                slot = tool_fragments.setdefault(tc.index, {"id": "", "name": "", "arguments": ""})
                if tc.id: slot["id"] = tc.id
                if tc.function and tc.function.name: slot["name"] = tc.function.name
                if tc.function and tc.function.arguments: slot["arguments"] += tc.function.arguments
        if header_printed:
            print()

        text = "".join(text_parts)
        tool_calls = [tool_fragments[i] for i in sorted(tool_fragments)]

        # Rebuild the assistant message from the streamed pieces and store it.
        assistant_msg = {"role": "assistant"}
        if tool_calls:
            assistant_msg["tool_calls"] = [
                {"id": tc["id"], "type": "function",
                 "function": {"name": tc["name"], "arguments": tc["arguments"]}}
                for tc in tool_calls
            ]
            if text:
                assistant_msg["content"] = text
        else:
            assistant_msg["content"] = text
        self.messages.append(assistant_msg)

        if not tool_calls:                  # final answer already streamed
            self._trim()
            return text or "I could not generate an answer. Please try again."

        for tc in tool_calls:               # run tools, then loop and stream again
            try:
                arguments = json.loads(tc["arguments"] or "{}")
            except json.JSONDecodeError:
                arguments = {}
            result = run_tool(tc["name"], arguments)   # the Part 7 dispatch, factored out:
            # def run_tool(name, arguments):
            #     if name not in available_tools: return f"Tool '{name}' is not available."
            #     try: return available_tools[name](**arguments)
            #     except Exception as exc: return f"Tool '{name}' failed: {exc}"
            self.messages.append({"role": "tool", "tool_call_id": tc["id"],
                                  "name": tc["name"], "content": str(result)})

    # (abridged: the same MAX_STEPS fallback as chat() closes the loop)

Put the chat() version next to this and the structure is identical — the streaming version just builds the assistant message by hand from fragments instead of getting it whole. That isomorphism is the point: streaming is a data-accumulation layer, not a new agent.

Step 4 — Feel the difference

print("── Without streaming (answer arrives all at once) ──")
session = ChatSession(verbose=True)
print(f"Assistant: {session.chat('What are the USC GPU lab hours?')}")

print("\n── With streaming ──")
for q in [
    "When does the USC AI Club meet?",                          # tool call, then streams
    "How many days until that?",                                # memory + tool, then streams
    "Which is sooner, that meeting or the AI/ML office hours?",  # multi-step, then streams
]:
    print(f"\nYou: {q}")
    session.stream(q)

The non-streaming call pauses, then dumps the answer. The streaming calls show the tool step, then the answer types itself out — and memory still works ("that" resolves to Thursday) and so does multi-step comparison. You only changed how the answer is delivered, not how the agent thinks.

Step 5 — The trap worth naming

There's a tempting "simpler" design: do a normal non-streaming call first to check whether the model wants a tool, and only if it doesn't, call again with stream=True to stream the answer. Don't. On the final turn that means you generate the whole answer once (blocking), then generate it again to stream it. Your first visible token now arrives later than if you hadn't streamed at all — the exact opposite of the goal — and you pay for the answer twice.

Streaming the same call that decides on tools is what gives you low time-to-first-token. That's why we accumulate fragments instead of peeking first. It's a few more lines, and it's the difference between streaming that helps and streaming that's theater.

Step 6 — What you actually built

Workshop 1 gave it a brain.
Workshop 2 gave it memory of facts (retrieval).
Workshop 3 gave it judgment (guardrails).
Workshop 4 gave it portability.
Workshop 5 gave it hands (one tool).
Workshop 6 gave it a plan (chained tools).
Workshop 7 gave it memory of the conversation.
Workshop 8 gave it a voice that arrives in real time.

The agent is the same while loop it's been since Part 5. Streaming, like memory and tools before it, is normal software wrapped around the model call — you read the response differently, and the experience transforms. Production systems push this further (streaming over WebSockets to a browser, rendering partial markdown, cancel-mid-stream), but every one of them is doing what you just did: consuming chunks and reassembling them.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open part8_streaming_agent.ipynb
Local Python: part8_streaming_agent.py in the repo (python3 part8_streaming_agent.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base and the tools for your school, your club, your project.

The full series

Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
Part 3: Add Guardrails So Your AI App Doesn't Lie
Part 4: Run NVIDIA NIM on Your Own GPU
Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
Part 8 (this post): Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
Part 9: Make Your Agent Return Data, Not Prose — Structured Outputs with NVIDIA NIM
Part 10: See What Your Agent Did — Tracing and Observability with NVIDIA NIM

A consolidated long-form version of the whole series is on Medium for anyone who'd rather read it in one sitting.

DEV Community

Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM

What you're adding

Step 1 — Streaming at its simplest (no tools)

Step 2 — The catch: tool calls arrive in fragments

Step 3 — Fold it into `ChatSession`

Step 4 — Feel the difference

Step 5 — The trap worth naming

Step 6 — What you actually built

Get the code

The full series

Top comments (0)

What you're adding

Step 1 — Streaming at its simplest (no tools)

Step 2 — The catch: tool calls arrive in fragments

Step 3 — Fold it into ChatSession

Step 4 — Feel the difference

Step 5 — The trap worth naming

Step 6 — What you actually built

Get the code

The full series

Step 3 — Fold it into `ChatSession`