The assistant we've built over seven parts is capable — it retrieves, refuses, plans, chains tools, and remembers a conversation. It also has one glaring UX flaw: you ask a question, it goes silent for a few seconds, and then a whole paragraph appears at once. For a one-line answer that's invisible. For anything longer, it feels broken.
Every chat product you've used solves this the same way: streaming. The text types itself out token by token, so you see progress immediately. This post adds exactly that to our agent, and the payoff is huge for how "alive" it feels — for a change that's mostly one flag.
Mostly. The flag (stream=True) is the easy 20%. The other 80% is what the stream hands back: not one tidy message, but a sequence of small chunks. Plain text is easy to reassemble. Tool calls are not — they arrive split into fragments across many chunks, and you have to stitch them back together before you can run anything. That reassembly is the real lesson of Workshop 8.
I'm B Torkian, NVIDIA Developer Champion at USC. Part 8 of the series.
What you're adding
Workshop 7: create(...) -> one message -> print it all at once
Workshop 8: create(..., stream=True) -> many chunks -> print each token as it lands
The agent loop does not change. You stream a turn, reassemble whatever came back (text or tool-call fragments), then do exactly what Workshop 7 did: run the tools and loop, or stop because the answer is done. Streaming is a layer inside the turn, not a new control flow.
Step 1 — Streaming at its simplest (no tools)
Add stream=True and the return value stops being a message — it becomes an iterator of chunks, each carrying a small delta. For plain text, the only field that matters is delta.content:
resp = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": "In two sentences, what is GPU acceleration?"}],
stream=True,
)
for chunk in resp:
if not chunk.choices: # a trailing usage-only chunk has none
continue
print(chunk.choices[0].delta.content or "", end="", flush=True)
print()
end="" and flush=True are what make it stream to the terminal instead of buffering. That's the whole trick for text. Run it and the answer types itself out.
Step 2 — The catch: tool calls arrive in fragments
Here's what surprises people. When the model decides to call a tool, the call does not arrive in one piece. The function name shows up in one chunk; the arguments JSON dribbles in across several more. Each fragment is tagged with an index so you know which call it belongs to — because the model can request more than one in a single turn.
So you keep a dictionary keyed by that index. For each fragment, you set the id and name when they appear, and you concatenate the arguments string as the pieces arrive:
text_parts = []
tool_fragments = {} # index -> {"id", "name", "arguments"}
for chunk in stream_resp:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if delta.content: # visible answer text
print(delta.content, end="", flush=True)
text_parts.append(delta.content)
for tc in (delta.tool_calls or []): # a fragment of a tool call
slot = tool_fragments.setdefault(tc.index, {"id": "", "name": "", "arguments": ""})
if tc.id:
slot["id"] = tc.id
if tc.function and tc.function.name:
slot["name"] = tc.function.name
if tc.function and tc.function.arguments:
slot["arguments"] += tc.function.arguments # JSON arrives in pieces
When the stream ends, each bucket holds one complete tool call, ready to parse and run. That's the only genuinely new idea in this workshop. Everything around it is the Workshop 7 loop.
Step 3 — Fold it into ChatSession
stream() lives alongside Workshop 7's chat() on the same session — same persistent self.messages, same _trim() (trim-by-turns), same memory. The only difference is that the turn is streamed and the assistant message is rebuilt from the accumulated pieces.
def stream(self, user_message: str) -> str:
self.messages.append({"role": "user", "content": user_message})
for step in range(1, MAX_STEPS + 1):
stream_resp = client.chat.completions.create(
model=MODEL, messages=self.messages, tools=tools,
tool_choice="auto", temperature=0.2, max_tokens=400, stream=True,
)
text_parts, tool_fragments, header_printed = [], {}, False
for chunk in stream_resp:
if not chunk.choices:
continue
delta = chunk.choices[0].delta
if delta.content:
if not header_printed:
print("Assistant: ", end="", flush=True); header_printed = True
print(delta.content, end="", flush=True)
text_parts.append(delta.content)
for tc in (delta.tool_calls or []):
slot = tool_fragments.setdefault(tc.index, {"id": "", "name": "", "arguments": ""})
if tc.id: slot["id"] = tc.id
if tc.function and tc.function.name: slot["name"] = tc.function.name
if tc.function and tc.function.arguments: slot["arguments"] += tc.function.arguments
if header_printed:
print()
text = "".join(text_parts)
tool_calls = [tool_fragments[i] for i in sorted(tool_fragments)]
# Rebuild the assistant message from the streamed pieces and store it.
assistant_msg = {"role": "assistant"}
if tool_calls:
assistant_msg["tool_calls"] = [
{"id": tc["id"], "type": "function",
"function": {"name": tc["name"], "arguments": tc["arguments"]}}
for tc in tool_calls
]
if text:
assistant_msg["content"] = text
else:
assistant_msg["content"] = text
self.messages.append(assistant_msg)
if not tool_calls: # final answer already streamed
self._trim()
return text
for tc in tool_calls: # run tools, then loop and stream again
try:
arguments = json.loads(tc["arguments"] or "{}")
except json.JSONDecodeError:
arguments = {}
result = run_tool(tc["name"], arguments)
self.messages.append({"role": "tool", "tool_call_id": tc["id"],
"name": tc["name"], "content": str(result)})
# (abridged: the same MAX_STEPS fallback as chat() closes the loop)
Put the chat() version next to this and the structure is identical — the streaming version just builds the assistant message by hand from fragments instead of getting it whole. That isomorphism is the point: streaming is a data-accumulation layer, not a new agent.
Step 4 — Feel the difference
print("── Without streaming (answer arrives all at once) ──")
session = ChatSession(verbose=True)
print(f"Assistant: {session.chat('What are the USC GPU lab hours?')}")
print("\n── With streaming ──")
for q in [
"When does the USC AI Club meet?", # tool call, then streams
"How many days until that?", # memory + tool, then streams
"Which is sooner, that meeting or the AI/ML office hours?", # multi-step, then streams
]:
print(f"\nYou: {q}")
session.stream(q)
The non-streaming call pauses, then dumps the answer. The streaming calls show the tool step, then the answer types itself out — and memory still works ("that" resolves to Thursday) and so does multi-step comparison. You only changed how the answer is delivered, not how the agent thinks.
Step 5 — The trap worth naming
There's a tempting "simpler" design: do a normal non-streaming call first to check whether the model wants a tool, and only if it doesn't, call again with stream=True to stream the answer. Don't. On the final turn that means you generate the whole answer once (blocking), then generate it again to stream it. Your first visible token now arrives later than if you hadn't streamed at all — the exact opposite of the goal — and you pay for the answer twice.
Streaming the same call that decides on tools is what gives you low time-to-first-token. That's why we accumulate fragments instead of peeking first. It's a few more lines, and it's the difference between streaming that helps and streaming that's theater.
Step 6 — What you actually built
- Workshop 1 gave it a brain.
- Workshop 2 gave it memory of facts (retrieval).
- Workshop 3 gave it judgment (guardrails).
- Workshop 4 gave it portability.
- Workshop 5 gave it hands (one tool).
- Workshop 6 gave it a plan (chained tools).
- Workshop 7 gave it memory of the conversation.
- Workshop 8 gave it a voice that arrives in real time.
The agent is the same while loop it's been since Part 5. Streaming, like memory and tools before it, is normal software wrapped around the model call — you read the response differently, and the experience transforms. Production systems push this further (streaming over WebSockets to a browser, rendering partial markdown, cancel-mid-stream), but every one of them is doing what you just did: consuming chunks and reassembling them.
Get the code
Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open part8_streaming_agent.ipynb
Local Python: part8_streaming_agent.py in the repo (python3 part8_streaming_agent.py after pip install -r requirements.txt).
MIT licensed. I run this at USC — fork it, swap the knowledge base and the tools for your school, your club, your project.
The full series
- Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
- Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
- Part 3: Add Guardrails So Your AI App Doesn't Lie
- Part 4: Run NVIDIA NIM on Your Own GPU
- Part 5: From Chatbot to Agent — Tool Calling with NVIDIA NIM
- Part 6: From One Tool to a Plan — Multi-Step Agents with NVIDIA NIM
- Part 7: Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM
- Part 8 (this post): Make Your Agent Feel Real-Time — Streaming with NVIDIA NIM
A consolidated long-form version of the whole series is on Medium for anyone who'd rather read it in one sitting.
Top comments (0)