Giving Your Agent a Memory — Multi-Turn Conversations with NVIDIA NIM

#nvidia #ai #python #tutorial

The agent we built in Part 6 is sharp — it plans, chains tools, and answers genuinely hard questions. It also has the memory of a goldfish. Ask it "when does the AI Club meet?", get a good answer, then ask "how many days until that?" — and it has no idea what "that" is. Every question starts from a blank slate.

That's the gap between a query tool and an assistant. A real assistant holds a conversation. It remembers what you just asked, resolves "that" and "those two" and "the second one" against what's already been said, and doesn't make you repeat yourself.

The fix is smaller than you'd think. In Part 6 the messages list lived inside the agent function and got thrown away after each question. In this post we lift that list out of the function and into a session object so it survives from one turn to the next. That's most of the work. The interesting part — the part that bites people — is what happens when the conversation gets long enough that you have to start forgetting old turns without breaking the tool-call bookkeeping.

I'm B Torkian, NVIDIA Developer Champion at USC. Part 7 of the series.

What you're adding

Turn 1: user asks → agent runs the tool loop → answer        ┐
Turn 2: user asks → agent runs the tool loop → answer        │  all sharing
Turn 3: ...                                                  ┘  ONE messages list

The list is never cleared between turns, so each turn sees everything before it.
When it gets too long, drop the OLDEST WHOLE TURN — never half of one.

The chat call from Part 1, the retriever from Part 2, the guardrail from Part 3, and the three tools from Part 6 all carry forward unchanged. The only new idea is persistence: keep the message history alive across calls.

Why "just keep the messages list" has a trap in it

Persisting the history is one line of intent — keep appending to the same list instead of starting a new one. But conversations grow without bound, and eventually you have to trim old turns or you'll blow past the context window and pay for tokens you don't need.

Here's the trap. With tool calling, the API enforces a pairing rule: every role="tool" message must match a tool_calls entry in an earlier assistant message, by ID. So if you naively trim "the oldest 4 messages" and one of them was the assistant message that requested a tool — but you keep the tool result that came right after — you've created an orphan. The tool result now references a tool_call_id that no longer exists in the history, and NVIDIA NIM (like any OpenAI-compatible endpoint) rejects the request with a validation error.

The fix is to think in turns, not messages. A turn is everything from one user message up to the next: the user's question, every assistant/tool exchange in between, and the final answer. You add and remove whole turns. Concretely, that means trim only at a user-message boundary — then you can never split a tool call from its result.

Step 1 — Carry the setup forward

You need the client, MODEL, the knowledge_base + retrieve_context from Part 2, and the three tools from Part 6 (search_campus_info, get_current_time, days_until_weekday). The Colab notebook has a compact prerequisite cell; the standalone part7_memory_agent.py defines everything from scratch.

Same meta/llama-3.3-70b-instruct on the same hosted endpoint. Low temperature matters even more here than in Part 6 — more on that at the end.

MODEL = "meta/llama-3.3-70b-instruct"
LOCAL_TZ = "America/Los_Angeles"

Step 2 — A session that remembers

In Part 6 the loop owned a local messages = [...]. Here we move that list onto an object. That's the whole conceptual jump: state that used to vanish when the function returned now lives on self and persists between calls.

class ChatSession:
    def __init__(self, max_turns: int = 8, verbose: bool = True):
        self.system = {"role": "system", "content": SYSTEM_PROMPT}
        self.messages = [self.system]      # <- persists across .chat() calls
        self.max_turns = max_turns
        self.verbose = verbose

    def reset(self):
        self.messages = [self.system]      # forget everything

    def _trim(self):
        # Keep system + the last `max_turns` turns. Cut ONLY at a user-message
        # boundary, so a tool result is never orphaned from its tool call.
        user_indices = [i for i, m in enumerate(self.messages) if m.get("role") == "user"]
        if len(user_indices) <= self.max_turns:
            return
        cut = user_indices[-self.max_turns]            # first index to keep
        dropped = len(user_indices) - self.max_turns
        self.messages = [self.system] + self.messages[cut:]
        if self.verbose:
            print(f"  (memory: dropped {dropped} old turn(s), keeping last {self.max_turns})")

A class beats a closure here for one reason: the memory is visible. You can print(session.messages) and see exactly what the model remembers, and session.reset() is an obvious way to clear it. Hidden state in a closure teaches the wrong mental model.

Step 3 — The turn loop, now against the full history

chat() is the Part 6 tool loop with two differences: it appends to self.messages (the persistent list) instead of a local one, and it calls _trim() before returning so memory stays bounded.

def chat(self, user_message: str) -> str:
    self.messages.append({"role": "user", "content": user_message})

    for step in range(1, MAX_STEPS + 1):
        response = client.chat.completions.create(
            model=MODEL, messages=self.messages, tools=tools,
            tool_choice="auto", temperature=0.2, max_tokens=400,
        )
        message = response.choices[0].message
        self.messages.append(message.model_dump(exclude_none=True))

        if not message.tool_calls:        # final answer for this turn
            self._trim()
            return message.content

        for tool_call in message.tool_calls:
            name = tool_call.function.name
            try:
                arguments = json.loads(tool_call.function.arguments or "{}")
            except json.JSONDecodeError:
                arguments = {}
            if name not in available_tools:
                result = f"Tool '{name}' is not available."
            else:
                try:
                    result = available_tools<a href="**arguments">name</a>
                except Exception as exc:
                    result = f"Tool '{name}' failed: {exc}"
            if self.verbose:
                print(f"  step {step} · acting  -> {name}({json.dumps(arguments)})")
                print(f"  step {step} · observe <- {result}")
            self.messages.append({"role": "tool", "tool_call_id": tool_call.id,
                                  "name": name, "content": str(result)})

    self._trim()
    return "I reached the step limit before finishing — try asking a narrower question."

The system prompt does real work in multi-turn mode — it gains three lines over Part 6's prompt, and each earns its keep:

When a question refers back to something already discussed — words like 'that',
'those', 'then', 'it', or 'the second one' — resolve the reference from the
conversation so far before doing anything else.

Before calling a tool, check whether the conversation ALREADY contains the
fact you need — do not re-search for something you found a turn ago.

To compare how soon two days are, call days_until_weekday for EACH day and
compare the numbers it returns — never estimate the number of days yourself.

The first makes back-references resolve. The second matters because, without it, the 70B model will sometimes call search_campus_info again for something it retrieved two turns ago.

One more line earns its keep: it tells the model that to compare how soon two days are, it must call days_until_weekday for each day and compare the numbers it returns — never estimate the day count itself. Without that line, the model cheerfully does the date arithmetic in its head on the "which is sooner?" turn — and gets it wrong. Pushing the comparison back through the tool is the same lesson as Part 6: don't let the model guess when a function can calculate exactly.

Step 4 — Have a conversation

session = ChatSession(verbose=True)
for user_message in [
    "When does the USC AI Club meet?",              # search -> "Thursday"
    "How many days until that?",                    # "that" = Thursday (from memory)
    "And when are the AI/ML faculty office hours?", # search -> "Tuesday"
    "Which of those two is sooner?",                # compares BOTH remembered facts
]:
    print(f"\nYou:       {user_message}")
    print(f"Assistant: {session.chat(user_message)}")

Watch the two turns that can't stand alone:

"How many days until that?" — the word that has no referent in the sentence itself. The model reads Turn 1 from history, resolves it to Thursday, and calls days_until_weekday("Thursday"). Strip the history and this question is meaningless.
"Which of those two is sooner?" — the model has to hold two facts it retrieved on different turns (AI Club = Thursday, office hours = Tuesday) and compare them. That's only possible because both are still in memory.

Step 5 — Prove memory is the thing doing the work

session.reset()
print("You:       How many days until that?")
print(f"Assistant: {session.chat('How many days until that?')}")

Same question, empty history. With nothing behind it, "that" has no referent, so the agent has nothing to resolve and falls back. The only variable that changed was whether the conversation was there — which is exactly the point.

Step 6 — What you actually built, and what's still missing

The assistant now has continuity:

Workshop 1 gave it a brain.
Workshop 2 gave it memory of facts (retrieval).
Workshop 3 gave it judgment.
Workshop 4 gave it portability.
Workshop 5 gave it hands (one tool).
Workshop 6 gave it a plan (chained tools).
Workshop 7 gave it memory of the conversation.

Three things to keep in mind as you take it further:

The history window is a real limit, not a formality. When a fact scrolls out of the kept turns, the model can't refer to it — and the 70B model will sometimes confabulate what was said rather than admit it forgot. Try setting max_turns=2 and asking a follow-up about turn 1; you may see it invent an answer rather than admit it forgot. That failure is exactly why production systems summarize old turns or store memory in a database instead of a list.
Trim by turns, never by messages. The orphaned-tool_call_id error is the most common way a beginner's multi-turn agent breaks. Cutting at user boundaries is the simplest safe rule.
Keep the temperature low. At higher temperatures the model varies its tool path between turns, so a follow-up may take a different route than the question it's following up on. temperature=0.2 keeps the conversation coherent.

Everything past here — summarization, a vector store for long-term memory, per-user sessions, streaming the replies — is normal software wrapped around the same loop. The agent is still a while loop over a model call. Now it just has a list that remembers.

Get the code

Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab: Open part7_memory_agent.ipynb
Local Python: part7_memory_agent.py in the repo (python3 part7_memory_agent.py after pip install -r requirements.txt).

MIT licensed. I run this at USC — fork it, swap the knowledge base and the tools for your school, your club, your project.