In Part 1 we got a USC campus assistant talking. In Part 2 we taught it to retrieve only the relevant context. Both posts ended with the same observation — when someone asked for the wifi password, the assistant refused. That refusal worked because we told it to. It would have just as happily made something up if we'd phrased the prompt differently.
This post is about hardening that refusal so it's not luck. Two guardrail layers, both small enough to read in one sitting, neither requiring a framework. First, tighten the prompt so the assistant knows what it's allowed to talk about. Second, add a second LLM call that re-reads the answer and the context and decides whether to ship the answer or refuse.
I'm B Torkian, NVIDIA Developer Champion at USC. This is the layer where a demo becomes something I'd actually let students use.
What you're adding
User question
→ retrieve top-k context (from Part 2)
→ scoped prompt: model answers OR returns the exact fallback line
→ grounding check: a second NIM call asks "is the answer supported by the context?"
→ ship the answer, or replace it with the fallback line
The chat call and the embedding setup carry over from Parts 1 and 2. Everything new in this post is fewer than 40 lines.
Why guardrails are not optional
The retrieval step from Part 2 narrowed what the model sees. It does nothing to stop the model from being clever with the data it has, or from drifting into topics outside the assistant's job.
Two real failure modes I've seen in student demos:
- Out-of-scope creep. Someone asks "can you write my breakup text?" The model is happy to oblige. The retriever pulled three USC chunks (cosine just returns something), the prompt didn't forbid relationship advice, so the model wrote the text.
- Confident-sounding hallucinations. The retrieved chunk says "Monday to Friday, 10 AM to 6 PM." The user asks about Saturday hours. The model decides the friendly answer is "Saturday hours are 11 AM to 4 PM" — a fabrication that sounds like a reasonable inference.
The first failure is solved by prompt scope. The second is what the grounding check is for.
Step 1 — Setup (self-contained)
If you already have Workshops 1 + 2 running in the same Colab session, skip this cell. If you're starting fresh, paste this in — it bundles the client, the embedding model, the USC knowledge base, and the retriever from Parts 1 and 2 so the rest of this post stands on its own.
%pip install -q openai numpy
import os, getpass
from openai import OpenAI
import numpy as np
if not os.getenv("NVIDIA_API_KEY"):
os.environ["NVIDIA_API_KEY"] = getpass.getpass("Paste your NVIDIA API key (starts with nvapi-): ")
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key=os.environ["NVIDIA_API_KEY"],
)
MODEL = "meta/llama-3.1-8b-instruct"
EMBED_MODEL = "nvidia/nv-embedqa-e5-v5"
def ask(system_prompt, user_message):
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
],
temperature=0.3,
max_tokens=400,
)
return response.choices[0].message.content
knowledge_base = [
{"title": "USC AI Club meeting",
"text": "The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204."},
{"title": "USC GPU lab hours",
"text": "The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM."},
{"title": "NVIDIA Developer Program",
"text": "USC students can join the NVIDIA Developer Program for free."},
{"title": "Next USC workshop",
"text": "The next USC AI Club workshop will cover Retrieval Augmented Generation (RAG)."},
{"title": "USC AI/ML office hours",
"text": "Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM."},
{"title": "USC robotics lab",
"text": "The USC robotics lab requires safety training before students can use the soldering station."},
{"title": "USC tutoring",
"text": "Peer tutoring for introductory Python at USC is available Wednesdays from 1 PM to 3 PM."},
]
def embed_texts(texts, input_type="passage"):
response = client.embeddings.create(
model=EMBED_MODEL,
input=texts,
extra_body={"input_type": input_type},
)
return [np.array(item.embedding, dtype=np.float32) for item in response.data]
def cosine_similarity(a, b):
denom = np.linalg.norm(a) * np.linalg.norm(b)
if denom == 0:
return 0.0
return float(np.dot(a, b) / denom)
def retrieve_context(question, k=3):
q_emb = embed_texts([question], input_type="query")[0]
scored = [(cosine_similarity(q_emb, item["embedding"]), item) for item in knowledge_base]
scored.sort(key=lambda p: p[0], reverse=True)
return "\n".join(f"- {item['text']}" for _, item in scored[:k])
for item, emb in zip(knowledge_base, embed_texts([i["text"] for i in knowledge_base], "passage")):
item["embedding"] = emb
print(f"Ready. Embedded {len(knowledge_base)} chunks.")
That cell defines everything Workshops 1 and 2 produced. The Part 3 code below builds on ask, retrieve_context, and the embedded knowledge_base.
Step 2 — Layer 1: prompt scope with a fixed fallback line
FALLBACK = "I don't have that information — check with the USC AI Club."
SCOPED_SYSTEM_PROMPT_TEMPLATE = """You are a USC campus assistant for AI Club,
GPU lab, NVIDIA program, workshop, office hour, robotics lab, and tutoring
questions only.
Rules:
- Answer ONLY using the CONTEXT below.
- If the user asks about anything outside this scope (e.g. weather, jokes,
personal advice, code generation, general world knowledge), reply with
exactly: "{fallback}"
- If the answer is not present in the context, reply with exactly: "{fallback}"
- Do not invent names, dates, room numbers, links, passwords, schedules,
policies, or instructions that are not in the context.
CONTEXT:
{context}
"""
Three things are doing work in this prompt:
- A finite topic list. The assistant has a job description. "Anything outside this scope" gives the model a clear opt-out — it doesn't have to guess what's in-bounds.
- One exact fallback string. Same wording, every time. This matters in Step 3 — the grounding check returns the same string when it overrides, so downstream code only has to recognize one shape.
- An explicit don't-invent list. Models are pliable. Spelling out the dangerous categories (room numbers, passwords, policies) lowers hallucination noticeably with no extra calls.
This layer alone catches most off-topic and most "the context didn't mention it" cases.
Step 3 — Layer 2: a grounding check on every answer
The scoped prompt is a request — the model can still ignore it. Layer 2 is a separate, narrower NIM call whose only job is to look at the context and the answer and decide whether the answer is supported.
def answer_is_grounded(question: str, context: str, answer: str) -> bool:
verdict = ask(
system_prompt=(
"You are a strict grounding verifier. Read the CONTEXT and the "
"ANSWER. Respond with only 'yes' or 'no'. Say 'yes' if every "
"factual claim in the ANSWER is directly supported by the CONTEXT. "
"Say 'no' otherwise — including if the ANSWER adds information not "
"in the CONTEXT, even if that information sounds plausible."
),
user_message=(
f"CONTEXT:\n{context}\n\n"
f"QUESTION:\n{question}\n\n"
f"ANSWER:\n{answer}\n\n"
"Is every factual claim in the ANSWER supported by the CONTEXT?"
),
)
return verdict.strip().lower().startswith("yes")
Three things to notice:
-
It's just another
ask()call — same client, same hosted NIM model, no new infrastructure. Layer 2 costs one extra call per question. - Yes/no only. Constraining the response shape makes the parsing reliable. If the verifier waffles ("yes, but..."), we treat that as a fail by checking the start of the string only.
- It can be wrong too. The verifier is itself an LLM. For workshop-grade safety this is fine; for production you'd add deterministic checks (regex for room numbers, exact string match for fallback) on top.
Step 4 — Wire both layers into ask_guarded()
def ask_guarded(question: str) -> str:
context = retrieve_context(question) # from Part 2
system_prompt = SCOPED_SYSTEM_PROMPT_TEMPLATE.format(
fallback=FALLBACK, context=context,
)
answer = ask(system_prompt, question) # Layer 1
if not answer_is_grounded(question, context, answer):
return FALLBACK # Layer 2 override
return answer
for question in [
"When does the USC AI Club meet?", # in scope, in context
"Can you write my breakup text?", # OUT of scope
"What is the wifi password?", # in scope, NOT in context
"What are the USC GPU lab Saturday hours?", # invites a hallucination
]:
print(f"Q: {question}")
print(f"A: {ask_guarded(question)}\n")
Read the output carefully.
- The AI Club question returns a real answer from the context. Both layers pass.
- The breakup-text question hits Layer 1 — the scope rule catches it.
- The wifi question also hits Layer 1 — nothing in the context mentions passwords, the scoped prompt forbids inventing them.
- The Saturday-hours question is the one that earns its keep. The context says "Monday to Friday." A friendlier model would guess "closed on Saturday." Layer 2 reads that answer, sees "Saturday" is not in the context, and returns the fallback instead.
Step 5 — What you actually built
You took the retriever from Part 2 and put it inside two cheap, inspectable guardrails. The whole thing is still one Python file, still one hosted NIM endpoint, still no vector database. The mental model is:
- Retrieval decides what the model sees.
- Scoped prompt decides what the model is allowed to write.
- Grounding check decides whether what the model wrote ships.
Real production systems extend each of these — deterministic rule checks, structured output, confidence thresholds, dedicated safety models, human review queues. The shape stays the same. Every additional layer is a yes/no gate between the user's question and the final response.
Get the code
Repo: github.com/torkian/nvidia-nim-workshop
One-click Colab for Part 3: Open part3_guardrails.ipynb
Local Python: part3_guardrails.py in the repo (python3 part3_guardrails.py after pip install -r requirements.txt).
MIT licensed. I run this at USC — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.
Previously / next in this series
- Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes
- Part 2: From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM
- Part 4 (next): Run NIM on Your Own GPU — same OpenAI-compatible API, different endpoint. Useful when you want data locality, predictable latency, or a self-hosted dev loop.
Top comments (0)