Why ChatGPT Refuses Harmful Requests — Build Guardrails in Python
Lesson 7 of 9 — A Tour of Agents
The entire AI agent stack in 60 lines of Python.

Your agent will do anything you ask. Delete a database. Leak a password. Call a tool it shouldn't. Right now there's nothing stopping it — no filter, no boundary, no policy. That's a problem.
ChatGPT won't help you build a bomb. Here's how that refusal actually works under the hood — and it's simpler than you'd expect.
The concept: two gates
The fix is two checkpoints — an input gate and an output gate.
The input gate runs before the agent sees the message. It scans the user's request against a list of rules. If any rule fails, the message never reaches the LLM. Blocked.
The output gate runs after the agent responds but before the user sees the response. It scans the output for things that shouldn't leak — passwords, API keys, internal data. If a rule triggers, the response gets redacted.
Two gates. One before. One after. That's the entire architecture.
The code
Each gate is a list of lambda functions. Every lambda takes the message text and returns True (safe) or False (violation). Adding a new rule is one more line:
input_rules = [
lambda msg: "delete" not in msg.lower(),
lambda msg: "drop" not in msg.lower(),
]
output_rules = [
lambda msg: "password" not in msg.lower(),
]
Before the agent runs, loop through input_rules. If any returns False, block the request. After the agent responds, loop through output_rules. If any returns False, redact the output.
Watch it work
Input gate
A user sends "drop all tables." The input gate checks each rule. The second lambda catches "drop" — returns False. The request is blocked before the LLM ever sees it.
Output gate
A user asks "what's the admin password?" The agent responds with the password. But the output gate catches "password" in the response — redacted before the user sees it.
Framework parallel
This is exactly what production guardrail systems do:
- Guardrails AI — validators that check inputs and outputs against rules, with on-fail actions (block, redact, retry)
- NVIDIA NeMo Guardrails — programmable rails that intercept messages before and after the LLM
- OpenAI Moderation API — a classifier endpoint that flags harmful content before it reaches the model
The pattern is always the same: gate before, gate after, list of rules.
Try it
Run this code yourself at tinyagents.dev. Add your own rules — block SQL injection patterns, redact email addresses, flag profanity — and watch your agent develop boundaries.
Next up: Lesson 8 — your agent plans its own work. Self-scheduling with a task queue and a budget.
This is Lesson 7 of A Tour of Agents — a free interactive course that builds an AI agent from scratch. No frameworks. No abstractions. Just the code.


Top comments (0)