How to Fine-Tune GPT-4o-mini on Your Own Guardrail Failures (50 Lines of Python)
Every time your LLM gets corrected by a guardrail, a training example is born and immediately thrown away. This tutorial shows you how to catch those examples and use them to make your model better — automatically, with no manual labeling.
By the end, you'll have a working pipeline that:
- Validates LLM outputs against natural language requirements
- Retries failures with structured feedback
- Captures every (rejected → corrected) pair to disk
- Exports those pairs in OpenAI fine-tuning format
- Uploads to OpenAI for fine-tuning
Total code: ~50 lines. Total manual labeling: zero.
Prerequisites
pip install "semantix-ai[all]" openai
You'll need an OpenAI API key for the LLM calls and fine-tuning upload. The validation itself runs locally — no API cost.
Step 1: Define What "Correct" Means
Semantix uses Intent classes. The docstring is the requirement. That's it.
from semantix import Intent
class ProfessionalDecline(Intent):
"""The text must politely decline an invitation without
being rude, dismissive, or aggressive."""
class ConstructiveFeedback(Intent):
"""The text must provide encouraging, constructive feedback
that acknowledges effort and suggests specific improvements."""
These aren't prompts. They're contracts. The validator checks every output against them.
Step 2: Wire Up Validation + Collection
from typing import Optional
from openai import OpenAI
from semantix import validate_intent
from semantix.training import TrainingCollector
client = OpenAI()
collector = TrainingCollector("training_data.jsonl")
@validate_intent(retries=2, collector=collector)
def decline_invite(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
messages = [{"role": "user", "content": f"Decline this invitation: {event}"}]
if semantix_feedback:
messages.append({"role": "user", "content": semantix_feedback})
return client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
).choices[0].message.content
Here's what happens when you call decline_invite("the company retreat"):
- GPT-4o-mini generates a response
- Semantix validates it against the docstring using a local NLI model (~15ms)
- If it fails: structured feedback is injected via
semantix_feedbackand the function retries - If the retry passes: the (rejected, accepted) pair is appended to
training_data.jsonl - If it passes first try: nothing is collected (no correction happened)
The semantix_feedback parameter is optional. Declare it and the decorator fills it automatically on retries. Don't declare it and retries still work — the model just doesn't get the structured hint.
Step 3: Generate Traffic
In production, this happens organically. For this tutorial, simulate it:
events = [
"a birthday party for someone you don't like",
"a mandatory corporate retreat",
"a wedding where you're the best man",
"a networking event at a bar",
"a charity gala you can't afford",
"a baby shower for a coworker you barely know",
"a holiday dinner with your in-laws",
"a surprise party that isn't a surprise",
]
for event in events:
try:
result = decline_invite(event)
print(f"OK: {event[:40]}... -> {str(result)[:60]}")
except Exception as e:
print(f"FAIL: {event[:40]}... -> {e}")
After running this, check what was captured:
stats = collector.stats()
print(f"Correction pairs collected: {stats['total_pairs']}")
print(f"Intents: {stats['intents']}")
Every pair represents a case where the model got it wrong, got feedback, and got it right. These are the hardest examples — exactly the ones worth training on.
Step 4: Export to Fine-Tuning Format
from semantix.training.exporters import export_openai
export_openai("training_data.jsonl", "finetune.jsonl")
Each correction pair becomes a chat completion training example:
{
"messages": [
{"role": "system", "content": "You must satisfy the following requirement:\n\nThe text must politely decline an invitation without being rude, dismissive, or aggressive."},
{"role": "user", "content": "Generate a response that satisfies the above requirement."},
{"role": "assistant", "content": "Thank you for the invitation, but I won't be able to attend..."}
]
}
Only the accepted output is used as the training target. The rejected output served its purpose — it triggered the correction.
Step 5: Upload and Fine-Tune
from openai import OpenAI
client = OpenAI()
# Upload the file
file = client.files.create(
file=open("finetune.jsonl", "rb"),
purpose="fine-tune",
)
# Start fine-tuning
job = client.fine_tuning.jobs.create(
training_file=file.id,
model="gpt-4o-mini-2024-07-18",
)
print(f"Fine-tuning job: {job.id}")
print(f"Status: {job.status}")
Wait for the job to complete (usually 10-30 minutes for small datasets). Then swap your model ID:
# Before: gpt-4o-mini
# After: ft:gpt-4o-mini-2024-07-18:your-org::job-id
@validate_intent(retries=2, collector=collector)
def decline_invite(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
return client.chat.completions.create(
model="ft:gpt-4o-mini-2024-07-18:your-org::job-id", # <-- fine-tuned
messages=[...],
).choices[0].message.content
The fine-tuned model runs through semantix again. It fails less. But when it does fail, those new correction pairs are captured too. Fine-tune again. Fails even less.
The Flywheel
Week 1: gpt-4o-mini → 15% failure rate → 200 correction pairs
Week 2: fine-tuned-v1 → 5% failure rate → 70 correction pairs
Week 3: fine-tuned-v2 → 2% failure rate → 25 correction pairs
Week 4: fine-tuned-v3 → <1% failure rate
These numbers are illustrative, but the pattern is real: each round of fine-tuning reduces the failure rate, which reduces the number of corrections, which means each subsequent training set is smaller but harder — exactly what you want.
No human labeled a single example. The guardrail did the labeling.
Try It Without an API Key
Don't have an OpenAI key? Run the full loop locally:
git clone https://github.com/labrat-akhona/semantix-ai.git
cd semantix-ai
pip install -e .
python examples/flywheel_demo.py
The demo uses a simple keyword judge instead of NLI, but the pipeline is identical: validate, fail, correct, capture, export.
What's Actually Happening Under the Hood
The @validate_intent decorator does four things:
- Calls your function and gets the raw string output
- Evaluates the string against the Intent's docstring using an NLI model (locally, ~15ms)
-
On failure: builds a structured Markdown feedback report, injects it via
semantix_feedback, retries -
On success after failure: calls
collector.record()with the rejected output, accepted output, scores, and feedback
The NLI model (cross-encoder/nli-MiniLM2-L6-H768) computes an entailment probability — how likely is it that the output satisfies the requirement? If the probability is below the threshold (default 0.5), validation fails.
No LLM is used for validation. No API calls. No tokens burned on checking.
When to Use This
This pattern works best when:
- Your LLM has a specific behavioral requirement (tone, style, compliance, safety)
- You're already retrying failures (so correction pairs exist)
- You want domain-specific fine-tuning without paying for human annotation
- Your failure rate is high enough to generate meaningful training data (>5%)
It works less well when:
- Your requirements are purely structural (use Pydantic)
- Your model never fails (you don't need a guardrail)
- Your outputs are too short or uniform to benefit from fine-tuning
The Full Script
Here's the complete pipeline in one file:
from typing import Optional
from openai import OpenAI
from semantix import Intent, validate_intent
from semantix.training import TrainingCollector
from semantix.training.exporters import export_openai
# 1. Define the requirement
class ProfessionalDecline(Intent):
"""The text must politely decline an invitation without
being rude, dismissive, or aggressive."""
# 2. Set up collection
client = OpenAI()
collector = TrainingCollector("training_data.jsonl")
# 3. Wrap your LLM call
@validate_intent(retries=2, collector=collector)
def decline_invite(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
messages = [{"role": "user", "content": f"Decline this invitation: {event}"}]
if semantix_feedback:
messages.append({"role": "user", "content": semantix_feedback})
return client.chat.completions.create(
model="gpt-4o-mini", messages=messages,
).choices[0].message.content
# 4. Generate traffic
for event in ["a party", "a retreat", "a wedding", "a gala"]:
try:
decline_invite(event)
except Exception:
pass
# 5. Export and fine-tune
export_openai("training_data.jsonl", "finetune.jsonl")
print(f"Collected {collector.stats()['total_pairs']} training pairs")
print("Ready for: openai api fine_tuning.jobs.create -t finetune.jsonl")
That's it. Your guardrail is now your training pipeline.
semantix-ai — pip install 'semantix-ai[all]'
PyPI | GitHub | Previous article: Your AI Guardrail Is a Dead End
Built by Akhona Eland in South Africa. 166 tests. Zero labeling. Your failures are now your curriculum.
Top comments (0)