DEV Community

Akhona Eland
Akhona Eland

Posted on

How to Fine-Tune GPT-4o-mini on Your Own Guardrail Failures (50 Lines of Python)

How to Fine-Tune GPT-4o-mini on Your Own Guardrail Failures (50 Lines of Python)

Every time your LLM gets corrected by a guardrail, a training example is born and immediately thrown away. This tutorial shows you how to catch those examples and use them to make your model better — automatically, with no manual labeling.

By the end, you'll have a working pipeline that:

  1. Validates LLM outputs against natural language requirements
  2. Retries failures with structured feedback
  3. Captures every (rejected → corrected) pair to disk
  4. Exports those pairs in OpenAI fine-tuning format
  5. Uploads to OpenAI for fine-tuning

Total code: ~50 lines. Total manual labeling: zero.


Prerequisites

pip install "semantix-ai[all]" openai
Enter fullscreen mode Exit fullscreen mode

You'll need an OpenAI API key for the LLM calls and fine-tuning upload. The validation itself runs locally — no API cost.


Step 1: Define What "Correct" Means

Semantix uses Intent classes. The docstring is the requirement. That's it.

from semantix import Intent

class ProfessionalDecline(Intent):
    """The text must politely decline an invitation without
    being rude, dismissive, or aggressive."""

class ConstructiveFeedback(Intent):
    """The text must provide encouraging, constructive feedback
    that acknowledges effort and suggests specific improvements."""
Enter fullscreen mode Exit fullscreen mode

These aren't prompts. They're contracts. The validator checks every output against them.


Step 2: Wire Up Validation + Collection

from typing import Optional
from openai import OpenAI
from semantix import validate_intent
from semantix.training import TrainingCollector

client = OpenAI()
collector = TrainingCollector("training_data.jsonl")

@validate_intent(retries=2, collector=collector)
def decline_invite(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
    messages = [{"role": "user", "content": f"Decline this invitation: {event}"}]
    if semantix_feedback:
        messages.append({"role": "user", "content": semantix_feedback})
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    ).choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Here's what happens when you call decline_invite("the company retreat"):

  1. GPT-4o-mini generates a response
  2. Semantix validates it against the docstring using a local NLI model (~15ms)
  3. If it fails: structured feedback is injected via semantix_feedback and the function retries
  4. If the retry passes: the (rejected, accepted) pair is appended to training_data.jsonl
  5. If it passes first try: nothing is collected (no correction happened)

The semantix_feedback parameter is optional. Declare it and the decorator fills it automatically on retries. Don't declare it and retries still work — the model just doesn't get the structured hint.


Step 3: Generate Traffic

In production, this happens organically. For this tutorial, simulate it:

events = [
    "a birthday party for someone you don't like",
    "a mandatory corporate retreat",
    "a wedding where you're the best man",
    "a networking event at a bar",
    "a charity gala you can't afford",
    "a baby shower for a coworker you barely know",
    "a holiday dinner with your in-laws",
    "a surprise party that isn't a surprise",
]

for event in events:
    try:
        result = decline_invite(event)
        print(f"OK: {event[:40]}... -> {str(result)[:60]}")
    except Exception as e:
        print(f"FAIL: {event[:40]}... -> {e}")
Enter fullscreen mode Exit fullscreen mode

After running this, check what was captured:

stats = collector.stats()
print(f"Correction pairs collected: {stats['total_pairs']}")
print(f"Intents: {stats['intents']}")
Enter fullscreen mode Exit fullscreen mode

Every pair represents a case where the model got it wrong, got feedback, and got it right. These are the hardest examples — exactly the ones worth training on.


Step 4: Export to Fine-Tuning Format

from semantix.training.exporters import export_openai

export_openai("training_data.jsonl", "finetune.jsonl")
Enter fullscreen mode Exit fullscreen mode

Each correction pair becomes a chat completion training example:

{
  "messages": [
    {"role": "system", "content": "You must satisfy the following requirement:\n\nThe text must politely decline an invitation without being rude, dismissive, or aggressive."},
    {"role": "user", "content": "Generate a response that satisfies the above requirement."},
    {"role": "assistant", "content": "Thank you for the invitation, but I won't be able to attend..."}
  ]
}
Enter fullscreen mode Exit fullscreen mode

Only the accepted output is used as the training target. The rejected output served its purpose — it triggered the correction.


Step 5: Upload and Fine-Tune

from openai import OpenAI

client = OpenAI()

# Upload the file
file = client.files.create(
    file=open("finetune.jsonl", "rb"),
    purpose="fine-tune",
)

# Start fine-tuning
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
)

print(f"Fine-tuning job: {job.id}")
print(f"Status: {job.status}")
Enter fullscreen mode Exit fullscreen mode

Wait for the job to complete (usually 10-30 minutes for small datasets). Then swap your model ID:

# Before: gpt-4o-mini
# After:  ft:gpt-4o-mini-2024-07-18:your-org::job-id

@validate_intent(retries=2, collector=collector)
def decline_invite(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
    return client.chat.completions.create(
        model="ft:gpt-4o-mini-2024-07-18:your-org::job-id",  # <-- fine-tuned
        messages=[...],
    ).choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

The fine-tuned model runs through semantix again. It fails less. But when it does fail, those new correction pairs are captured too. Fine-tune again. Fails even less.


The Flywheel

Week 1: gpt-4o-mini          → 15% failure rate → 200 correction pairs
Week 2: fine-tuned-v1        →  5% failure rate →  70 correction pairs  
Week 3: fine-tuned-v2        →  2% failure rate →  25 correction pairs
Week 4: fine-tuned-v3        →  <1% failure rate
Enter fullscreen mode Exit fullscreen mode

These numbers are illustrative, but the pattern is real: each round of fine-tuning reduces the failure rate, which reduces the number of corrections, which means each subsequent training set is smaller but harder — exactly what you want.

No human labeled a single example. The guardrail did the labeling.


Try It Without an API Key

Don't have an OpenAI key? Run the full loop locally:

git clone https://github.com/labrat-akhona/semantix-ai.git
cd semantix-ai
pip install -e .
python examples/flywheel_demo.py
Enter fullscreen mode Exit fullscreen mode

The demo uses a simple keyword judge instead of NLI, but the pipeline is identical: validate, fail, correct, capture, export.


What's Actually Happening Under the Hood

The @validate_intent decorator does four things:

  1. Calls your function and gets the raw string output
  2. Evaluates the string against the Intent's docstring using an NLI model (locally, ~15ms)
  3. On failure: builds a structured Markdown feedback report, injects it via semantix_feedback, retries
  4. On success after failure: calls collector.record() with the rejected output, accepted output, scores, and feedback

The NLI model (cross-encoder/nli-MiniLM2-L6-H768) computes an entailment probability — how likely is it that the output satisfies the requirement? If the probability is below the threshold (default 0.5), validation fails.

No LLM is used for validation. No API calls. No tokens burned on checking.


When to Use This

This pattern works best when:

  • Your LLM has a specific behavioral requirement (tone, style, compliance, safety)
  • You're already retrying failures (so correction pairs exist)
  • You want domain-specific fine-tuning without paying for human annotation
  • Your failure rate is high enough to generate meaningful training data (>5%)

It works less well when:

  • Your requirements are purely structural (use Pydantic)
  • Your model never fails (you don't need a guardrail)
  • Your outputs are too short or uniform to benefit from fine-tuning

The Full Script

Here's the complete pipeline in one file:

from typing import Optional
from openai import OpenAI
from semantix import Intent, validate_intent
from semantix.training import TrainingCollector
from semantix.training.exporters import export_openai

# 1. Define the requirement
class ProfessionalDecline(Intent):
    """The text must politely decline an invitation without
    being rude, dismissive, or aggressive."""

# 2. Set up collection
client = OpenAI()
collector = TrainingCollector("training_data.jsonl")

# 3. Wrap your LLM call
@validate_intent(retries=2, collector=collector)
def decline_invite(event: str, semantix_feedback: Optional[str] = None) -> ProfessionalDecline:
    messages = [{"role": "user", "content": f"Decline this invitation: {event}"}]
    if semantix_feedback:
        messages.append({"role": "user", "content": semantix_feedback})
    return client.chat.completions.create(
        model="gpt-4o-mini", messages=messages,
    ).choices[0].message.content

# 4. Generate traffic
for event in ["a party", "a retreat", "a wedding", "a gala"]:
    try:
        decline_invite(event)
    except Exception:
        pass

# 5. Export and fine-tune
export_openai("training_data.jsonl", "finetune.jsonl")
print(f"Collected {collector.stats()['total_pairs']} training pairs")
print("Ready for: openai api fine_tuning.jobs.create -t finetune.jsonl")
Enter fullscreen mode Exit fullscreen mode

That's it. Your guardrail is now your training pipeline.


semantix-aipip install 'semantix-ai[all]'

PyPI | GitHub | Previous article: Your AI Guardrail Is a Dead End

Built by Akhona Eland in South Africa. 166 tests. Zero labeling. Your failures are now your curriculum.

Top comments (0)