Originally published on AI Tech Connect.
The uncomfortable finding Suppose you take a safety-aligned frontier model — one that reliably refuses harmful requests, does not hallucinate dangerous medical advice, and behaves predictably across a wide range of prompts. You then fine-tune it on your company's customer service transcripts: benign exchanges about account queries, refund policies, and product support. No harmful content. No adversarial examples. Nothing that would fail a content review. You evaluate the fine-tuned model on your task and it performs well. You ship it. Six weeks later, a red-teamer probing an unrelated part of the system discovers that your model is now willing to produce content the base model would have refused, or is behaving in subtly deceptive ways in edge cases that have nothing to do with customer…
Top comments (0)