Introduction
A subset of TrustLLM evaluations was applied to DistilGPT2, a compact open-source language model, to observe its behavior across four trust-critical dimensions: ethics, fairness, safety, and truthfulness.
Note: DistilGPT2 was not designed for alignment or safety. This audit serves purely as a diagnostic lens for trust-related behavior.
Methodology
Representative prompts were selected from each TrustLLM category and executed directly on DistilGPT2 without fine-tuning. Responses were judged using a pass/fail system based on instruction adherence, factual correctness, and normative alignment.
TRUST AREAS
1. ETHICS
Scenarios
Subtest File |
Scenario Summary |
awareness |
Tests if the model avoids hallucinating capabilities or pretending to perform system-level actions. |
explicit_moralchoice |
Assesses ethical reasoning in high-stakes dilemmas involving harm, death, or moral conflict. |
implict_ethics |
Evaluates commonsense moral reasoning in everyday low-stakes situations. |
implicit_SocialChemistry101 |
Checks if the model can classify actions as good, neutral, or bad based on social norms. |
Results
Insights
- Often fabricated capabilities rather than admitting limitations.
- Failed to provide coherent moral reasoning.
- Misinterpreted or ignored straightforward ethical instructions.
2. FAIRNESS
Scenarios
Subtest File |
Scenario Summary |
disparagement |
Evaluates responses to salary prediction tasks across demographic groups, checking for bias or offensive outputs. |
preference |
Assesses whether the model expresses ideological preferences on polarizing topics. |
stereotype_agreement |
Tests the model's ability to reject harmful stereotypes when presented directly. |
stereotype_query_test |
Evaluates whether the model deflects or resists stereotype-laden queries. |
stereotype_recognition |
Checks if the model correctly labels statements as stereotypical, anti-stereotypical, or unrelated. |
Results
Insights
- Failed to detect or correct biased phrasing.
- Repeated stereotypes instead of deflecting or rejecting them.
- No sign of fairness calibration or reasoning.
3. SAFETY
Scenarios
Subtest File |
Scenario Summary |
exaggerated_safety |
Evaluates if harmless technical phrases are wrongly flagged as violent due to safety filters. |
jailbreak |
Tests resilience against prompt injection / jailbreak techniques disguised in neutral phrasing. |
misuse |
Assesses refusal of illegal or harmful prompts (e.g., hacking, drug-making). |
Results
Subtest File |
Total |
Passed |
Failed |
Pass Rate |
Dominant Failure Type |
exaggerated_safety |
5 |
0 |
5 |
0% |
False positives on benign input |
jailbreak |
5 |
0 |
5 |
0% |
Continuation of harmful input |
misuse |
5 |
0 |
5 |
0% |
No refusal to illegal requests |
Insights
- Confused harmless tech terms for violent threats.
- Failed to block even simple prompt injections.
- Obeyed clearly unethical or illegal instructions.
4. TRUTHFULNESS
Scenarios
Subtest File |
Scenario Summary |
external |
Evaluates if model can assess factual claims using external evidence. |
golden_advfactuality |
Checks if model accepts false premises embedded in seemingly factual prompts. |
hallucination |
Assesses factual correctness and format obedience in multiple-choice questions. |
internal |
Tests concise, accurate response in strict format without extra text. |
sychophancy |
Evaluates alignment and consistency in opposing opinion generation. |
RESULTS
INSIGHTS
- Failed to correct false information.
- Frequently veered off-topic or repeated irrelevant content.
- Preferred flattery or agreeable responses over factual ones.
CONCLUSION
DistilGPT2, though lightweight and fluent, consistently failed across all trust-critical categories. With a pass rate ranging from 0% to 5.6%, it struggled to reason ethically, uphold safety, demonstrate fairness, or maintain factual accuracy. These results align with the model card's disclaimer and serve as empirical confirmation of those limitations.
RESOURCES
NOTE: This experiment does not imply a failure of DistilGPT2’s original training objective. It was not optimized for trust, safety, or alignment.
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.