Abubakar

Posted on Jun 8

Evaluating Trustworthiness in DistilGPT2

#ai #trust

Introduction

A subset of TrustLLM evaluations was applied to DistilGPT2, a compact open-source language model, to observe its behavior across four trust-critical dimensions: ethics, fairness, safety, and truthfulness.

Note: DistilGPT2 was not designed for alignment or safety. This audit serves purely as a diagnostic lens for trust-related behavior.

Methodology

Representative prompts were selected from each TrustLLM category and executed directly on DistilGPT2 without fine-tuning. Responses were judged using a pass/fail system based on instruction adherence, factual correctness, and normative alignment.

TRUST AREAS

1. ETHICS

Scenarios

Subtest File	Scenario Summary
`awareness`	Tests if the model avoids hallucinating capabilities or pretending to perform system-level actions.
`explicit_moralchoice`	Assesses ethical reasoning in high-stakes dilemmas involving harm, death, or moral conflict.
`implict_ethics`	Evaluates commonsense moral reasoning in everyday low-stakes situations.
`implicit_SocialChemistry101`	Checks if the model can classify actions as good, neutral, or bad based on social norms.

Results

Subtest File	Total	Passed	Failed	Pass Rate	Dominant Failure Type
awareness	5	1	4	20%	Hallucinated capabilities
explicit_moralchoice	5	0	5	0%	Moral incoherence
implict_ethics	3	0	3	0%	Instructional failure
implicit_SocialChemistry101	5	0	5	0%	Looping / Null output

Insights

Often fabricated capabilities rather than admitting limitations.
Failed to provide coherent moral reasoning.
Misinterpreted or ignored straightforward ethical instructions.

2. FAIRNESS

Scenarios

Subtest File	Scenario Summary
`disparagement`	Evaluates responses to salary prediction tasks across demographic groups, checking for bias or offensive outputs.
`preference`	Assesses whether the model expresses ideological preferences on polarizing topics.
`stereotype_agreement`	Tests the model's ability to reject harmful stereotypes when presented directly.
`stereotype_query_test`	Evaluates whether the model deflects or resists stereotype-laden queries.
`stereotype_recognition`	Checks if the model correctly labels statements as stereotypical, anti-stereotypical, or unrelated.

Results

Subtest File	Total	Failed	Pass Rate	Dominant Failure Type
disparagement	5	5	0%	Degenerative completions
preference	5	5	0%	Incoherent / evasive
stereotype_agreement	5	5	0%	Repetitive agreement
stereotype_query_test	6	6	0%	Echoing biased prompts
stereotype_recognition	5	5	0%	Classification failure

Insights

Failed to detect or correct biased phrasing.
Repeated stereotypes instead of deflecting or rejecting them.
No sign of fairness calibration or reasoning.

3. SAFETY

Scenarios

Subtest File	Scenario Summary
`exaggerated_safety`	Evaluates if harmless technical phrases are wrongly flagged as violent due to safety filters.
`jailbreak`	Tests resilience against prompt injection / jailbreak techniques disguised in neutral phrasing.
`misuse`	Assesses refusal of illegal or harmful prompts (e.g., hacking, drug-making).

Results

Subtest File	Total	Failed	Pass Rate	Dominant Failure Type
exaggerated_safety	5	5	0%	False positives on benign input
jailbreak	5	5	0%	Continuation of harmful input
misuse	5	5	0%	No refusal to illegal requests

Insights

Confused harmless tech terms for violent threats.
Failed to block even simple prompt injections.
Obeyed clearly unethical or illegal instructions.

4. TRUTHFULNESS

Scenarios

Subtest File	Scenario Summary
`external`	Evaluates if model can assess factual claims using external evidence.
`golden_advfactuality`	Checks if model accepts false premises embedded in seemingly factual prompts.
`hallucination`	Assesses factual correctness and format obedience in multiple-choice questions.
`internal`	Tests concise, accurate response in strict format without extra text.
`sychophancy`	Evaluates alignment and consistency in opposing opinion generation.

RESULTS

Subtest File	Total	Failed	Pass Rate	Dominant Failure Type
external	4	4	0%	Off-topic / repetitive
golden_advfactuality	5	5	0%	Accepts false premises
hallucination	6	6	0%	Incorrect MCQ answers
internal	8	8	0%	Nonsensical completions
sychophancy	7	7	0%	Irrelevant flattery

INSIGHTS

Failed to correct false information.
Frequently veered off-topic or repeated irrelevant content.
Preferred flattery or agreeable responses over factual ones.

CONCLUSION

DistilGPT2, though lightweight and fluent, consistently failed across all trust-critical categories. With a pass rate ranging from 0% to 5.6%, it struggled to reason ethically, uphold safety, demonstrate fairness, or maintain factual accuracy. These results align with the model card's disclaimer and serve as empirical confirmation of those limitations.

RESOURCES

NOTE: This experiment does not imply a failure of DistilGPT2’s original training objective. It was not optimized for trust, safety, or alignment.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.