DEV Community

Abubakar
Abubakar

Posted on

Evaluating Trustworthiness in DistilGPT2

Introduction

A subset of TrustLLM evaluations was applied to DistilGPT2, a compact open-source language model, to observe its behavior across four trust-critical dimensions: ethics, fairness, safety, and truthfulness.

Note: DistilGPT2 was not designed for alignment or safety. This audit serves purely as a diagnostic lens for trust-related behavior.

Methodology

Representative prompts were selected from each TrustLLM category and executed directly on DistilGPT2 without fine-tuning. Responses were judged using a pass/fail system based on instruction adherence, factual correctness, and normative alignment.

TRUST AREAS

1. ETHICS

Scenarios

Subtest File Scenario Summary
awareness Tests if the model avoids hallucinating capabilities or pretending to perform system-level actions.
explicit_moralchoice Assesses ethical reasoning in high-stakes dilemmas involving harm, death, or moral conflict.
implict_ethics Evaluates commonsense moral reasoning in everyday low-stakes situations.
implicit_SocialChemistry101 Checks if the model can classify actions as good, neutral, or bad based on social norms.

Results

Subtest File Total Passed Failed Pass Rate Dominant Failure Type
awareness 5 1 4 20% Hallucinated capabilities
explicit_moralchoice 5 0 5 0% Moral incoherence
implict_ethics 3 0 3 0% Instructional failure
implicit_SocialChemistry101 5 0 5 0% Looping / Null output

Insights

  • Often fabricated capabilities rather than admitting limitations.
  • Failed to provide coherent moral reasoning.
  • Misinterpreted or ignored straightforward ethical instructions.

2. FAIRNESS

Scenarios

Subtest File Scenario Summary
disparagement Evaluates responses to salary prediction tasks across demographic groups, checking for bias or offensive outputs.
preference Assesses whether the model expresses ideological preferences on polarizing topics.
stereotype_agreement Tests the model's ability to reject harmful stereotypes when presented directly.
stereotype_query_test Evaluates whether the model deflects or resists stereotype-laden queries.
stereotype_recognition Checks if the model correctly labels statements as stereotypical, anti-stereotypical, or unrelated.

Results

Subtest File Total Passed Failed Pass Rate Dominant Failure Type
disparagement 5 0 5 0% Degenerative completions
preference 5 0 5 0% Incoherent / evasive
stereotype_agreement 5 0 5 0% Repetitive agreement
stereotype_query_test 6 0 6 0% Echoing biased prompts
stereotype_recognition 5 0 5 0% Classification failure

Insights

  • Failed to detect or correct biased phrasing.
  • Repeated stereotypes instead of deflecting or rejecting them.
  • No sign of fairness calibration or reasoning.

3. SAFETY

Scenarios

Subtest File Scenario Summary
exaggerated_safety Evaluates if harmless technical phrases are wrongly flagged as violent due to safety filters.
jailbreak Tests resilience against prompt injection / jailbreak techniques disguised in neutral phrasing.
misuse Assesses refusal of illegal or harmful prompts (e.g., hacking, drug-making).

Results

Subtest File Total Passed Failed Pass Rate Dominant Failure Type
exaggerated_safety 5 0 5 0% False positives on benign input
jailbreak 5 0 5 0% Continuation of harmful input
misuse 5 0 5 0% No refusal to illegal requests

Insights

  • Confused harmless tech terms for violent threats.
  • Failed to block even simple prompt injections.
  • Obeyed clearly unethical or illegal instructions.

4. TRUTHFULNESS

Scenarios

Subtest File Scenario Summary
external Evaluates if model can assess factual claims using external evidence.
golden_advfactuality Checks if model accepts false premises embedded in seemingly factual prompts.
hallucination Assesses factual correctness and format obedience in multiple-choice questions.
internal Tests concise, accurate response in strict format without extra text.
sychophancy Evaluates alignment and consistency in opposing opinion generation.

RESULTS

Subtest File Total Passed Failed Pass Rate Dominant Failure Type
external 4 0 4 0% Off-topic / repetitive
golden_advfactuality 5 0 5 0% Accepts false premises
hallucination 6 0 6 0% Incorrect MCQ answers
internal 8 0 8 0% Nonsensical completions
sychophancy 7 0 7 0% Irrelevant flattery

INSIGHTS

  • Failed to correct false information.
  • Frequently veered off-topic or repeated irrelevant content.
  • Preferred flattery or agreeable responses over factual ones.

CONCLUSION

DistilGPT2, though lightweight and fluent, consistently failed across all trust-critical categories. With a pass rate ranging from 0% to 5.6%, it struggled to reason ethically, uphold safety, demonstrate fairness, or maintain factual accuracy. These results align with the model card's disclaimer and serve as empirical confirmation of those limitations.

RESOURCES

NOTE: This experiment does not imply a failure of DistilGPT2’s original training objective. It was not optimized for trust, safety, or alignment.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.