Qwen 2.5 7B Expresses Near-Constant Confidence Whether It Is Right or Wrong, Study Finds

#ai #machinelearning #research #deeplearning

A June 2026 arXiv preprint from University of Minnesota researchers tested Qwen 2.5 7B on structured clinical prediction data and found its verbalized confidence scores are essentially uninformative -- clustering between 0.856 and 0.937 no matter how well or badly the model performs. Combining SHAP-

A large language model that says it is 85% confident and one that says it is 94% confident should behave very differently. A preprint published on arXiv in June 2026 by researchers at the University of Minnesota shows that for Qwen 2.5 7B on structured clinical data, those two numbers are functionally identical -- and neither predicts whether the model is actually correct.

The paper, authored by Akshat Dasula, Prasanna Desikan, and Jaideep Srivastava, tests the model on a clinical prediction task and records its verbalized confidence across conditions where accuracy ranges from 49% to 75.3%. The confidence scores barely move: the full range reported is 0.856 to 0.937. The model is, in statistical terms, epistemically uncalibrated -- its stated certainty tracks prompt format, not prediction quality.

The finding is not a quirk of Qwen. A Nature-published study on gastroenterology clinical reasoning questions, covering 48 LLMs including GPT-4o and Claude 3.5 Sonnet, found the same pattern: models maintained high confidence regardless of question difficulty or whether they were correct. The Minnesota paper adds a sharper diagnostic: it uses cross-model attribution divergence to pinpoint where Qwen's reasoning departs from a well-calibrated baseline.

The wrong answer, delivered confidently

The paper's most striking result is an inverse difficulty effect. When XGBoost -- a classical gradient-boosted tree model -- achieves 99% accuracy on a given subset of cases, Qwen 2.5 7B's accuracy falls to 64.8%. When XGBoost is only moderately certain (around 73%), Qwen matches it at 73.8% versus 73.1%. The model is most confidently wrong precisely on cases that a traditional ML model finds easy.

The authors attribute this to a 'cold start' problem: LLMs have rich prior knowledge encoded from pretraining on natural language, but structured tabular data -- rows of clinical variables with no textual context -- sits outside that prior. The model lacks the feature-space intuitions that XGBoost builds from training data, so its confidence signals reflect linguistic patterns in the prompt rather than actual predictive evidence.

This is a meaningful risk in clinical settings. If a physician or decision-support system treats verbalized confidence as a reliability signal, high-confidence wrong answers on easy cases are more dangerous than acknowledged uncertainty.

Two fixes, better together

The researchers tested four intervention conditions: baseline, few-shot examples only, SHAP attribution injection only, and the two combined.

Alone, neither few-shot examples nor SHAP injection dramatically changes accuracy. Combined, they are super-additive: the Attribution Disagreement Score (ADS) -- a measure of how differently Qwen and XGBoost weight the same features -- drops from 1.54 to 0.38, and accuracy rises from 49% to 75.3%. No retraining or fine-tuning is required.

The fourth finding is arguably the most deployment-relevant. A cross-model calibrator that uses attribution divergence between Qwen and XGBoost to estimate when the LLM is reliable reduces expected calibration error (ECE) from 0.254 to 0.080. To put that in context, a 2026 EACL benchmarking study found that LLMs with 70B+ parameters typically achieve ECE around 0.10; the calibrator brings a 7B model to roughly that standard on clinical tabular data, without accessing model weights or requiring repeated inference.

Key facts

Qwen 2.5 7B confidence range on the test set: 0.856-0.937 (a 0.081-point spread across wildly different accuracy levels)
Baseline accuracy: 49%; accuracy with combined SHAP + few-shot: 75.3%
Attribution Disagreement Score: 1.54 (baseline) vs. 0.38 (combined intervention)
Expected calibration error: 0.254 (baseline) vs. 0.080 (with cross-model calibrator)
XGBoost accuracy on same task: ~99% on easy subsets, ~73% on uncertain ones

Why this matters beyond one model

Qwen 2.5 7B is a capable open-weight model widely used in research and production deployments where GPT-4o licensing costs are prohibitive. The paper's findings apply to any LLM asked to reason over tabular or structured data without domain-specific fine-tuning -- a common scenario in healthcare, finance, and logistics.

The practical takeaway is not that LLMs should be abandoned for structured prediction. It is that verbalized confidence, on its own, should not be trusted as a reliability signal for structured data tasks. The paper's cross-model calibrator offers a lightweight alternative: pair the LLM with a classical ML model, compare their feature attributions, and use divergence as a proxy for when to defer to the ML model or flag for human review.

This aligns with a broader research consensus. Multiple 2026 studies confirm that XGBoost and LightGBM outperform LLMs on clinical tabular classification, while LLMs retain advantages on unstructured text and reasoning tasks. Hybrid pipelines that delegate structured prediction to classical models and use LLMs for explanation and synthesis are emerging as the practical middle ground.

What to watch

Whether the SHAP + few-shot combination generalises beyond the single clinical dataset used here is the open question. The preprint tests one model on one task; replication on larger models (Llama 3 70B, Qwen 2.5 72B) and on financial or operational tabular data would determine whether the cross-model calibrator is a broadly applicable tool or a result specific to this experimental setup. Independent replication of the arXiv preprint (2606.19509) -- submitted June 2026 and not yet peer-reviewed -- should be monitored before clinical deployment decisions are made on its basis.

Source: arxiv_ai

Originally published on gentic.news