DEV Community

cz
cz

Posted on

41 Open-Source Large Language Models Benchmark Testing Report

Project Overview

This is a large-scale evaluation project of open-source large language models, using the lm-evaluation-harness library to conduct 19 benchmark tests on 41 open-source LLMs. All evaluations were completed locally on personal computers, demonstrating the performance of different models across various tasks.

Evaluation Framework

Test Categories

The benchmark tests are divided into three main categories:

1. Reasoning & Math

  • Tasks: gsm8k, bbh, arc_challenge, anli_r1/r2/r3, gpqa_main_zeroshot
  • Evaluation Metrics: Exact match, strict match, normalized accuracy, etc.

2. Commonsense & Natural Language Inference (NLI)

  • Tasks: hellaswag, piqa, winogrande, boolq, openbookqa, sciq, qnli
  • Evaluation Metrics: Normalized accuracy, accuracy, etc.

3. Knowledge & Reading Comprehension

  • Tasks: mmlu, nq_open, drop, truthfulqa_mc1/mc2, triviaqa
  • Evaluation Metrics: Accuracy, exact match, F1 score, etc.

Key Metrics Explanation

Model Naming Convention

  • Format: Company_ModelName
  • Quantized models marked with: (8bit)

Time Metrics

  • Total Time: System runtime to complete all benchmark tests
  • GPU Util Time: Equivalent RTX 5090 GPU time at 100% utilization

Scoring System

  • Mean Score: Arithmetic mean of all benchmark tasks
  • Score Range: 0-1, higher scores indicate better performance
  • Ranking: Calculated based on task average scores

Test Results Leaderboard

Overall Ranking (Top 10)

Rank Model Name Total Time GPU Util Time Mean Score
1 google_gemma-3-12b-it 15h 45m 14h 8m 0.6038
2 Qwen_Qwen3-14B (8bit) 29h 45m 17h 29m 0.5961
3 openchat_openchat-3.6-8b-20240522 7h 51m 6h 59m 0.5871
4 Qwen_Qwen3-8B 15h 31m 13h 44m 0.5859
5 Qwen_Qwen2.5-7B-Instruct 9h 36m 8h 33m 0.5788
6 Qwen_Qwen2.5-14B-Instruct (8bit) 52h 44m 29h 32m 0.5775
7 01-ai_Yi-1.5-9B 11h 43m 10h 26m 0.5676
8 Qwen_Qwen2.5-7B-Instruct-1M 11h 17m 10h 10m 0.5672
9 meta-llama_Llama-3.1-8B-Instruct 12h 19m 10h 52m 0.5653
10 01-ai_Yi-1.5-9B-Chat 13h 54m 12h 15m 0.5621

Category Ranking Highlights

Reasoning & Math Performance Ranking (Top 5)

  1. google_gemma-3-12b-it (0.6266)
  2. Qwen_Qwen3-8B (0.6214)
  3. Qwen_Qwen3-14B (8bit) (0.586)
  4. Qwen_Qwen3-4B (0.5712)
  5. Qwen_Qwen2.5-7B-Instruct (0.5541)

Commonsense & NLI Ranking (Top 5)

  1. Qwen_Qwen2.5-14B-Instruct (8bit) (0.7941)
  2. Qwen_Qwen3-14B (8bit) (0.7807)
  3. google_gemma-3-12b-it (0.7737)
  4. Qwen_Qwen2.5-7B-Instruct (0.773)
  5. openchat_openchat-3.6-8b-20240522 (0.7726)

Knowledge & Reading Comprehension Ranking (Top 5)

  1. 01-ai_Yi-1.5-9B (0.4369)
  2. openchat_openchat-3.6-8b-20240522 (0.4136)
  3. meta-llama_Llama-3.1-8B-Instruct (0.4127)
  4. 01-ai_Yi-1.5-6B (0.4063)
  5. mistralai_Mistral-7B-Instruct-v0.3 (0.4045)

Key Findings

Performance Analysis

  • Google Gemma-3-12B-IT tops the overall ranking, particularly excelling in reasoning and math tasks
  • Qwen series models show strong performance across all categories, especially in commonsense reasoning
  • Yi series models excel in knowledge and reading comprehension tasks
  • Quantized models (8bit) maintain good performance while significantly reducing computational resource requirements

Efficiency Analysis

  • Smaller models can compete with larger models in certain specific tasks
  • GPU utilization time correlates positively with model size and complexity
  • Some medium-scale models demonstrate better cost-effectiveness

Project Resource Consumption

  • Total Machine Runtime: 18 days 8 hours
  • Equivalent GPU Time: 14 days 23 hours (RTX 5090 at 100% utilization)
  • Environmental Impact: Carbon neutralized through active use of public transportation 😊

Project Value

This comprehensive evaluation provides the open-source LLM community with:

  1. Objective performance comparison benchmarks
  2. Efficiency analysis of different scale models
  3. Task-specific model selection guidance
  4. Empirical data on quantization technique effectiveness

The complete data, scripts, and logs of this project have been open-sourced, providing valuable reference resources for researchers and developers.


Data Source: Hugging Face Spaces Leaderboard
Article Source: CurateClick

Top comments (0)