Posted on Sep 7

41 Open-Source Large Language Models Benchmark Testing Report

Project Overview

This is a large-scale evaluation project of open-source large language models, using the lm-evaluation-harness library to conduct 19 benchmark tests on 41 open-source LLMs. All evaluations were completed locally on personal computers, demonstrating the performance of different models across various tasks.

Evaluation Framework

Test Categories

The benchmark tests are divided into three main categories:

1. Reasoning & Math

Tasks: gsm8k, bbh, arc_challenge, anli_r1/r2/r3, gpqa_main_zeroshot
Evaluation Metrics: Exact match, strict match, normalized accuracy, etc.

2. Commonsense & Natural Language Inference (NLI)

Tasks: hellaswag, piqa, winogrande, boolq, openbookqa, sciq, qnli
Evaluation Metrics: Normalized accuracy, accuracy, etc.

3. Knowledge & Reading Comprehension

Tasks: mmlu, nq_open, drop, truthfulqa_mc1/mc2, triviaqa
Evaluation Metrics: Accuracy, exact match, F1 score, etc.

Key Metrics Explanation

Model Naming Convention

Format: Company_ModelName
Quantized models marked with: (8bit)

Time Metrics

Total Time: System runtime to complete all benchmark tests
GPU Util Time: Equivalent RTX 5090 GPU time at 100% utilization

Scoring System

Mean Score: Arithmetic mean of all benchmark tasks
Score Range: 0-1, higher scores indicate better performance
Ranking: Calculated based on task average scores

Test Results Leaderboard

Overall Ranking (Top 10)

Rank	Model Name	Total Time	GPU Util Time	Mean Score
1	google_gemma-3-12b-it	15h 45m	14h 8m	0.6038
2	Qwen_Qwen3-14B (8bit)	29h 45m	17h 29m	0.5961
3	openchat_openchat-3.6-8b-20240522	7h 51m	6h 59m	0.5871
4	Qwen_Qwen3-8B	15h 31m	13h 44m	0.5859
5	Qwen_Qwen2.5-7B-Instruct	9h 36m	8h 33m	0.5788
6	Qwen_Qwen2.5-14B-Instruct (8bit)	52h 44m	29h 32m	0.5775
7	01-ai_Yi-1.5-9B	11h 43m	10h 26m	0.5676
8	Qwen_Qwen2.5-7B-Instruct-1M	11h 17m	10h 10m	0.5672
9	meta-llama_Llama-3.1-8B-Instruct	12h 19m	10h 52m	0.5653
10	01-ai_Yi-1.5-9B-Chat	13h 54m	12h 15m	0.5621

Category Ranking Highlights

Reasoning & Math Performance Ranking (Top 5)

google_gemma-3-12b-it (0.6266)
Qwen_Qwen3-8B (0.6214)
Qwen_Qwen3-14B (8bit) (0.586)
Qwen_Qwen3-4B (0.5712)
Qwen_Qwen2.5-7B-Instruct (0.5541)

Commonsense & NLI Ranking (Top 5)

Qwen_Qwen2.5-14B-Instruct (8bit) (0.7941)
Qwen_Qwen3-14B (8bit) (0.7807)
google_gemma-3-12b-it (0.7737)
Qwen_Qwen2.5-7B-Instruct (0.773)
openchat_openchat-3.6-8b-20240522 (0.7726)

Knowledge & Reading Comprehension Ranking (Top 5)

01-ai_Yi-1.5-9B (0.4369)
openchat_openchat-3.6-8b-20240522 (0.4136)
meta-llama_Llama-3.1-8B-Instruct (0.4127)
01-ai_Yi-1.5-6B (0.4063)
mistralai_Mistral-7B-Instruct-v0.3 (0.4045)

Key Findings

Performance Analysis

Google Gemma-3-12B-IT tops the overall ranking, particularly excelling in reasoning and math tasks
Qwen series models show strong performance across all categories, especially in commonsense reasoning
Yi series models excel in knowledge and reading comprehension tasks
Quantized models (8bit) maintain good performance while significantly reducing computational resource requirements

Efficiency Analysis

Smaller models can compete with larger models in certain specific tasks
GPU utilization time correlates positively with model size and complexity
Some medium-scale models demonstrate better cost-effectiveness

Project Resource Consumption

Total Machine Runtime: 18 days 8 hours
Equivalent GPU Time: 14 days 23 hours (RTX 5090 at 100% utilization)
Environmental Impact: Carbon neutralized through active use of public transportation 😊

Project Value

This comprehensive evaluation provides the open-source LLM community with:

Objective performance comparison benchmarks
Efficiency analysis of different scale models
Task-specific model selection guidance
Empirical data on quantization technique effectiveness

The complete data, scripts, and logs of this project have been open-sourced, providing valuable reference resources for researchers and developers.

Data Source: Hugging Face Spaces Leaderboard
Article Source: CurateClick

DEV Community