(This article is an English translation of a post originally published in Japanese on my blog. You can read the original Japanese version here).
My fanless M4 MacBook Air hit 2.78 TFLOPS in a matrix multiplication benchmark using Apple's MLX framework.
Matrix multiplication (GEMM) isn't just a basic math problem; it's the beating heart of modern Machine Learning and Large Language Models (LLMs). By measuring how fast a machine can multiply huge matrices, we are essentially measuring its raw capability to run AI locally. Let's see what the M4 chip can do.
< Test Environment >
- Machine: M4 Macbook Air
- Memory: 16GB
- Python: 3.10.11
- Framework: MLX v0.28.0
1. Measuring Execution Time
To measure the execution time, I used a simple matrix multiplication operation.
1.1 The Benchmark Script
I've published the measurement script on GitHub Gist. Feel free to download it and test it on your own Apple Silicon Mac.
Note: I ran the benchmark last summer. I have recently verified that the script still runs on Python 3.12.12 and MLX v0.31.1 without any modifications.
Save the script as matmul_benchmark.py and run the following commands in your terminal:
% python3 -m venv venv
% . venv/bin/activate
% pip install mlx
% caffeinate python matmul_benchmark.py
1.2 Results
I measured the execution time for various matrix sizes. The average time below is based on 5 consecutive runs for each size. All computations were done using the bfloat16 data type.
| Matrix Size | Avg Time (s) | Total Ops | Real TFLOPS | Run 1 (s) | Run 2 (s) | Run 3 (s) | Run 4 (s) | Run 5 (s) |
|---|---|---|---|---|---|---|---|---|
10000x10000 |
0.740 | 2 Trillion | 2.70 | 0.738 | 0.738 | 0.742 | 0.740 | 0.743 |
20000x20000 |
5.754 | 16 Trillion | 2.78 | 5.667 | 5.698 | 5.750 | 5.810 | 5.847 |
30000x30000 |
21.624 | 54 Trillion | 2.50 | 20.341 | 21.076 | 22.252 | 21.897 | 22.555 |
40000x40000 |
75.872 | 128 Trillion | 1.69 | 62.925 | 77.502 | 72.921 | 85.509 | 80.505 |
1.3 Observations
-
Peak Performance: The highest performance recorded was 2.78 TFLOPS at a matrix size of
20000x20000. -
Thermal Behavior: For the
20000x20000and30000x30000sizes, execution times creep up slightly with each run. This hints that the fanless chassis might be starting to manage heat. -
The Impact of Swap Memory: At
40000x40000, the execution time spiked, and the variance between runs became larger. This is because the 16GB of unified memory ran out, forcing the system to frequently swap to the slower SSD. I confirmed this by watching the swap space usage shoot up in Activity Monitor during the run.
2. Calculating Total Operations and TFLOPS
TFLOPS (Tera Floating-point Operations Per Second) is a performance metric indicating how many trillions of floating-point calculations a processor can execute in one second.
We can calculate this using the total number of operations and the execution time.
2.1 Total Operations
First, let's calculate the total operations. A matrix multiplication for matrices and of size can be expressed as:
A single element in matrix is calculated as:
This requires multiplications and additions, totaling operations. In our experiment, since is large, we can approximate this to operations per element.
Since the matrix has elements, the total number of operations is:
For example, when :
2.2 Calculating the TFLOPS Performance
Finally, we use the total operations and the execution time to calculate the TFLOPS using this formula:
For :
**
Did you feel the sheer scale of TFLOPS calculations? It's pretty mind-blowing what a fanless laptop can achieve locally these days.
Top comments (0)