lwgena for TinyAlg

Posted on Mar 19

2.78 TFLOPS on a Fanless MacBook Air? Benchmarking Apple's M4 with MLX

#benchmark #python #mlx #machinelearning

(This article is an English translation of a post originally published in Japanese on my blog. You can read the original Japanese version here).

My fanless M4 MacBook Air hit 2.78 TFLOPS in a matrix multiplication benchmark using Apple's MLX framework.

Matrix multiplication (GEMM) isn't just a basic math problem; it's the beating heart of modern Machine Learning and Large Language Models (LLMs). By measuring how fast a machine can multiply huge matrices, we are essentially measuring its raw capability to run AI locally. Let's see what the M4 chip can do.

< Test Environment >

Machine: M4 Macbook Air
Memory: 16GB
Python: 3.10.11
Framework: MLX v0.28.0

1. Measuring Execution Time

To measure the execution time, I used a simple matrix multiplication operation.

1.1 The Benchmark Script

I've published the measurement script on GitHub Gist. Feel free to download it and test it on your own Apple Silicon Mac.

Note: I ran the benchmark last summer. I have recently verified that the script still runs on Python 3.12.12 and MLX v0.31.1 without any modifications.

Save the script as matmul_benchmark.py and run the following commands in your terminal:

% python3 -m venv venv
% . venv/bin/activate
% pip install mlx
% caffeinate python matmul_benchmark.py

1.2 Results

I measured the execution time for various matrix sizes. The average time below is based on 5 consecutive runs for each size. All computations were done using the bfloat16 data type.

Matrix Size	Avg Time (s)	Total Ops	Real TFLOPS	Run 1 (s)	Run 2 (s)	Run 3 (s)	Run 4 (s)	Run 5 (s)
`10000x10000`	0.740	2 Trillion	2.70	0.738	0.738	0.742	0.740	0.743
`20000x20000`	5.754	16 Trillion	2.78	5.667	5.698	5.750	5.810	5.847
`30000x30000`	21.624	54 Trillion	2.50	20.341	21.076	22.252	21.897	22.555
`40000x40000`	75.872	128 Trillion	1.69	62.925	77.502	72.921	85.509	80.505

1.3 Observations

Peak Performance: The highest performance recorded was 2.78 TFLOPS at a matrix size of 20000x20000.
Thermal Behavior: For the 20000x20000 and 30000x30000 sizes, execution times creep up slightly with each run. This hints that the fanless chassis might be starting to manage heat.
The Impact of Swap Memory: At 40000x40000, the execution time spiked, and the variance between runs became larger. This is because the 16GB of unified memory ran out, forcing the system to frequently swap to the slower SSD. I confirmed this by watching the swap space usage shoot up in Activity Monitor during the run.

2. Calculating Total Operations and TFLOPS

TFLOPS (Tera Floating-point Operations Per Second) is a performance metric indicating how many trillions of floating-point calculations a processor can execute in one second.

We can calculate this using the total number of operations and the execution time.

2.1 Total Operations

First, let's calculate the total operations. A matrix multiplication $C = A \cdot B$ for matrices $A$ and $B$ of size $N \times N$ can be expressed as:

\left( \begin{array}{rrr} c_{11} & \dots & c_{1N} \\ \vdots & \ddots & \vdots \\ c_{N1} & \dots & c_{NN} \end{array} \right)= \left( \begin{array}{rrr} a_{11} & \dots & a_{1N} \\ \vdots & \ddots & \vdots \\ a_{N1} & \dots & a_{NN} \end{array} \right)\cdot \left( \begin{array}{rrr} b_{11} & \dots & b_{1N} \\ \vdots & \ddots & \vdots \\ b_{N1} & \dots & b_{NN} \end{array} \right)

A single element $c_{ij}$ in matrix $C$ is calculated as:

c_{ij} = \sum_{k=1}^{N} a_{ik}b_{kj} = a_{i1}b_{1j} + a_{i2}b_{2j} + \cdots + a_{iN}b_{Nj}

This requires $N$ multiplications and $N-1$ additions, totaling $2N-1$ operations. In our experiment, since $N$ is large, we can approximate this to $2N$ operations per element.

Since the matrix $C$ has $N^2$ elements, the total number of operations is:

\text{Total Operations} \approx (2N) \times N^2 = 2N^3

For example, when $N=20000$ :

\text{Total Operations} \approx 2 \times 20000^3 = 16 \times 10^{12} \text{ (16 Trillion Operations)}

2.2 Calculating the TFLOPS Performance

Finally, we use the total operations and the execution time to calculate the TFLOPS using this formula:

\text{TFLOPS} = \frac{\text{Total Operations}}{\text{Execution Time [s]} \times 10^{12}}

For $N=20000$ :

\text{TFLOPS} = \frac{16 \times 10^{12}}{5.754 \times 10^{12}} \approx 2.78

**

Did you feel the sheer scale of TFLOPS calculations? It's pretty mind-blowing what a fanless laptop can achieve locally these days.

DEV Community