DEV Community

Cover image for 2.78 TFLOPS on a Fanless MacBook Air? Benchmarking Apple's M4 with MLX
lwgena for TinyAlg

Posted on

2.78 TFLOPS on a Fanless MacBook Air? Benchmarking Apple's M4 with MLX

(This article is an English translation of a post originally published in Japanese on my blog. You can read the original Japanese version here).

My fanless M4 MacBook Air hit 2.78 TFLOPS in a matrix multiplication benchmark using Apple's MLX framework.

Matrix multiplication (GEMM) isn't just a basic math problem; it's the beating heart of modern Machine Learning and Large Language Models (LLMs). By measuring how fast a machine can multiply huge matrices, we are essentially measuring its raw capability to run AI locally. Let's see what the M4 chip can do.

< Test Environment >

  • Machine: M4 Macbook Air
  • Memory: 16GB
  • Python: 3.10.11
  • Framework: MLX v0.28.0

1. Measuring Execution Time

To measure the execution time, I used a simple matrix multiplication operation.

1.1 The Benchmark Script

I've published the measurement script on GitHub Gist. Feel free to download it and test it on your own Apple Silicon Mac.

Note: I ran the benchmark last summer. I have recently verified that the script still runs on Python 3.12.12 and MLX v0.31.1 without any modifications.

Save the script as matmul_benchmark.py and run the following commands in your terminal:

% python3 -m venv venv
% . venv/bin/activate
% pip install mlx
% caffeinate python matmul_benchmark.py
Enter fullscreen mode Exit fullscreen mode

1.2 Results

I measured the execution time for various matrix sizes. The average time below is based on 5 consecutive runs for each size. All computations were done using the bfloat16 data type.

Matrix Size Avg Time (s) Total Ops Real TFLOPS Run 1 (s) Run 2 (s) Run 3 (s) Run 4 (s) Run 5 (s)
10000x10000 0.740 2 Trillion 2.70 0.738 0.738 0.742 0.740 0.743
20000x20000 5.754 16 Trillion 2.78 5.667 5.698 5.750 5.810 5.847
30000x30000 21.624 54 Trillion 2.50 20.341 21.076 22.252 21.897 22.555
40000x40000 75.872 128 Trillion 1.69 62.925 77.502 72.921 85.509 80.505

1.3 Observations

  • Peak Performance: The highest performance recorded was 2.78 TFLOPS at a matrix size of 20000x20000.
  • Thermal Behavior: For the 20000x20000 and 30000x30000 sizes, execution times creep up slightly with each run. This hints that the fanless chassis might be starting to manage heat.
  • The Impact of Swap Memory: At 40000x40000, the execution time spiked, and the variance between runs became larger. This is because the 16GB of unified memory ran out, forcing the system to frequently swap to the slower SSD. I confirmed this by watching the swap space usage shoot up in Activity Monitor during the run.

2. Calculating Total Operations and TFLOPS

TFLOPS (Tera Floating-point Operations Per Second) is a performance metric indicating how many trillions of floating-point calculations a processor can execute in one second.

We can calculate this using the total number of operations and the execution time.

2.1 Total Operations

First, let's calculate the total operations. A matrix multiplication C=ABC = A \cdot B for matrices AA and BB of size N×NN \times N can be expressed as:

(c11c1NcN1cNN)=(a11a1NaN1aNN)(b11b1NbN1bNN) \left( \begin{array}{rrr} c_{11} & \dots & c_{1N} \\ \vdots & \ddots & \vdots \\ c_{N1} & \dots & c_{NN} \end{array} \right)= \left( \begin{array}{rrr} a_{11} & \dots & a_{1N} \\ \vdots & \ddots & \vdots \\ a_{N1} & \dots & a_{NN} \end{array} \right)\cdot \left( \begin{array}{rrr} b_{11} & \dots & b_{1N} \\ \vdots & \ddots & \vdots \\ b_{N1} & \dots & b_{NN} \end{array} \right)

A single element cijc_{ij} in matrix CC is calculated as:

cij=k=1Naikbkj=ai1b1j+ai2b2j++aiNbNj c_{ij} = \sum_{k=1}^{N} a_{ik}b_{kj} = a_{i1}b_{1j} + a_{i2}b_{2j} + \cdots + a_{iN}b_{Nj}

This requires NN multiplications and N1N-1 additions, totaling 2N12N-1 operations. In our experiment, since NN is large, we can approximate this to 2N2N operations per element.

Since the matrix CC has N2N^2 elements, the total number of operations is:

Total Operations(2N)×N2=2N3 \text{Total Operations} \approx (2N) \times N^2 = 2N^3

For example, when N=20000N=20000 :

Total Operations2×200003=16×1012 (16 Trillion Operations) \text{Total Operations} \approx 2 \times 20000^3 = 16 \times 10^{12} \text{ (16 Trillion Operations)}

2.2 Calculating the TFLOPS Performance

Finally, we use the total operations and the execution time to calculate the TFLOPS using this formula:

TFLOPS=Total OperationsExecution Time [s]×1012 \text{TFLOPS} = \frac{\text{Total Operations}}{\text{Execution Time [s]} \times 10^{12}}

For N=20000N=20000 :

TFLOPS=16×10125.754×10122.78 \text{TFLOPS} = \frac{16 \times 10^{12}}{5.754 \times 10^{12}} \approx 2.78

**

Did you feel the sheer scale of TFLOPS calculations? It's pretty mind-blowing what a fanless laptop can achieve locally these days.

Top comments (0)