DEV Community

Cover image for AWS re:Invent 2025 - Supercharge ML and Inference on Apple Silicon with EC2 Mac (CMP346)
Kazuya
Kazuya

Posted on • Edited on

AWS re:Invent 2025 - Supercharge ML and Inference on Apple Silicon with EC2 Mac (CMP346)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Supercharge ML and Inference on Apple Silicon with EC2 Mac (CMP346)

In this video, Sébastien Stormacq and Eliran Efron demonstrate machine learning inference on Apple Silicon using Amazon EC2 Mac instances. They explain how Apple's unified memory architecture eliminates CPU-GPU memory transfer bottlenecks, making M-series chips power-efficient for ML workloads. The session provides hands-on coding examples using MLX, Apple's open-source array framework optimized for Apple Silicon, showing its similarity to PyTorch and NumPy APIs. Key topics include lazy computation, function transformers, MLX.NN for building neural networks, and MLX-LM for running large language models. They demonstrate quantization techniques to reduce model size and memory usage, running Llama models with different bit precisions. The presentation also covers PyTorch MPS backend as an alternative for existing PyTorch users wanting to leverage Apple's Neural Engine.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Thumbnail 40

Introduction: Machine Learning Inference on Apple Silicon with Amazon EC2 Mac

This session is about machine learning inference on different types of hardware. We are going to show you how to deploy a machine learning workload on Apple Silicon using Amazon EC2 Mac. You know that sometimes GPUs are not easy to access. They are extremely expensive, and you might have Mac Minis laying around your desk with unused cycles during your pipeline. Why not use these Mac Minis for something else, something they are very capable of? It's about inference, machine learning inference, or even training. My name is Sébastien Stormacq. I'm a Developer Relations at AWS, and my partner in crime today is taking pictures in the back of the room. It's Eliran. Eliran is going to do the hard work. This is a code talk, so we are going to spend time showing you code with Jupyter notebooks, a lot of Python, and going down into the mathematical details and the coding details of running large language models on Apple Silicon.

Thumbnail 80

As you know, getting access to a GPU is extremely complicated and might be quite expensive. Most of our customers are telling us that they need a large number of GPUs to run their machine learning workloads, and getting access to GPUs is not the answer to the problem. It's only one part of the problem, because once you have the GPU, you also need a lot of storage. You need extremely fast, high bandwidth network to move the data around the different nodes inside your cluster. Building this infrastructure, even if it is in the cloud, costs you time and money. You are in a super competitive market. Every minute that you spend trying to create infrastructure, manage infrastructure, or update infrastructure is a minute that you cannot spend on what matters for your customers: the application or the end result.

Thumbnail 110

Thumbnail 130

GPUs are super powerful, but they come with one bottleneck as well. They have siloed memory. The CPU has its own memory and the GPU has its own memory. They are totally separated. When you have a training or inference task that needs to move data between the CPU and the GPU, all the data go through a bus which might add extra latency. To solve that problem, Apple came with a radically different approach. Inside a Mac Mini today or even inside your phone, inside the M chip, the Apple Silicon chip, we have a system on a chip that combines on the same chip the CPU, the GPU, and the Neural Engine. And one large bank of unified memory, so it means the CPU and the GPU can share the same memory. There is no more memory transfer between the two.

Thumbnail 160

Thumbnail 210

Amazon EC2 Mac Instances and the MLX Framework

Of course, the GPU cannot use the entire memory. You need to leave a bit of room for the operating system to work. So typically you can go up to 70 percent of the memory on these systems. They are not as fast as the raw discrete performance you can get from an NVIDIA chip today, but they are very power efficient. This is what you have in your pocket if you have a recent iPhone, and this is what you have on your desk if you have a Mac. The good news is that you can get Macs in the cloud as well. Around 2020 we launched Amazon EC2 Mac. EC2 Mac instances are Mac Minis in a special enclosure inside our data center. They are connected through the Thunderbolt port to an AWS Nitro card. The Nitro card is the system that allows us to connect to the rest of the AWS network and provide security as well.

Thumbnail 270

It's everything that you know and love about Amazon EC2 for the last 20 years applied to a Mac. It's a real Mac for you. It's a dedicated host. There are no virtual machines. You have access to the raw hardware, but at the same time it can access your VPC. It has security groups and IAM policies. It boots from an external volume. Everything you know and love from EC2 is available, but for macOS. We have different types of Amazon EC2 Mac. I'm not going into the whole list, but we start with the Intel ones which are out of the scope for this talk. Obviously we need Apple Silicon for this talk, but we have M1, M2, M2 Pro, M1 Ultra, M4, and M4 Pro that we launched recently. Yesterday during Matt Garman's keynote we announced M3 Ultra and M4 something—I forgot, it was just yesterday. These are coming. They are either in preview or pre-announced for later, not later this year, for next year. Look at the number of cores you have there.

Thumbnail 330

On M4 Pro, you have 14 cores. You have 16 neural engines. On M1 Ultra, you have 32 neural engines. That's a lot of processing power available to run your large language model on Apple Silicon. We need a framework, and we are going to use MLX. MLX is an open source array framework that is purpose built for Apple Silicon. It's a very flexible tool that can be used for basic numerical computation all the way to running the largest models directly on your Apple device.

Thumbnail 370

Thumbnail 380

If you want to generate text, images, video, or audio with a large language model on your Mac, MLX is the framework that allows you to do that. You can also use it to train, fine-tune, or otherwise customize your large language model. MLX is designed to run on Apple Silicon, but it has a very similar API. If you're familiar with PyTorch or NumPy, you will find that the API from MLX is very similar. It's very easy to port code between PyTorch, NumPy, and JAX to MLX.

Thumbnail 400

Thumbnail 410

Thumbnail 420

MLX integrates with other tools. Maybe you are using LM Studio on your Mac, so you can have MLX installed alongside LM Studio and use large language models through LM Studio on your Mac. Of course it has a Python binding and Python API, but it also has a Swift API. Swift is the open source programming language created by Apple a bit more than 10 years ago, and it even has C++ and C APIs. Look at the similarity of the code. On the left side you have a very short example of code written for MLX, and on the right side you have the same code for PyTorch. We are calling the linear function on the object there. The main difference is at the end we call the ReLU function on the nn object directly in the MLX version, and we call it on the layer in the PyTorch version. You see very similar code between MLX and PyTorch.

Thumbnail 470

Thumbnail 490

Getting Started with MLX: Arrays, Lazy Evaluation, and Basic Operations

That was my introduction. I promise you that we are going to have a lot of code today. It's a code session, so we are going to dive into the code and everyone will show you how to actually use it. Thank you very much. Hey everyone, nice meeting you all. My name is Eliran Efron. I'm a solutions architect from Tel Aviv, Israel. I've been with AWS for the past 5 years, very excited about everything data related, which also led me to the world of neural nets and the resources and needs that it brings to us.

In our example today, I want to talk about MLX and a bit about the things that Apple did within MLX in order to allow it to be optimized for the different hardware devices that they have. For example, if you're running something on an M1 or an M4 or something else like an A-series chip, it will try to use the best scores that it can for that specific task because it has the resources from within that device. I want to talk about a few important things in MLX. One of the most important parts of MLX is that it has lazy computation and function transformers from within it.

Thumbnail 590

Let's start going through our first Jupyter notebook. As you can see, I'm running a localhost. This is an SSH pipe to an M1 Ultra Mac on AWS. That M1 Ultra is running a Jupyter Lab, and we're connected to it through a pipe, and we're running it here. Just a basic example of a comparison between MLX and NumPy. Afterwards we'll start discussing other things that Apple added like the neural network class and the MLX-LM extension that they built. Let's get started. We'll just import MLX core and NumPy for a second. We'll also import time just to be able to check things out. If we look at how we're creating arrays in both array frameworks, as you can see we have a NumPy example on top and then we have an MLX example on the bottom.

Thumbnail 630

Thumbnail 650

Thumbnail 660

Thumbnail 670

They are pretty much the same. Most of the differences come from things like the random function, which looks a bit different. Here we're using uniform from the random class and we're providing it a shape. We can also provide it a data type. This can include a dtype, and say I want MLX with some float. At this moment we'll use it as is and you'll be able to see that we're also getting that data type out. As I said earlier, we can say we want to use a different data type, let's use float 16. So with MLX flows in and you can see that we've chosen the data type for everything and we can continue forward.

One important thing about MLX is that we can decide on which device we want each calculation to actually happen. As I said before, MLX has lazy computation, like we know on other frameworks. Python specifically is an eager evaluator, so there's quite a difference there. But at the basic level of things, when we're doing something in MLX, like a basic operation on it, nothing should happen until the point that it requires either an eval, a type conversion which again requires an evaluation, or something of that sort.

Thumbnail 730

Thumbnail 740

Here, let's just take the same calculation. We have two arrays and we want to add one to another, and we can actually decide on which stream we would like to run this on. So there's like a GPU calc and a CPU calc that I did here. We'll see the results will be the same because we're expecting it. But you can see that we can just decide on which to run. Now we'll get to the lazy evaluation in a bit. I want to show you a bit of basic operations differences between them.

As you can see, it pretty much looks the same, very similar. Some differences come from things like APIs. For example, here you have NumPy, so we don't have that one in MLX. We have MLX math, and again we can use the Python derivative in order to do that. So we don't necessarily need to, and we can see again that we are receiving the same values and so on.

Thumbnail 770

Now let's get back to the lazy evaluation. When MLX is actually running things, again, if we're not forcing an eval or type conversion, it will not evaluate the actual value. What happens in the background is that we're getting a compute graph built for the set of things that we need to run on the device in order to build that calculation. One thing to remember here is that sometimes we would also want to compile. We can change multiple nodes on the compute graph into a single compute graph node, because for example if we have a function that we're using all the time and it will run exactly the same when it gets an input and provides an output, then we can compile it and then we'll have less things in the compute graph which will also reduce the swaps in the actions, so it can actually benefit even more.

Thumbnail 870

If you look at this example, we have again two arrays and we can combine them, and once we do that, nothing will happen. We'll have that compute graph behind the scenes, and once we force an eval, then we will actually do an evaluation. Again, the same will happen if we print it out. It's a type conversion to a string, so we will need to know the value of it. Same goes with the list because we're trying to convert it to a list. So if we run this, we can see exactly the same things. You just evaluated them once we wanted it.

Performance Optimization and Function Transformers in MLX

Now we discussed a bit about the performance for Mac. So why is it that important to use the right tools for our hardware? At the end of the day, if we're looking at it, the CPU bus and the world of working with different memory areas is very complex in most use cases. Sometimes we have customers doing things like computing something on a GPU, but then batching that on a CPU. So you have an impedance in the middle, you have a ton of memory bandwidth in and out from the GPU in order to just run that specific operation.

In that case, this could have probably run faster on a single device if they would have moved everything to the GPU. That's a great thing, but we're even seeing it today in things like VLLM taking a mixture of experts and trying to provide the best way to do so on multiple GPUs in a large model. In that case, you'll have RDMA in the middle, and sometimes that RDMA library uses CPU. So we're getting into the same bottlenecks all the time, and it really matters what we use and when.

Thumbnail 950

Thumbnail 970

Thumbnail 980

If we're taking a bit of a performance comparison on a larger scale, in this case we have again a random NumPy array and a random MLX array. We'll do a matrix multiplication on both and see how much time it takes. You'll be able to see that sometimes NumPy is faster on smaller stuff, and that's because we're using CPU on both for comparison. Sometimes I can decide I want to do it on a GPU again, not as fast, but it looks similar. Sometimes the MLX output will be faster, but once we get into larger sized arrays, we will be able to actually feel the benefit of it.

Thumbnail 990

Thumbnail 1010

Thumbnail 1020

Thumbnail 1030

So as you can see, once we're doing it on MLX, then it's much faster. But here we did it on a GPU, so let's do it on a CPU. We'll see that again MLX will be faster, and that's probably because of the compute graph and some optimization that it can do alongside it. That's a small comparison on a 6000 by 6000 dimension matrix multiplication.

Another cool thing about MLX is function transformers. The idea of function transformations is actually taking a function as an input and returning a new function as an output. MLX has two types of function transformations. We have automatic differentiation and we have graph optimization. Like I've mentioned before, we can compile multiple actions into a single compute node graph. Automatic differentiation will be functions like mx.grad, like getting the gradient descent of a function, which I will showcase in a second. We can pretty much automatically compute the gradient of any function with it. We can also do a second derivative of that as well, as I'll show in a second. The second type of function transformation will be operations that optimize the compute graph, like we've discussed with the compile function.

Thumbnail 1120

If we look at automatic differentiation, which is more interesting in that case, we have a very simple function here and we have a simple input and we want to have our gradient of that function. So we can say, alright, let's have a compute function here and we are sending a value to it. Then we can take that function and do mx.grad on it and it'll just give us a return as a function which is the gradient function of our function. So we can see that we can just run it and we'll have it output as that. We can see the expected gradient of it. Again, we can do the same thing with a second grade. So in that case, we'll take a sin and we'll do mx.sin, a very simple function, and we'll take again the gradient and we'll take the gradient of the gradient. So we have a second derivative of sin, and in that case we can just run our array through it. In that case, we brought a single object array and we can have the second derivative of sin and just have our output ready.

Thumbnail 1160

Thumbnail 1190

Thumbnail 1200

Now, as I've mentioned, we can decide which type of object, which type of device we want to run each calculation. So each calculation that we do, we can decide which device we want to run it on. So for example, that sin. We can take that sin and again take that input of it and we can use the stream in order to decide on which device we would like it to run on. So in that case you can see that we ran it on a CPU. So I have a CPU and a GPU, but let's run it on the GPU. We'll be able to see the same outputs, but it just ran on the GPU. So that's a small intro to MLX. As for some basic things on top of that, we have an array framework.

Thumbnail 1240

Building Neural Networks with MLX.NN

Now that we have an array framework, we want to have real tools to build linear layers and construct actual neural networks. Apple has built MLX.NN, which provides the neural network class containing many of the features we expect from frameworks like PyTorch. This class includes the tools we need to build neural networks on Mac.

Thumbnail 1250

Let's import everything and look at some basic neural network examples. If we want a linear layer of a specific size, we use neural net.linear. Since we imported MLX.NN this way, we don't need to prefix it with MX. For example, we can see the shape and bias of the linear layer. If we want a convolutional layer, we can do that as well by deciding how many input and output channels the layer will have and setting the kernel size. We can easily output it and see the structure.

Thumbnail 1280

Thumbnail 1300

On top of that, we have built-in normalization layers. For example, batch norm will be part of it, so we can get a layer using that. Moving forward, something very related to the world of large language models is activation functions that we see throughout the layers. In that case, we have activations pre-built, and those activations are optimized to run on Mac using the underlying Metal derivatives to build them as optimized as possible.

Thumbnail 1320

Thumbnail 1330

Thumbnail 1340

Thumbnail 1350

Instead of building something like a sigmoid or GELU, you can just use it from the library. If we run this, you'll be able to see we have our layers, we're able to watch the shapes of the set layers and find the biases for them. We can see the activation functions and so on. This is specifically a value effect that we've edited at the end.

If we compare it to how we would build it with PyTorch, there are differences between the two. In PyTorch, we generally have the call function and the way we're actually calling the activation function. In MLX, we're using the class and sending the object to it, whereas in PyTorch and others, we usually just use the object and it will have an activation function on top of that.

Thumbnail 1400

Now we want to do multiple layers and build real networks. We have two examples of complex models here. Today they're not that complex, but more than a linear layer. The first is a multi-layer perceptron. In that case, we can see that we have our layers. When we initialize it, we're just creating our objects and adding our layers. We can create a neural layer, an activation function, and another dropout. The way we call the layers is we go for each layer, pass the data through it, and then later when we try to use it, we do MLP and send it to it. It will just go through the call.

Thumbnail 1500

If we look at a simple convolutional network, we have the convolutional layers. We're using nn.Conv2d to receive the actual layers from it to build the layer. We can see that those layers are expecting to be one after the other because the output channel of one will be the input channel of the other. We also build a classifier. In that classifier, we can take sequential as an object from our neural network library and say we have a sequential object running through all of these layers with our first linear layer, an activation function, a dropout, and another linear layer. When we look at our call function, we can run through our various layers.

Thumbnail 1510

Thumbnail 1520

We can return our object. So if we're looking at those, you can see in the MLP we'll have those layers, and in the simple CNN we'll have our convolutional layers and our classifier, and we can do a forward pass on those and see what happens.

Thumbnail 1530

Thumbnail 1560

Thumbnail 1580

Thumbnail 1590

Thumbnail 1600

Alongside it, MLX also has optimizers. Sometimes we would want to do something like RMSprop or other types of optimizers. Instead of building them on your own, you can just use what's built within the optimizers in MLX. If you look up here, the import was import MLX.optimizers as optim. So again, as you can see, we can go through optimizers, we can take those optimizers for us and we can actually use them instead of building them on our own. We'll have those different optimizers from MLX.optimizers and we'll have some examples of running through them.

Thumbnail 1620

Thumbnail 1630

Thumbnail 1640

So if we look at a complete training example with MLX, then a complete training will look something like this. We'll have a classification, and we'll generate some synthetic data. That's data classified for various types of use cases, and then we'll create a dataset from it. We'll take a training dataset and an evaluation dataset, and we'll convert them to MLX arrays. So we'll have our train and validation arrays, and then we'll be able to take our model and also use optimizers and build our loss function and accuracy function and actually train it.

Thumbnail 1650

Thumbnail 1660

Thumbnail 1670

So if we're looking at the training loop, what we'll do, we'll have multiple epochs and we'll generally just pass through the loss and grad function and train our model. All right. So that's pretty much neural networks on MLX. You can do it in one of many ways. There are plenty of implementations within it. It's just a matter of what you're looking to build and how to build it.

Thumbnail 1690

MLX-LM: Loading and Running Large Language Models

So, moving on to MLX-LM. Apple understood that in order to actually build large language models or large models, there are things that need to be wrapped around the neural networks and things that need to add on. So in this case, Apple released another GitHub repo called MLX-LM. And MLX-LM allows you to have that wraparound for the large models part. So if we look at it, we have a few interesting things. We still need MLX.core, but we also have from MLX-LM we have load and generate in order to load large models and generate them.

There will also be implementations there on how to actually inference various types of models, because different types of models have different types of infrastructure and layers and ways of being built and ways of being actually inferenced. The inference engine will need to know some of those in order to actually be able to run through all of the layers that we are expecting in order to get our output. So that's that. Later on, we'll talk about prompt cache a bit. It's sort of a KV cache for prompts, but yet again, it's not a real KV cache like we're expecting on other places where we're saying KV cache. It's just a way to cache the K and Vs of the prompts that we're heavily using. So that's that. And it's also built from within MLX-LM.

Thumbnail 1800

And we also have a convert function. We'll talk about the convert function in a bit, and it's very important because quantization is a very important tool for a lot of inference techniques. We'll talk about it in a bit. So yet again, we're bringing on some prerequisites, and we'll continue down.

Thumbnail 1810

Let's continue down and look at the model candidates. We're taking a few different models here. At the base, it's a single model, but with various sizes. We can see that we have a 4-bit quantized model, an 8-bit quantized model, BF16 quantized as well, and the full size of that model. It's quite a small model with 3 billion parameters, which is still big in terms of neural networks. In this case, we're creating a model antagonizer that we'll use shortly.

Thumbnail 1860

We'll choose a single model from our examples. In this case, I want to take the BF16 one. We'll be able to change it down the road as well. What we'll do is use the load function with the model name, and we will get a model object and a tokenizer. This model name comes from an integration between MLX and Hugging Face. So if that model is on Hugging Face, you can just use that with that name. That's why we also have the MLX community, which is part of the model name. We'll see down a bit when we quantize the model, we can select the name again and upload to Hugging Face.

Thumbnail 1900

In this use case, we'll just load a model and see how much time it takes. I already have that model weights on this specific Mac that we're using, so we didn't download anything. It was quite fast, but we can see that we have a low load time here and we received our model and tokenizer. We ran an encode and decode through that tokenizer and received the actual post-tokenization output. As you can see, the tokenizer added the beginning of text because that's part of that prompt.

Thumbnail 1940

Thumbnail 1960

Thumbnail 1990

Using the model object from within MLX-LM, we can also look at the model parameters. So in case we're fine-tuning a lower layer or doing various things and we want to understand if a parameter changed or something, we can use that to scroll through it. We can also see the model layers. That's very interesting because sometimes later on when we quantize, you'll see that maybe we don't want to quantize the entire model. Maybe we want to try quantizing a specific layer. Sometimes we're even fine-tuning just a specific layer. It really depends on what we're trying to do and how we're trying to modify a specific model for our use case. Being able to understand what is where and which type of layers we have really helps us understand what we're working with.

Thumbnail 2000

Thumbnail 2020

Thumbnail 2030

Now I want to move on to an interactive chat with Llama. In this case, we have Llama 3B, and we're taking the BF16 model. As you can see, we have a list of prompts, and this code will generally show us how to actually run and use the object. What we'll do is take the first prompt, inference it, take the next prompt, inference it, take the next prompt, and inference it. We'll measure the memory usage that we used. We can use MLX.get_active_memory to see how much actual data we're using and how much memory. We will also use our time to calculate how much time it took to generate.

Thumbnail 2050

Thumbnail 2060

Thumbnail 2080

Thumbnail 2100

Looking at that for a second, let's let it start talking. It introduces itself, and while that's not that important, we can see we're getting pretty nice tokens per second and it was quite fast. It used only 6 gigabytes, and continuing down the road, we're seeing pretty much the same. Now, what we can do is choose a larger model and use that one. Let's try to use our largest model here. I've brought Llama 3.7 billion parameters, but it's 4-bit quantized. Let's take that. It'll take a second to load. We'll see the tokenizer providing us the output from here.

Thumbnail 2130

The M1 Ultra contains 128 gigabytes of RAM. It's important to know that you can use only 75% of that RAM for a specific device. For example, the GPU will be able to get 75% of that 128 gigabytes. So you can actually load quite large models on top of it.

Thumbnail 2160

Let me see why this doesn't want to work. Everything breaks sometimes. Let me just look back to our full-size model here. I think maybe the kernel stuck. Let me try it out again.

Thumbnail 2170

Thumbnail 2180

Thumbnail 2190

Thumbnail 2200

So yet again we're taking the larger one, it's not the 70B. I'm not sure why it didn't work, but it's the 3B, and it's the full size of it. So let's try and run our interactive chat through it and we should be able to see a bit of a difference in the output. Not the output values, but the outputs in terms of tokens per second, which are quite similar.

Model Quantization Techniques and the Convert Function

Now let's talk about quantization. For those who don't know, quantization is the idea of using smaller sizes of data types in order to reduce the amount of calculation and the amount of memory that we actually need. There are many ways to do quantization. The best way to do quantization is to take the training input and actually quantize that, remove the extra bits in the data types, and then do real training on it. Do full-on training in order to actually calculate the weights between layers and understand what the actual activations should be.

Thumbnail 2300

Other things that we can do is take the model parameters as we have them and just reduce those. We can do quantization by that and just use that with the lower size. Although there will probably be a bit of difference in the accuracy of what we would have expected, it's still a good enough example for most use cases and a good enough implementation of quantization.

Generally, what this code will do is look at the specific model that it received. It checks that the sizes of it are what we're expecting, and this is a very nice table that I had a hard time making sure was printed out properly. The interesting thing to remember is that we have different quantization formats and we have various types of data type sizes. They will change things. They will change the quality, they'll change the speed, and the use case to use them will probably be a bit different.

Thumbnail 2370

Sometimes we would want to have a slower speed but we want to have higher accuracy. And sometimes we can decide on changing some of it. Maybe we can use a specific layer and have that specific layer less accurate but faster, and it will still provide us the results that we would like. It's really important to look for those kinds of things. In this case again, we can see that we have our large model, which weighs about 6 gigabytes. We can see that if it had various types of sizes, it probably would have been better if it's for research, development, and so on, or running on an edge device.

Thumbnail 2380

I would want to talk about running on edge devices in a bit because there is a really cool example of what can be done. So if we continue down, we talked about the MLX-LM's convert function. MLX-LM provides us a convert function that actually wraps a convert function within MLX. We already have quantization logic within MLX itself, within MLX core. But for large language models, there are a few small traits and taints that we need.

Thumbnail 2440

Thumbnail 2450

In that case, we have our convert function. As you can see, we have a Hugging Face path of the model that we want, and we can say, all right, quantize it, decide a new path for our new MLX converted model, and we can have the amount of bits that we would want. We're deciding the data size that we want to have on our quantized model. If I run that, this will actually run at this moment, and we can see that inside this library that I've created, my quantized models will be able to see our model. This is our quantized model here.

Thumbnail 2470

Thumbnail 2480

If we open up the config JSON, I'll open up this text. If we look at it, we can see we have a quantization and a quantization config, and we actually quantized our model to 8 bits. We can do things that are more complex as I mentioned. Instead of doing that, we can say, all right, I don't necessarily want every layer to be quantized to 8 bits. Let's use a quant predicate and use a function that will get the layer and we'll decide by the layer, if it's the name of it or the layer itself. By the layer, we will decide how we would like to quantize it.

Thumbnail 2530

Thumbnail 2550

In this case, we can see that we built a quantized per layer quant predicate function. In that case, we took LM head and we said, all right, let's quantize the LM head to 6 bits and all the others, let's quantize them to 4 bits. If we do that, again, it takes a few seconds because we don't have like a post-training here or something. It's just a matter of reducing the sizes. If we get back into that, we can see we have another directory here, our variable quantized model. If we open its config JSON, sorry again, we can see that now in the quantization we have an object representing each layer and each layer's actual group size and object size. We can see the bits for each layer and the group size of it. In that case, we can see again our LM head remained 6 bits and the rest is 4 bits quantized for all of the layers.

Thumbnail 2600

PyTorch Backend Support, Future Opportunities, and Key Takeaways

Now, just switching over to a browser here. So again, talking about PyTorch for a second. All of what we've just seen is optimizations that are able to be provided to us because we're using MLX. MLX will always get the latest from Apple, and Apple is the one maintaining it. So when there's something new, we'll probably have an MLX update that will allow us to use that new feature, if it's a core or if it's a trick of some sort. If we're using PyTorch and we want to use PyTorch optimized on Mac, then the way to do it is to use an MPS backend. Apple released a backend for PyTorch, allowing that.

It's quite similar to what we just released on Trainium on our side of things. On Trainium 3, right after the launch, we also launched that and released native support for PyTorch on Trainium and Inferentia chips. It's pretty much the same. It's building out the lower level implementation behind that specific framework to allow it to actually use the various types of devices. So if we're talking about the lowest level, the Metal libraries from Apple, someone needs to build that implementation. The best thing you can get on Macs if you are still wanting to use PyTorch and not moving to MLX is a very similar implementation of PyTorch. You can use that backend again. If anything changes, we will be reliant on Apple releasing a new backend for it or someone else fixing a backend for it, and that's a downside.

Thumbnail 2710

Thumbnail 2720

We have pretty much the same thing for TensorFlow as well. In TensorFlow, it's a pluggable device where you're defining that device and adding to it. Again, Apple provides all of the examples and how that looks, and that's pretty much it.

Thumbnail 2730

Thumbnail 2760

Now going forward, there are a few cool things that Apple also included in MLX. Apple has included the fast library. Thinking about all the things that we're capable of doing, if we have preset things that we know happen a lot, we can optimize them. So Apple did it for us for a few things. An example would be RMS norm or any other thing that you're seeing here, but RMS norm is something that happens a lot in large language models and in models as a whole. Instead of us building our own RMS norms and causing the array framework to actually have multiple calls onto the SOC to provide the instruction sets to do this now and this afterwards and this afterwards, Apple just pre-compiled the best implementation it could have created for us to use. So we can use the fast library from MLX core and we have things there.

Thumbnail 2790

Thumbnail 2800

Again, we can also see those for our transformations as we've discussed as well, and so on. The last thing I wanted to talk about is the opportunity here. If we're talking about the opportunities here, because we're seeing those SOCs coming to the market, people have iPads and so on, there are more and more things that we will probably want to use the same core for. So Apple also released lately fast VLM. That's a single example from Apple on how to build a model that's very optimized for using it. And they are actually using PyTorch as part of it as well.

Thumbnail 2840

But the idea here is that at the end of the day you're capable of getting more from that specific device. If we look at various types of workloads, sometimes we have workloads that require moving through multiple different layers or multiple different neural nets that will need to have them hot in memory in order to not load them to the memory, and we don't necessarily need too many cores in order to inference through them. A lot of use cases like that can use these types of hardware.

Thumbnail 2880

Thumbnail 2890

Thumbnail 2900

Thumbnail 2910

And down the road I really hope to see more and more. We're seeing AMD now also building Ryzen AI chips. It's just a matter of them building the same support like MLX in order to provide us more and more tools to use it. Continuing forward, I really believe in the direction of SOCs that provide more than a single device. I personally really hope to see more and more, and I'm really looking forward to seeing what our customers will be able to do with these types of interesting cores. That's pretty much it.

Thumbnail 2920

Thumbnail 2930

So if you have to summarize what we learned, first, you have really a lot of computing power on your desk when you have a Mac mini with an Apple silicon chip. These chips are containing the CPU, the neural engine, and other types of engines with unified memory that makes them extremely power efficient and fast to process your machine learning workload. Second, you have the MLX framework that allows you to create your model, train your model, customize your model, or MLX-LM that allows you to run existing models from Hugging Face directly so you can download the model and run them directly on your Mac.

Thumbnail 2970

And third, if you are still in the PyTorch world, which makes a lot of sense, you can also use PyTorch directly on the neural engine with the backend that Eliran just showed you. So take your existing code optimized for the neural engine thanks to PyTorch. That was dense. We still have a couple of minutes, so we might open the floor for questions. So thank you so much for your interest and all the questions. Thank you Eliran for doing the hard work there. Thank you very much guys. We're really looking forward to seeing what you'll build. Have a good one.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)