Jimin Lee

Posted on Sep 22

On-Device LLM

#llm #machinelearning #nlp #ondeviceai

Note: This article was originally written in 2024. Even though I’ve updated parts of it, some parts may feel a bit dated by today’s standards. However, most of the key ideas about LLMs remain just as relevant today.

I'd like to talk about On-Device LLMs today. The term "On-Device LLM" has been getting a lot of buzz recently, often mentioned alongside the broader topic of On-Device AI.

For instance, Samsung has integrated an On-Device LLM into its Galaxy AI, starting with the Galaxy S24, and other manufacturers are following. Apple recently announced that its upcoming Apple Intelligence will feature a 3B-class On-Device LLM.

So, what exactly is an On-Device LLM, and what does it take to build one? Let's dive in.

What is an On-Device LLM?

Let's start with the basics: what does "On-Device LLM" even mean?

As the name suggests, it's an LLM that runs on a local device. The term "On-Device" is a counterpoint to Cloud. Instead of relying on a powerful cloud server, it operates directly on the hardware you own—your smartphone, laptop, PC, or even a refrigerator, TV, or car. As long as it's not the cloud, it's "on-device."

For the sake of this article, we'll focus on smartphones, as they're the most common use case for On-Device LLMs.

Next, what does it mean for an LLM to "run" on a device? In this context, "running" refers to inference. While there are some attempts to train LLMs on-device, for now, it's primarily limited to inference.

When people talk about On-Device LLMs, they usually mean one of two things:

A lightweight LLM (often called an sLM or sLLM) that's small enough to perform inference on a device.
The act of running an LLM on a device itself. Even a lightweight LLM requires some serious technical know-how to get it running smoothly on-device.

On-Device Constraints: Memory

So, what makes an on-device environment so different from the cloud that it requires its own category? Let's assume we're talking about a smartphone, though most of these points apply to any on-device setup.

First, there's the memory constraint, which includes both memory size and memory speed.

Today's smartphones typically have anywhere from 8GB to 24GB of RAM (some Chinese brands have models with 24GB). The Galaxy S24 comes in 8GB and 12GB versions, while the iPhone 15 Pro has 8GB.

How much memory does a 7B LLM require? If you store the model parameters in fp32 (32-bit floating point), you need 28GB. With fp16 (16-bit floating point), you need 14GB. Even a 12GB phone is nowhere near enough, especially when you consider the memory already being used by the operating system and other apps.

What about a smaller 3B LLM? That still requires 12GB for fp32 and 6GB for fp16. It's still not a practical amount of memory.

Memory speed is another major issue. While some AI technologies are compute-bound, modern LLMs using the Transformer architecture are notoriously memory-bound. This means their performance is limited more by how fast they can access memory rather than by the speed of the CPU or GPU.

To solve this, cloud hardware like the A100 and H100 GPUs use incredibly fast memory. But smartphones can't do that for two main reasons. First, it would make the device prohibitively expensive. Who would buy a smartphone that costs $15,000 just to run an On-Device LLM? Second, these high-end components consume a lot of power, which would quickly drain a smartphone's battery.

On-Device Constraints: Storage

Storage poses similar challenges. The iPhone 15 Pro offers storage options from 128GB to 1TB. The apps we use every day are typically a few dozen to a few hundred megabytes, with some games reaching a few gigabytes.

But as we saw, a 3B LLM can take up 6GB to 12GB of storage. That's a significant chunk of space on a phone where people are constantly complaining about not having enough room for photos.

Storage speed is another hurdle. To run an LLM, the model must first be loaded from storage into memory. Since storage is much slower than RAM, this loading process can be a bottleneck.

This problem gets even worse when combined with memory size constraints. Cloud servers can take a while to load a model initially, but that's fine since it only happens once when the server starts. On a smartphone, the OS routinely shuts down apps it deems unnecessary to free up memory, and a massive LLM is a prime target. Once the model is unloaded, it needs to be loaded again from storage the next time you use it.

Alternatively, you could prevent the OS from killing the LLM app, but then a huge chunk of precious memory is permanently occupied, leaving less for other applications.

On-Device Constraints: Processing Speed

Processing speed is a massive problem. No matter how much smartphone CPUs and GPUs have improved, they can't compete with the A100s and H100s used in the cloud. Imagine you somehow managed to get an On-Device ChatGPT running on your phone, but it took 10 minutes to respond to a question. Would you ever use it? The answer is a clear no.

On-Device Constraints: Battery and Heat

In addition to the constraints above, battery life and heat are significant issues. For mobile devices, power efficiency is crucial. Running powerful hardware to power an LLM consumes a lot of electricity, which a small battery can't sustain.

Heat is a related problem. Even if you could solve the battery issue, running a powerful LLM would generate a tremendous amount of heat. Smartphones have poor heat dissipation, which can lead to thermal throttling—slowing down the hardware to prevent overheating—and even cause low-temperature burns to the user's hand.

Overcoming the Limitations

After reading all this, you might be thinking, "Is an On-Device LLM even possible?" The answer is yes. Samsung, Apple, and others are seriously integrating them into their flagship products.

Let's explore how they're overcoming these challenges and making On-Device LLMs a reality.

Model Lightening

The first step is to reduce the model's size. An LLM is simply too big to run on a smartphone. How do we shrink it?

The most obvious way is to use a smaller model. A 7B model is better than a 70B, and a 70B is better than a 100B. Smaller is better. As mentioned, Apple is using a 3B LLM for Apple Intelligence. While that's a small model by cloud standards, it's still a significant size for a device.

Once you have a reasonably small model, you can use various proven techniques to make it even lighter. These include Pruning to remove unnecessary weights and Distillation to train a smaller model using a larger one. While both are important, the most critical technique for on-device environments is Quantization.

Like pruning and distillation, quantization isn't exclusive to on-device tech, but it's used much more aggressively here.

(Quantization is a complex topic that deserves its own article, so we'll keep it brief for now.)

Quantization, in simple terms, reduces the size of the data used to store the model's parameters. The fp32 data type, commonly used in the cloud, requires 4 bytes per number. For a 3B LLM, that's 3B * 4 = 12GB of memory. Using fp16 (2 bytes) reduces this to 3B * 2 = 6GB.

What if we use even smaller data types? We can use int8 (8-bit integers, 1 byte) or int4 (4-bit integers, 0.5 bytes). While this reduces accuracy, it drastically shrinks the model size. Quantizing a 3B LLM to int8 requires just 3GB, and int4 requires a mere 1.5GB. Compared to the 12GB needed for fp32, this is a monumental space saving.

This works under the assumption that we can maintain model quality, and fortunately, advancements in technology have made it possible to use int4 quantization with minimal performance degradation. Qualcomm, for instance, has confidently promoted INT4 support on its NPUs.

Unlike pruning or distillation, quantization requires specific hardware support. Even if you have an int4 quantized model, if the hardware only supports fp16, the values will have to be converted to fp16 when loaded into memory. This negates the memory savings you get from quantization. We'll discuss this more later, but NPUs, which are commonly used on-device, support a more limited range of data types and operations than GPUs, which must be considered during quantization.

Acceleration

Let's say we've used int4 quantization to shrink a 12GB 3B model down to 1.5GB. That's a huge reduction, but unfortunately, it's still too large to run effectively on a smartphone's CPU.

We need hardware acceleration for inference.

The first option that comes to mind is the familiar GPU. With the amazing graphics in today's mobile games, it's clear that smartphone GPUs are powerful. Using the GPU would be much faster than the CPU, but there are two main problems with this approach.

First, the GPU is not an exclusive resource. Multiple apps like games and graphics editors need GPU acceleration. If an LLM is hogging the GPU, other apps won't run properly.

Second, GPUs are power-hungry. Running an LLM on the GPU would drain the battery rapidly, similar to playing a graphically intensive game nonstop.

This is where the NPU comes in.

(Like quantization, NPUs are a huge topic that we'll save for another article.)

Just as a GPU is specialized for graphics acceleration, an NPU (Neural Processing Unit) is specialized for running deep learning tasks. It accelerates common deep learning operations by building them directly into the chip. While NPUs initially focused on computer vision tasks, they have evolved to include features for NLP and, more recently, for efficiently running the Transformer architecture.

Using an NPU for LLM inference offers several benefits:

Dedicated Resource: NPUs are designed specifically for deep learning, so there's less competition with other apps.
Power Efficiency: NPUs are built for low-power operations, consuming less battery than a CPU or GPU.
Speed: Hardware acceleration leads to faster inference.

However, NPUs also have drawbacks. They are less versatile than GPUs. Each chipset manufacturer (Qualcomm, Apple, Samsung, etc.) designs their NPU differently, so a model might use an operation that a specific NPU doesn't support. This means that a generic model created with frameworks like PyTorch or TensorFlow needs to be converted and optimized for each NPU. If a function isn't supported, you might have to modify the model or find a workaround in the software, which can be a real hassle.

This conversion process isn't just an inconvenience; it can lead to performance degradation if the NPU's supported operations aren't a perfect match. In the worst-case scenario, your model might not run on the NPU at all.

Due to these constraints, chipset manufacturers often provide guidelines. For example, Qualcomm suggests using the CPU for real-time tasks that require a small model and the NPU for longer processing or larger models.

Hardware isn't the only solution for acceleration. Techniques developed for cloud LLMs, such as KV Cache, GQA, and (Self) Speculative Decoding, are also being adopted for on-device use.

Quality Assurance

So now we have a small model that runs fast, but what if the output quality is bad? The small size of on-device models makes this a serious concern. They're often less responsive to prompts and instructions.

This is where fine-tuning becomes critical. For On-Device LLMs, fine-tuning isn't just an option—it's a necessity for ensuring good performance. But this introduces another problem.

Imagine you fine-tuned a 3B model to create a summarization feature. Now you want to add a resume writing feature. You'll need to fine-tune the model again, creating a separate, specialized version. What if you also want an email drafting feature? You'll need yet another model. Soon, you'll have multiple 3B models, each consuming a lot of storage.

This is where LoRA (Low-Rank Adaptation) is perfect. LoRA allows you to keep the base model and just swap out small, task-specific LoRA weights, saving significant storage space. While implementing this on-device is not a trivial task, some companies have already done it, proving it's possible.

The Cascade of Performance Loss

Let's recap the entire process:

Choose a reasonably sized LLM as your starting point.
Fine-tune it for a specific task.
Shrink the model using techniques like pruning, distillation, and quantization.
Convert and optimize the model for your target NPU.

The problem is that each of these steps can introduce performance loss.

Using LoRA for fine-tuning might result in slightly lower performance than full fine-tuning.
Quantization can cause a drop in quality.
The conversion process for the NPU can also lead to performance degradation.

After all these steps, the final performance can be much lower than you initially expected. The real challenge lies in the painstaking process of analyzing and fixing where the performance dropped in the pipeline.

So, Why On-Device LLMs?

Despite these difficulties, why are so many people and companies so interested in On-Device LLMs?

From a user's perspective, privacy is a major benefit. As AI features become more integrated into our lives, we share more personal data. With On-Device AI, you have the assurance that your data never leaves your device.

For companies, there are multiple advantages:

Reduced Cloud Costs: GPUs are expensive, and running a cloud LLM comes with significant operational costs, including electricity.
Legal Risk Mitigation: Storing and processing user data in the cloud has its benefits, but it also carries significant risks, especially with privacy laws like GDPR.
New Product Differentiation: Samsung and Apple are using "Galaxy AI" and "Apple Intelligence" as powerful selling points for their new phones. This trend is also gaining traction in PCs with the "AI PC" movement. This is also great for semiconductor companies, as running these LLMs requires more RAM, storage, and powerful processors.

Today, LLM companies seem to be split into two camps: those like OpenAI and Anthropic, who focus on huge models, and those who target On-Device, starting with smaller models. Startups, in particular, are focusing on smaller models because they can't compete with the massive scale of the larger players.

Conclusion

In the end, building an On-Device LLM isn’t just about compressing a big model into a small chip. It’s about mastering the art of trade-offs—balancing memory, speed, quality, and user experience. The companies that win this race won’t just shrink models; they’ll redefine what’s possible on the devices we carry every day.

DEV Community