KAMAL KISHOR

Posted on Oct 11, 2024 • Edited on Apr 17

Running Llama 3.2 on Android: A Step-by-Step Guide Using Ollama

#webdev #javascript #openai #python

Meta’s Llama 3.2, unveiled at their Developer Conference, has redefined on-device AI with its multimodal capabilities and optimized models for mobile hardware from Qualcomm and MediaTek. With four variants—multimodal models (11B and 90B parameters) and text-only models (1B and 3B parameters)—Llama 3.2 brings powerful AI to Android devices, enabling private, efficient, and responsive applications. The lightweight 1B and 3B models are particularly suited for mobile, excelling in text generation and multilingual tasks, while the larger models shine in image understanding and chart reasoning.

In this blog, we’ll walk you through the updated process of running Llama 3.2 on an Android device using Termux and Ollama. Recent advancements have simplified the setup, making it easier than ever to deploy these models locally. We’ll focus on the practical, hands-on steps to get you started, address performance considerations, and highlight why local AI is a game-changer.

Why Run Llama 3.2 Locally?

Running AI models like Llama 3.2 on your Android device offers two key advantages:

Instant Processing: All computations happen on-device, eliminating cloud latency.
Enhanced Privacy: Your data stays on your device, ensuring confidentiality.

With Termux providing a Linux environment and Ollama streamlining model management, you can harness Llama 3.2’s capabilities without relying on cloud infrastructure.

Prerequisites

Before diving in, ensure you have:

An Android device with at least 2GB RAM (4GB+ recommended for the 3B model).
A stable internet connection for downloading Termux, Ollama, and the Llama 3.2 model.
About 2–3GB of free storage for the 1B or 3B models and dependencies.

Step-by-Step Guide to Running Llama 3.2 on Android

Step 1: Install Termux

Termux is a terminal emulator that creates a Linux environment on Android without requiring root access.

Download Termux:
- Visit the Termux GitHub Releases page and download the latest APK (e.g., termux-app_v0.119.0+ or newer).
- Avoid the Google Play Store version, as it may be outdated.
Install the APK:
- Open the downloaded file and follow the prompts to install Termux.
Launch Termux:
- Open the app to access the terminal.

Step 2: Set Up Termux Environment

Configure Termux to ensure it’s ready for Ollama and Llama 3.2.

Grant Storage Access:

   termux-setup-storage

This allows Termux to access your device’s storage for file management.

Update Packages:

   pkg update
   pkg upgrade

Enter Y when prompted to update Termux and its packages to the latest versions.

Step 3: Install Ollama

Ollama, a platform for running large language models, is now available directly in the Termux repository, simplifying the process compared to older methods that required manual compilation.

Install Ollama:

   pkg install ollama

This command installs the pre-built Ollama binary, saving you from cloning and building the repository manually.

Step 4: Start the Ollama Server

Run the Ollama server to manage model requests.

Start the Server:

   ollama serve &

The & runs the server in the background, freeing up the terminal for further commands.

Step 5: Run Llama 3.2

Choose a Llama 3.2 model based on your device’s capabilities and run it.

Select a Model:
- 1B Model (llama3.2:1b, ~1.3GB): Ideal for low-spec devices with 2GB+ RAM.
- 3B Model (llama3.2:3b, ~2.0GB): Suitable for devices with 4GB+ RAM for better performance.
- The 11B and 90B models are impractical for most Android devices due to their size (7.9GB and 55GB) and resource demands.
Run the Model: For the 3B model:

   ollama run llama3.2:3b

For the 1B model (if performance is slow):

   ollama run llama3.2:1b

The model will download the first time you run it. Once downloaded, you can interact with it directly in the terminal.

Optional Verbose Mode: Add --verbose for detailed logs, e.g., ollama run llama3.2:3b --verbose.

Step 6: Interact with Llama 3.2

After the model loads, you’ll see a prompt where you can input queries. For example:

Input: “Summarize a 500-word article into 50 words.”
Output: Llama 3.2 will generate a concise summary based on your input.

Try tasks like text generation, answering questions, or multilingual translation. The 1B and 3B models are optimized for such text-based tasks.

Step 7: Optimize Performance

Performance depends on your device’s hardware. Here are tips to ensure smooth operation:

Close Background Apps: Free up RAM by closing heavy apps.
Use the 1B Model: If the 3B model lags (e.g., on older devices), switch to the 1B model for faster responses.
Tested Devices: The 1B model runs smoothly on mid-range devices (e.g., 2GB RAM), while the 3B model performs well on high-end devices like the Samsung S21 Ultra with 4GB+ RAM.

Step 8: Optional Cleanup

After experimenting, clean up to save space or streamline access.

Remove Unnecessary Files:

   chmod -R 700 ~/go
   rm -r ~/go

This clears out any residual Go-related files.

Move Ollama Binary:

   cp ollama/ollama /data/data/com.termux/files/usr/bin/

This makes ollama accessible globally, so you can run it from any directory.

Troubleshooting Common Issues

Slow Performance: Switch to the 1B model or close other apps. Devices with less than 4GB RAM may struggle with the 3B model.
Installation Errors: Ensure pkg update and pkg upgrade are run first. If pkg install ollama fails, check your internet connection or reinstall Termux.
Model Download Stuck: Verify storage availability and retry. Ollama models are large (1.3–2.0GB for 1B/3B).
Verbose Logs: Use --verbose to diagnose issues during model execution.

For device-specific issues (e.g., slow responses on Pixel devices post-updates), check community forums like Reddit or the Termux GitHub.

Why This Matters

Llama 3.2’s ability to run on Android devices marks a significant step toward democratizing AI. By leveraging Termux and Ollama, developers and enthusiasts can:

Build privacy-first apps that process data locally.
Create offline AI tools for tasks like summarization, translation, or chatbots.
Experiment with AI on consumer-grade hardware, reducing reliance on cloud services.

The 1B and 3B models are particularly exciting for mobile, enabling use cases like on-device personal assistants or educational tools in low-connectivity areas.

Conclusion

Running Llama 3.2 on Android with Termux and Ollama is now more accessible than ever, thanks to the simplified pkg install ollama method. Whether you’re a developer exploring on-device AI or an enthusiast curious about local LLMs, this guide equips you to get started. With models optimized for mobile and a streamlined setup, Llama 3.2 opens the door to fast, secure, and private AI applications.

For more details, visit the Ollama website or explore community discussions on Termux GitHub. Share your experiences or questions in the comments below—happy AI tinkering!

Top comments (30)

Sam Rahimi • Feb 21

Awesome! Worked out of the box for me (Vivo v30 Lite, Android 14, and the latest Termux APK, specifically: github.com/termux/termux-app/relea...)

Note that this will almost certainly NOT work if you download Termux from Google Play Store - while it's fine for casual use it is NOT the same as the open source distro that you download from GitHub.

KAMAL KISHOR • Mar 7

Glad to hear it worked out of the box for you on your Vivo V30 Lite with Android 14! 🎉 Thanks for sharing your setup details.

And yes, you're absolutely right—downloading Termux from the Google Play Store can cause issues since it's outdated and lacks important updates. The GitHub version is the way to go for the latest features and compatibility.

Honoré SOKE • Oct 18 '24

Great article. Thank

manutown • Jan 30

Está corriendo suavemente el de 3 b
Muchas gracias

KAMAL KISHOR • Jan 31

¡Me alegra saber que el modelo de 3B está funcionando sin problemas! 😊 Si tienes alguna otra pregunta o necesitas más ayuda, no dudes en preguntar. Aquí estoy para ayudarte. ¡Feliz experimentación con Ollama y los modelos de IA! 🚀

manutown • Jan 31

Y lo he hecho funcionar con un celular Android de gama baja el poco c65 de Xiaomi
no es de gama alta como los Samsung s23 en el cual funcionan muy rápido gracias

KAMAL KISHOR • Feb 8

Parece que lograste hacerlo funcionar en un Xiaomi Poco C65, ¡eso es genial! A pesar de ser un dispositivo de gama baja, demuestra que la optimización y la eficiencia del software pueden marcar la diferencia. 🚀 ¿Notaste algún problema de rendimiento o corre bien en general?

manutown • Feb 10

Ya lo instale en un galaxy s24 y ejecuta sus análisis igual en los mismos tiempos si utilizo el el 1b responde más rápido pero tiende a mentir y hacer circular la respuesta

Ervan Kurniawan • Feb 17 • Edited

Successfully run deepseek-coder for the first time locally even too slow! (I'll change the model later that run faster on my Samsung A51 device 😅)
Image description

b9Joker108 • Oct 26 '24

I got this error 🤕 on my Samsung Galaxy Tab S9 Ultra:

❯ go build .
# github.com/ollama/ollama/llama
ggml-quants.c:4023:88: error: always_inline function 'vmmlaq_s32' requires target feature 'i8mm', but would be inlined into function 'ggml_vec_dot_q4_0_q8_0' that is compiled without support for 'i8mm'
ggml-quants.c:4023:76: error: always_inline function 'vmmlaq_s32' requires target feature 'i8mm', but would be inlined into function 'ggml_vec_dot_q4_0_q8_0' that is compiled without support for 'i8mm'
ggml-quants.c:4023:64: error: always_inline function 'vmmlaq_s32' requires target feature 'i8mm', but would be inlined into function 'ggml_vec_dot_q4_0_q8_0' that is compiled without support for 'i8mm'
ggml-quants.c:4023:52: error: always_inline function 'vmmlaq_s32' requires target feature 'i8mm', but would be inlined into function 'ggml_vec_dot_q4_0_q8_0' that is compiled without support for 'i8mm'
# github.com/ollama/ollama/discover
gpu_info_cudart.c:61:13: warning: comparison of different enumeration types ('cudartReturn_t' (aka 'enum cudartReturn_enum') and 'enum cudaError_enum') [-Wenum-compare]

lanbase • Oct 27 '24 • Edited

Hi, I have got the same error on Honor Magic6 Pro (snapdragon 8 gen3).

~/ollama $ go build .
# github.com/ollama/ollama/discover
gpu_info_cudart.c:61:13: warning: comparison of different enumeration types ('cudartReturn_t' (aka 'enum cudartReturn_enum') and 'enum cudaError_enum') [-Wenum-compare]
# github.com/ollama/ollama/llama
ggml-quants.c:4023:88: error: always_inline function 'vmmlaq_s32' requires target feature 'i8mm', but would be inlined into function 'ggml_vec_dot_q4_0_q8_0' that is compiled without support for 'i8mm'
ggml-quants.c:4023:76: error: always_inline function 'vmmlaq_s32' requires target feature 'i8mm', but would be inlined into function 'ggml_vec_dot_q4_0_q8_0' that is compiled without support for 'i8mm'
ggml-quants.c:4023:64: error: always_inline function 'vmmlaq_s32' requires target feature 'i8mm', but would be inlined into function 'ggml_vec_dot_q4_0_q8_0' that is compiled without support for 'i8mm'
ggml-quants.c:4023:52: error: always_inline function 'vmmlaq_s32' requires target feature 'i8mm', but would be inlined into function 'ggml_vec_dot_q4_0_q8_0' that is compiled without support for 'i8mm'
~/ollama $

update :

I have found a workaround here:

github.com/ollama/ollama/issues/7292

cheers.

Johan • Oct 27 '24 • Edited

I had the same, but found a workaround here

Basically, you modify llama.go#L37-L38 to remove -D__ARM_FEATURE_MATMUL_INT8

walid f • Mar 21

All this tutorial is actually outdated, you can directly install ollama by using:
pkg install ollama
on both playstore and non playstore version of termux (the playstore version gets update slower than the other one, and btw for people on pixel 7 and more use only Ollama via CLI if you try to use any Webui or gui for ollama you tokens will go from 10t/s to 0.2/s (no jokes its really 1token every 5seconds, that due to the 5 march update on pixel phone that i dont what it does exactly but im makes the text response goes really slow.

acauema • Apr 17

yeah, very very true! thanks for the update. i will repeat the command below with proper formatting for better ux:

pkg install ollama

and to remove all the unnecessary files from this tutorial, i just deleted the folder, which i installed on home folder.

rm -rf ~/ollama

i actually kept go and other packages as i will probably use them eventually... but to remove the packages as well, go ahead with it.

pkg remove git cmake goland

Nigel Burton • Nov 9 '24

Qualcomm's spec sheets for Snapdragon 8 gen 3 suggest it can use GPU and a DSP to speed up LLM inference.

Do you, or any other readers know whether Ollama is taking advantage of the hardware?

if not are there any open source projects that are utilizing the full capabilities of the Gen3?

Thanks for the very useful article.

Learn AI • Jan 22

Based on the search results, here’s a detailed response to your questions regarding the utilization of Snapdragon 8 Gen 3 hardware (GPU and DSP) for LLM inference, particularly with Ollama and other open-source projects:

1. Is Ollama Taking Advantage of Snapdragon 8 Gen 3 Hardware?

As of the latest information, Ollama does not currently fully utilize the GPU and DSP capabilities of the Snapdragon 8 Gen 3 for LLM inference. While Ollama supports running models like Llama 3.2 on Android devices using Termux, its primary focus has been on CPU-based inference. There are discussions and efforts to integrate GPU and NPU support, but these are still in progress and not yet fully realized.

For example:

A user reported that the NPU on a Snapdragon X Elite device (which shares similar architecture with Snapdragon 8 Gen 3) was not being utilized when running Ollama.
Developers have mentioned that GPU support (via OpenCL) is being worked on, but NPU support will require more effort and has no estimated timeline.

2. Open-Source Projects Utilizing Snapdragon 8 Gen 3 Hardware

Several open-source projects and frameworks are actively leveraging the full capabilities of the Snapdragon 8 Gen 3, including its GPU, DSP, and NPU for AI and LLM tasks:

a. Qualcomm AI Hub Models

The Qualcomm AI Hub Models project provides optimized machine learning models for Snapdragon devices, including the Snapdragon 8 Gen 3. These models are designed to take advantage of the hardware's CPU, GPU, and NPU for tasks like image classification, object detection, and LLM inference.
The project supports various runtimes, including Qualcomm AI Engine Direct, TensorFlow Lite, and ONNX, enabling efficient deployment on Snapdragon hardware.

b. MiniCPM-Llama3-V 2.5

This open-source multimodal model is optimized for deployment on Snapdragon 8 Gen 3 devices. It uses 4-bit quantization and integrates with Qualcomm’s QNN framework to unlock NPU acceleration, achieving significant speed-ups in image encoding and language decoding.
The model demonstrates how open-source projects can leverage Snapdragon hardware for efficient on-device AI applications.

c. Llama.cpp

Llama.cpp is a popular open-source project for running LLMs locally. While it primarily focuses on CPU inference, there are ongoing efforts to add GPU and NPU support for Snapdragon devices. For example, developers are working on an OpenCL-based backend for Adreno GPUs, which could extend to Snapdragon 8 Gen 3.
Some users have reported successful performance benchmarks on Snapdragon X Elite devices, indicating potential for future optimizations.

d. Qualcomm AI Engine Direct

This framework allows developers to compile and optimize models for Snapdragon hardware, including the GPU and NPU. It is used in projects like EdgeStableDiffusion, which demonstrates how large models like Stable Diffusion can be efficiently run on Snapdragon 8 Gen 2 and Gen 3 devices.

3. Future Prospects

Ollama: While Ollama does not yet fully utilize Snapdragon 8 Gen 3 hardware, the development community is actively working on GPU and NPU support. This could significantly improve performance for on-device LLM inference in the future.
Open-Source Ecosystem: Projects like Qualcomm AI Hub Models, MiniCPM-Llama3-V 2.5, and Llama.cpp are leading the way in leveraging Snapdragon hardware. These efforts highlight the potential for open-source tools to fully utilize the capabilities of modern mobile chipsets.

Conclusion

Currently, Ollama does not fully utilize the GPU and DSP capabilities of the Snapdragon 8 Gen 3, but there are promising open-source projects like Qualcomm AI Hub Models, MiniCPM-Llama3-V 2.5, and Llama.cpp that are making significant strides in this area. As development continues, we can expect more tools to take full advantage of Snapdragon hardware for efficient on-device AI and LLM inference.

For further details, you can explore the referenced projects and discussions in the search results.