DEV Community

Nithin I Bhandari
Nithin I Bhandari

Posted on

How to run LLAMA 2 on your local computer

Introduction

LLAMA 2 is a large language model that can generate text, translate languages, and answer your questions in an informative way. In this blog post, I will show you how to run LLAMA 2 on your local computer.

Prerequisite:

  1. Install anaconda
  2. Install Python 11

Steps

Step 1:

1.1: Visit to huggingface.co
Model Link: https://huggingface.co/meta-llama/Llama-2-7b-hf
1.2: Create an account on HuggingFace
1.3: Request for llama model access
It may take a day to get access.
1.4: Go to below link and request llama access
Link: https://ai.meta.com/resources/models-and-libraries/llama-downloads/
1.5: As llama 2 is private repo, login by huggingface and generate a token.
Link: https://huggingface.co/settings/tokens

pip install huggingface_hub
Enter fullscreen mode Exit fullscreen mode
huggingface-cli login
Enter fullscreen mode Exit fullscreen mode

Step 2: Create a conda environment and activate conda environment

conda create -n py_3_11_lamma2_run python=3.11 -y
Enter fullscreen mode Exit fullscreen mode
conda activate py_3_11_lamma2_run
Enter fullscreen mode Exit fullscreen mode

Step 3: Install library

pip install transformers torch accelerate
Enter fullscreen mode Exit fullscreen mode

Step 4: Create a file "touch run.py"

import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

timeStart = time.time()

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
)

print("Load model time: ", -timeStart + time.time())

while(True):
    input_str = input('Enter: ')
    input_token_length = input('Enter length: ')

    if(input_str == 'exit'):
        break

    timeStart = time.time()

    inputs = tokenizer.encode(
        input_str,
        return_tensors="pt"
    )

    outputs = model.generate(
        inputs,
        max_new_tokens=int(input_token_length),
    )

    output_str = tokenizer.decode(outputs[0])

    print(output_str)

    print("Time taken: ", -timeStart + time.time())
Enter fullscreen mode Exit fullscreen mode

Step 5: Run python file

python run.py
Enter fullscreen mode Exit fullscreen mode

Performance:

I am using a CPU with 20 GB of RAM (4 GB + 16 GB).
It took 51 seconds to load the model and 227 seconds to generate a response for 250 tokens.
If you use a GPU, it will take significantly less time.
On Google Colab, i got 16 second for a response.

Llama 2 load time on CPU

Congratulations! You have successfully run llama on local machine.

Top comments (5)

Collapse
 
umashankar_nedunchezhian_ profile image
Umashankar Nedunchezhian

Hi Nitin, Thanks for sharing this when I follow the above steps I am not getting any output after I give the Input String and token length. The code just hangs. I tried this in my MacBook Pro(M2) and also in AMD powered machine with 48 GB RAM with 6 Core processors and Nvidia GPU.

Can you please advice

Collapse
 
nithinibhandari1999 profile image
Nithin I Bhandari

Please try to give less input_token_length that is 1 (1 token).
And check does it is produce an output.
If the above step produce output, then try with 10 token.
And check does it is produce an output.

Also try to see Task manager, does there are any fluctuation in RAM and SSD usage.

Please try these and share does it worked or not.

Collapse
 
nithinibhandari1999 profile image
Nithin I Bhandari

Can you please share the screenshot

Collapse
 
nithinibhandari1999 profile image
Nithin I Bhandari

Try to also change
torch_dtype=torch.bfloat16,
to
torch_dtype=torch.float16,

Collapse
 
carriefischer profile image
Carrie Fischer

The article "How to Run Llama 2 on Your Local Computer" by Nithin I. offers a clear and concise guide, simplifying the process for beginners. The step-by-step instructions are incredibly helpful and easy to follow. For further information on tech-related topics like this, visit How to Run Llama 2 Locally