DEV Community

Cover image for Running multiple LLMs on a single GPU
Shannon Lal
Shannon Lal

Posted on

8

Running multiple LLMs on a single GPU

In recent weeks, I have been working on projects that utilize GPUs, and I have been exploring ways to optimize their usage. To gain insights into GPU utilization, I started by analyzing the memory consumption and usage patterns using the nvidia-smi tool. This provided me with a detailed breakdown of the GPU memory and usage for each application.
One of the areas I have been focusing on is deploying our own LLMs. I noticed that when working with smaller LLMs, such as those with 7B parameters, on an A100 GPU, they were only consuming about 8GB of memory and utilizing around 20% of the GPU during inference. This observation led me to investigate the possibility of running multiple LLM processes in parallel on a single GPU to optimize resource utilization.
To achieve this, I explored using Python's multiprocessing module and the spawn method to launch multiple processes concurrently. By doing so, I aimed to efficiently run multiple LLM inference tasks in parallel on a single GPU. The following code demonstrates the approach I used to set up and execute multiple LLMs on a single GPU.

MAX_MODELS = 3

def load_model(model_name: str, device: str):
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        return_dict=True, 
        load_in_8bit=True, 
        device_map={"":device},
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    return model, tokenizer

def inference(model, prompt: str):
    text = model.process_and_generate(prompt, params={"max_new_tokens": 200, "temperature": 1.0})
    return text


def process_task(task_queue, result_queue):
    model = load_model("tiiuae/falcon-7b-instruct", device="cuda:0")
    while True:
        task = task_queue.get()
        if task is None:
            break
        prompt = task
        start = time.time()
        summary = inference(model, prompt)
        print(f"Completed inference in {time.time() - start}")
        result_queue.put(summary)

def main():
    task_queue = multiprocessing.Queue()
    result_queue = multiprocessing.Queue()
    prompt = "" # The prompt you want to execute

    processes = []
    for _ in range(MAX_MODELS):
        process = multiprocessing.Process(target=process_task, args=(task_queue, result_queue ))
        process.start()
        processes.append(process)

    start = time.time()

    # I want to run this 3 times for each of the models
    for _ in range(MAX_MODELS*3):
        task_queue.put((prompt))

    results = []
    for _ in range(MAX_MODELS*3):
        result = result_queue.get()
        results.append(result)
    end = time.time()

if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")
    main()

Enter fullscreen mode Exit fullscreen mode

The following is a quick summary of some of the tests that I ran.

GPU # of LLMs GPU Memory GPU Usage Average Inference Time
A100 with 40GB 1 8 GB 20% 12.8 seconds
A100 with 40GB 2 16 GB 95% 16 seconds
A100 with 40GB 3 32 GB 100% 23.2 seconds

Running multiple LLM instances on a single GPU can significantly reduce costs and increase availability by efficiently utilizing the available resources. However, it's important to note that this approach may result in a slight performance degradation, as evident from the increased average inference time when running multiple LLMs concurrently. If you have any other ways of optimizing GPU usage or questions on how this works feel free to reach out.

Thanks

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more