DEV Community

Cover image for How to Serve LLM Completions in Production
Mateusz Charytoniuk
Mateusz Charytoniuk

Posted on • Updated on • Originally published at resonance.distantmagic.com

How to Serve LLM Completions in Production

Preparations

To start, you need to compile llama.cpp. You can follow their README for instructions.

The server is compiled alongside other targets by default.

Once you have the server running, we can continue. We will use PHP Resonance framework.

Troubleshooting

Obtaining Open-Source LLM

I recommend starting either with llama2 or Mistral. You need to download the pretrained weights and convert them into GGUF format before they can be used with llama.cpp.

Starting Server Without a GPU

llama.cpp supports CPU-only setups, so you don't have to do any additional configuration. It will be slow, but you will still have tokens generated.

Running With a Low VRAM Memory

You can try quantization if you don't have enough VRAM on your GPU to run a specific model. That lowers the response quality and the memory the model needs to use. Llama.cpp has a utility to quantize models:

$ ./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0
Enter fullscreen mode Exit fullscreen mode

10GB of VRAM is enough to run most quantized models.

Starting llama.cpp Server

While writing this tutorial, I had a server started with a command:

$ ./server 
    --model ~/llama-2-7b-chat/ggml-model-q4_0.gguf 
    --n-gpu-layers 200000 
    --ctx-size 2048 
    --parallel 8 
    --cont-batching
    --mlock 
    --port 8081
Enter fullscreen mode Exit fullscreen mode

cont-batching parameter is essential, because it enables continuous batching, which is an optimization technique that allows parallel request.

Without it, even with multiple parallel slots, the server could answer to only one request at a time. cont-batching allows the server to respond to multiple completion requests in parallel.

Configuring Resonance

All you need to do is add a configuration section that specifies the llama.cpp server location:

[llamacpp]
host = 127.0.0.1
port = 8081
Enter fullscreen mode Exit fullscreen mode

Testing

Resonance has built-in commands that connect to llama.cpp and issue requests.

You can send a sample prompt through llamacpp:completion:

$ php ./bin/resonance.php llamacpp:completion "How to write a 'Hello, world' in PHP?"
To write a "Hello, world" in PHP, you can use the following code:

<?php
  echo "Hello, world!";
?>

This will produce a simple "Hello, world!" message when executed.
Enter fullscreen mode Exit fullscreen mode

Programmatic Use

In your class, you need to use Dependency Injection to inject LlamaCppClient:

<?php

namespace App;

use Distantmagic\Resonance\LlamaCppClient;
use Distantmagic\Resonance\LlamaCppCompletionRequest;

#[Singleton]
class LlamaCppGenerate 
{
    public function __construct(protected LlamaCppClient $llamaCppClient) 
    {
    }

    public function doSomething(): void
    {
        $request = new LlamaCppCompletionRequest('How to make a cat happy?');

        $completion = $this->llamaCppClient->generateCompletion($request);

        // each token is a chunk of text, usually few-several letters returned
        // from the model you are using
        foreach ($completion as $token) {
            swoole_error_log(SWOOLE_LOG_DEBUG, (string) $token);

            if ($token->isLast) {
                // ...do something else
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Summary

In this tutorial, we went through how to start llama.cpp server and connect to it with Resonance.


If you like Resonance, check us out on GitHub and give us a star! :)

https://github.com/distantmagic/resonance

Top comments (4)

Collapse
 
kwnaidoo profile image
Kevin Naidoo

Good article. Even with "cont-batching" - the server can crash because it will accept the connection, but your server resources may already be maxed out with existing requests.

I wouldn't use llama.cpp in prod in this "raw" form for anything critical.

It's not smart enough to balance resources, I had to build an API in front that efficiently sends requests to llama.cpp depending on server resource availability.

Instead of one big box, I further also set up a cluster of small nodes and then load balanced between them, spin up new nodes on demand, etc...

Collapse
 
mcharytoniuk profile image
Mateusz Charytoniuk

Thanks for sharing your concerns and tips. I am building a framework around it to mitigate those resources issues, so hopefully I will be able to mitigate most of them.

My understanding is you can also tweak it by adjusting slot sizes to not ever exceed the maximum context, so in general I’m more optimistic. :D I will remember your advice, however.

Collapse
 
kwnaidoo profile image
Kevin Naidoo

Cool thanks! That sounds great, and glad it's working for you. We need more experimentation in this area beyond just using OpenAI.

Collapse
 
malzag profile image
Małgorzata Zagajewska

Looks very promising, congrats on the integration!