DEV Community

Ray
Ray

Posted on

Locally running LLaMA2 'finish_reason': 'length'

A very small problem-solving share about running the LLama model locally!

Previously I used the library provided by llama-cpp-python directly to build the chat model for my local deployment of LLaMA and focused mostly on issues such as RAG model building, data preprocessing, and M1 Mac GPU acceleration. There was no patience to work on the quality and setup issues of the basic language-generated models.

Today I was reading the GitHub documentation for llama-cpp-python to try out the simple directed chat architecture function llm.create_chat_completion(), but suddenly I realized that on my M1 Mac, the model inference would suddenly stop as if the words were not finished.
This problem has been encountered often before when I was building the LLM backend and frontend chat interface projects.

The solution I found before was to keep changing the

n_ctx,
n_batch,
n_threads,
n_parts,
n_gpu_layers
`

Even adjusting the max token to a different value such as 512 can sometimes fix the problem.

According to author abetlen and GitHub user zhangchn

The __call__ of Llama has an unfortunate default value of 16, which was changed to be consistent with the OpenAI API, but I didn't realize until later that a lot of people call the completions function without setting max_tokens. tokens.

Image description

So to avoid the situation in the image shown above, where the generation of text suddenly stops, we can override this value by setting max_tokens to -1 when we call llm for inference generation to avoid this situation.

Image description

Top comments (0)