Locally running LLaMA2 'finish_reason': 'length'

#ai #llama2 #python #beginners

A very small problem-solving share about running the LLama model locally!

Previously I used the library provided by llama-cpp-python directly to build the chat model for my local deployment of LLaMA and focused mostly on issues such as RAG model building, data preprocessing, and M1 Mac GPU acceleration. There was no patience to work on the quality and setup issues of the basic language-generated models.

Today I was reading the GitHub documentation for llama-cpp-python to try out the simple directed chat architecture function llm.create_chat_completion(), but suddenly I realized that on my M1 Mac, the model inference would suddenly stop as if the words were not finished.

This problem has been encountered often before when I was building the LLM backend and frontend chat interface projects.

The solution I found before was to keep changing the
n_ctx, n_batch, n_threads, n_parts, n_gpu_layers`

Even adjusting the max token to a different value such as 512 can sometimes fix the problem.

According to author abetlen and GitHub user zhangchn
The __call__ of Llama has an unfortunate default value of 16, which was changed to be consistent with the OpenAI API, but I didn't realize until later that a lot of people call the completions function without setting max_tokens. tokens.

So to avoid the situation in the image shown above, where the generation of text suddenly stops, we can override this value by setting max_tokens to -1 when we call llm for inference generation to avoid this situation.

DEV Community

Locally running LLaMA2 'finish_reason': 'length'

A very small problem-solving share about running the LLama model locally!

Today I was reading the GitHub documentation for llama-cpp-python to try out the simple directed chat architecture function llm.create_chat_completion(), but suddenly I realized that on my M1 Mac, the model inference would suddenly stop as if the words were not finished.

This problem has been encountered often before when I was building the LLM backend and frontend chat interface projects.

Even adjusting the max token to a different value such as 512 can sometimes fix the problem.

Top comments (0)

Read next

PHP OOP Part-2: Constructor and Destructor

Oh My Zsh: A Simple Guide for Developers

Connect to multiple databases, make or generate SQL queries, analyze or visualize.

From Lama2 to LiveAPI: Building Super-Convenient API Documentation (Part II)