DEV Community

Ray
Ray

Posted on

1

Locally running LLaMA2 'finish_reason': 'length'

A very small problem-solving share about running the LLama model locally!

Previously I used the library provided by llama-cpp-python directly to build the chat model for my local deployment of LLaMA and focused mostly on issues such as RAG model building, data preprocessing, and M1 Mac GPU acceleration. There was no patience to work on the quality and setup issues of the basic language-generated models.

Today I was reading the GitHub documentation for llama-cpp-python to try out the simple directed chat architecture function llm.create_chat_completion(), but suddenly I realized that on my M1 Mac, the model inference would suddenly stop as if the words were not finished.
This problem has been encountered often before when I was building the LLM backend and frontend chat interface projects.

The solution I found before was to keep changing the

n_ctx,
n_batch,
n_threads,
n_parts,
n_gpu_layers
`

Even adjusting the max token to a different value such as 512 can sometimes fix the problem.

According to author abetlen and GitHub user zhangchn

The __call__ of Llama has an unfortunate default value of 16, which was changed to be consistent with the OpenAI API, but I didn't realize until later that a lot of people call the completions function without setting max_tokens. tokens.

Image description

So to avoid the situation in the image shown above, where the generation of text suddenly stops, we can override this value by setting max_tokens to -1 when we call llm for inference generation to avoid this situation.

Image description

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry 🕒

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

Image of Timescale

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free