DEV Community

Teruo Kunihiro
Teruo Kunihiro

Posted on

Cha Cha Chat with AI in Local

Hello everyone. I've recently joined a generative AI team on the current company. I don't have much experience with generative AI though, I've been experimenting with running a Large Language Model (LLM) locally to prepare for any future requests to develop AI chat app like ChatGPT. Since I'm a Japanese speaker, I look for LLMs for Japanese one in this article.

Let's get started.

About PC Specifications

In this article, all tries were on this environment.

  • Model: MacBook Pro 14-inch, 2023
  • Chip: Apple M2 Max
  • Memory: 64GB
  • OS: macOS 14.1

About Large Language Models

There are various types of Large Language Models (LLMs), like the well-known GPT, BERT, LLaMA, etc. I won't dive into their differences or specifics given my current knowledge, but for this endeavor, I chose LLaMA for this article, which is popular among third parties for its accuracy and commercial viability.

Just Want to Get It Running

I knew that publicly available LLMs could be found on a site called Hugging Face, but I had no idea how to run them on the local. My aim was to create something like ChatGPT for future app implementation ideas.

After some research, I came across an Open Source Software (OSS) called FastChat, Text generation web UI. With this repository, I was able to locally run llama2 and chat with it.

For those who just want to try llama2, Hugging Face has a demo page, which is probably the quickest way to experience it: Hugging Face Demo for Llama2

About Japanese Models

While llama2 performs well in English, it seems far from the level of ChatGPT in Japanese. The responses in Japanese often include English words or are expressed in romanized Japanese. So, I looked for Japanese models.

About Youri7B

This is a model pre-trained in Japanese by rinna Co., Ltd., based on llama2. I tried running it using the 'Text generation web UI' mentioned earlier. Rinna Youri-7B

However, it didn't work as expected. The model seemed to load correctly in the UI, but all responses were in English. I didn't know the reason why it didn't work.

Running Python Files

I tried running Python scripts as described on the Hugging Face Youri-7B page. It looked like to be simpler than using third-party UIs and I could embed this to API after it would work, but due to my limited Python knowledge and the script consuming about 30GB of memory, my PC crashed.

Discovering Ollama

There were some reasons I couldn't complete to run some LLMs on my local environment.

  • Lack of Python Knowledge
  • Many dependencies caused difficulties and frustrations
  • Wanted to ignore runtime environments
  • Wanted to avoid troubleshooting

Summing up these points, what I'm looking for now is an OSS with a chat UI that doesn't require specific knowledge of Python or understanding of dependencies, and one that has clear documentation on how to apply models from Hugging Face.

Meanwhile, I was drifting on the internet and I stumbled upon Ollama. Its documentation seemed minimal but sufficient for my needs. Ollama operates like Docker, with model configuration files and instructions for using models downloaded from Hugging Face. That's what I wanted!

Trying Ollama

Run a LLM for Japanese

I wanted to run the Japanese model Youri, so I set up the Modelfile as suggested in the documentation. Like this.

FROM ./models/rinna-youri-7b-chat-q6_K.gguf

TEMPLATE """[INST] {{ .Prompt }} [/INST] """
PARAMETER num_ctx 4096
PARAMETER stop "[INST]"
PARAMETER stop "[/INST]"
Enter fullscreen mode Exit fullscreen mode

Additionally, I used a gguf model converted by a volunteer from this Hugging Face page.

Running as a server

Ollama can set up a local server while the app is running and it's totally easy. Let's take a look README.md to launch the server. I tried one of the user-provided UIs called ollama-ui and asked it a question about Japanese history. But the quality in Japanese is less than in English.

Ask the history of Japan in Japanese. AI responses short answer.

Ask the history of Japan in English. AI responses with a enough brief overview of Japan's history in English

Insights Gained While Running Ollama

While exploring the Ollama repository, I noticed it was written in Go. It piqued my interest in how it runs LLaMA. It turns out that Ollama uses llama.cpp for execution, which appears to be an app designed to run LLaMA smoothly on Mac. Llama.cpp itself seems not to depend on Python and using C++ instead, which is wrapping up the complex parts and making it accessible even to those with little understanding like myself.

Exploring Frontend LLM

I had heard rumors about running LLaMA as WebAssembly (WASM) on the frontend. So, I looked into some ambitious projects like llama2.c-web and WebLLM, which run LLMs on WASM. Running LLMs on the frontend is fascinating as it allows immediate responses without network dependency, ideal for quick-response needs like voice input or text summarization. I tried both platforms, and they worked impressively.
This seems particularly useful for immediate responses in cases like voice input or text summarization. A configuration where lightweight and rapid-response tasks are handled at the edge, while relatively heavier tasks are managed by server-based LLMs, appears to have high potential for scalability.

Chat with llama2 on a web browser.
The image depicts a chat interface where a user is asking about the capital of Japan.

Check those demos out! They are fantastic.

https://webllm.mlc.ai/#chat-demo
https://diegomarcos.com/llama2.c-web

Try WebLLM

WebLLM is one of the MLC-LLM projects that compiles LLMs for web execution. By compiling the models, it enables them to run on various device runtimes prepared by MLC-LLM. This means you can create LLMs that run in the browser's WASM runtime without depending on Python modules. For users, it's quite amazing that simply loading the model in the browser can start a chat like magic.

Reference:MLC-LLM Project Overview

To run youri7b-chat, as described above, the model needs to be compiled first. For this, I referred to the following documentation and proceeded with the compilation:
Compile Models - MLC-LLM

While going through the documentation, I realized that emscripten also needs to be installed, so I prepared that as well:
Emscripten Installation Instructions

Once everything was ready and the compilation was done, I found something called simple-chat in the examples directory of webllm, which I decided to run locally:
Simple-Chat Example - WebLLM

The compilation and web server setup went smoothly, but then it didn't work and I have completely no idea to make it.

It depicts a chat interface where a user is asking about the capital of Japan.But an error happens associated with WebGPU

Wrap-up

This journey was solely about exploring and running OSS in my local environment, meanwhile I didn't code any single line. It highlighted the power of the OSS community and my respect for everyone developing OSS. I hope to contribute to the LLM ecosystem in some way in the future.

In conclusion, while there were many challenges, it was a learning experience. M2 Macs can handle these models surprisingly well, encouraging me to keep experimenting. Goodbye for now.

Top comments (0)