#SemanticKernel – 📎Chat Service demo running Llama2 LLM locally in Ubuntu

#englishpost #codesample #llama2

Hi!

Today’s post is a demo on how to interact with a local LLM using Semantic Kernel. In my previous post, I wrote about how to use LM Studio to host a local server. Today we will use ollama in Ubuntu to host the LLM.

Ollama

Ollama is an open-source language model platform designed for local interaction with large language models (LLMs). It provides developers with a convenient way to run LLMs on their own machines, allowing experimentation, fine-tuning, and customization. With Ollama, you can create and execute scripts directly, without relying on external tools. Notable features include Python and JavaScript libraries, integration of vision models, session management, and improved CPU support. Whether you’re a researcher, developer, or enthusiast, Ollama empowers you to explore and harness the capabilities of language models locally.

Run a local inference LLM server using Ollama

In their latest post, the Ollama team describes how to download and run locally a Llama2 model in a docker container, now also supporting the OpenAI API schema for chat calls (see OpenAI Compatibility).

They also describe the necessary steps to run this in a linux distribution. So, I got back to life on my Ubuntu using Windows Subsystem for Linux.

And if you want to know more, here are my Ubuntu specs:

Now time to install ollama, run the server, and start a live journal track in a separate window using the following commands:


# install ollama
curl -fsSL https://ollama.com/install.sh | sh

# run ollama
ollama run llama2

/# show journal / logs in live model
journalctl -u ollama -f

The ollama server is up and running, hosting a llama2 model in the endpoint: http://localhost:11434/v1/chat/completions

Llama 2

In my previous post, I used Phi-2 as the LLM to test with Semantic Kernel. Ollama allows us to use a different set of models, this time I decided to test Llama 2.

Llama 2 is a family of transformer-based autoregressive causal language models. These models take a sequence of words as input and recursively predict—the next word(s).

Here are some key points about Llama 2 :

Open Source : Llama 2 is Meta’s open-source large language model (LLM). Unlike some other language models, it is freely available for both research and commercial purposes.
Parameters and Features : Llama 2 comes in many sizes, with 7 billion to 70 billion parameters. It is designed to empower developers and researchers by providing access to state-of-the-art language models.
Applications : Llama 2 can be used for a wide range of applications, including text generation , inference , and fine-tuning. Its versatility makes it valuable for natural language understanding and creative tasks.
Global Support : Llama 2 has garnered support from companies, cloud providers, and researchers worldwide. These supporters appreciate its open approach and the potential it holds for advancing AI innovation.

Source: Conversation with Microsoft Copilot:

Llama. https://llama.meta.com/
Llama 2 is here – get it on Hugging Face. https://huggingface.co/blog/llama2
Download Llama. https://ai.meta.com/resources/models-and-libraries/llama-downloads/

📎 Semantic Kernel and Custom LLMs

If you want to learn more about Semantic Kernel, check the official repository here: https://aka.ms/ebsk

The whole sample can be found in: https://aka.ms/repo-skcustomllm01

In this new iteration, I added a few changes:

Create a shared class library “sk-customllm”. This class implements the Chat Completion Service from Semantic Kernel.
Added a few more fields to the models to work with the OpenAI API specification.

The new solution looks like this one: