Sebastian

Posted on Mar 11

LLMs on your local Computer (Part 1)

#llm

The Cambrian explosion of Large Language Models (LLMs) happens right now. Ever increasing astonishing models are published and used for text generation tasks ranging from question-answering to fact checking and knowledge interference. Model with sizes ranging from 100 million to 7 billion and more are available with open source licenses. Using these models started from proprietary APIs and evolved to binaries that run on your computer. But which tools exactly can you use? What features do they have? And which models do they support?

This blog post is here to help you navigate the tool landscape for running LLM interference. You will learn about five different tools in the form of a small description accompanied by copy-and-paste ready code snippets for installation and execution. Furthermore, you will see which tools can be used as command-line interface, as an API or even a GUI. With this blog post, you will quickly find your preferred tool to run LLMs on your computer.

The technical context of this article is a computer with a recent Linux or Mac OsX distribution, Python v3.11, and the most recent version of the tools. All instructions should work with newer versions of the tools as well.

This article originally appeared at my blog admantium.com.

Local LLM Interference in a Nutshell

To run an LLM locally, you need a combination of tool, a model representation, and a computer. Let’s review these requirements.

Tool

Several developers are actively working on open-source tools published on GitHub and available as binary files or python packages. All tools run on Linux and OsX, and several also support Windows natively. The following tools are investigated in this article:

Model

A LLM is a neural network typically trained with the Python frameworks TensorFlow or PyTorch. The models are published as binary files that contain the networks architecture and its weights. File sizes range from 500MB to 4GB and larger, depending on the model's parameter size. All models will be given the following prompt, and then their answer and execution stats are measured:



PROMPT="You are a helpful AI assistant.
Please answer the following question.
Question: Which active space exploration missions are conducted by NASA?
Answer:"

Computer

All tools presented in this article can run on CPU and RAM only. A GPU is not required, but if you have one, some tools can be used to either offload some model layers to the GPU, or even running the complete model in the GPU. The reference computer used in this article is from 2017 and comes with the following hardware specs:

Ryzen 3 1600X (3.6 GHZ, 6 core, 12 threads)
GeForce GTX 970 (4GB, 1114 MHz)
16GB DDR4 RAM (1333 MHz)
256GB SSD

This hardware is suitable to run 7B models, but it will be around 0,5 token/s only. To compare: Running 70B models on 2023 hardware and decent GPUs gives you 7 token/s for 70B models, and of course, with hosted LLMs like GPT-4, you get tokens instantaneously.

ggml

CLI	API	GUI	GPU Support	NVDIA	AMD	Included Models
✅	❌	❌	❌	❌	❌	10 + 16 (from the community)
---	---	---	-----------	-----	---	-------------------------

The ggml library is one of the first library for local LLM interference. It’s a pure C library that converts models to run on several devices, including desktops, laptops, and even mobile device - and therefore, it can also be considered as a tinkering tool, trying new optimizations, that will then be incorporated into other downstream projects. This tool is at the heart of several other projects, powering LLM interference on desktop or even mobile phones. Subprojects for running specific LLMs or LLM families exists, such as whisper.cpp.

Installation & Model Loading



git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build
cd build
cmake ..

make -j4 gpt-j
../examples/gpt-j/download-ggml-model.sh 6B

Model Interference



time ./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "${PROMPT}"

Stats



gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 11542.79 MB / num tensors = 285

main: mem per token = 15552688 bytes
main:     load time = 34065.07 ms
main:   sample time =   508.65 ms
main:  predict time = 97095.84 ms / 420.33 ms per token
main:    total time = 133148.73 ms

real 2m14.283s
user 6m21.891s
sys 0m30.806s

Text Output



Answer: There are six active space exploration missions currently being conducted by NASA, including the following:

The Dawn Spacecraft

The Dawn Spacecraft was launched on September 27, 2007, on a mission to explore the large Vesta Asteroid and is currently in its second year of exploration. Dawn was named in honor of the Roman goddess of the dawn, and her name means "to know." The Dawn spacecraft is being orbited by the Mars-bound Dawn Orbiter, and is being followed by the Dawn Spacecraft which is performing a flyby of Vesta every year. The Dawn Orbiter is scheduled to be deorbited in June of next year.

The Dawn Spacecraft is currently in orbit around Vesta, a large protoplanet of approximately 600 miles in diameter. Vesta is the second largest protoplanet discovered, and is the third largest object ever discovered in the solar system. The Dawn Spacecraft is currently in orbit around Vesta, but Dawn is scheduled to [END OF TEXT]

Other Models

Several pre-defined scripts exist that show how to run any other HuggingFace model with ggml. Here is an example for MPT 7B Chat:



sudo apt-get intall git-lfs
python3 -m pip install torch transformers

git clone --depth=1 https://huggingface.co/mosaicml/mpt-7b-chat
python3 ../examples/mpt/convert-h5-to-ggml.py ./mpt-7b-chat 1

./bin/mpt -m ./mpt-7b-chat/ggml-model-f16.bin -p "${PROMPT}"

Running this model generated the following stats.



mpt_model_load: ggml ctx size = 12939.11 MB
mpt_model_load: memory_size =   256.00 MB, n_mem = 16384
mpt_model_load: ........................ done
mpt_model_load: model size = 12683.02 MB / num tensors = 194

main: sampled tokens =      200
main:  mem per token =   395960 bytes
main:      load time = 36896.35 ms
main:    sample time = 15379.12 ms / 76.90 ms per token
main:      eval time = 114642.98 ms / 496.29 ms per token
main:     total time = 168488.67 ms

real 2m49.671s
user 7m46.547s
sys 0m31.411s

llama.cpp

CLI	API	GUI	GPU Support	NVDIA	AMD	Models
✅	❌	❌	✅	cUBLAS	CLBLAST	31
---	---	---	-----------	-----	---	-------------------------

This library started as a hacking project for the LLaMA model only, but evolved to a flagship open-source project running over 30 models and being incorporated into many other projects. At the time of writing this article, more than 31 models are supported, including LLaMA, Alpaca, Vicuna, Falcon and WizardLM. It also supports all computer platforms.

Installation & Model Loading



git clone --depth=1 https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake ..
cmake --build . --config Release

wget -c --show-progress -o models/llama-2-13b.Q4_0.gguf https://huggingface.co/TheBloke/Llama-2-13B-GGUF/resolve/main/llama-2-13b.Q4_0.gguf?download=true

Model Interference



time ./bin/main -m models/llama-2-13b.Q4_0.gguf -p "${PROMPT}"

Stats



llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 6.86 GiB (4.53 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_tensors: ggml ctx size       =    0.14 MiB
llm_load_tensors: system memory used  = 7024.03 MiB

llama_print_timings:        load time =    1303.44 ms
llama_print_timings:      sample time =     422.36 ms /  1206 runs   (    0.35 ms per token,  2855.38 tokens per second)
llama_print_timings: prompt eval time =    7158.69 ms /    36 tokens (  198.85 ms per token,     5.03 tokens per second)
llama_print_timings:        eval time =  332840.85 ms /  1205 runs   (  276.22 ms per token,     3.62 tokens per second)
llama_print_timings:       total time =  340843.92 ms

real 1m43.805s
user 9m42.562s
sys 0m26.427s

Text Output



NASA is conducting the Artemis mission to return humans to the Moon, and to establish sustainable lunar exploration.  It will send a crewed vehicle to the Gateway in lunar orbit.  The Gateway will serve as a staging point for missions to the surface of the moon.

The InSight lander mission is currently on Mars studying its interior, and the Perseverance rover will arrive there soon to further study the planet's surface and collect samples for future return to Earth.  The Psyche orbiter mission will be exploring a metallic asteroid in our solar system.

The Dragonfly quadcopter mission will land on Titan, one of Saturn’s moons, where it will perform a series of flights to different locations around the surface and study the moon's surface composition and geology.  The Mars 2020 rover, which has not yet launched, is slated to launch in July and will conduct in-depth studies at an ancient river delta in Jezero crater, and collect samples for future return to Earth.

The Hubble Space Telescope (HST) was the first of four Great Observatories, launched by NASA in 1990. HST is still operating after more than three decades in space, and has provided numerous groundbreaking insights about astronomy. The Compton Gamma Ray Observatory was also launched in 1990 and studied cosmic gamma rays, including the spectroscopy of several supernova remnants.

The Chandra X-ray Observatory was launched in 1999 and studies the universe with X-Ray photography, while the Spitzer Space Telescope was launched in 2003 and specialized in infrared imaging. The James Webb Space Telescope will soon replace Hubble as NASA's flagship space observatory when it is deployed later this year. It will operate from Lagrangian point 2 (L2) of the Sun-Earth system, and provide greatly improved capabilities over its predecessors.

The Voyager program consisted of two probes launched in 1977, both of which have entered interstellar space and are still transmitting data back to Earth. The twin Voyagers were designed for a five year mission to study Jupiter and Saturn and their moons, but continued on to Uranus and Neptune. The New Horizons probe was launched in 2006, and was used to provide the first close-up views of Pluto, along with detailed observations of several other Kuiper Belt objects.

The Cassini–Huygens spacecraft was a collaboration between NASA, ESA, and ASI that consisted of two parts. The Cassini orbiter studied Saturn and its moons from orbit, while the Huygens probe detached from it to study Titan's surface in 2005.

The Juno spacecraft was launched by NASA in 2011 to study Jupiter and its atmosphere. The Rosetta space probe was a collaborative effort between ESA and NASA to orbit the comet 67P/Churyumov–Gerasimenko from 2014-2016, providing numerous insights into the composition of its surface.

The Parker Solar Probe is scheduled to launch in 2018 by NASA to study solar wind and other phenomena associated with the Sun's corona, in an effort to better understand solar storms and their effects on Earth.

### Space probes

Main article: List of space probes

Space probes have been deployed into interplanetary space (that is outside of any celestial body's gravity field), or beyond the heliosphere and even into interstellar space, to study many different parts of the Solar System. Most space probes have not entered planetary atmospheres due to their nature as either flybys or probes in deep space, but some such as the Messenger probe did briefly enter Mercury's atmosphere.

### Landers and rovers on other celestial bodies

Main article: List of artificial objects on extraterrestrial bodies

Lander and rover missions are currently active in our solar system, with multiple lander or rover missions occurring over the past decades. In some cases a probe will enter orbit around the target celestial body, then send down either a lander or a rover to study it from closer range than possible by orbital means alone.

### Astromobiles and satellites in deep space

Main article: List of artificial objects on extraterrestrial bodies § Satellites and astromobiles

Astromobile is another term used for a satellite or probe that is traveling through the Solar System, but does not orbit any celestial body. This includes the Voyager program's spacecraft, as well as some of NASA's New Horizons and Dawn missions.

### Probes in transit

Main article: List of artificial objects on extraterrestrial bodies § Probes in transit

These are probes that were launched from the Earth but are not yet in orbit around any celestial body or are en route to do so, and are thus classified as "in transit". They include missions such as the Mars 2020 Rover. [END OF TEXT]

GPU Support

The llama.cpp library can be run with GPU support too. Essentially, you need to have the graphics card driver installed, and then pass a corresponding compile flag. For my reference computer with an NVDIA graphics card:



sudo apt-get install nvidia-cuda-toolkit

mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CCACHE=OFF
cmake --build . --config Release

- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
...
-- Found CUDAToolkit: /usr/include (found version "11.5.119")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 11.5.119
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 52;61;70
-- CUDA host compiler is GNU 11.4.0
// ...

Using the same model llama-2-13b.Q4 with GPU support generated the following stats.



llama_print_timings:        load time =    1497.28 ms
llama_print_timings:      sample time =      24.44 ms /    74 runs   (    0.33 ms per token,  3028.07 tokens per second)
llama_print_timings: prompt eval time =    2485.92 ms /    36 tokens (   69.05 ms per token,    14.48 tokens per second)
llama_print_timings:        eval time =   19191.20 ms /    73 runs   (  262.89 ms per token,     3.80 tokens per second)
llama_print_timings:       total time =   22114.99 ms /   109 tokens

real 0m24.709s
user 1m58.611s
sys 0m15.936s

As you see, the token count only marginally improved when using a GPU.



(276.22 ms per token, 3.62 tokens per second) //CPU only
(262.89 ms per token, 3.80 tokens per second) //CPU and GPU

ollama.ai

CLI	API	GUI	GPU Support	NVDIA	AMD	Models
✅	✅	❌	❌	❌	❌	61
---	---	---	-----------	-----	---	-------------------------

This project started in July 2023 and provides a stable and efficient way to run LLMs. Its focus is to optimize models through quantization, e.g. going from 16bit to 4bit, which drastically improves the speed of token generation.

From all the tools reviews in this article, ollama.ai impressed me the most: It has the best overall token generation speed, and thanks to its usage of 4bit quantized models, it consumes nearly no RAM, only CPU. It shows how much a well-defined and designed open-source project can help to lower the barrier for using complex technologies.

Installation & Model Loading



mkdir ollama
cd ollama
wget -c  https://ollama.ai/download/ollama-linux-amd64 -O ollama.bin
chmod +x ollama.bin

ollama run llama2

Model Interference



ollama run llama2

Stats

Running ollama.ai models via CLI does not reveal any stats.

Text Output



NASA is currently conducting several active space exploration missions, including:

1. Mars Exploration Program: NASA's ongoing mission to explore Mars and gather data about the planet's geology, climate,
and potential habitability.
2. Cassini-Huygens Mission: A joint mission between NASA, the European Space Agency, and the Italian space agency, which
is exploring the Saturn system and its moons, including Titan.
3. Juno Mission: A mission to study the Jupiter system and gather data about the planet's atmosphere, magnetic field,
and interior.
4. New Horizons Mission: A mission to study the Pluto system and other objects in the Kuiper Belt, a region of icy
bodies beyond Neptune.
5. OSIRIS-REx Mission: A mission to study the asteroid Bennu and gather samples for return to Earth.
6. Parker Solar Probe Mission: A mission to study the Sun's corona and gather data about the solar wind and other
phenomena in the solar system.
7. Commercial Crew Program: A program to launch astronauts to the International Space Station using private spacecraft,
such as SpaceX's Dragon and Boeing's Starliner.
8. International Space Station Program: An ongoing program to operate and conduct research on the International Space
Station, a habitable artificial satellite in low Earth orbit.
9. Mars 2020 Mission: A mission to explore Jezero crater on Mars and gather data about the planet's geology and
potential biosignatures.
10. Artemis Program: A program to return humans to the Moon by 2024 and establish a sustainable presence on the lunar
surface, with the goal of eventually sending humans to Mars and other destinations in the solar system.

These are just a few examples of NASA's ongoing space exploration missions. The agency is constantly working on new
missions and projects to explore our solar system and beyond.

API

Another way to use ollama.ai is via its API. Once started, it creates a configurable REST API - and with a simple curl request, this endpoint can be queried.



curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "You are a helpful AI assistant. Please answer the following question. Question: Which active space exploration missions are conducted by NASA? Answer:"
 }'

The response will be streamed token-by-token to the caller, and interestingly, a final response also includes performance stats.



{"model":"llama2","created_at":"2024-01-21T11:45:20.691017501Z","response":"\n","done":false}
{"model":"llama2","created_at":"2024-01-21T11:45:20.839294214Z","response":"N","done":false}
{"model":"llama2","created_at":"2024-01-21T11:45:20.989470461Z","response":"AS","done":false}
{"model":"llama2","created_at":"2024-01-21T11:45:21.138173364Z","response":"A","done":false}
{"model":"llama2","created_at":"2024-01-21T11:45:21.289603804Z","response":" (","done":false}
{"model":"llama2","created_at":"2024-01-21T11:45:21.438224653Z","response":"National","done":false}
// ...

{"model":"llama2","created_at":"2024-01-21T11:47:21.407359314Z","response":" Earth","done":false}
{"model":"llama2","created_at":"2024-01-21T11:47:21.644278398Z","response":" orbit","done":false}
{"model":"llama2","created_at":"2024-01-21T11:47:21.833488632Z","response":" and","done":false}
{"model":"llama2","created_at":"2024-01-21T11:47:22.031820197Z","response":" explore","done":false}
{"model":"llama2","created_at":"2024-01-21T11:47:22.226019821Z","response":" deep","done":false}
{"model":"llama2","created_at":"2024-01-21T11:47:22.421290323Z","response":" space","done":false}
{"model":"llama2","created_at":"2024-01-21T11:47:22.618374168Z","response":".","done":false}

// ...
{
  "model": "llama2",
  "created_at": "2024-01-21T11:47:22.816675556Z",
  "response": "",
  "done": true,
  "context": [
    518,
    25580,
    29962,
    // ...
    29889
  ],
  "total_duration": 126549030083,
  "load_duration": 600226,
  "prompt_eval_count": 36,
  "prompt_eval_duration": 4569576000,
  "eval_count": 685,
  "eval_duration": 121977152000
}

lit-gpt

CLI	API	GUI	GPU Support	NVDIA	AMD	Models
✅	❌	❌	✅	cUBLAS	❌	19
---	---	---	-----------	-----	---	-------------------------

A pure Python library for running several open source LLMs locally, such as Mistral or StableLM. It also branched into subprojects for supporting specific LLM model-families, such as lit-llama. The projects goal are simplicity and optimization to run on several hardware. When used with 4bit quantization, it needs below 1GB to run a 3B model, and also CPU usage remains low.

Model provision requires downloading the model weights from HuggingFace. For the built-in models, download scripts are included in the project. For running your own models, you potentially need to download and convert them manually. An interesting aspect thereby is to fine-tune the models - see the finetune adapter documentation.

Installation & Model Loading



git clone --depth=1 https://github.com/Lightning-AI/lit-gpt
cd lit-gpt
pip install -r requirements.txt
pip install bitsandbytes==0.41.0 huggingface_hub

python scripts/download.py --repo_id stabilityai/stablelm-zephyr-3b --from_safetensors=True
python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/stabilityai/stablelm-zephyr-3b --dtype float32

Model Interference

The program expects to be run on GPU. To enable CPU, you need to pass a --quantize flag.



python chat/base.py --checkpoint_dir checkpoints/stabilityai/stablelm-zephyr-3b --quantize bnb.nf4

Stats



Time for inference: 290.24 sec total, 2.24 tokens/sec, 650 tokens

Text Output



>> Reply: NASA has been involved in numerous active space exploration missions throughout its history. Some of the significant and ongoing missions conducted by NASA include:

1. International Space Station (ISS): NASA collaborates with ESA (European Space Agency), Roscosmos (Russian Space Agency), JAXA (Japan Aerospace Exploration Agency), and CSA (Canadian Space Agency) in this long-duration research and experimentation outpost, orbiting Earth at an average altitude of about 408 kilometers (256 miles) above the Earth's surface. (Launched in 1998, currently ongoing)

2. Mars Exploration Program: Since 1996, NASA has been leading the United States' Mars exploration missions, with multiple spacecraft and rovers sent to the red planet. The most notable missions include the Mars Climate Orbiter (1999), Mars Exploration Rovers (Spirit and Opportunity in 2003), and Mars Reconnaisance Orbiter (Mars Global Remote Sensing Orbiter in 2006). NASA continues to develop and execute cutting-edge Mars missions, such as the Mars 2020 Mission (Pitchcombeat Rover in 2021).

3. Commercial Crew Program: NASA's Commercial Crew Program aims to develop and sustain a human-rated spacecraft to transport astronauts to and from the International Space Station (ISS). The first two missions (Artemis I and Atlas V Demo-2) were carried out by private companies (SpaceX and Northrop Grumman) with NASA's critical support and oversight. NASA aims to complete this program by 2024, with Starliner and SLS (Space Launch System) missions to carry astronauts to ISS.

4. Small Earth Observing Mission (SCOM): Launched in 2018, the SCO Mission (previously known as the Planetary Health and Life Survey Mission) is a NASA-led effort to study Earth's dynamic environment and its relationship with life. The spacecraft will conduct observations in multiple spectral bands covering the whole electromagnetic spectrum, providing detailed insights into our planet's health and potential hazards.

5. Juno: The Jet Propulsion Laboratory (JPL), a NASA center, launched the Juno spacecraft in 2011 to study Jupiter's deep atmosphere, magnetic field, and its role in shaping the solar system. This mission has greatly expanded our understanding of the iconic planet.

6. Cassini-Huygens Mission: Launched in 1997 as part of NASA's Galilean Satellite Survey, this mission provided extensive insight into Saturn and its moons. The Cassini-Huygens mission ended in September 2017 when it deliberately plunged into Saturn's atmosphere to extend the mission, and to clean up potential space debris.

7. James Webb Space Telescope (JWST): This major infrared observatory, scheduled for launch in late 2021, is NASA's most powerful space observatory. Its primary objectives include investigating the earliest galaxies, exoplanets, and the formation of stars and planets in our cosmic neighborhood and expanding our understanding of the universe.

These are just a few examples of NASA's active space exploration missions. The space agency constantly explores new frontiers and works towards achieving its goals in space exploration and scientific research.

FastChat

CLI	API	GUI	GPU Support	NVDIA	AMD	Models
✅	✅	✅	✅	cUBLAS	ROCm	57
---	---	---	-----------	-----	---	-------------------------

FastChat is a complex framework with several use cases - serving LLMs is only one, but I included it in this review to show in which direction LLM tools evolve. This project provides a CLI, an API, and even a GUI, launched by robust Python programs that can even parallelize using multiple GPUs and server. Interestingly, you can use it to run several LLMs at the same time and actively compare their text generation capabilities.

Because FastChat defaults to run models with their original trained precision, it requires more computing resources than any other tool explained before. For example, the Vicuna7B models requires either 14GB GPU or 30GB RAM. However, it supports parallelizing with multiple GPUs and can alternatively load models with 8b quantization.

Installation & Model Loading



mkdir fastchat
cd fastchat
python -m pip install fschat[model_worker,webui]

python -m fastchat.serve.cli --model-path lmsys/fastchat-t5-3b-v1.0

Interference



python -m fastchat.serve.cli --model-path lmsys/fastchat-t5-3b-v1.0 --device=cpu

Stats

The CLI mode does not expose any performance stats.

Text Output



NASA has several active space exploration missions that are currently underway. Some examples include:
1. Apollo 11: This mission took humans to the Moon in 1969 and is now on its way back to Earth.
2. Mars Exploration Rovers: These vehicles were used by NASA to explore the planet Mars from a new perspective and gather data on the surface of Mars.
3. Solar Probe B: This mission went to the moon in 2004 and returned in 2015 with scientific instruments to study the icy surface.
4. Voyager 1: This spacecraft was launched in 1998 but is still operational and is still exploring orbiting the Sun.
5. Cassini Space Mission: This mission was launched in 2013 with a crew of six astronauts and the first deep-space probe to be launched by NASA.
6. Hubble Space Telescope: This telescope is used by NASA to study the solar system and provides images of the Sun, the Moon, and other celestial bodies.
7. New Horizons: This mission was launched in 2013 and is the pristine gateway to our universe. It is ready to return humans to the moon and to explore the solar system again from a new perspective.
8. Europa Clipper: This mission was launched in 2015 and is returning astronauts to the surface of the moon with scientific instruments to study the composition of the atmosphere.
9. Rosetta: This mission is sending a [END OF TEXT]

API

FastChat can be started in API mode. You need to start three different Python programs, and then a REST API becomes available.



python3 -m fastchat.serve.controller
python3 -m fastchat.serve.model_worker --model-path lmsys/fastchat-t5-3b-v1.0 --device=cpu
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000

An API query needs to include both the message and the model name.



curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "fastchat-t5-3b-v1.0",
    "messages": [{"role": "user", "content": "You are a helpful AI assistant. Please answer the following question. Question: Which active space exploration missions are conducted by NASA? Answer:"}]
  }'

And here is the output:



{
  "id": "chatcmpl-3mi6zV3GSm2AA29FGoeHug",
  "object": "chat.completion",
  "created": 1705842246,
  "model": "fastchat-t5-3b-v1.0",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "NASA is the United States' space agency and is responsible for conducting a wide range of space-related missions, including:\n1. The International Space Station (ISS)\n2. The Apollo program (including missions to the Moon, Mars, and the Moon)\n3. The International Space Station (ISS)\n4. The Galileo mission (which is the first spacecraft to explore the Solar System)\n5. The Hubble Space Telescope\n6. The Voyager 1 and 2 missions to the Moon\n7. The Mars Exploration Rovers (MER) program\n8. The Europa mission\n9. The Viking program to explore the Moon\n10. The Mars Science Laboratory (MSL) mission to the surface of Mars\n11. The Hubble Space Telescope (HST) mission to the Moon\n12. The Transit program to explore the Moon and its moons\n13. The Deep Impact program to investigate the formation and evolution of the Moon's oceans\n14. The International Space Station (ISS) program to explore the satellite and robotic missions to the Moon, Mars, and other celestial bodies\nThese missions consist of scientific research and exploration of celestial bodies, as well as technology development and educational outreach programs.\n"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 903,
    "total_tokens": 1316,
    "completion_tokens": 413
  }
}

GUI

FastChat also provides a GUI. As before, you need to start several Python programs.



python3 -m fastchat.serve.controller
python3 -m fastchat.serve.model_worker --model-path lmsys/fastchat-t5-3b-v1.0 --device=cpu
python3 -m fastchat.serve.gradio_web_server

The GUI can be accessed on any configured host IP and port. Here is a screenshot:

While using the GUI, the model worker outputs a message that details which system prompt the UI uses.



{

  "model": "fastchat-t5-3b-v1.0",

  "prompt": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n### Human: Got any creative ideas for a 10 year old’s birthday?\n### Assistant: Of course! Here are some creative ideas for a 10-year-old's birthday party:\n1. Treasure Hunt: Organize a treasure hunt in your backyard or nearby park. Create clues and riddles for the kids to solve, leading them to hidden treasures and surprises.\n2. Science Party: Plan a science-themed party where kids can engage in fun and interactive experiments. You can set up different stations with activities like making slime, erupting volcanoes, or creating simple chemical reactions.\n3. Outdoor Movie Night: Set up a backyard movie night with a projector and a large screen or white sheet. Create a cozy seating area with blankets and pillows, and serve popcorn and snacks while the kids enjoy a favorite movie under the stars.\n4. DIY Crafts Party: Arrange a craft party where kids can unleash their creativity. Provide a variety of craft supplies like beads, paints, and fabrics, and let them create their own unique masterpieces to take home as party favors.\n5. Sports Olympics: Host a mini Olympics event with various sports and games. Set up different stations for activities like sack races, relay races, basketball shooting, and obstacle courses. Give out medals or certificates to the participants.\n6. Cooking Party: Have a cooking-themed party where the kids can prepare their own mini pizzas, cupcakes, or cookies. Provide toppings, frosting, and decorating supplies, and let them get hands-on in the kitchen.\n7. Superhero Training Camp: Create a superhero-themed party where the kids can engage in fun training activities. Set up an obstacle course, have them design their own superhero capes or masks, and organize superhero-themed games and challenges.\n8. Outdoor Adventure: Plan an outdoor adventure party at a local park or nature reserve. Arrange activities like hiking, nature scavenger hunts, or a picnic with games. Encourage exploration and appreciation for the outdoors.\nRemember to tailor the activities to the birthday child's interests and preferences. Have a great celebration!\n### Human: You are a helpful AI assistant. Please answer the following question. Question: Which active space exploration missions are conducted by NASA? Answer:\n### Assistant:",

  "temperature": 0.7,

  "repetition_penalty": 1.2,

  "top_p": 1.0,

  "max_new_tokens": 1024,

  "stop": "###",

  "stop_token_ids": None,

  "echo": False,

}

Comparison

The following table summarizes all tools.

Lib	CLI	API	GUI	GPU Support	NVDIA	AMD	Models
ggml	✅	❌	❌	❌	❌	❌	10
llama.cpp	✅	❌	❌	✅	cUBLAS	CLBLAST	31
ollama.ai	✅	✅	❌	❌	❌	❌	61
lit-gpt	✅	❌	❌	✅	cUBLAS	❌	19
fastchat	✅	✅	✅	✅	cUBLAS	ROCm	57
---	---	---	-----------	-----	---	-------------------------

Conclusion

For running Large Language Models locally, several libraries exist. In this article, you learned how to get started with five different. For each tool, copy-and-paste ready code-snippets were shown to install, load models, and start interference. You also saw how the models responded to a query about active space programs and their token generation performance. From all the investigated tools, ollma.ai provided the easiest installation method, and the biggest performance for token generation.

Local LLM Interference in a Nutshell

Tool

Model

Computer

ggml

Installation & Model Loading

Model Interference

Stats

Text Output

Other Models

llama.cpp

Installation & Model Loading

Model Interference

Stats

Text Output

GPU Support

ollama.ai

Installation & Model Loading

Model Interference

Stats

Text Output

API

lit-gpt

Installation & Model Loading

Model Interference

Stats

Text Output

FastChat

Installation & Model Loading

Interference

Stats

Text Output

API

GUI

Comparison

Conclusion

Read next

SQLRAG: Transforming Database Interactions with Natural Language and LLMs

Building a Multi-Agent Framework from Scratch with LlamaIndex

Skepticism about Large Language Models (LLM) and ChatGPT

PLANSEARCH: Improving reasoning for LLMs