I recently dove into the world of Local LLMs, to see how I can hook them up with Langroid, the agent-oriented open-source LLM framework. These are some notes from what I learned, plus a pointer to a tutorial on how to use them with Langroid.
Why local models?
There are commercial, remotely served models that currently appear to beat all open/local
models. So why care about local models? Local models are exciting for a number of reasons:
- cost: other than compute/electricity, there is no cost to use them.
- privacy: no concerns about sending your data to a remote server.
- latency: no network latency due to remote API calls, so faster response times, provided you can get fast enough inference.
- uncensored: some local models are not censored to avoid sensitive topics.
- fine-tunable: you can fine-tune them on private/recent data, which current commercial models don't have access to.
- sheer thrill: having a model running on your machine with no internet connection, and being able to have an intelligent conversation with it -- there is something almost magical about it.
The main appeal with local models is that with sufficiently careful prompting,
they may behave sufficiently well to be useful for specific tasks/domains,
and bring all of the above benefits. Some ideas on how you might use local LLMs:
- In a mult-agent system, you could have some agents use local models for narrow tasks with a lower bar for accuracy (and fix responses with multiple tries).
- You could run many instances of the same or different models and combine their responses.
- Local LLMs can act as a privacy layer, to identify and handle sensitive data before passing to remote LLMs.
- Some local LLMs have intriguing features, for example llama.cpp lets you constrain its output using grammars.
Running LLMs locally
There are several ways to use LLMs locally. See the r/LocalLLaMA
subreddit for
a wealth of information. There are open source libraries that offer front-ends
to run local models, for example oobabooga/text-generation-webui
(or "ooba-TGW" for short) but the focus in this tutorial is on spinning up a
server that mimics an OpenAI-like API, so that any Langroid code that works with
the OpenAI API (for say GPT3.5 or GPT4) will work with a local model,
with just a simple change: set openai.api_base
to the URL where the local API
server is listening, typically http://localhost:8000/v1
. It really is as simple as that!
There are two libraries I'd recommend for setting up local models with OpenAI-like APIs:
- ooba-TGW mentioned above, for a variety of models, including llama2 models.
- llama-cpp-python (LCP for short), specifically for llama2 models.
If you have other recommends, feel free to add them in the comments.
Building Applications with Local LLMs
We open sourced Langroid to simplify building LLM-powered applications, whether with local or commercial LLMs. If you’re itching to play with local LLMs in simple python scripts, head over to our tutorial on using Langroid with local LLMs.
Top comments (0)