David Mezzetti for NeuML

Posted on May 31, 2024 • Edited on Dec 14 • Originally published at neuml.hashnode.dev

RAG with llama.cpp and external API services

#ai #llm #rag #vectordatabase

txtai has been and always will be a local-first framework. It was originally designed to run models on local hardware using Hugging Face Transformers. As the AI space has evolved over the last year, so has txtai. Additional LLM inference frameworks have been available for a while using llama.cpp and external API services (via LiteLLM). Recent changes have added the ability to use these frameworks for vectorization and made it easier to use for LLM inference.

This article will demonstrate how to run retrieval-augmented-generation (RAG) processes (vectorization and LLM inference) with llama.cpp and external API services.

Install dependencies

Install txtai and all dependencies.

# Install txtai and dependencies
pip install llama-cpp-python[server] --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
pip install txtai[pipeline-llm]

Embeddings with llama.cpp vectorization

The first example will build an Embeddings database backed by llama.cpp vectorization.

The llama.cpp project states: The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

Let's give it a try.

from txtai import Embeddings

# Create Embeddings with llama.cpp GGUF model
embeddings = Embeddings(
    path="second-state/All-MiniLM-L6-v2-Embedding-GGUF/all-MiniLM-L6-v2-Q4_K_M.gguf",
    content=True
)

# Load dataset
wikipedia = Embeddings()
wikipedia.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

query = """
SELECT id, text FROM txtai
order by percentile desc
LIMIT 10000
"""

# Index dataset
embeddings.index(wikipedia.search(query))

Now that the Embeddings database is ready, let's run a search query.

embeddings.search("Inventors of electric-powered devices")

[{'id': 'Thomas Edison',
  'text': 'Thomas Alva Edison (February 11, 1847October 18, 1931) was an American inventor and businessman. He developed many devices in fields such as electric power generation, mass communication, sound recording, and motion pictures. These inventions, which include the phonograph, the motion picture camera, and early versions of the electric light bulb, have had a widespread impact on the modern industrialized world. He was one of the first inventors to apply the principles of organized science and teamwork to the process of invention, working with many researchers and employees. He established the first industrial research laboratory.',
  'score': 0.6758285164833069},
 {'id': 'Nikola Tesla',
  'text': 'Nikola Tesla (; , ;  1856\xa0– 7 January 1943) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist. He is best-known for his contributions to the design of the modern alternating current (AC) electricity supply system.',
  'score': 0.6077840328216553},
 {'id': 'Alexander Graham Bell',
  'text': 'Alexander Graham Bell (, born Alexander Bell; March 3, 1847 – August 2, 1922) was a  Scottish-born Canadian-American inventor, scientist and engineer who is credited with patenting the first practical telephone. He also co-founded the American Telephone and Telegraph Company (AT&T) in 1885.',
  'score': 0.4573010802268982}]

As we can see, this Embeddings database works just like any other Embeddings database. The difference is that it's using a llama.cpp model for vectorization instead of PyTorch.

RAG with llama.cpp

LLM inference with llama.cpp is not a new txtai feature. A recent change added support for conversational messages in additional to standard prompts. This abstracts away having to understand prompting formats.

Let's run a retrieval-augmented-generation (RAG) process fully backed by llama.cpp models.

It's important to note that conversational messages work with all LLM backends supported by txtai (transformers, llama.cpp, litellm).

from txtai import LLM

# LLM instance
llm = LLM(path="unsloth/Qwen3-4B-Instruct-2507-GGUF/Qwen3-4B-Instruct-2507-Q4_K_M.gguf")

# Question and context
question = "Write a list of invented electric-powered devices"
context = "\n".join(x["text"] for x in embeddings.search(question))

# Pass messages to LLM
response = llm([
    {"role": "system", "content": "You are a friendly assistant. You answer questions from users."},
    {"role": "user", "content": f"""
Answer the following question using only the context below. Only include information specifically discussed.

question: {question}
context: {context}
"""}
])
print(response)

Based on the given context, here's a list of invented electric-powered devices:

1. Electric light bulb by Thomas Edison
2. Phonograph by Thomas Edison
3. Motion picture camera by Thomas Edison
4. Alternating current (AC) electricity supply system by Nikola Tesla
5. Telephone by Alexander Graham Bell

And just like that, RAG with llama.cpp🦙!

Embeddings with external vectorization

Next, we'll show how an Embeddings database can integrate with external API services via LiteLLM .

In the LiteLLM project's own words: LiteLLM handles loadbalancing, fallbacks and spend tracking across 100+ LLMs. All in the OpenAI format.

Let's first startup a local API service to use for this demo.

# Download models
wget https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/resolve/main/all-MiniLM-L6-v2-Q4_K_M.gguf
wget https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507-GGUF/blob/main/Qwen3-4B-Instruct-2507-Q4_K_M.gguf

# Start local API services
nohup python -m llama_cpp.server --n_gpu_layers -1 --model all-MiniLM-L6-v2-Q4_K_M.gguf --host 127.0.0.1 --port 8000 &> vector.log &
nohup python -m llama_cpp.server --n_gpu_layers -1 --model Qwen3-4B-Instruct-2507-Q4_K_M.gguf --chat_format chatml --host 127.0.0.1 --port 8001 &> llm.log &
sleep 30

Now let's connect and use this local service to generate vectors for a new Embeddings database. Note that the local service responds in OpenAI's response format, hence the path setting below.

from txtai import Embeddings

# Create Embeddings instance with external vectorization
embeddings = Embeddings(
    path="openai/gpt-4-turbo",
    content=True,
    vectors={
        "api_base": "http://localhost:8000/v1",
        "api_key": "sk-1234"
    }
)

# Load dataset
wikipedia = Embeddings()
wikipedia.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

query = """
SELECT id, text FROM txtai
order by percentile desc
LIMIT 10000
"""

# Index dataset
embeddings.index(wikipedia.search(query))

embeddings.search("Inventors of electric-powered devices")

[{'id': 'Thomas Edison',
  'text': 'Thomas Alva Edison (February 11, 1847October 18, 1931) was an American inventor and businessman. He developed many devices in fields such as electric power generation, mass communication, sound recording, and motion pictures. These inventions, which include the phonograph, the motion picture camera, and early versions of the electric light bulb, have had a widespread impact on the modern industrialized world. He was one of the first inventors to apply the principles of organized science and teamwork to the process of invention, working with many researchers and employees. He established the first industrial research laboratory.',
  'score': 0.6758285164833069},
 {'id': 'Nikola Tesla',
  'text': 'Nikola Tesla (; , ;  1856\xa0– 7 January 1943) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist. He is best-known for his contributions to the design of the modern alternating current (AC) electricity supply system.',
  'score': 0.6077840328216553},
 {'id': 'Alexander Graham Bell',
  'text': 'Alexander Graham Bell (, born Alexander Bell; March 3, 1847 – August 2, 1922) was a  Scottish-born Canadian-American inventor, scientist and engineer who is credited with patenting the first practical telephone. He also co-founded the American Telephone and Telegraph Company (AT&T) in 1885.',
  'score': 0.4573010802268982}]

Like the previous example with llama.cpp, this Embeddings database behaves exactly the same. The main difference is that content is sent to an external service for vectorization.

RAG with External API services

For our last task, we'll run a retrieval-augmented-generation (RAG) process fully backed by an external API service.

from txtai import LLM

# LLM instance
llm = LLM(path="openai/gpt-4-turbo", api_base="http://localhost:8001/v1", api_key="sk-1234")

# Question and context
question = "Write a list of invented electric-powered devices"
context = "\n".join(x["text"] for x in embeddings.search(question))

# Pass messages to LLM
response = llm([
    {"role": "system", "content": "You are a friendly assistant. You answer questions from users."},
    {"role": "user", "content": f"""
Answer the following question using only the context below. Only include information specifically discussed.

question: {question}
context: {context}
"""}
])
print(response)

Based on the given context, a list of invented electric-powered devices includes:

1. Phonograph by Thomas Edison
2. Motion Picture Camera by Thomas Edison
3. Early versions of the Electric Light Bulb by Thomas Edison
4. AC (Alternating Current) Electricity Supply System by Nikola Tesla
5. Telephone by Alexander Graham Bell

Wrapping up

txtai supports a number of different vector and LLM backends. The default method uses PyTorch models via the Hugging Face Transformers library. This article demonstrated how llama.cpp and external API services can also be used.

These additional vector and LLM backends enable maximum flexibility and scalability. For example, vectorization can be fully offloaded to an external API service or another local service. llama.cpp has great support for macOS devices, alternate accelerators such AMD ROCm / Intel GPUs and has been known to run on Raspberry Pi devices.

It's exciting to see the confluence of all these new advances coming together. Stay tuned for more!