Retrieval-augmented generation (RAG) connects LLM answers to your own documents instead of relying on training data. This tutorial builds a complete pipeline with Runware handling generation on purpose-built inference infrastructure, which is faster and cheaper than commodity providers, through an OpenAI-compatible endpoint, and LangChain handling the indexing and retrieval layer.
Without retrieval, assistants either hallucinate details such as inventing API fields or policies that don't exist or go stale the moment your docs change. RAG fixes both by pulling the relevant passages before generation.
The RAG pattern
RAG retrieves relevant document chunks at query time and adds them to the prompt so the model answers using that specific context. Unlike memory, which tracks user info across sessions, RAG surfaces up-to-date docs (like APIs or policies). Mixing user chat logs into RAG often leads to stale or sensitive results, so use each for their purpose.
As you scale, remember these tips:
Chunk with purpose. Fixed-size splits can break code and headings. Use RecursiveCharacterTextSplitter by default. Adjust splitting if retrieval feels slightly off.
Retrieval quality matters most. A wrong top chunk leads to confident but incorrect answers. Increasing
khelps up to a point. For exact matches liketaskUUID, try hybrid keyword and vector search.Let the model refuse when unsure. It is better to have a short "I don’t know" than a wrong but convincing answer.
The stack here is local embeddings with FastEmbed, FAISS on disk, LangChain LCEL for the chain, and Runware for the LLM call. You can add reranking without changing how you call Runware.
Runware in the stack
Runware runs generation on its Sonic Inference Engine, custom-built for AI and fully optimized from hardware to software. This means faster, cheaper inference for open source models than general cloud providers, with pay-as-you-go pricing and no minimums.
For example, MiniMax M2.5 costs less on Runware ($0.27/$0.95 per million input/output tokens) than on MiniMax’s own API ($0.30/$1.20). At enterprise scale (think 10M+ tokens), this saves nearly $10K a year in inference for the same workload without any code changes.
This tutorial runs chunk embeddings locally with FastEmbed (no embedding bill) and sends only the generation call to Runware, which keeps costs low even at scale.
Pipeline overview
The end-to-end loop looks like this:
- Load source text (snippets, Markdown, PDFs).
- Split into chunks and embed them locally.
- Store vectors in FAISS on disk.
- For each question, retrieve the closest chunks.
- Send those chunks plus the question to Runware via LangChain.
- Optionally run the same question without retrieval to see the difference.
LangChain covers loaders, splitters, the vector store, the retriever, and the chain. Runware covers the model response. In this setup, swapping a Runware model id does not require re-indexing.
Prerequisites
- Python 3.10 or later
- A Runware account
- Enough disk for a small embedding model and FAISS (CPU is fine for a doc-sized corpus)
Set up the environment
Let's start by creating a new Python project. To keep your Python dependencies organized you should create a virtual environment.
First, create and navigate into a local directory:
# Create and move to the new directory
mkdir runware-rag
cd runware-rag
Afterwards, create and activate a new virtual environment:
# Create a virtual environment
python -m venv venv
# Active the virtual environment (Windows)
.\venv\Scripts\activate.bat
# Active the virtual environment (Linux)
source ./venv/bin/activate
Now, create a requirements.txt file with the following dependencies:
langchain
langchain-openai
langchain-community
langchain-text-splitters
fastembed
faiss-cpu
python-dotenv
Then, you can install the dependencies with the following command:
pip install -r requirements.txt
Finally, obtain the Runware API Key and create an .env file with the copied value:
RUNWARE_API_KEY="your_runware_api_key_here"
Step 1. Prepare your corpus
To fill your search index, you'll need some source text. For this tutorial, we'll use a set of brief snippets from the Runware documentation to make it easy to check your results. If you’re working with your own materials, you can use LangChain's various document loaders to ingest content such as Markdown files, PDFs, web pages, or even entire sitemaps.
Create corpus.py with those snippets:
RUNWARE_DOC_SNIPPETS = [
"Runware exposes an OpenAI-compatible chat completions endpoint at [https://api.runware.ai/v1/chat/completions](https://api.runware.ai/v1/chat/completions?utm_source=devto&utm_medium=community-site&utm_campaign=2026-06-rag-application-blog&utm_content=2026-06-11_devto-sponsored-article). Point existing OpenAI SDKs at that base URL with your Runware API key.",
"The Runware API accepts an array of tasks per POST. Each task includes taskType, taskUUID, and modality-specific fields. You can batch image, video, audio, and text in one call.",
"textInference is the task type for LLM chat on the native API. OpenAI-compatible clients use chat completions instead.",
"Runware supports deliveryMethod sync, stream, and async. Async tasks can include webhookURL for completion callbacks.",
"includeCost on a native task request returns the USD cost for that task when set to true.",
"DeepSeek-V4-Flash on Runware supports thinkingLevel off, high, and max, jsonSchema for structured output, and tool definitions on textInference.",
"Generated output URLs are retained for a limited time by default. Use ttl where supported to control retention.",
]
If a question has no match in these snippets, the assistant should say it does not know.
Step 2. Chunk and index
Before retrieval, split the docs into chunks and turn each chunk into a vector with FastEmbed. It's an ONNX embedding model that runs fine on CPU and downloads itself the first time you call it.
Define the embedding model once in embeddings.py so indexing and retrieval always use the same one. A mismatch here quietly wrecks your results.
from langchain_community.embeddings import FastEmbedEmbeddings
EMBEDDING_MODEL = "BAAI/bge-small-en-v1.5"
def get_embeddings():
return FastEmbedEmbeddings(model_name=EMBEDDING_MODEL)
Create index.py to build and save the FAISS index:
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from corpus import RUNWARE_DOC_SNIPPETS
from embeddings import get_embeddings
def build_vectorstore():
docs = [
Document(page_content=s, metadata={"source": f"snippet-{i}"})
for i, s in enumerate(RUNWARE_DOC_SNIPPETS)
]
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=80)
chunks = splitter.split_documents(docs)
return FAISS.from_documents(chunks, get_embeddings())
if __name__ == "__main__":
store = build_vectorstore()
store.save_local("faiss_runware_docs")
print("Saved faiss_runware_docs")
python index.py
Re-run this whenever your docs change. Use the same embedding model for indexing and querying. Treat faiss_runware_docs as a build artifact and version it like the rest of your code.
Step 3. Connect the LLM to Runware
Connecting to Runware looks identical to connecting to OpenAI (that's the point!). The OpenAI-compatible endpoint means you can drop Runware into an existing LangChain app by changing two lines. What you get in return is Runware's inference infrastructure with lower cost per token, faster response times, and access to hundreds of open source models under the same schema (all on pay-as-you-go billing).
Create llm.py:
import os
from langchain_openai import ChatOpenAI
def runware_chat():
return ChatOpenAI(
model="deepseek-v4-flash",
openai_api_key=os.environ["RUNWARE_API_KEY"],
openai_api_base="https://api.runware.ai/v1",
temperature=0.2,
max_tokens=512,
)
That model field is the only thing you change to switch models. Runware hosts 400K+ model variants under one consistent endpoint and schema, so when you swap deepseek-v4-flash for any other supported model, you just update that string and nothing else. You do not need to migrate or re-index. The same API key and base URL also work for image, video, audio, and text tasks, so this RAG pattern extends easily to multimodal pipelines without adding another vendor.
For taskUUID, webhooks, includeCost, or mixed-modality batches, see the native API docs. This tutorial uses the OpenAI-compatible endpoint because it drops into LangChain without extra code. The includeCost flag is covered in Production notes.
Step 4. Build the chain
Wire the retriever, prompt, and LLM together with LangChain Expression Language (LCEL).
Create chain.py:
from operator import itemgetter
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from embeddings import get_embeddings
from llm import runware_chat
def load_retriever():
store = FAISS.load_local(
"faiss_runware_docs",
get_embeddings(),
allow_dangerous_deserialization=True,
)
return store.as_retriever(search_kwargs={"k": 4})
def format_docs(docs):
return "\n\n".join(
f"[{d.metadata.get('source', 'doc')}] {d.page_content}" for d in docs
)
prompt = ChatPromptTemplate.from_messages([
(
"system",
"You answer questions about Runware using only the context below. "
"If the context is insufficient, say you do not know. "
"Do not invent API fields, prices, or behavior.",
),
("human", "Context:\n{context}\n\nQuestion: {question}"),
])
def build_rag_chain():
retriever = load_retriever()
llm = runware_chat()
return (
{
"context": itemgetter("question") | retriever | format_docs,
"question": itemgetter("question"),
}
| prompt
| llm
| StrOutputParser()
)
Keep the system prompt strict. If the retrieved context is empty or off-topic, the model should say so instead of guessing.
Step 5. Test with and without retrieval
In your project root, create a file named main.py.
from chain import build_rag_chain
from llm import runware_chat
from dotenv import load_dotenv
load_dotenv()
QUESTION = "How does Runware handle async text results and cost visibility?"
def main():
rag = build_rag_chain()
answer = rag.invoke({"question": QUESTION})
print("=== With retrieval ===")
print(answer)
print()
bare = runware_chat()
no_context = bare.invoke(
"Answer in one short paragraph: How does Runware handle async text results and cost visibility?"
)
print("=== Without retrieval ===")
print(no_context.content)
if __name__ == "__main__":
main()
python main.py
A good run surfaces terms from your corpus like webhookURL, includeCost, and async delivery. Ask something outside corpus.py and the assistant should refuse instead of making something up.
=== With retrieval ===
Based on the context provided, Runware supports async tasks with a `webhookURL` for completion callbacks. For cost visibility, setting `includeCost` to `true` on a native task request returns the USD cost for that task. However, the context does not specify how async text results are delivered beyond the webhook callback, nor does it detail cost visibility specifically for async text tasks.
=== Without retrieval ===
Runware handles async text results by returning a unique `taskId` immediately upon submission, allowing users to poll or receive webhook callbacks for the final output once processing completes. For cost visibility, Runware provides real-time pricing estimates before task execution and itemized cost breakdowns in the response, showing the exact credits or fees consumed per request.
Step 6. Add streaming (optional)
Runware supports "stream": true on the compatible endpoint. Stream the chain when a human reads the answer live.
for chunk in rag.stream({"question": QUESTION}):
print(chunk, end="", flush=True)
For batch work use the Runware native API directly as it supports webhooks and per-task cost breakdowns. See Production notes for cost attribution guidance.
Production notes
A few things to address before production:
Treat the vector store as a versioned build artifact, and log which embedding model built each version. Chasing down a model mismatch after the fact is painful.
Redact secrets before indexing. Chunking is local, but retrieved text goes to Runware’s API.
Tune
kagainst real user questions. If you get the right topic but the wrong paragraph, reach for reranking or hybrid search before you jump to a bigger model.Keep a set of golden questions with expected source snippets and run them in CI. This catches wrong citations and missing abstentions early.
On Runware's native API, includeCost helps attribute spend during pilots.
Wrap-up
The key thing to take away is the separation of concerns within RAG system. LangChain owns everything up to the retrieval step, and Runware handles generation through a standard chat completions interface. That boundary matters because it means you can tune retrieval without touching the LLM, and swap models without rebuilding the index.
Most quality problems in RAG come from the retrieval side, not the model. If answers are off, check what chunks actually got retrieved before assuming the model needs to be bigger. From here, the practical next steps are tighter chunking, a small set of real user questions to eval against, and clear rules for when the assistant should refuse to answer.
Top comments (0)