In the previous post, we turned our command-line RAG application into a web API using FastAPI. In this final post, we'll containerize our application with Docker and refactor our code to support multiple LLM providers and a unified configuration system.
The Goal for This Post
By the end of this post, you will have a flexible, scalable, and maintainable Q&A application that is containerized with Docker and supports multiple LLM providers.
Why Docker?
Docker is a platform for developing, shipping, and running applications in containers. Containers are lightweight, standalone, and executable packages that include everything needed to run an application: code, runtime, system tools, system libraries, and settings.
By containerizing our application, we can:
- Ensure consistency: Our application will run the same way in any environment, whether it's a developer's laptop or a production server.
- Simplify dependencies: We can package all of our application's dependencies (like Python and the required libraries) into the container, so we wouldn't have to worry about installing them on the host machine.
- Improve scalability: Containers can be easily scaled up or down to meet demand.
The Code
Here is the complete code for our flexible and containerized application.
Dockerfile
# Start from a Python base image
FROM python:3.11-slim
# Set a working directory
WORKDIR /app
# Copy requirements.txt and install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the rest of your app directory
COPY ./app /app/app
# Expose the port your FastAPI app runs on
EXPOSE 8000
# Define the CMD to run your application using uvicorn
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
docker-compose.yml
version: '3.8'
services:
app:
build: .
ports:
- "8000:8000"
environment:
- LLM_PROVIDER=local
- LLM_ENDPOINT=http://host.docker.internal:1234
- LLM_API_KEY=your_api_key
- LLM_MODEL_ID=google/flan-t5-xxl
- VECTOR_DB_DIR=/app/vector_store
- DATA_DIR=/app/data
volumes:
- ./vector_store:/app/vector_store
- ./data:/app/data
.env.example
LLM_PROVIDER=local
LLM_ENDPOINT=http://localhost:1234
LLM_API_KEY=your_api_key
LLM_MODEL_ID=google/flan-t5-xxl
VECTOR_DB_DIR=vector_store
DATA_DIR=data
app/rag_logic.py
import os
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders.directory import DirectoryLoader
from langchain_community.document_loaders.pdf import PDFPlumberLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI
from langchain_community.llms import HuggingFaceHub
def get_llm():
llm_provider = os.getenv("LLM_PROVIDER")
api_key = os.getenv("LLM_API_KEY")
endpoint = os.getenv("LLM_ENDPOINT")
model_id = os.getenv("LLM_MODEL_ID")
if llm_provider == "openai":
return OpenAI(api_key=api_key)
elif llm_provider == "huggingface":
return HuggingFaceHub(repo_id=model_id, huggingfacehub_api_token=api_key)
elif llm_provider == "local":
return OpenAI(base_url=endpoint, api_key=api_key)
else:
raise ValueError(f"Unsupported LLM provider: {llm_provider}")
def create_rag_chain(retriever):
"""Sets up the connection to your LM Studio server and creates the final retrieval chain."""
template = """ You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {input}
Context: {context}
Answer:
"""
prompt = PromptTemplate(template=template, input_variables=["input", "context"])
llm = get_llm()
document_chain = create_stuff_documents_chain(llm, prompt)
return create_retrieval_chain(retriever, document_chain)
app/main.py
@app.post("/ask")
def ask_question(query: Query):
"""Answers a question using the RAG pipeline."""
try:
vectordb_path = os.getenv("VECTOR_DB_DIR")
if not vectordb_path:
raise ValueError("VECTOR_DB_DIR environment variable not set.")
retriever = load_retriever(vectordb_path)
rag_chain = create_rag_chain(retriever)
response = rag_chain.invoke({"input": query.question})
return {"answer": response["answer"]}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
How to Run the Code
-
Build and run the application:
docker-compose up --build
-
Interact with the API:
You can interact with the API using
curl
:
curl -X POST "http://localhost:8000/ask" -H "Content-Type: application/json" -d '{ "question": "What is attention?" }'
Or, you can use the Swagger UI by navigating to
http://localhost:8000/docs
in your browser.
The Final Output
Congratulations! You have successfully built a flexible, scalable, and maintainable Q&A application. You can now use this project as a starting point for your own RAG applications.
We hope this series has been informative and has inspired you to build your own intelligent Q&A system. Thanks for reading!
Full implementation: hadywalied/AskAttentionAI
Top comments (0)