Tests on monitoring LLM based application by Grafana/Prometheus
Introduction
Ever wonder what’s really happening under the hood of your applications? How many users are clicking that “Summarize” button? How long does it take your AI to respond to a query? In the world of software, answers to these questions are crucial for performance, reliability, and user experience. That’s where Grafana and Prometheus come in, forming a dynamic duo for application monitoring.
Grafana & Prometheus: Your Monitoring Power Couple 💑
Imagine you have a bustling restaurant. Prometheus is like the diligent waiter, meticulously noting down every order, how long each dish takes to prepare, and how many times someone asks for water. Grafana, on the other hand, is the head chef’s big-screen display in the kitchen, showing real-time trends: “Are we getting too many pasta orders?”, “Is the grill taking too long tonight?”, “Are the coffee machines constantly busy?”.
- Prometheus: An open-source monitoring system that collects and stores metrics as time-series data. It “scrapes” (pulls) data from your application endpoints at regular intervals. It’s fantastic at collecting, processing, and querying numerical data.
- Grafana: An open-source analytics and interactive visualization web application. It connects to data sources like Prometheus and allows you to create beautiful, insightful dashboards that bring your metrics to life with graphs, charts, and alerts.
Together, they provide a powerful, flexible, and scalable solution for understanding your application’s health and performance.
Is Monitoring Still Relevant for LLM-Based Applications & Agents? 🤔 Absolutely!
The rise of Large Language Models (LLMs) and intelligent agents has brought new complexities to application development. While the core monitoring principles remain, the types of metrics we care about expand significantly.
Here’s why Grafana and Prometheus are not just “interesting” but becoming essential for LLM-based applications:
1. Performance & Latency 🚀
LLMs can be resource-intensive, and their response times can vary based on model complexity, server load, and prompt length.
What to Monitor:
- Response Time: How long does it take Ollama (or your LLM) to generate a response? (e.g., ollama_app_summary_duration_seconds from our example).
- Token Generation Rate: How many tokens per second is the model generating?
- Concurrent Requests: How many active users or agents are interacting with the LLM at any given moment?
- Why it Matters: Slow responses degrade user experience. Monitoring helps identify bottlenecks, optimize model serving, and scale resources effectively.
2. Usage & Cost Optimization 💸
LLM APIs (or even local models with varying resource demands) can incur costs based on token usage or computational resources. Understanding usage patterns is critical for cost management.
What to Monitor:
- Total API Calls/Local Invocations: How many times has your LLM been queried? (e.g., ollama_app_chat_messages_total, ollama_app_summary_requests_total).
- Input/Output Token Counts: How many tokens are being sent to and received from the LLM?
- Feature Usage: Which parts of your LLM-powered app (e.g., chat vs. summarization) are most popular?
- Why it Matters: Uncontrolled usage can lead to unexpected bills or resource strain. Monitoring helps forecast demand and identify opportunities for optimization (e.g., caching, prompt engineering to reduce token counts).
3. Reliability & Error Tracking 🚨
Even the most advanced LLMs can encounter issues — API errors, timeout failures, or unexpected responses.
What to Monitor:
- Error Rates: How often does the application fail to get a response from the LLM, or the LLM returns an error?
- Retry Attempts: Are certain LLM calls requiring multiple retries before succeeding?
- Model Health: Is the underlying LLM service (like Ollama) running and responsive?
- Why it Matters: Quick detection of errors means faster resolution, preventing downtime and maintaining a stable application.
4. Agent Behavior & Interaction Quality 🤖
For LLM-powered agents, understanding their decision-making and interaction patterns is a new frontier in monitoring.
What to Monitor:
- Tool Usage: Which external tools (e.g., web search, calculators) is the agent using, and how frequently?
- Chain Length/Steps: How many steps does it take an agent to complete a task?
- Conversation Turns: How long are the conversations the LLM is having with users?
- Why it Matters: This helps in debugging agent behavior, improving its efficiency, and ensuring it performs tasks as intended, rather than hallucinating or getting stuck.
Getting Started is Easier Than You Think!
As I’ll show with a Streamlit example, integrating Prometheus metrics into a Python application is straightforward with libraries like prometheus_client
. Once the metrics are exposed and Prometheus is scraping them, visualizing them in Grafana is an intuitive process of building queries and designing dashboards.
So, whether you’re running a traditional microservice or pioneering the next generation of AI agents, investing in a robust monitoring stack with Grafana and Prometheus is a smart move* (see my conclusion below). It transforms raw data into actionable insights, helping to build better, more reliable, and more performant applications.
Hereafter is a little implementation I did to make a very basic test.
- Install and setup Grafana and Prometheus locally (I do these tests on a macOS laptop).
# --- Grafana ---
# brew
brew update
brew install grafana
#or binairies
curl -O https://dl.grafana.com/grafana/release/12.3.0-17718666199/grafana_12.3.0-17718666199_17718666199_darwin_amd64.tar.gz
tar -zxvf grafana_12.3.0-17718666199_17718666199_darwin_amd64.tar.gz
# for arm
curl -O https://dl.grafana.com/grafana/release/12.1.1/grafana_12.1.1_16903967602_darwin_arm64.tar.gz
tar -zxvf grafana_12.1.1_16903967602_darwin_arm64.tar.gz
# --- Prometheus ---
curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh
brew install prometheus
- Once both Grafana and Promotheus are installed, they’ll be lauched as services. In order to start/stop these services I made a little Bash script.
#!/bin/bash
# Check if an argument is provided
if [ -z "$1" ]; then
echo "Usage: $0 {install|uninstall|start|stop|prom_install|prom_uninstall|prom_start|prom_stop}"
exit 1
fi
# Use a case statement to handle different arguments
case "$1" in
start)
echo "Starting Grafana service..."
brew services start grafana
;;
stop)
echo "Stopping Grafana service..."
brew services stop grafana
;;
install)
echo "Installing Grafana service..."
brew brew install grafana
;;
uninstall)
echo "Uninstalling Grafana service..."
brew brew uninstall grafana
;;
prom_install)
echo "Installing Promotheus..."
curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh
brew install prometheus
;;
prom_uninstall)
echo "Uninstalling Promotheus..."
brew uninstall prometheus
;;
prom_start)
echo "Starting Promotheus..."
brew services start prometheus
;;
prom_stop)
echo "Stopping Promotheus..."
brew services stop prometheus
brew services cleanup
;;
*)
echo "Invalid option: $1"
echo "Usage: $0 {install|uninstall|start|stop|prom_install|prom_uninstall|prom_start|prom_stop}"
exit 1
;;
esac
- Prepare the application environment (for LLM and Grafana).
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install prometheus_client
pip install requests
#
pip install streamlit
pip install ollama
pip install traceloop.sdk
pip install traceloop-sdk
pip install PyPDF2
pip install watchdog
- Sample LLM based application locally used with Ollama and Granite ⬇️
# streamlit_ollama_app.py
import streamlit as st
import ollama
import io
from PyPDF2 import PdfReader
from prometheus_client import start_http_server, Counter, Histogram
import threading
# --- Traceloop Initialization ---
try:
from traceloop.sdk import Traceloop
Traceloop.init(
disable_batch=True,
api_key="xxxxxxxxxx" # could be used in addition to the rest...
)
except ImportError:
st.warning("Traceloop SDK not found. Install with 'pip install traceloop-sdk' to enable tracing.")
Traceloop = None
# --- Prometheus Metrics Configuration & Initialization ---
def init_metrics():
# Only create and return metrics if they don't exist in session state
if 'metrics' not in st.session_state:
st.session_state.metrics = {
"chat_messages_count": Counter('ollama_app_chat_messages_total', 'Total number of chat messages sent to Ollama.'),
"summary_requests_count": Counter('ollama_app_summary_requests_total', 'Total number of document summarization requests.'),
"summary_duration": Histogram('ollama_app_summary_duration_seconds', 'Time spent summarizing documents.')
}
# Start the Prometheus server only once
if 'metrics_server_thread' not in st.session_state:
def start_metrics_server():
start_http_server(8000)
st.session_state.metrics_server_thread = threading.Thread(target=start_metrics_server, daemon=True)
st.session_state.metrics_server_thread.start()
print("Prometheus metrics server started on port 8000.")
return st.session_state.metrics
# Call the initialization function at the start of your script
metrics = init_metrics()
CHAT_MESSAGES_COUNT = metrics["chat_messages_count"]
SUMMARY_REQUESTS_COUNT = metrics["summary_requests_count"]
SUMMARY_DURATION = metrics["summary_duration"]
# --- Streamlit App Configuration & Session State ---
st.set_page_config(page_title="Ollama AI Assistant", layout="centered")
st.title("💬 Ollama AI Assistant")
st.markdown("Interact with `granite3.3:latest` locally or summarize documents!")
# --- Ollama Model & App-specific Configuration ---
OLLAMA_MODEL = "granite3.3:latest"
MAX_SUMMARY_INPUT_LENGTH = 4000
# Initialize session state for messages and document content
if "messages" not in st.session_state:
st.session_state.messages = []
if 'document_content' not in st.session_state:
st.session_state.document_content = None
# --- Document Summarizer Section ---
st.header("📄 Document Summarizer")
uploaded_file = st.file_uploader("Upload a document (TXT, MD, PDF)", type=["txt", "md", "pdf"])
# This block executes when a new file is uploaded
if uploaded_file is not None:
if 'last_uploaded_file_name' not in st.session_state or st.session_state.last_uploaded_file_name != uploaded_file.name:
st.session_state.last_uploaded_file_name = uploaded_file.name
try:
if uploaded_file.type in ["text/plain", "text/markdown"]:
st.session_state.document_content = uploaded_file.read().decode("utf-8")
elif uploaded_file.type == "application/pdf":
reader = PdfReader(io.BytesIO(uploaded_file.read()))
st.session_state.document_content = ""
for page in reader.pages:
st.session_state.document_content += page.extract_text() or ""
if not st.session_state.document_content.strip():
st.warning("Could not extract text from the PDF. It might be an image-based PDF or encrypted.")
st.session_state.document_content = None
else:
st.warning(f"Unsupported file type: {uploaded_file.type}.")
st.session_state.document_content = None
except Exception as e:
st.error(f"Error reading file: {e}")
st.session_state.document_content = None
# This part displays the content and the button if content exists in session state
if st.session_state.document_content:
st.subheader("Document Preview:")
display_content = st.session_state.document_content[:2000] + ("\n\n...[Document truncated for preview]..." if len(st.session_state.document_content) > 2000 else "")
st.text_area("Content", display_content, height=200, disabled=True)
if len(st.session_state.document_content) > MAX_SUMMARY_INPUT_LENGTH:
st.warning(f"Document is very long ({len(st.session_state.document_content)} characters). "
f"Only the first {MAX_SUMMARY_INPUT_LENGTH} characters will be sent for summarization.")
if st.button("Summarize Document"):
SUMMARY_REQUESTS_COUNT.inc() # Increment the Prometheus counter
with SUMMARY_DURATION.time(): # This will time the block and record in the Histogram
with st.spinner("Summarizing your document..."):
try:
content_for_llm = st.session_state.document_content[:MAX_SUMMARY_INPUT_LENGTH]
summarize_prompt = (
f"Please read the following text carefully and provide a concise, "
f"high-level summary of its main points and key information. "
f"Focus on the most important aspects and avoid unnecessary details.\n\n"
f"Document Content:\n{content_for_llm}"
)
summary_response = ollama.chat(model=OLLAMA_MODEL, messages=[{'role': 'user', 'content': summarize_prompt}])
summary = summary_response['message']['content']
st.subheader("Document Summary:")
st.write(summary)
except ollama.ResponseError as e:
st.error(f"Error summarizing document with Ollama: {e}")
st.info("Please ensure Ollama is running and the model is downloaded.")
except Exception as e:
st.error(f"An unexpected error occurred during summarization: {e}")
# --- Chat Interface Section ---
st.header("💬 Chat with Ollama")
# Display chat messages from history on app rerun
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
if prompt := st.chat_input("What's on your mind?"):
CHAT_MESSAGES_COUNT.inc() # Increment the Prometheus counter for each new chat message
with st.chat_message("user"):
st.markdown(prompt)
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("assistant"):
with st.spinner("Ollama is thinking..."):
try:
ollama_messages = [{"role": m["role"], "content": m["content"]} for m in st.session_state.messages]
response = ollama.chat(model=OLLAMA_MODEL, messages=ollama_messages)
full_response = response['message']['content']
st.markdown(full_response)
except ollama.ResponseError as e:
st.error(f"Error communicating with Ollama: {e}")
st.info("Please ensure Ollama is running and the model is downloaded.")
full_response = "Error: Could not get a response from Ollama."
except Exception as e:
st.error(f"An unexpected error occurred: {e}")
full_response = "Error: An unexpected issue occurred."
st.session_state.messages.append({"role": "assistant", "content": full_response})
st.markdown("---")
st.caption(f"Powered by Ollama ({OLLAMA_MODEL}) and Traceloop.")
- To enable metrics with Prometheus and Grafana you’ll need a YAML file to be configured related to your application.
scrape_configs:
- job_name: 'ollama_streamlit_app' # <-- This is the job name
# Metrics for the Streamlit app exposed on port 8000
scrape_interval: 5s
static_configs:
- targets: ['localhost:8000']
- 🟥 Promotheus should be enabled to use this YAML file 🟥
prometheus --config.file="prometheus.yml" --web.enable-lifecycle
curl -X POST http://localhost:9090/-/reload
- Run the application and execute some simple tasks.
streamlit run streamlit_ollama_app.py
Configure Prometheus and add a Dashboard in Grafana
Finding Events in Prometheus: first, we should confirm that the metrics are being collected by Prometheus. Since the Python application exposes metrics on port 8000
, the Prometheus server needs to be configured to scrape this endpoint.
- Open the browser and navigate to the Prometheus UI, which is typically at
http://localhost:9090
. - In the Prometheus UI, go to the Status menu and select Targets.
- We should see an entry for the application, likely with a job_name of
ollama_streamlit_app
(or whatever the application is named in the prometheus.yml file) and an endpoint oflocalhost:8000
. The status should be UP. If it's not, there's an issue with your Prometheus configuration or your Streamlit app isn't running. - Go to the Graph page (or use the search bar) in Prometheus. You can find the metrics here by typing their names into the expression field, such as o
llama_app_chat_messages_total
orollama_app_summary_requests_total
. Prometheus will display the raw data points.
Visualize the Metrics in Grafana: once we’ve verified that Prometheus is collecting the data, we can create a dashboard in Grafana to visualize it.
- Open the Grafana dashboard at
http://localhost:3000
. - We should ensure that Prometheus data source is already configured and connected to
http://localhost:9090
. If not, add it under Connections > Data sources. - Create a new dashboard by clicking the “New Dashboard” button.
- Add a new panel. In the Query section of the panel, select the Prometheus data source.
- We can type the name of one of the metrics, such as
ollama_app_summary_requests_total
, into the query field. - For a counter like
ollama_app_chat_messages_total
, a good query is rate(ollama_app_chat_messages_total[5m]). This shows the rate of events per second over the last 5 minutes, which is more useful for a counter than just the raw total. - For the
SUMMARY_DURATION
histogram, we can use queries like rate(ollama_app_summary_duration_seconds_sum[5m]
) / rate(ollama_app_summary_duration_seconds_count[5m]
) to calculate the average duration. We can also visualize the histogram buckets to see the distribution of response times.
Conclusion
Admittedly, our sample is basic; true LLM agents are far more intricate and this simple application only scratches the surface. While starting with Grafana and Prometheus provides invaluable, cost-effective initial visibility, scaling quickly demands specialized tools. This is where industrialized observability platforms like Instana shine. They are engineered to automatically trace and instrument LLM calls, providing superior, contextual metrics. These powerful solutions empower more than just engineers: their advanced dashboards transform raw data into clear business intelligence, moving the organization from passively observing technical metrics to proactively making business and architectural decisions.
Links
- Using OpenLLMetry: https://alain-airom.medium.com/using-openllmetry-is-simple-comme-bonjour-104249ec737c
- Local LLM observability with Instana: https://developer.ibm.com/tutorials/llm-observability-instana-local/
- Cloud-native LLM observability with IBM Instana: https://developer.ibm.com/tutorials/llm-observability-instana-cloud/
- LLM observability with IBM Instana: https://medium.com/@IBMDeveloper/llm-observability-with-ibm-instana-3b392c03c3ef
Top comments (0)