Recently, I have been talking to a lot of LLM developers trying to understand the issues they face while building production-grade LLM applications. There's a certain similarity among all those interviews, most of them are not sure what to evaluate beside the extent of hallucinations.
To make that easy for you, here's a compiled list of the most important evaluation metrics you need to consider before launching your LLM application to production. I have also added notebooks for you to try them out:
Response Quality:
| Metrics | Usage |
|---|---|
| Response Completeness | Evaluate if the response completely resolves the given user query. |
| Response Relevance | Evaluate whether the generated response for the given question, is relevant or not. |
| Response Conciseness | Evaluate how concise the generated response is i.e. the extent of additional irrelevant information in the response. |
| Response Matching | Compare the LLM-generated text with the gold (ideal) response using the defined score metric. |
| Response Consistency | Evaluate how consistent the response is with the question asked as well as with the context provided. |
Quality of Retrieved Context and Response Groundedness:
| Metrics | Usage |
|---|---|
| Factual Accuracy | Evaluate if the facts present in the response can be verified by the retrieved context. |
| Response Completeness wrt Context | Grades how complete the response was for the question specified concerning the information present in the context. |
| Context Relevance | Evaluate if the retrieved context contains sufficient information to answer the given question. |
Prompt Security:
| Metrics | Usage |
|---|---|
| Prompt Injection | Identify prompt leakage attacks |
Language Quality of Response:
| Metrics | Usage |
|---|---|
| Tone Critique | Assess if the tone of machine-generated responses matches with the desired persona. |
| Language Critique | Evaluate LLM generated responses on multiple aspects - fluence, politeness, grammar, and coherence. |
Conversation Quality:
| Metrics | Usage |
|---|---|
| Conversation Satisfaction | Measures the user’s satisfaction with the conversation with the AI assistant based on completeness and user acceptance. |
Some other Custom Evaluations:
| Metrics | Usage |
|---|---|
| Guideline Adherence | Grade how well the LLM adheres to a given custom guideline. |
| Custom Prompt Evaluation | Evaluate by defining your custom grading prompt. |
| Cosine Similarity | Calculate cosine similarity between embeddings of two texts. |
BTW all these metrics are maintained by UpTrain, by far the best open-source tool that I have used for LLM evaluations.
Top comments (0)