A complete list of all the LLM evaluation metrics you need to care about!

#llm #ai #chatgpt #opensource

Recently, I have been talking to a lot of LLM developers trying to understand the issues they face while building production-grade LLM applications. There's a certain similarity among all those interviews, most of them are not sure what to evaluate beside the extent of hallucinations.

To make that easy for you, here's a compiled list of the most important evaluation metrics you need to consider before launching your LLM application to production. I have also added notebooks for you to try them out:

Response Quality:

Metrics	Usage
Response Completeness	Evaluate if the response completely resolves the given user query.
Response Relevance	Evaluate whether the generated response for the given question, is relevant or not.
Response Conciseness	Evaluate how concise the generated response is i.e. the extent of additional irrelevant information in the response.
Response Matching	Compare the LLM-generated text with the gold (ideal) response using the defined score metric.
Response Consistency	Evaluate how consistent the response is with the question asked as well as with the context provided.

Quality of Retrieved Context and Response Groundedness:

Metrics	Usage
Factual Accuracy	Evaluate if the facts present in the response can be verified by the retrieved context.
Response Completeness wrt Context	Grades how complete the response was for the question specified concerning the information present in the context.
Context Relevance	Evaluate if the retrieved context contains sufficient information to answer the given question.

Prompt Security:

Metrics	Usage
Prompt Injection	Identify prompt leakage attacks

Language Quality of Response:

Metrics	Usage
Tone Critique	Assess if the tone of machine-generated responses matches with the desired persona.
Language Critique	Evaluate LLM generated responses on multiple aspects - fluence, politeness, grammar, and coherence.

Conversation Quality:

Metrics	Usage
Conversation Satisfaction	Measures the user’s satisfaction with the conversation with the AI assistant based on completeness and user acceptance.

Some other Custom Evaluations:

Metrics	Usage
Guideline Adherence	Grade how well the LLM adheres to a given custom guideline.
Custom Prompt Evaluation	Evaluate by defining your custom grading prompt.
Cosine Similarity	Calculate cosine similarity between embeddings of two texts.

BTW all these metrics are maintained by UpTrain, by far the best open-source tool that I have used for LLM evaluations.

DEV Community

A complete list of all the LLM evaluation metrics you need to care about!

Top comments (0)