Skip to content

DEV Community

Shreyansh Jain

Posted on Apr 9, 2024

Evaluation of OpenAI Assistants

#chatgpt #llm #openai #llmops

Recently, I had an interesting user call where the user wanted to evaluate the performance of OpenAI assistants.

The Use Case:

Similar to a RAG pipeline, the user built an Assistant to answer medical queries about diseases and medicines.
Provided a prompt to instruct the assistant and a set of files containing supporting information from which the assistant was required to generate responses.

Challenges Faced:

Had to mock conversations with the chatbot, acting as different personas (e.g., a patient with malaria), which was time-consuming for over 100 personas.
After mocking conversations, had to manually rate individual responses based on parameters like whether the response was grounded in supporting documents, concise, complete, and polite.

Solution Developed:

Simulating Conversations: Built a tool that mocks conversations with the Assistant based on user personas (e.g., "A patient asking about the treatment of malaria").
Evaluation of OpenAI Assistant: Tool rates the conversation on parameters like user satisfaction, grounded facts, relevance, etc., using UpTrain's pre-configured metrics (20+ metrics covering use cases such as response quality, tonality, grammar, etc.)

Currently seeking feedback for the developed tool. Would love it if you can check it out on: https://github.com/uptrain-ai/uptrain/blob/main/examples/assistants/assistant_evaluator.ipynb

Top comments (0)

Subscribe