Recently, I had an interesting user call where the user wanted to evaluate the performance of OpenAI assistants.
The Use Case:
- Similar to a RAG pipeline, the user built an Assistant to answer medical queries about diseases and medicines.
- Provided a prompt to instruct the assistant and a set of files containing supporting information from which the assistant was required to generate responses.
Challenges Faced:
- Had to mock conversations with the chatbot, acting as different personas (e.g., a patient with malaria), which was time-consuming for over 100 personas.
- After mocking conversations, had to manually rate individual responses based on parameters like whether the response was grounded in supporting documents, concise, complete, and polite.
Solution Developed:
- Simulating Conversations: Built a tool that mocks conversations with the Assistant based on user personas (e.g., "A patient asking about the treatment of malaria").
- Evaluation of OpenAI Assistant: Tool rates the conversation on parameters like user satisfaction, grounded facts, relevance, etc., using UpTrain's pre-configured metrics (20+ metrics covering use cases such as response quality, tonality, grammar, etc.)
Currently seeking feedback for the developed tool. Would love it if you can check it out on: https://github.com/uptrain-ai/uptrain/blob/main/examples/assistants/assistant_evaluator.ipynb
Top comments (0)