DEV Community

Carlos Fernández-Loría
Carlos Fernández-Loría

Posted on

Chronicles of SHARI - 002: An Evaluation Framework

Hey there, fellow adventurers! It's time for another exciting update on Chronicles of SHARI. If you need a quick recap, you can always check out our first post.

Progress and Challenges

Over the past month, our development team has been working on two key features of our text-based game master. First and foremost, we've been focused on ensuring that SHARI responds in a way that aligns with the role of a game master without deviating from it. The goal is for SHARI to provide an immersive and engaging experience, staying true to its role throughout the game.

Secondly, we've also been dedicated to ensuring that SHARI respects the player's agency. We want players to have full control over their character's actions and choices. SHARI should not dictate or assume actions on behalf of the player, empowering them to shape their own path within the game.

However, it hasn't been a smooth ride. We've encountered some challenges along the way, primarily due to the probabilistic nature of large language models (LLMs). Adjusting the prompt templates and model parameters has proven to be more difficult than anticipated. We've faced two main challenges in this regard:

  • Trial and error: It's virtually impossible to predict the effectiveness of prompt templates and model parameters without actually trying them out. As a result, our development process has involved a lot of trial and error, which has been very time-consuming.

  • Inconsistent responses: LLMs can provide inconsistent responses even when presented with the same input, prompt template, and model parameters. Additionally, the performance of the system is highly sensitive to the player's input. For example, certain prompt templates may work well under normal circumstances but may not handle adversarial users effectively. Achieving a "100% error-free" system has proven to be elusive.

Given these challenges, we now recognize the importance of establishing a robust evaluation framework. We need a systematic approach to measure the effectiveness of prompt templates and LLM parameters. By implementing a solid benchmarking framework, we aim to make steady progress in the development of SHARI.

Why an Evaluation Framework?

While our primary focus is on developing a text-based game master, it's crucial to understand that any system based on LLMs requires a robust evaluation framework. Here's why an evaluation framework is critical in the development process:

  • Acknowledging Imperfection: LLM-based systems will not be error-free. Believing that we can build a 100% flawless system with LLMs is unrealistic. As the famous quote by George Box states, "All models are wrong, some are useful." The goal is not perfection but usefulness. An evaluation framework helps us determine the usefulness of the system and identify areas for improvement.

  • Domain-Specific Evaluation: Evaluating LLM-based systems based on statistical measures or how well they respond to random questions in benchmark datasets is insufficient. Different contexts and applications have varying degrees of tolerance for errors. Some errors may be costlier or more prevalent in specific domains. Therefore, it's essential to incorporate domain-specific evaluation criteria into the framework. This allows us to assess the system's performance accurately within the intended context.

  • Handling Inconsistency: LLMs are inherently probabilistic models, and their output can vary even with identical inputs. This inconsistency adds complexity to the development process. An evaluation framework must account for this variability and provide a comprehensive assessment that considers different output possibilities.

  • Navigating Complex Design: Designing LLM-based systems is challenging. It's difficult to predict in advance how well prompt templates will work under various scenarios. Additionally, when the system relies on a chain of prompt calls, changes in one prompt call may interact unexpectedly with subsequent prompt calls. An evaluation framework becomes indispensable in testing a wide range of tweaks and scenarios to uncover optimal configurations and address unforeseen issues.

  • Addressing Costly Evaluation: While the end user is the gold standard for evaluating the system's performance, relying solely on user evaluation can be time-consuming, expensive, or even impractical, especially during the development phase. To scale the evaluation process and overcome these limitations, an evaluation framework that provides scalable and cost-effective evaluation measures is vital.

The Evaluation Framework

The framework we've developed draws heavy inspiration from a concise and informative video on evaluating large language models by Kevin Dewalt from Prolego. You can find the video here. Our framework builds upon the concepts covered in the video and offers an automated approach to evaluation.

Our evaluation framework comprises the following essential components:

  • Testers: These are LLM-based agents that simulate player behavior. We utilize them to replicate game sessions and gather data for evaluation.

  • Setups: Each Setup defines a specific system configuration that we wish to test. For instance, a Setup might consist of particular prompt templates or specific model parameters that we want to assess.

  • Assessments: Each Assessment provides performance measures of the system for a given game session. These measures may include factors such as response time, adherence to the role of a game master, avoidance of dictating actions on behalf of the player, and more.

  • Validator: We employ an LLM-based system as the Validator to generate Assessments of game sessions. This component helps ensure consistent and objective evaluation across various Setups.

  • Overall Evaluation Criterion (OEC): The OEC defines weights that allow us to aggregate the Assessments into a single metric. This metric serves as a basis for comparing and benchmarking different Setups. The OEC requires weights for each type of measure in the Assessment, which helps account for the relative importance of different types of errors, as well as weights for each type of Tester, which considers the prevalence of different player profiles.

Here's how we plan to utilize the evaluation framework for benchmarking various solutions:

Evaluate(Testers, Setups, Validator, OEC):
    assessments = empty dictionary
    for each tester in Testers:
        for each setup in Setups:
            system = setup.initializeSystem()
            for i = 1 to k:
                gameSession = tester.playWith(system)
                assessment = Validator.evaluate(gameSession)
                assessments[tester, setup, i] = assessment
    results = OEC.calculate(assessments)
    return results
Enter fullscreen mode Exit fullscreen mode

In the above process, we iterate through the available Testers and Setups. For each combination, we initialize the system with the corresponding Setup. We then simulate k game sessions using the Tester, assess each session using the Validator, and store the resulting Assessments in the dictionary. Finally, we calculate the overall evaluation metric using the OEC and return the results.

Advantages of the Framework

Our proposed evaluation framework offers several advantages in addressing the points we described earlier:

  • Acknowledging Imperfection: The framework acknowledges that no system Setup can achieve perfection. Instead, we compare different Setups based on the Overall Evaluation Criterion (OEC). This allows us to (1) select the most suitable Setup based on our specific goals and (2) determine if a Setup performs well enough for deployment, considering the intended use case.

  • Domain-Specific Evaluation: The framework incorporates domain-specific evaluation through the use of Testers and the measures included in the Assessments. By tailoring the evaluation components to the specific domain of text-based game mastering, we can accurately assess the system's performance within its intended context.

  • Handling Inconsistency: To account for the inherent variability of large language models (LLMs), the framework produces k Assessments for each Tester-Setup pair. By generating multiple Assessments, we capture the range of possible system responses and mitigate the impact of LLM inconsistency.

  • Navigating Complex Design: Instead of relying solely on intuition or theoretical analysis, the framework takes a practical approach to complex design. By exploring multiple Setups with different Testers, we gain empirical insights into what performs best based on our application goals. This empirical approach helps us navigate the complexities of designing the system and uncover optimal Setups.

  • Addressing Costly Evaluation: Evaluating LLM-based systems solely through user testing can be time-consuming and costly. To address this, our framework utilizes the Validator component, which enables us to repeatedly evaluate the system's performance quickly and at a lower cost compared to extensive user testing. This cost-effective evaluation approach should allow us to scale the evaluation process efficiently.

Nice! But does it actually work?

Well, let's get to the heart of the matter: we don't have a definitive answer just yet, but we have a plan in motion. In the coming weeks, our main focus will be on implementing and testing the evaluation framework. Yep, that means we'll be evaluating the evaluation framework itself. Meta, right?

There are two key components that are absolutely vital for this whole thing to work smoothly. First up is the Validator. Its role is to generate high-quality Assessments, and that's a big deal. To determine its effectiveness, we're planning to generate a set of conversations and manually assess them ourselves. Then, we'll put the Validator through its paces and compare its Assessments with ours. If we find that the LLM-based Validator falls short, we're considering the option of training a classifier using our own assessments as training data.

The second key component is the Testers. These little fellas need to accurately simulate how real players would interact with the game master. Now, here's the catch: we won't know for sure how good the simulations are until we put the system out there and gather real user feedback. But for now, our strategy is to generate Testers with a high temperature setting, which will increase the randomness of the responses. This approach should allow us to capture a wide variety of conversations, whether they're common scenarios or rare outliers. By doing so, we'll be able to tackle those tricky edge cases head-on during development.

That's the lowdown for now. But hey, we're all ears for your thoughts, comments, and feedback. Stay tuned for our monthly updates where we'll spill all the beans about our progress and any exciting discoveries we make along the way. Hang tight and keep the excitement going!

Top comments (0)