DEV Community

Thomas Ezan
Thomas Ezan

Posted on

LLM evaluation: a quick overview of Stax

The views and opinions expressed on this blog are my own and do not reflect those of my employer. Additionally, any solutions, APIs, or products mentioned are for informational and discussion purposes only and should not be considered an endorsement.

Developing applications with Generative AI and Large Language Models (LLMs) feels often more like an art than a precise science. Developers dedicate significant time to manually testing various prompts and depend on subjective assessments or intuition to ascertain if the AI model meets their requirements.
What if you could skip all that manual and subjective process and instead take a data-driven and repeatable approach? 🤔

That is what Stax is for! ✨

Stax landing page

What is Stax?

Stax is a developer tool from Google Labs designed to simplify LLM and AI application testing. It enables you to test models and prompts against your own use-cases.

Who is Stax for?

Stax is built primarily for product managers and app developers building LLM-powered apps or features. With stack you will

  1. Create a dataset of prompts tailored to your specific use cases for testing various models.
  2. Use this dataset to evaluate various LLMs with clear metrics to determine if your AI application is working as intended before shipping to real users. The platform makes it easy to create and work with the 2 building blocks of effective AI evaluation: datasets and evaluators.

Create a dataset

In Stax, datasets are defined as collections of prompts. These prompts are used to generate responses from LLMs that will then be evaluated. First, select that model you want to evaluate.

Model selection

Then, type a few user prompts into the playground to build your evaluation benchmark. If you already have data, you can also upload it as a CSV.

Playground to input prompts

Evaluate the model’s responses

Once you've produced outputs, you can start evaluating them. To do so, select the best evaluator for your use case. In our case we are building a news summarization app, so we are going to use the pre-defined Summarization Quality evaluator. But you can also create a custom one.

Evaluator selection

When the evaluation is done, you can visalize individual and aggregate metrics so that you can compare various models or iterations of prompts against each other on quality but also latency.

Evaluation results

Conclusion

Stax enables you to make data-driven decisions, helping you select the optimal model for both quality and speed. With Stax, you can easily rerun your datasets and evaluators whenever you wish to experiment with a new model. This makes it an excellent model-agnostic platform to build AI enabled feature with confidence. 💪

Top comments (0)