DEV Community: Dasha Maliugina

💡 10 learnings on LLM evaluations

Dasha Maliugina — Wed, 14 May 2025 19:29:44 +0000

If you are building with LLMs, you already know this: evaluation is hard. And it's nothing like testing regular software. You can’t just throw in some unit tests and call it a day.

Here are 10 key ideas on LLM evaluations, sourced from LLM evaluation for builders: applied course (It’s a free and open course with YouTube videos and code examples).

1. LLM evaluation ≠ Benchmarking

Evaluating LLM systems is not the same as evaluating LLM models. Public LLM benchmarks test general model capabilities, like math, code, or reasoning. But if you're building an LLM application, you need to evaluate it on your specific use cases.

For example, for a customer support chatbot, you need to test how well it grounds the answers on the company’s policies or handles tricky queries like comparing your product against competitors. It’s not just how it performs on MMLU.

2. Evaluation is a tool, not a task

LLM evaluation isn’t just a checkbox. It is a tool that helps answer product questions and supports decision-making throughout the product lifecycle:

During the experiments, evals help you compare different prompts, models, and settings to determine what works best.
Before deployment, you can run stress-testing and red-teaming to check how your system handles edge cases.
Once in production, you need to monitor how well your product is doing in the wild.
When you make changes, you need to run regression tests before shipping the updates.

So, to design evaluations well, you first need to figure out what you are looking to solve!

3. LLM evaluation ≠ Software testing

Unlike traditional software, LLM systems are non-deterministic. That means that the same input can yield different outputs. Also, LLM products often solve open-ended tasks, like writing an email, that do not have a single correct answer.

In addition, LLM systems bring a whole new set of risks. They can hallucinate and confidently make things up. Malicious users can attempt to jailbreak LLM apps and bypass their security measures. LLM app data leaks can result in exposing sensitive data.

Testing for functionality is no longer enough. You must also evaluate the quality and safety of your LLM system’s responses.

4. Combine manual and automated evaluations

You should always start with manual review to build intuition. Your goal is to understand what “good” means for your use case and spot patterns: are there any common failure modes or unexpected behavior?

Once you know what you’re looking for, you can add automation. But the important thing is that automated LLM evals are here to scale human judgment, not replace it.

5. Use both reference-based and reference-free evals

There are two main types of LLM evaluation methods:

Reference-based evals compare your system’s outputs to expected – or “ground-truth” – answers, which is great for regression testing or experiments. You can use such methods as exact match, semantic similarity, BERTScore, and LLM-as-a-judge.

Reference-free evals allow assessing specific qualities of the response, like helpfulness or tone. These are useful in open-ended scenarios and production monitoring. You can use text statistics, regular expressions, ML models, and LLM judges.

As you’ve probably guessed, you’ll need both types!

6. Think in datasets, not unit tests

Traditional testing is built around unit tests. With LLMs, it’s more useful to think in datasets. You need to test for a range of acceptable behaviors, so it’s not enough to run evaluations on a single example.

You may need to create diverse test sets, including happy paths, edge cases, and adversarial examples. Good evaluation datasets reflect both how users interact with your app in the real world and where things can go wrong.

7. LLM-as-a-judge is a key evaluation method

LLM-as-a-judge is a common technique to evaluate LLM-powered products. The idea is simple: you can use another LLM (or the same one!) to evaluate your system’s response with a custom evaluation prompt. For example, you can ask the judge whether your chatbot’s response is polite or aligns with the brand image.

This approach is scalable and surprisingly effective. Just remember, LLM judges aren’t perfect. You’ll need to assess and tune them, and invest in designing the evaluation criteria.

8. Use custom criteria, not just generic metrics

Since LLMs often solve very custom tasks, you must design quality criteria that map to your use case. You can’t just blindly use metrics like coherence or helpfulness without critically thinking about what actually matters.

Instead, you must define what “good” means for your app. Then, customize your evaluation to your domain, users, and specific risks. For a legal assistant, you can check whether the answer cites the correct regulations. For a wellness chatbot, you may need to ensure it answers in a friendly manner and does not provide medical advice.

9. Start with analytics

Evaluations are very analytical tasks.

To run LLM evals, you first need the data. So log everything: capture all the inputs and outputs, record metadata like model version and prompt, and track user feedback if you have it. If your app doesn’t have real users yet, you can start with synthetic data and grow from there.

You also need to manually analyze the outputs you get to determine your criteria and understand the failure modes you observe.

10. Evaluation is a moat

Building a solid LLM evaluation system is an investment. But it is also a competitive advantage:

Rapid iteration. Evals help speed up your AI product development cycle, ship updates stress-free, switch models easily, and debug issues efficiently.
Safe deployment. Evaluations allow you to test how an LLM system handles edge cases to avoid liability risks and protect customers from harmful outputs.
Product quality at scale. Finally, evals help ensure your LLM app works well and provides a good customer experience.

Your competitors can’t copy that — even if they use the same model for their LLM app!

🔥 Free course on LLM evaluations

Learn how to create LLM judges, evaluate RAG systems, and run adversarial tests. The course is designed for AI/ML Engineers and those building real-world LLM apps; basic Python skills are required. And yes, it’s free! Learn more and sign up.

20 examples of LLM-powered applications in the real world

Dasha Maliugina — Wed, 03 Jul 2024 17:13:55 +0000

The recent advancements in LLMs improved their performance and made them more affordable – this unlocked multiple possibilities for companies to integrate LLMs into their products. Indeed, there have been a lot of impressive demos. But how do companies actually use LLMs in production?

We put together and regularly update a database of 450 use cases from 100+ companies that detail real-world applications and insights from ML and LLM system design. In this blog, we share 20 selected examples of LLM-powered products from various industries.

The database is maintained by the team behind Evidently, an open-source tool for LLM and ML evaluation and observability. Give us a star on GitHub to support the project!

👷 LinkedIn extracts skill information from texts

They extract skills from various content across the platform and map these skills to their Skills Graph to ensure accurate job and learning matches.

🗝 Google speeds up security and privacy incidents workflows

They use LLMs to summarize incidents for different audiences, including executives, leads, and partner teams. It saves responders’ time and improves the quality of incident summaries.

🏪 Picnic improves search relevance for product listings

They leverage LLMs to enhance product and recipe search retrieval for users from three countries with their own unique language and culinary preferences.

🙅 Yelp detects inappropriate language in reviews

The company enhanced its content moderation system with LLMs to help identify egregious instances of threats, harassment, lewdness, personal attacks, or hate speech.

🚗 Uber tests mobile applications

They created DragonCrawl, a system that uses LLMs to execute mobile tests with the intuition of a human. It saves thousands of developer hours and reduces test maintenance costs.

#️⃣ Grab automatically tags sensitive data

They use LLM to classify data entities, identify sensitive data, and assign the most appropriate tag to each entity.

🛒 Instacart builds an internal AI assistant

Teams use an internal AI assistant called Ava to write, review and debug code, improve communications, and build AI-enabled internal tools on top of the company’s APIs.

🛍 Whatnot detects marketplace spam

They use LLMs to enhance trust and safety areas like multimodal content moderation, fulfillment, bidding irregularities, and general fraud protection.

💌 Nextdoor generates engaging email subject lines

The company aims to generate informative and engaging subject lines that will lead to more email opens, clicks, and eventually more sessions on the platform.

🍿 Vimeo builds customer support AI assistant

They prototyped a help desk chatbot where customers input their questions and receive immediate, accurate, and personalized responses.

🤖 GoDaddy classifies support inquiries

GoDaddy leverages LLMs to enhance customer experience in their messaging channels by classifying support inquiries. They share lessons learned operationalizing these models.

🗞 OLX extracts information from job listings

They use Prosus AI Assistant, their generative AI (GenAI) model, to extract job roles in job ads and ensure a closer alignment between job seekers’ desired jobs and the relevant listings.

🔢 Honeycomb helps users write data queries

The company built Query Assistant to accelerate users’ learning curve associated with queries. Users can describe or ask things in plain English like “slow endpoints by status code” and Query Assistant will generate a relevant Honeycomb query to iterate on.

📦 DoorDash extracts product attributes from unstructured SKU data

They use LLMs to extract and tag product attributes from raw merchant data. It allows to easily match customer queries with relevant items on DoorDash and helps delivery drivers to find the correct product in the store.

⚠️ Incident.io generates summaries of software incidents

Incident.io helps to collaborate on software incidents by suggesting and updating the incident summary. This suggestion considers the latest update, the conversation in the Slack channel, and the previous summary. Half of all summary updates in Incident.io are now written by AI.

🪡 StitchFix generates ad headlines and product descriptions

The company combines algo-generated text with a human expert-in-the-loop approach to streamline crafting engaging advertisement headlines and producing high-fidelity product descriptions.

💳 Digits suggests questions about banking transactions

They use generative models to assist their customers – accountants – by suggesting questions about a transaction to a client. The accountants can then send the question to their client as is, or edit it without typing everything from scratch.

🧑‍🏫 Duolingo generates content for lessons

The company leverages LLMs to help their learning designers come up with relevant exercises for lessons. Human experts plan out the theme, grammar, vocabulary, and exercise types for a given lesson and the model outputs relevant exercises.

🏠 Zillow detects discriminatory content in real-estate listings

The company uses LLMs to understand whether real-estate listings contain proxy for race and other remnants of historical inequalities in the real estate domain.

🍲 Swiggy improves search relevance in hyperlocal food delivery

They use LLMs to match search queries in a variety of languages with millions of dish names with regional variety.

Want more examples of LLM systems in production?

Check out our database of 450 use cases from 100+ companies that share their learnings from implementing ML and LLM systems. Bookmark the list and enjoy the reading!