DEV Community

Cover image for GenAI Foundations – Chapter 4: Model Customization & Evaluation – Can We Trust the Outputs?
Romina Mendez
Romina Mendez

Posted on

GenAI Foundations – Chapter 4: Model Customization & Evaluation – Can We Trust the Outputs?

👉 “Measuring quality and adapting models for real-world use”

Introduction

Generative AI is not just about writing prompts it is also about measuring whether outputs are useful, safe, and reliable. Evaluation is essential to detect limitations such as inaccuracies, hallucinations, or style mismatches.

When evaluation shows that base models are not enough, customization comes into play. Depending on the problem, this may mean refining prompts, extending knowledge with Retrieval-Augmented Generation (RAG), or adapting models through fine-tuning.


Fine-tuning and Efficient Adaptation Methods

Fine-tuning is the classic approach to specialize foundation models, since it starts from a pretrained model with general knowledge and continues its training using data specific to the target domain.

How does it work?

  1. ➡️ Starting point: A pretrained model with general capabilities is used.
  2. 📚 Specialized training: It is trained with a smaller but specific dataset (e.g., medical texts, legal documents, or technical support conversations).
  3. 🎛️ Parameter adjustment: The internal parameters of the model are modified to optimize its performance in the specific task.

The Fine-tuning Process

The following diagram illustrates the complete fine-tuning workflow, from initial planning to production deployment:

The process follows a systematic approach that begins with defining clear objectives and use cases, then moves to selecting an appropriate base model and preparing high-quality training datasets.

The training phase involves configuring hyperparameters and executing the fine-tuning process with proper validation and performance assessment.

Throughout this workflow, several key considerations are essential:

  • Prioritizing quality over quantity in training data, ensuring it accurately represents the target domain
  • Implementing early stopping and checkpoints to prevent overfitting
  • Using appropriate metrics for validation to measure effectiveness

The final phase involves deploying the model to production and establishing continuous monitoring and observability systems to track performance and detect any degradation over time.


When and Why to Fine-tune?

Fine-tuning is the process of specializing a pretrained model for a specific domain or task. While foundation models have impressive general capabilities, they often need adaptation for:

  • 💬 Specialized vocabulary: Medical, legal, or technical terms that the base model doesn't fully master
  • 🗂️ Specific response styles: Professional tone, structured formats, communication protocols
  • 🎯Task-specific improvements: Sentiment analysis for product reviews, technical document classification
  • 📄 Regulatory compliance: Adaptation to industry-specific regulations

Fine-tuning Approaches

Full Fine-tuning

Traditional fine-tuning involves continuing the training of the pretrained model using domain-specific data. This process modifies the model's parameters to optimize performance for the specific task.

🧩 Characteristics:

  • Maximum level of specialization possible
  • Requires significant computational resources
  • Considerable training time (days to weeks)
  • Risk of overfitting with limited datasets

✅ Main advantages:

Achieves the highest possible level of customization
Delivers maximum performance when sufficient data and resources are available


Parameter-Efficient Tuning Methods (PETM)

Because full fine-tuning is often impractical for large-scale models, a set of alternatives known as Parameter-Efficient Tuning Methods (PETM) has been developed. These techniques adapt models with smaller training datasets, lower computational cost, and minimal modifications to the base model's architecture.


LoRA (Low-Rank Adaptation)

LoRA is based on the hypothesis that the changes needed for adaptation have low intrinsic dimensionality. Instead of modifying complete weight matrices, it breaks down updates into lower-rank matrices.

✅ Main advantages:
  • Significantly reduces memory requirements (60-80% less than full fine-tuning)
  • Maintains most of the performance of traditional fine-tuning
  • Enables faster and more economical training
  • Facilitates versioning and management of multiple adaptations
  • Especially effective for cases requiring specialization without losing the base model's general capabilities

Adapter Tuning

This method introduces small, trainable modules (adapters) between existing transformer layers while keeping the original parameters frozen.

✅ Main advantages:
  • Modularity: allows switching between different specializations.
  • Training stability: provides consistent results.
  • Multi-capability implementation: facilitates deploying multiple specialized capabilities in a single system.

Prefix Tuning and Prompt Tuning

These methods focus on learning optimal representations that are prepended to inputs, guiding model behavior without modifying its internal parameters.

✅ Main advantages:
  • Minimal base model modification: preserves original capabilities
  • Extremely parameter-efficient: requires minimal additional resources
  • Suitable for resource-constrained scenarios: ideal when computational resources are very limited

Choosing the Right Approach

The selection between full fine-tuning and PETM depends on several factors:

Understanding these trade-offs is essential for making informed decisions about which fine-tuning approach will best serve your specific use case.


Open Source Models

Open source language models have grown rapidly, but running them locally can be a challenge due to their large size and the computational resources they require. To address this limitation, the concept of SLM (Small Language Models) has emerged: reduced and optimized models that allow their use in local environments, with lower hardware demand and easier deployment.

A prominent example is the Gemma family, developed by Google DeepMind in collaboration with other Google teams. The 💎 Gemma models are built on the same underlying technology as the Gemini series and are released as open source models. The Gemma family includes different specialized variants, such as Med-Gemma (focused on medical text and image interpretation) and models oriented toward code-related tasks, among others. This diversity allows Gemma to adapt to different domains and research needs while remaining accessible in open source form.

As we mentioned before, Med-Gemma is a model specialized in the interpretation of medical text and images. In my article dedicated to Med-Gemma I presented a practical case of local implementation, along with a more detailed explanation of its capabilities and possible applications in hospital and research environments.


Comprehensive Guide to Generative AI Model Evaluation

Evaluating a generative AI system goes beyond checking whether it can produce text or images.

A rigorous evaluation must consider broader dimensions such as the accuracy of the information, the coherence and clarity of the outputs, and their reliability across different contexts.

To address these aspects, modern evaluation frameworks combine complementary approaches: automatic metrics, human review, and newer methods such as LLM-as-a-judge.

What is LLM-as-a-judge?

Within current evaluation practices, one emerging approach is known as LLM-as-a-judge and in this paradigm, a large-scale language model is used to assess the outputs of other models.

Rather than relying exclusively on statistical metrics such as BLEU or ROUGE, or on human annotators, this method leverages the model’s capacity to analyze responses and provide quality judgments.

For instance, an evaluator LLM can be prompted to rate a response along dimensions such as coherence, relevance, truthfulness, or safety, using explicit criteria or reference examples as a baseline.

🎯Advantages:

  • Scales better than human evaluation.
  • Can capture semantic nuances that n-gram metrics do not detect.
  • Easily integrates into evaluation and continuous monitoring pipelines.

✏️ Challenges:

  • Risk of biases inherited from the evaluator LLM.
  • Dependence on proper prompting to obtain consistent judgments.
  • Need for cross-validation with human judgments for greater reliability.

Evaluation Journey

As we discussed earlier, we need to be able to perform model evaluations to understand their proper functioning before putting them into production. Therefore, it is necessary to understand how to carry out this process and what things to take into account.
Below is the representation of the journey where you can visualize that you must ask yourself 3 fundamental questions in the design of your validation pipeline:

Key Considerations

From the above, we want to emphasize that you should always keep in mind to perform domain validation and always be able to log everything that is done to have traceability of your results. Often the data is trained with a lot of data and it may be that within our domain of specific use cases, certain terms correspond to specific vocabulary, so it is always important to validate domain-specific performance.

  • ⚖️ Balance Automation and Manual Review Find the optimal balance between automated and manual evaluation to ensure both model safety and economic viability.
  • 📚 Domain-Specific Considerations Always account for domain-specific vocabulary and contexts. Models trained on general data may not understand specialized terminology relevant to your use case.
  • 🔍 Traceability and Logging Maintain comprehensive logs of all evaluation processes to ensure result traceability and enable continuous improvement.
  • 🎯 Hybrid Approach Benefits While cloud providers offer convenient solutions, combining multiple frameworks and open-source tools often provides more comprehensive evaluation coverage.

What Is an AI Benchmark?

When evaluating generative AI systems, particularly large language models (LLMs), benchmarks serve three core purposes:

  1. Objectivity: They provide standardized metrics that reduce subjectivity in evaluation.
  2. Reproducibility: Results can be reliably reproduced by others using the same setup.
  3. Progress Tracking: They allow researchers to track the evolution of model performance over time. Still, it's important to choose the right benchmark based on what you're trying to evaluate—reasoning, factuality, safety, coding ability, etc.

Types of Evaluation Datasets

Depending on your evaluation goals, you might use:

🌍 Public Benchmarks

Designed for general-purpose evaluation (e.g., reasoning, truthfulness, math, safety) and these provide standardized baselines across the research community.

Key Examples:

  1. MMLU (Massive Multitask Language Understanding): General knowledge and reasoning across 57 academic subjects. This covers a wide variety of language tasks, encourages model generalization, and provides standard metrics across the research community
  2. TruthfulQA: Detecting factuality and hallucination resistance,
  3. HELM (Holistic Evaluation of Language Models): Comprehensive evaluation across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency
  4. ToxiGen: Measuring implicit hate speech detection across different demographic groups
  5. GSM8K (Grade School Math 8K): Mathematical reasoning with grade-school level word problems.

🏥 Domain-Specific Datasets

Tailored to specific industries like 🩺healthcare, ⚖️law, or 🏦finance and so on.

This kind of dataset help assess model suitability in professional or regulated environments.

One example is 🩺 HealthBench, an open benchmark for evaluating performance on real healthcare language data. Developed in collaboration with 262 physicians from 60 countries, HealthBench includes 5,000 realistic health conversations, each accompanied by a custom rubric created by a physician to grade model responses.


🛠️ Custom Datasets

Created by organizations using internal user data or specific use cases, and these can offer contextually relevant information about the real-world environment where the model will be deployed.
When to develop custom datasets:

  • Public benchmarks don't capture your specific domain or use case
  • You have sufficient high-quality internal data
  • The investment in dataset creation is justified by improved evaluation accuracy

☁️ Service Providers

Cloud platforms like AWS (via Bedrock), Google (Vertex AI), and Microsoft (Azure AI) all emphasize dataset management and offer flexibility to use both public and custom data.
These platforms typically provide:

  • Pre-integrated access to popular public benchmarks
  • Tools for uploading and managing custom evaluation datasets
  • Automated evaluation pipelines that can run against multiple benchmark types
  • Comparative analysis across different model versions or configurations


🔮 What’s Next?

Evaluation and customization provide the foundation for scaling AI solutions responsibly. In the final step, we bring everything together into a structured framework for planning. You can continue with the next chapter in this series: Chapter 5: AI Project Planning – The Generative AI Canvas.


📖 Series Overview

You can find the entire series on my Profile:

  • ✏️ GenAI Foundations – Chapter 1: Prompt Basics – From Theory to Practice
  • 🧩 GenAI Foundations – Chapter 2: Prompt Engineering in Action – Unlocking Better AI Responses
  • 📚 GenAI Foundations – Chapter 3: RAG Patterns – Building Smarter AI Systems
  • ✅ GenAI Foundations – Chapter 4: Model Customization & Evaluation – Can We Trust the Outputs?
  • 🗂️ GenAI Foundations – Chapter 5: AI Project Planning – The Generative AI Canvas

📚 References

  1. Academy OpenAI. (2025, febrero 13). Advanced prompt engineering. https://academy.openai.com/home/videos/advanced-prompt-engineering-2025-02-13
  2. Anthropic. (s.f.). Creating message batches. Anthropic Documentation. https://docs.anthropic.com/en/api/creating-message-batches
  3. AWS. (s.f.). ¿Qué son los modelos fundacionales?. https://aws.amazon.com/es/what-is/foundation-models/
  4. AWS. (s.f.). ¿Qué es Retrieval-Augmented Generation (RAG)?. https://aws.amazon.com/es/what-is/retrieval-augmented-generation/
  5. Cloud Skills Boost. (s.f.). Introduction to generative AI. Google Cloud. https://www.cloudskillsboost.google/course_templates/536
  6. Google Developers. (s.f.). Ingeniería de instrucciones para la IA generativa https://developers.google.com/machine-learning/resources/prompt-eng?hl=es-419
  7. Google Developers. (s.f.). Información general: ¿Qué es un modelo generativo? https://developers.google.com/machine-learning/gan/generative?hl=es-419
  8. IBM. (s.f.). What is LLM Temperature?. https://www.ibm.com/think/topics/llm-temperature
  9. IBM. (s.f.). ¿Qué es el prompt engineering ? https://www.ibm.com/es-es/think/topics/prompt-engineering
  10. IBM. (s.f.). AI hallucinations. https://www.ibm.com/es-es/think/topics/ai-hallucinations
  11. Luke Salamone. (s.f.). What is temperature?. https://blog.lukesalamone.com/posts/what-is-temperature/
  12. McKinsey & Company. (2024-04-02). What is generative AI?https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-generative-ai
  13. New York Times. (2025-05-08). La IA es cada vez más potente, pero sus alucinaciones son cada vez peores https://www.nytimes.com/es/2025/05/08/espanol/negocios/ia-errores-alucionaciones-chatbot.html
  14. Prompt Engineering. (2024-04-06). Complete Guide to Prompt Engineering with Temperature and Top-p https://promptengineering.org/prompt-engineering-with-temperature-and-top-p/
  15. Prompting Guide. (s.f.). ReAct prompting. https://www.promptingguide.ai/techniques/react
  16. Prompting Guide. (s.f.). Consistency prompting. https://www.promptingguide.ai/techniques/consistency
  17. Learn Prompting. (2024-09-27). Self-Calibration Prompting https://learnprompting.org/docs/advanced/self_criticism/self_calibration
  18. AI Prompt Theory. (2025-07-08). Temperature and Top p: Controlling Creativity and Predictability https://aiprompttheory.com/temperature-and-top-p-controlling-creativity-and-predictability/?utm_source=chatgpt.com
  19. Vellum. (s.f.). How to use JSON Mode https://www.vellum.ai/llm-parameters/json-mode?utm_source=www.vellum.ai&utm_medium=referral
  20. OpenAI. (2025-08). What are tokens and how to count them?. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
  21. Milvus.(s.f.) What are benchmark datasets in machine learning, and where can I find them?. https://milvus.io/ai-quick-reference/what-are-benchmark-datasets-in-machine-learning-and-where-can-i-find-them

Top comments (0)