Domain Specific Language Models Book Review

#slm #llm #bookreview #granite

Feedback on “Domain Specific Language Models” from Manning.com

I just wrapped up the latest chapters of ‘Domain-Specific Small Language Models’ from Manning. High-quality technical deep dive into building efficient AI. It echoes IBM’s stance that the future isn’t just about ‘bigger’ models, but about ‘smarter, specialized’ ones that are cost-effective and transparent. (Note: Independent review, no affiliation with the publisher or author).

Introduction: IBM’s Strategic View on SLMs

IBM posits that for Generative AI to be successful in the enterprise, it must move beyond the “hype” of massive models and focus on three pillars: efficiency, transparency, and flexibility.

Efficiency: Larger models are becoming prohibitively expensive and energy-intensive; SLMs (generally defined as models under 30 billion parameters) offer a cost-effective way to scale AI for specific tasks.
Transparency: IBM emphasizes the need for “cleaned, filtered datasets” to reduce bias and ensure trustworthy results, as seen in their Granite™ family of models.
Flexibility: Open-source SLMs allow organizations to infuse proprietary data into specialized models, achieving task-specific performance that rivals larger frontier models at a fraction (3 to 23 times less) of the cost.

The Economics of Size: Scaling Responsibly

Image from “Domain Specific Language Models”

As IBM emphasizes, successful Generative AI requires performance that is cost-effective for scaling. We cannot ignore that the cost of training, inferencing, and interacting with an LLM is orders of magnitude higher than that of an SLM. For many specific tasks — like document classification, customer service, or internal HR tools — a general-purpose LLM is essentially “blasting a cannon to swat a mosquito.”

By utilizing an SLM or a fine-tuned model like IBM Granite, companies can achieve comparable performance for task-specific goals at a fraction of the cost. Beyond the financial benefits, this shift represents a “greener” path for the planet. SLMs require less energy for both the massive initial training phase and the daily energy consumption required for inference. Transitioning to specialized, smaller models allows organizations to unlock AI’s true value while meeting critical corporate environmental goals and reducing their overall carbon footprint.

Image from “Domain Specific Language Models”

Book Synthesis: Building Domain-Specific SLMs

The book Domain-Specific Small Language Models by Guglielmo Iozzia serves as a technical deep dive into the practical application of the principles IBM advocates. It moves from theoretical foundations to production-ready implementations of SLMs.

1. Technical Foundations and Optimization

The Architecture of “Small”: SLMs utilize the same Transformer architecture as LLMs but are optimized for a smaller memory footprint and lower computational requirements.
Quantization: A core focus of the book is making SLMs run on commodity hardware (like laptops or edge devices) through techniques like 4-bit quantization (using libraries like ggml or GPTQ) to reduce memory usage without significantly sacrificing accuracy.
Inference Speed: The book explores frameworks like ONNX Runtime and FlexGen to accelerate inference, ensuring that models can operate in near-real-time environments.

2. Domain Adaptation Strategies

Fine-Tuning: The author details the process of adapting pre-trained models (like DistilBERT) to specific industry tasks, such as question answering, by adjusting model weights on specialized, labeled datasets.
RAG and Agentic AI: The text explains that SLMs are often more valuable when integrated into complex systems like Retrieval-Augmented Generation (RAG) or Agentic AI, where they serve as efficient “reasoning engines” for narrow tasks.

3. Real-World Industry Use Cases

Coding Assistance: Examples include generating Python code and optimizing it for local execution.
Biotech and Chemistry: The book showcases specialized SLMs for generating protein structures and crystal structures, proving that SLMs can handle complex, non-natural language data.

Focused Feedback: The Efficiency of ONNX

Image from “Domain Specific Language Models”

In my review of ‘Domain-Specific Small Language Models,’ I found Chapter 5, ‘Exploring ONNX,’ to be the most critical for anyone serious about sustainable AI. While we often hear that we don’t necessarily need a massive LLM, this chapter explains the technical ‘magic’ that makes SLMs truly viable for the enterprise.

Iozzia demonstrates how converting models to the ONNX format allows them to run across different hardware — from specialized servers to standard office laptops — without losing performance. This directly addresses the need for economic and greener AI:

Reduced Inference Costs: By using the ONNX Runtime, models are optimized to execute faster with fewer compute resources.
Hardware Agnostic: You aren’t locked into expensive, power-hungry GPUs; you can run specialized models on CPUs or edge devices.
Environmental Impact: Faster execution means less electricity consumed per query. As I mentioned previously (with no affiliation to Manning or the author), this chapter transforms the theoretical ‘Small is Good’ argument into a practical, deployable reality. If an organization wants to follow IBM’s lead in using models like Granite efficiently, the ONNX optimization workflows described here are the blueprint.

Image from “Domain Specific Language Models”

Practical Implementation: Learning through Hands-on Examples

Good code samples are provided to help to explain the content of the book by hands-on examples, bridging the gap between theoretical SLM concepts and real-world deployment. A prime example is the implementation of Parameter-Efficient Fine-Tuning (PEFT) using LoRA (Low-Rank Adaptation) on a FLAN-T5 Small model. This specific companion notebook demonstrates how to take a lightweight base model and tune it for a specialized task — in this case, text summarization using the samsum dataset.

LoRA on FLAN-T5 Small with the HF's Transformers Library
This notebook is a companion of chapter 2 of the "Domain Specific LLMs in Action" book, author Guglielmo Iozzia, Manning Publications, 2025.
The code in this notebook is to introduce readers to a PEFT (Parameter-Efficient Fine-Tuning) technique called LoRA (Low Ranking Adaptation). The pre-trained LLM model used as baseline is FLAN-T5 small loaded through the Hugging Face's Transformers library. It is going to be tuned for text summarization on a subset of the samsum dataset. Code execution requires a Colab free VM with hardware acceleration (GPU).
More details about this code example can be found in the book's chapter.

Install the missing requirements in the Colab VM.


!pip install datasets peft accelerate bitsandbytes evaluate rouge_score py7zr

import locale

original_getpreferredencoding = locale.getpreferredencoding

def getpreferredencoding_wrapper(do_raise=True):
  return original_getpreferredencoding()

locale.getpreferredencoding = getpreferredencoding_wrapper

Data Preparation
Load the sansum dataset from the HF Hub.


from datasets import load_dataset

dataset = load_dataset("knkarthick/samsum", trust_remote_code=True)

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")

Load the FLAN-T5 small model tokenizer from the HF Hub.


from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id="google/flan-t5-small"

tokenizer = AutoTokenizer.from_pretrained(model_id)

Some preprocessing of the training/test data is needed.
We need to truncate training and test sequences that are longer than the maximum input sequence after tokenization and pad those that are shorter. This applies to both input and target.
For the input, we take the 85 percentile of the max length for better utilization.


from datasets import concatenate_datasets
import numpy as np

combined_dataset = concatenate_datasets([dataset["train"], dataset["test"]])

tokenized_inputs = []
for example in combined_dataset:
    if example["dialogue"] is not None:
        tokenized_input = tokenizer(text=example["dialogue"], truncation=True)
        tokenized_inputs.append(tokenized_input["input_ids"])

input_lenghts = [len(x) for x in tokenized_inputs]
max_source_length = int(np.percentile(input_lenghts, 85))
print(f"Max source length: {max_source_length}")

For the target, we take the 90 percentile of the max length for better utilization.


tokenized_targets = concatenate_datasets(
    [dataset["train"], dataset["test"]]).map(
        lambda x: tokenizer(x["summary"], truncation=True), batched=True,
        remove_columns=["dialogue", "summary"])
target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]
max_target_length = int(np.percentile(target_lenghts, 90))
print(f"Max target length: {max_target_length}")

We can now define a single function that executes all the preprocessing steps (input tokenization, truncation and padding).


def preprocess_function(sample, padding="max_length"):
    # Filter out examples where dialogue is None and keep corresponding summaries
    processed_samples = [(dialogue, summary) for dialogue, summary in zip(sample["dialogue"], sample["summary"]) if dialogue is not None]

    inputs = ["summarize: " + dialogue for dialogue, summary in processed_samples]
    labels_text = [summary for dialogue, summary in processed_samples]

    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    labels = tokenizer(text_target=labels_text, max_length=max_target_length, padding=padding, truncation=True)

    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

Apply the function defined at the previous cell to the tokenized dataset.


tokenized_dataset = dataset.map(preprocess_function, batched=True,
                                remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

Save the preprocessed datasets to disk to reuse them later.


tokenized_dataset["train"].save_to_disk("data/train")
tokenized_dataset["test"].save_to_disk("data/eval")

Fine tuning with LoRA and bitsandbytes int8.
Load the FLAN-T5 small model in 8-bit precision from the HF's Hub.


from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True,
                                              device_map="auto")

Define the LoRA configuration, prepare the model for training, and add the LoRA adaptor to it.


from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType

lora_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules=["q", "v"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.SEQ_2_SEQ_LM
)

model = prepare_model_for_kbit_training(model)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

At the end of the execution of the code cell above, the number of parameters to train should be < 1% of the total for the model.
The training process is the same as regular LLM training, the main difference is in model to be trained, which is the one after submission to LoRA.
Define a Data Collator for this training:


from transformers import DataCollatorForSeq2Seq

label_pad_token_id = -100
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

Set the training arguments and use them to create a Trainer instance. For this use case training for 3 epochs should be enough.
Model warnings have been silenced to make the output at training time less verbose and more readable.


from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

output_dir="lora-flan-t5-small"

training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
 auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=3,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)
model.config.use_cache = False

Start the training.


trainer.train()

Save the fine-tuned model to disk.


lora_model_id="flan_t5_lora"
trainer.model.save_pretrained(lora_model_id)
tokenizer.save_pretrained(lora_model_id)

Inference and Evaluation
Prepare the model for inference. Load the LoRA configuration and checkpoints, reload the base model and merge the weights.


import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

config = PeftConfig.from_pretrained(lora_model_id)

model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,  load_in_8bit=True,  device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

model = PeftModel.from_pretrained(model, lora_model_id, device_map={"":0})
model.eval()

Perform inference (text summarization) on a random subset of the test samples.


from random import randrange
from datasets import load_dataset

sample = dataset['test'][randrange(len(dataset["test"]))]

input_ids = tokenizer(sample["dialogue"], return_tensors="pt",
                      truncation=True).input_ids.cuda()
outputs = model.generate(input_ids=input_ids, max_new_tokens=10,
                         do_sample=True, top_p=0.9)
print(f"input sentence: {sample['dialogue']}\n{'---'* 20}")

print(f"summary:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]}")

Define a function to evaluate the model.


import numpy as np

def evaluate_peft_model(sample, max_target_length=50):
    outputs = model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(),
                             do_sample=True, top_p=0.9,
                             max_new_tokens=max_target_length)
    prediction = tokenizer.decode(
        outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
    labels = np.where(sample['labels'] != -100,
                      sample['labels'], tokenizer.pad_token_id)
    labels = tokenizer.decode(labels, skip_special_tokens=True)

    return prediction, labels

Evaluate the model (ROUGE score) on the test dataset.


import evaluate
from datasets import load_from_disk
from tqdm import tqdm

metric = evaluate.load("rouge")

test_dataset = load_from_disk("data/eval/").with_format("torch")

predictions, references = [] , []
for sample in tqdm(test_dataset):
    p,l = evaluate_peft_model(sample)
    predictions.append(p)
    references.append(l)

rogue = metric.compute(predictions=predictions,
                       references=references,
                       use_stemmer=True)

print(f"Rogue1: {rogue['rouge1']* 100:2f}%")
print(f"rouge2: {rogue['rouge2']* 100:2f}%")
print(f"rougeL: {rogue['rougeL']* 100:2f}%")
print(f"rougeLsum: {rogue['rougeLsum']* 100:2f}%")

Memory Optimization: The model is loaded in 8-bit precision using the bitsandbytes library, allowing for training on consumer-grade hardware like a free Colab GPU.
LoRA Integration: By adding a LoRA adapter, the code reduces the number of trainable parameters to less than 1% of the model’s total size, drastically cutting down on the “compute tax” and energy consumption required for training.
Validation: Finally, it provides automated evaluation scripts using the ROUGE metric, ensuring that the “small” model maintains high performance despite its reduced footprint. By providing these executable scripts, the book ensures that the “Power of Small” is not just a strategic concept, but a reproducible technical workflow that is both economically viable and greener for the planet.

Comparison: Why IBM’s View and the Book Align

I think that there is a strong synergy between IBM’s corporate strategy and the technical roadmap provided in this book!

| Feature            | IBM's Point of View                                          | Iozzia's *Domain-Specific SLMs*                              |
| ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| **Primary Goal**   | Delivering "trustworthy" and "cost-effective" AI for specific enterprise tasks. | Moving past "hype" to create "exhaustive, highly informative" local models for industry challenges. |
| **Data Integrity** | Use of "cleaned, filtered datasets" to reduce risk and bias. | Emphasis on "private data" and "domain expertise" to tackle industry-specific tasks where privacy is mandatory. |
| **Open Source**    | Advocacy for open-source base models (Granite) to empower organizations. | A deep dive into the "valid Open Source ecosystem" as a necessary alternative to closed-source commercial models. |
| **Cost & Scaling** | Achieving high performance at 3–23x lower cost than LLMs.    | Focus on "tight infrastructure budgets" and running models on "laptops or smartphones". |
| **Future Vision**  | Using techniques like **InstructLab** for efficient data infusion. | Exploring the "future of Agentic AI" where SLMs act as economical, powerful task agents. |

Outcome: both sources agree that the “Power of Small” lies in specialization. While general-purpose LLMs are powerful, the future of enterprise AI belongs to SLMs that are small enough to be private, fast enough to be local, and specialized enough to outperform their larger counterparts in specific domains.

Conclusion: Navigating the Future of Specialized AI

This comprehensive synthesis of IBM’s strategic vision and Guglielmo Iozzia’s “Domain-Specific Small Language Models” reveals a critical shift in the AI landscape. The “bigger is better” era of Large Language Models is being supplemented — and in many cases, replaced — by a more surgical, efficient, and sustainable approach: the Small Language Model (SLM).

Summary of Key Pillars

Economic and Environmental Sustainability: We have established that the massive costs associated with training and inferencing LLMs are no longer a mandatory “tax” for AI adoption. By leveraging SLMs, organizations can achieve task-specific performance that is 3 to 23 times more cost-effective and significantly greener for the planet.
Technical Optimization via ONNX: As explored in Chapter 5, the transition to hardware-agnostic formats like ONNX is the technical bridge that allows these models to run efficiently on everything from edge devices to standard enterprise servers.
Hands-on Mastery with PEFT and LoRA: The provided code samples, such as the FLAN-T5 Small fine-tuning notebook, offer a practical blueprint for developers. By using LoRA, it is possible to train high-performing models on a fraction of the parameters, making advanced AI accessible even on a tight infrastructure budget.

Final Recommendation

For those involved in the field of AI and LLMs — whether as developers, architects, or strategic decision-makers — this book is an essential and highly recommendable reading. It moves beyond the hype of frontier models to provide a grounded, technical roadmap for building AI that is:

Trustworthy through cleaned and filtered datasets.
Private through local deployment on modest hardware.
Performant through domain-specific fine-tuning and modern optimization frameworks.

In an industry where the only constant is change, this book provides the “How” to IBM’s “Why” equipping practitioners with the tools to build the next generation of specialized, efficient AI systems.