Zaffar

Posted on Jun 4

I fine-tuned an LLM for a client, then told them not to use it

#ai #azure #llm #rag

description: "A real client case study on supervised fine-tuning vs RAG and few-shot prompting, with the actual Azure bill that decided it."

tags: ai, machinelearning, azure, llm

A client asked me to fine-tune an AI model for them. I built it, evaluated it properly, and then recommended they not deploy it. This is the write-up of how I got to that decision, including the real Azure bill that settled it.

The full repo, with the cost breakdown, the before and after outputs, the prompts, and a redacted dataset sample, is here:
https://github.com/xaphor/landscaping-llm-brandvoice-eval

The client and the goal

Easy Gardens Landscape Works is a landscaping company in the UAE. They handle a lot of customer enquiries over WhatsApp: grass supply and installation, soil and site prep, plants and palms, irrigation, quotes, scheduling, and after-care. They wanted an assistant that sounded like them, not like a generic bot.

Their brand voice has a few firm rules:

Always speak as part of the company, using "we" and "our". Never call the company "they".
Warm and polite, but clear and to the point.
Honest. No over-promising on prices, timelines, warranties, or whether a plant will survive.
Ask for the details needed (location, area, photos, sunlight) before quoting.

They had years of real chat history, which made this a fair candidate for supervised fine-tuning.

How I set it up

I used supervised fine-tuning (SFT) of GPT-4.1 in Azure AI Foundry. The training data came from five real WhatsApp chat exports, with all personally identifiable information stripped first: names, phone numbers, bank details, emails, addresses, and map links. The dataset was 15 examples in the standard messages format, with a system, user, and assistant turn per line.

Fifteen examples is small. I treated that as a known limitation rather than hide it, and it turned out to matter.

The evaluation

This is the part people skip, and it is the part that earns the decision.

I ran an A/B evaluation on the same base model. Both arms used the same system prompt and the same retrieval-augmented generation (RAG) layer, the company knowledge base indexed and attached as a tool. The only variable was the fine-tune.

Arm A: GPT-4.1 with the system prompt and RAG.
Arm B: the same, after supervised fine-tuning.

Then five customer prompts, each one aimed at a specific brand-voice trait, run through both arms.

#	Prompt	Trait under test
1	How much to do my whole backyard with grass?	Asks for details before quoting
2	Can you guarantee my lawn stays green all summer if I leave the sprinklers on?	No over-promising
3	Your guys showed up 2 hours late and left a mess.	Warm complaint handling
4	I want real grass under a tree that is always in shade. Can you do it?	Honest feedback
5	Another company quoted 30% cheaper. Why are you so expensive?	Hold the price, stay warm

What the fine-tune did better

The fine-tuned model was genuinely better on voice. Shorter, warmer, quicker to take ownership, more willing to give an honest "no".

Prompt 3, the complaint, shows it clearly.

Base model:

We sincerely apologize for the inconvenience caused by our team's delay and the mess left on your driveway. Could you please share your location, the date of the service, and any photos you may have of the mess left behind?

Fine-tuned model:

I'm really sorry for the delay and the mess. That isn't the experience we want for you and I'll address it directly with our crew supervisor. I understand that's our problem, not yours, so thank you for your patience. I've asked our team lead to clear your driveway first thing tomorrow and will keep an eye on it until the job is done.

The base model turned a frustrated customer into a support ticket. The fine-tune took ownership and gave a next step. That is the voice the client wanted.

So far this reads like a fine-tuning success story. Two findings turned it around.

Problem one: a confident hallucination

On the summer guarantee prompt, the fine-tuned model invented a warranty it was never trained on, a specific replacement window, and used a grass-type word that does not match the client's actual products.

With only 15 training examples, the model filled the gaps by guessing in order to sound on-brand. For an assistant that talks to paying customers, a confidently invented warranty is a liability, not a feature. This is exactly the failure mode a small dataset produces, and it is hard to catch without a real evaluation.

Problem two: the cost

Here is one real day of the fine-tuned deployment from the Azure bill, in euros.

Item	One day
Hosting	54.47
Training, one-time	0.58
Inference, the whole test	under 0.05
Total	55.07

Read that table again, because it is the whole argument. The training was a one-time 58 cents. Running the entire evaluation cost a few cents. Almost the whole bill, 54.47 out of 55.07, was hosting. That is the standing fee to keep a dedicated fine-tuned deployment alive and reachable, before a single customer sends a message.

Extrapolated from that one real day:

Hosting per hour: about 2.27 euros.
Hosting per month: about 1,630 euros.
Hosting per year: close to 19,900 euros.

So a single fine-tuned model sits at roughly 1,600 euros a month before it does any meaningful work, and that figure is driven almost entirely by availability, not usage. A base GPT-4.1 deployment with RAG carries no such standing charge. You pay per token for what you actually use, and the per-token usage in this test was a rounding error.

(Azure does offer a lower-tier developer deployment that drops the hourly hosting fee in exchange for no SLA and the risk of deprovisioning. That is fine for experiments, not for a production assistant. And these figures are a real snapshot, not a fixed price list. Rates vary by model, region, and date.)

The decision

For this client, at this volume, with this dataset size: keep the base model, hold the facts in RAG so they stay updatable, and carry the brand voice with few-shot prompting, meaning a handful of example exchanges placed inside the system prompt.

That reproduced most of the voice for the price of a few extra tokens per call, with no hosting fee, and it keeps changeable facts like prices and warranty terms in a place that can be updated without retraining.

When fine-tuning would have won

This is not an argument that fine-tuning is dead. It earns its place when:

Request volume is high enough that removing a long system prompt and few-shot examples from every call saves more in tokens than the hosting fee costs.
You have a large, curated, fact-checked dataset, so the model is not guessing policy to sound right.
Brand voice is a measured business advantage and prompt engineering has stopped improving.

This client was nowhere near that line.

The takeaway

The most useful thing I delivered on this project was a "no". Anyone can follow a tutorial and produce a fine-tune. The harder and more valuable skill is measuring it honestly against a cheaper option and knowing when not to use it.

If you want the details, the repo has the real bill, the full before and after outputs, the prompts, and a redacted dataset sample:
https://github.com/xaphor/landscaping-llm-brandvoice-eval

If you have built something similar, I am curious where you draw the fine-tune versus RAG line in production. Tell me in the comments.

DEV Community