Prince Raj

Posted on Apr 17

Part 2: The Dataset - Labels, Heuristics, Synthetic Data, and Why AI Starts Before the Model

#ai #machinelearning #datascience #backend

Before we begin, if you have come directly to this post (Part 2 of 6), here is Part 1 where I explain the basics and set the expectations from this series.

Prince Raj

Apr 16

Part 1: What We Built - A Tiny AI System for Support Ticket Classification

#ai #go #backend #machinelearning

5 min read

The part most people skip

When many developers first approach AI, they jump straight to the model.

They ask:

Which neural network should I use?
Should I use transformers?
How many layers should I add?

Those are fair questions, but not the first questions.

For this project, the first real job was:

define what the model is supposed to mean

That sounds obvious, but it is the foundation of everything else.

If your labels are vague, inconsistent, or impossible to infer from text, the model will struggle no matter how fancy the architecture is.

The five things this model predicts

This classifier does not output one label. It outputs five:

department
sentiment
lead_intent
churn_risk
intent

That means every training example needs a shape like this:

{
  "text": "refund nahi mila yet",
  "department": "billing",
  "sentiment": "negative",
  "lead_intent": "low",
  "churn_risk": "high",
  "intent": "refund"
}

This is the canonical schema of the training set.

Plain-English version:

Every ticket must be translated into one consistent answer sheet.

Why schema design matters

Imagine you have data from three places:

a banking support dataset
a sentiment dataset
a general intent dataset

None of them naturally match your product.

One dataset may have label=payment_issue. Another may only know positive vs negative sentiment. Yet another may say nothing about churn risk at all. So the job is not only "load data."

The job is:

convert different sources into one shared language

That is what the dataset pipeline in this project does.

The label strategy

Let’s go through each output the way a backend engineer would.

1. Department

This is a routing problem. The question is:

Which team should probably handle this?

Examples:

refund -> billing
password reset -> technical
tracking issue -> logistics
pricing request -> sales

This label is operational.
It exists to move work to the right queue.

2. Sentiment

This measures emotional tone:

positive
neutral
negative

This is not the same as intent.

A pricing question can be neutral.
A refund request can be negative.
A thank-you note can be positive.

This label helps downstream prioritization and messaging.

3. Lead intent

This is where business context starts to matter.

The question is:

Does this message look like a buying opportunity?

Examples:

demo request -> high
pricing inquiry -> high
feature request -> medium
complaint -> low

This label is not just language understanding. It is business interpretation.

That matters later, because it is one reason small custom models can beat general-purpose LLMs on narrow tasks.

4. Churn risk

This estimates whether the customer may leave.

Examples:

cancellation request -> high
repeated refund frustration -> high
neutral tracking question -> low

Again, this is partly semantic and partly business logic.

5. Intent

This is the most specific task.

Examples:

refund
cancellation
delivery_issue
pricing_inquiry
technical_issue

Turning messy data into this schema

The training pipeline pulls data from multiple sources:

Hugging Face datasets like banking77
Sentiment data like tweet_eval/sentiment
Intent datasets like clinc_oos
Local JSONL files
Synthetic examples
Manual correction data

But raw source labels do not line up nicely with our five-task schema. So we normalize them.

Technical term:

This is schema normalization.

Plain-English version:

We take many different spreadsheets and convert them into one house format.

Where heuristics come in

Here is an important beginner lesson:

Not every training label has to come from a human manually writing every field.

Sometimes a dataset gives you only one known label. You can infer the others using domain rules.

For example:

if intent is refund, department is probably billing
if intent is pricing_inquiry, lead intent is probably high
if intent is complaint, sentiment is probably negative
if intent is cancellation, churn risk is probably high

That is exactly what this project does.

In plain language:

When we know one strong clue, we can responsibly fill in related labels.

This is not perfect.
But it is often very useful when building a practical system from mixed data sources.

Why synthetic data was necessary

This is one of my favorite parts of the project, because it is very relatable for backend engineers.

Real support data is usually messy in two ways:

it is incomplete
it is uneven

Maybe you have lots of billing messages but not many sales leads.
Maybe you have clean English examples but not Hinglish.
Maybe you do not have enough high-churn refund tickets.

So the pipeline generates synthetic tickets using templates.

Examples of synthetic patterns:

"I want a refund for my subscription"
"Refund nahi mila for my order"
"Can I get a demo for my team?"
"Payment failed but money got deducted"

Then it adds style noise:

typos
shorthand
uppercase
casual phrasing
Hinglish variants

Plain-English version:

We manufacture extra training examples for situations we care about but do not have enough of.

Technical term:

This is synthetic data generation or data augmentation.

Why Hinglish normalization matters

A lot of AI tutorials quietly assume clean English input. Real production systems do not get that luxury.

Users write things like:

refund chahiye
paisa mila nahi
app kharab hai
jaldi fix karo

If you ignore that kind of variation, your model will feel fragile in production.

So this project includes simple but valuable normalization rules that map common Hinglish words to normalized English equivalents:

nahi -> not
paisa -> money
kharab -> broken
chahiye -> want

This is not "full multilingual AI."
It is something more practical:

targeted robustness for the language patterns your users actually type

The corrections loop is the most production-friendly part

This project also supports a corrections.jsonl file.

That means once the model is live, you can capture corrected labels and feed them back into training.

The workflow looks like this:

Model makes a prediction in production
Human or system corrects bad labels
Corrected example gets appended to corrections.jsonl
Next training run boosts those corrections

I love this because it feels very familiar to backend teams. It is not mystical. It is a feedback loop.

You ship.
You observe.
You correct.
You retrain.

That is how production systems grow up.

Training and validation split

After collecting all examples, the pipeline splits them into:

Training data
Validation data

Why do we need validation?

Because if we only measure performance on the same examples the model learned from, the scores can be misleading.

Plain-English version:

Training data is the study material. Validation data is the exam.

The project also tries to stratify by intent when splitting.

That means it attempts to preserve label balance, so the validation set does not accidentally miss important classes.

A simple but important truth

At this point, we still have not talked about embeddings, dense layers, or PyTorch math.

And that is the point.

The AI project already contains a lot of engineering value before the neural network starts training:

Schema design
Label definitions
Heuristics
Dataset normalization
Synthetic example generation
Production corrections
Validation setup

This is why I keep telling backend engineers:

You already have a lot of the mindset needed for AI systems.

Good AI pipelines reward the same habits as good backend systems:

Consistent contracts
Thoughtful data modeling
Clear assumptions
Measurable feedback loops

If you can only remember one thing from this article

Remember this:

Training data is not "whatever text you found."
Training data is a product design decision.

You are deciding:

What the model should notice
What tradeoffs it should care about
What your labels really mean in the business

That is the real beginning of AI work.

What comes next

In Part 3, we will finally answer the question that makes many people feel like AI is magic:

How does text become numbers?

I will explain:

Bag-of-words
Keyword flags
Token IDs
Embeddings
Why this project combines all of them

And I’ll do it in plain language first, then connect each idea to the proper technical terms.

Disclosure: AI was used to frame the article.

DEV Community

Part 2: The Dataset - Labels, Heuristics, Synthetic Data, and Why AI Starts Before the Model

Part 1: What We Built - A Tiny AI System for Support Ticket Classification

The part most people skip

The five things this model predicts

Why schema design matters

The label strategy

1. Department

2. Sentiment

3. Lead intent

4. Churn risk

5. Intent

Turning messy data into this schema

Where heuristics come in

Why synthetic data was necessary

Why Hinglish normalization matters

The corrections loop is the most production-friendly part

Training and validation split

A simple but important truth

If you can only remember one thing from this article

What comes next

Top comments (0)