DEV Community

Cover image for Part 2: The Dataset - Labels, Heuristics, Synthetic Data, and Why AI Starts Before the Model
Prince Raj
Prince Raj

Posted on

Part 2: The Dataset - Labels, Heuristics, Synthetic Data, and Why AI Starts Before the Model

Before we begin, if you have come directly to this post (Part 2 of 6), here is Part 1 where I explain the basics and set the expectations from this series.

The part most people skip

When many developers first approach AI, they jump straight to the model.

They ask:

  • Which neural network should I use?
  • Should I use transformers?
  • How many layers should I add?

Those are fair questions, but not the first questions.

For this project, the first real job was:

define what the model is supposed to mean

That sounds obvious, but it is the foundation of everything else.

If your labels are vague, inconsistent, or impossible to infer from text, the model will struggle no matter how fancy the architecture is.

The five things this model predicts

This classifier does not output one label. It outputs five:

  • department
  • sentiment
  • lead_intent
  • churn_risk
  • intent

That means every training example needs a shape like this:

{
  "text": "refund nahi mila yet",
  "department": "billing",
  "sentiment": "negative",
  "lead_intent": "low",
  "churn_risk": "high",
  "intent": "refund"
}
Enter fullscreen mode Exit fullscreen mode

This is the canonical schema of the training set.

Plain-English version:

Every ticket must be translated into one consistent answer sheet.

Why schema design matters

Imagine you have data from three places:

  • a banking support dataset
  • a sentiment dataset
  • a general intent dataset

None of them naturally match your product.

One dataset may have label=payment_issue. Another may only know positive vs negative sentiment. Yet another may say nothing about churn risk at all. So the job is not only "load data."

The job is:

convert different sources into one shared language

That is what the dataset pipeline in this project does.

The label strategy

Let’s go through each output the way a backend engineer would.

1. Department

This is a routing problem. The question is:

Which team should probably handle this?

Examples:

  • refund -> billing
  • password reset -> technical
  • tracking issue -> logistics
  • pricing request -> sales

This label is operational.
It exists to move work to the right queue.

2. Sentiment

This measures emotional tone:

  • positive
  • neutral
  • negative

This is not the same as intent.

  • A pricing question can be neutral.
  • A refund request can be negative.
  • A thank-you note can be positive.

This label helps downstream prioritization and messaging.

3. Lead intent

This is where business context starts to matter.

The question is:

Does this message look like a buying opportunity?

Examples:

  • demo request -> high
  • pricing inquiry -> high
  • feature request -> medium
  • complaint -> low

This label is not just language understanding. It is business interpretation.

That matters later, because it is one reason small custom models can beat general-purpose LLMs on narrow tasks.

4. Churn risk

This estimates whether the customer may leave.

Examples:

  • cancellation request -> high
  • repeated refund frustration -> high
  • neutral tracking question -> low

Again, this is partly semantic and partly business logic.

5. Intent

This is the most specific task.

Examples:

  • refund
  • cancellation
  • delivery_issue
  • pricing_inquiry
  • technical_issue

Turning messy data into this schema

The training pipeline pulls data from multiple sources:

  • Hugging Face datasets like banking77
  • Sentiment data like tweet_eval/sentiment
  • Intent datasets like clinc_oos
  • Local JSONL files
  • Synthetic examples
  • Manual correction data

But raw source labels do not line up nicely with our five-task schema. So we normalize them.

Technical term:

This is schema normalization.

Plain-English version:

We take many different spreadsheets and convert them into one house format.

Where heuristics come in

Here is an important beginner lesson:

Not every training label has to come from a human manually writing every field.

Sometimes a dataset gives you only one known label. You can infer the others using domain rules.

For example:

  • if intent is refund, department is probably billing
  • if intent is pricing_inquiry, lead intent is probably high
  • if intent is complaint, sentiment is probably negative
  • if intent is cancellation, churn risk is probably high

That is exactly what this project does.

In plain language:

When we know one strong clue, we can responsibly fill in related labels.

This is not perfect.
But it is often very useful when building a practical system from mixed data sources.

Why synthetic data was necessary

This is one of my favorite parts of the project, because it is very relatable for backend engineers.

Real support data is usually messy in two ways:

  1. it is incomplete
  2. it is uneven

Maybe you have lots of billing messages but not many sales leads.
Maybe you have clean English examples but not Hinglish.
Maybe you do not have enough high-churn refund tickets.

So the pipeline generates synthetic tickets using templates.

Examples of synthetic patterns:

  • "I want a refund for my subscription"
  • "Refund nahi mila for my order"
  • "Can I get a demo for my team?"
  • "Payment failed but money got deducted"

Then it adds style noise:

  • typos
  • shorthand
  • uppercase
  • casual phrasing
  • Hinglish variants

Plain-English version:

We manufacture extra training examples for situations we care about but do not have enough of.

Technical term:

This is synthetic data generation or data augmentation.

Why Hinglish normalization matters

A lot of AI tutorials quietly assume clean English input. Real production systems do not get that luxury.

Users write things like:

  • refund chahiye
  • paisa mila nahi
  • app kharab hai
  • jaldi fix karo

If you ignore that kind of variation, your model will feel fragile in production.

So this project includes simple but valuable normalization rules that map common Hinglish words to normalized English equivalents:

  • nahi -> not
  • paisa -> money
  • kharab -> broken
  • chahiye -> want

This is not "full multilingual AI."
It is something more practical:

targeted robustness for the language patterns your users actually type

The corrections loop is the most production-friendly part

This project also supports a corrections.jsonl file.

That means once the model is live, you can capture corrected labels and feed them back into training.

The workflow looks like this:

  1. Model makes a prediction in production
  2. Human or system corrects bad labels
  3. Corrected example gets appended to corrections.jsonl
  4. Next training run boosts those corrections

I love this because it feels very familiar to backend teams. It is not mystical. It is a feedback loop.

  • You ship.
  • You observe.
  • You correct.
  • You retrain.

That is how production systems grow up.

Training and validation split

After collecting all examples, the pipeline splits them into:

  • Training data
  • Validation data

Why do we need validation?

Because if we only measure performance on the same examples the model learned from, the scores can be misleading.

Plain-English version:

Training data is the study material. Validation data is the exam.

The project also tries to stratify by intent when splitting.

That means it attempts to preserve label balance, so the validation set does not accidentally miss important classes.

A simple but important truth

At this point, we still have not talked about embeddings, dense layers, or PyTorch math.

And that is the point.

The AI project already contains a lot of engineering value before the neural network starts training:

  • Schema design
  • Label definitions
  • Heuristics
  • Dataset normalization
  • Synthetic example generation
  • Production corrections
  • Validation setup

This is why I keep telling backend engineers:

You already have a lot of the mindset needed for AI systems.

Good AI pipelines reward the same habits as good backend systems:

  • Consistent contracts
  • Thoughtful data modeling
  • Clear assumptions
  • Measurable feedback loops

If you can only remember one thing from this article

Remember this:

Training data is not "whatever text you found."
Training data is a product design decision.

You are deciding:

  • What the model should notice
  • What tradeoffs it should care about
  • What your labels really mean in the business

That is the real beginning of AI work.

What comes next

In Part 3, we will finally answer the question that makes many people feel like AI is magic:

How does text become numbers?

I will explain:

  • Bag-of-words
  • Keyword flags
  • Token IDs
  • Embeddings
  • Why this project combines all of them

And I’ll do it in plain language first, then connect each idea to the proper technical terms.

Disclosure: AI was used to frame the article.

Top comments (0)