Yodit Weldegeorgise

Posted on Apr 13

𝗪𝗵𝗮𝘁 𝗜 𝗟𝗲𝗮𝗿𝗻𝗲𝗱 𝗳𝗿𝗼𝗺 𝗖𝗵𝗮𝗽𝘁𝗲𝗿 𝟮 𝗼𝗳 𝗔𝗜 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴: 𝗪𝗵𝘆 𝗦𝗮𝗺𝗽𝗹𝗶𝗻𝗴 𝗖𝗵𝗮𝗻𝗴𝗲𝘀 𝗘𝘃𝗲𝗿𝘆𝘁𝗵𝗶𝗻𝗴

#ai #deeplearning #llm #machinelearning

When people talk about AI models, the focus is usually on training, such as how much data was used, how big the model is, or what architecture it uses.

That’s where most of the attention goes.

But Chapter 2 of AI Engineering made me focus on something that has a direct impact on how models behave in practice: 𝗦𝗮𝗺𝗽𝗹𝗶𝗻𝗴.

Sampling is how a model selects one output from many possible options. It might seem like a small detail, but it explains a lot of what we see in real-world usage, especially inconsistency and hallucinations.

𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗜𝘀𝗻’𝘁 𝗝𝘂𝘀𝘁 𝗔𝗯𝗼𝘂𝘁 𝗠𝗼𝗿𝗲 𝗗𝗮𝘁𝗮

It’s easy to assume that more data leads to better performance, but that assumption breaks down quickly.

A model trained on a smaller amount of high-quality data can outperform a larger model trained on low-quality data.

What matters is finding the right balance between quantity, quality, and diversity. The model needs enough exposure to learn patterns, the data needs to be reliable, and it needs enough variety to generalize well.

This aligns closely with how we think about data in backend systems. Clean and well-structured inputs tend to produce more reliable outputs.

𝗖𝗼𝗺𝗽𝘂𝘁𝗲 𝗙𝗼𝗿𝗰𝗲𝘀 𝗧𝗿𝗮𝗱𝗲𝗼𝗳𝗳𝘀

AI systems don’t scale in isolation. They scale within constraints.

Larger models and datasets require more compute, and compute directly translates to cost.

In practice, teams don’t start with the biggest possible model. They start with a budget and design within that limit.

This is where the 𝗖𝗵𝗶𝗻𝗰𝗵𝗶𝗹𝗹𝗮 𝘀𝗰𝗮𝗹𝗶𝗻𝗴 𝗹𝗮𝘄 becomes useful.

Before jumping into large numbers, it helps to understand one term:

A 𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿 is a learned value inside the model that helps it make decisions. More parameters mean the model can learn more patterns, but it also needs more data to train properly.

Now, think of the scaling rule like this:

1 parameter → ~20 tokens

Then scale it up:

1B parameters → ~20B tokens
3B parameters → ~60B tokens

The pattern stays consistent. As the model grows, the data needs to grow with it. Otherwise, you end up with a larger model that isn’t fully trained and doesn’t use compute efficiently.

This matters because it directly affects how we make decisions when working with AI systems.

In practice, we are constantly choosing between models, deciding whether to fine-tune, and balancing cost with performance. This concept gives a way to reason about those choices.

For example, a larger model isn’t automatically better if it wasn’t trained with enough data. That explains why smaller, well-trained models can sometimes outperform larger ones.

It also applies when fine-tuning. Adding complexity or expecting better results won’t help unless there is enough high-quality data to support it.

Even when using APIs, this changes the mindset. Instead of defaulting to the biggest model, the focus shifts to whether the model was trained efficiently and whether it fits the use case.

So this is not just a scaling rule. It becomes a way to guide model selection, fine-tuning decisions, and cost vs performance tradeoffs.

𝗙𝗿𝗼𝗺 𝗣𝗿𝗲-𝗧𝗿𝗮𝗶𝗻𝗲𝗱 𝗠𝗼𝗱𝗲𝗹𝘀 𝘁𝗼 𝗥𝗲𝗮𝗹 𝗦𝘆𝘀𝘁𝗲𝗺𝘀

A pre-trained model is not ready for production use.

It is optimized for predicting the next token, not for producing useful, safe, or aligned responses.

That’s where post-training comes in.

𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴 teaches the model how to respond using structured examples. However, that alone is not enough.

𝗥𝗟𝗛𝗙 (𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗳𝗿𝗼𝗺 𝗛𝘂𝗺𝗮𝗻 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸) introduces a feedback loop that improves alignment.

A 𝗿𝗲𝘄𝗮𝗿𝗱 𝗺𝗼𝗱𝗲𝗹 is trained to evaluate how good a response is. Instead of relying on absolute scoring, models often learn from comparing multiple responses, which helps reduce inconsistency.

Then RLHF uses that feedback process. The model generates responses, the reward model scores them, and the model is updated to favor better responses over time.

This process helps align models with human expectations, not just in correctness but also in tone, safety, and usefulness.

𝗔 𝗛𝗶𝗱𝗱𝗲𝗻 𝗥𝗶𝘀𝗸: 𝗧𝗵𝗲 𝗤𝘂𝗮𝗹𝗶𝘁𝘆 𝗼𝗳 𝘁𝗵𝗲 𝗜𝗻𝘁𝗲𝗿𝗻𝗲𝘁 𝗜𝘁𝘀𝗲𝗹𝗳

Models are trained on internet-scale data. That means whatever exists online, whether accurate or misleading, can influence how models behave.

As more AI-generated content is published, there is a growing risk that future models will be trained on synthetic or incorrect information.

It is also possible for bad actors to intentionally introduce misleading content into the internet so that future models learn from it.

This turns into a data integrity problem, not just a modeling problem.

As engineers, this means we need to be more mindful. Not all data sources are equally reliable, and blindly trusting model outputs becomes riskier over time.

𝗪𝗵𝘆 𝗔𝗜 𝗙𝗲𝗲𝗹𝘀 𝗜𝗻𝗰𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝘁

One of the most important ideas in this chapter is that AI models are 𝗽𝗿𝗼𝗯𝗮𝗯𝗶𝗹𝗶𝘀𝘁𝗶𝗰 𝘀𝘆𝘀𝘁𝗲𝗺𝘀.

That means the same input can produce different outputs, and even a small change in input can lead to a noticeably different response.

This behavior is driven by 𝗦𝗮𝗺𝗽𝗹𝗶𝗻𝗴.

It also explains 𝗵𝗮𝗹𝗹𝘂𝗰𝗶𝗻𝗮𝘁𝗶𝗼𝗻𝘀, where the model generates responses that sound correct but are not grounded in fact.

𝗗𝗲𝘀𝗶𝗴𝗻𝗶𝗻𝗴 𝗔𝗿𝗼𝘂𝗻𝗱 𝗣𝗿𝗼𝗯𝗮𝗯𝗶𝗹𝗶𝘀𝘁𝗶𝗰 𝗦𝘆𝘀𝘁𝗲𝗺𝘀

This chapter didn’t introduce completely new ideas to me, but it helped connect things more clearly.

In backend systems, I’m used to building deterministic workflows where the same input leads to the same output. This chapter reinforced that AI systems don’t behave that way.

Instead, AI systems need to be designed with their probabilistic nature in mind.

That shows up in practice. Outputs need validation instead of blind trust. Prompting and constraints act as control mechanisms. Fine-tuning becomes a tool for consistency, not just improvement.

AI systems are shaped by the data they are trained on, the compute used during training, the post-training process, and the sampling strategy that generates outputs.

Sampling is what makes models flexible and useful, but it is also what introduces variability.

Understanding that tradeoff is what makes AI engineering more practical.