DEV Community

Yodit Weldegeorgise
Yodit Weldegeorgise

Posted on

๐—ช๐—ต๐—ฎ๐˜ ๐—œ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ฒ๐—ฑ ๐—ณ๐—ฟ๐—ผ๐—บ ๐—–๐—ต๐—ฎ๐—ฝ๐˜๐—ฒ๐—ฟ ๐Ÿฎ ๐—ผ๐—ณ ๐—”๐—œ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด: ๐—ช๐—ต๐˜† ๐—ฆ๐—ฎ๐—บ๐—ฝ๐—น๐—ถ๐—ป๐—ด ๐—–๐—ต๐—ฎ๐—ป๐—ด๐—ฒ๐˜€ ๐—˜๐˜ƒ๐—ฒ๐—ฟ๐˜†๐˜๐—ต๐—ถ๐—ป๐—ด

When people talk about AI models, the focus is usually on training, such as how much data was used, how big the model is, or what architecture it uses.

Thatโ€™s where most of the attention goes.

But Chapter 2 of AI Engineering made me focus on something that has a direct impact on how models behave in practice: ๐—ฆ๐—ฎ๐—บ๐—ฝ๐—น๐—ถ๐—ป๐—ด.

Sampling is how a model selects one output from many possible options. It might seem like a small detail, but it explains a lot of what we see in real-world usage, especially inconsistency and hallucinations.

๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐—œ๐˜€๐—ปโ€™๐˜ ๐—๐˜‚๐˜€๐˜ ๐—”๐—ฏ๐—ผ๐˜‚๐˜ ๐— ๐—ผ๐—ฟ๐—ฒ ๐——๐—ฎ๐˜๐—ฎ

Itโ€™s easy to assume that more data leads to better performance, but that assumption breaks down quickly.

A model trained on a smaller amount of high-quality data can outperform a larger model trained on low-quality data.

What matters is finding the right balance between quantity, quality, and diversity. The model needs enough exposure to learn patterns, the data needs to be reliable, and it needs enough variety to generalize well.

This aligns closely with how we think about data in backend systems. Clean and well-structured inputs tend to produce more reliable outputs.

๐—–๐—ผ๐—บ๐—ฝ๐˜‚๐˜๐—ฒ ๐—™๐—ผ๐—ฟ๐—ฐ๐—ฒ๐˜€ ๐—ง๐—ฟ๐—ฎ๐—ฑ๐—ฒ๐—ผ๐—ณ๐—ณ๐˜€

AI systems donโ€™t scale in isolation. They scale within constraints.

Larger models and datasets require more compute, and compute directly translates to cost.

In practice, teams donโ€™t start with the biggest possible model. They start with a budget and design within that limit.

This is where the ๐—–๐—ต๐—ถ๐—ป๐—ฐ๐—ต๐—ถ๐—น๐—น๐—ฎ ๐˜€๐—ฐ๐—ฎ๐—น๐—ถ๐—ป๐—ด ๐—น๐—ฎ๐˜„ becomes useful.

Before jumping into large numbers, it helps to understand one term:

A ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ is a learned value inside the model that helps it make decisions. More parameters mean the model can learn more patterns, but it also needs more data to train properly.

Now, think of the scaling rule like this:

1 parameter โ†’ ~20 tokens

Then scale it up:

1B parameters โ†’ ~20B tokens
3B parameters โ†’ ~60B tokens

The pattern stays consistent. As the model grows, the data needs to grow with it. Otherwise, you end up with a larger model that isnโ€™t fully trained and doesnโ€™t use compute efficiently.

This matters because it directly affects how we make decisions when working with AI systems.

In practice, we are constantly choosing between models, deciding whether to fine-tune, and balancing cost with performance. This concept gives a way to reason about those choices.

For example, a larger model isnโ€™t automatically better if it wasnโ€™t trained with enough data. That explains why smaller, well-trained models can sometimes outperform larger ones.

It also applies when fine-tuning. Adding complexity or expecting better results wonโ€™t help unless there is enough high-quality data to support it.

Even when using APIs, this changes the mindset. Instead of defaulting to the biggest model, the focus shifts to whether the model was trained efficiently and whether it fits the use case.

So this is not just a scaling rule. It becomes a way to guide model selection, fine-tuning decisions, and cost vs performance tradeoffs.

๐—™๐—ฟ๐—ผ๐—บ ๐—ฃ๐—ฟ๐—ฒ-๐—ง๐—ฟ๐—ฎ๐—ถ๐—ป๐—ฒ๐—ฑ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐˜๐—ผ ๐—ฅ๐—ฒ๐—ฎ๐—น ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€

A pre-trained model is not ready for production use.

It is optimized for predicting the next token, not for producing useful, safe, or aligned responses.

Thatโ€™s where post-training comes in.

๐—ฆ๐˜‚๐—ฝ๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐˜€๐—ฒ๐—ฑ ๐—ณ๐—ถ๐—ป๐—ฒ-๐˜๐˜‚๐—ป๐—ถ๐—ป๐—ด teaches the model how to respond using structured examples. However, that alone is not enough.

๐—ฅ๐—Ÿ๐—›๐—™ (๐—ฅ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ณ๐—ฟ๐—ผ๐—บ ๐—›๐˜‚๐—บ๐—ฎ๐—ป ๐—™๐—ฒ๐—ฒ๐—ฑ๐—ฏ๐—ฎ๐—ฐ๐—ธ) introduces a feedback loop that improves alignment.

A ๐—ฟ๐—ฒ๐˜„๐—ฎ๐—ฟ๐—ฑ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น is trained to evaluate how good a response is. Instead of relying on absolute scoring, models often learn from comparing multiple responses, which helps reduce inconsistency.

Then RLHF uses that feedback process. The model generates responses, the reward model scores them, and the model is updated to favor better responses over time.

This process helps align models with human expectations, not just in correctness but also in tone, safety, and usefulness.

๐—” ๐—›๐—ถ๐—ฑ๐—ฑ๐—ฒ๐—ป ๐—ฅ๐—ถ๐˜€๐—ธ: ๐—ง๐—ต๐—ฒ ๐—ค๐˜‚๐—ฎ๐—น๐—ถ๐˜๐˜† ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐—ป๐—ฒ๐˜ ๐—œ๐˜๐˜€๐—ฒ๐—น๐—ณ

Models are trained on internet-scale data. That means whatever exists online, whether accurate or misleading, can influence how models behave.

As more AI-generated content is published, there is a growing risk that future models will be trained on synthetic or incorrect information.

It is also possible for bad actors to intentionally introduce misleading content into the internet so that future models learn from it.

This turns into a data integrity problem, not just a modeling problem.

As engineers, this means we need to be more mindful. Not all data sources are equally reliable, and blindly trusting model outputs becomes riskier over time.

๐—ช๐—ต๐˜† ๐—”๐—œ ๐—™๐—ฒ๐—ฒ๐—น๐˜€ ๐—œ๐—ป๐—ฐ๐—ผ๐—ป๐˜€๐—ถ๐˜€๐˜๐—ฒ๐—ป๐˜

One of the most important ideas in this chapter is that AI models are ๐—ฝ๐—ฟ๐—ผ๐—ฏ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜€๐˜๐—ถ๐—ฐ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€.

That means the same input can produce different outputs, and even a small change in input can lead to a noticeably different response.

This behavior is driven by ๐—ฆ๐—ฎ๐—บ๐—ฝ๐—น๐—ถ๐—ป๐—ด.

It also explains ๐—ต๐—ฎ๐—น๐—น๐˜‚๐—ฐ๐—ถ๐—ป๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€, where the model generates responses that sound correct but are not grounded in fact.

๐——๐—ฒ๐˜€๐—ถ๐—ด๐—ป๐—ถ๐—ป๐—ด ๐—”๐—ฟ๐—ผ๐˜‚๐—ป๐—ฑ ๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜€๐˜๐—ถ๐—ฐ ๐—ฆ๐˜†๐˜€๐˜๐—ฒ๐—บ๐˜€

This chapter didnโ€™t introduce completely new ideas to me, but it helped connect things more clearly.

In backend systems, Iโ€™m used to building deterministic workflows where the same input leads to the same output. This chapter reinforced that AI systems donโ€™t behave that way.

Instead, AI systems need to be designed with their probabilistic nature in mind.

That shows up in practice. Outputs need validation instead of blind trust. Prompting and constraints act as control mechanisms. Fine-tuning becomes a tool for consistency, not just improvement.

AI systems are shaped by the data they are trained on, the compute used during training, the post-training process, and the sampling strategy that generates outputs.

Sampling is what makes models flexible and useful, but it is also what introduces variability.

Understanding that tradeoff is what makes AI engineering more practical.

Top comments (0)