Yaohua Chen for ImagineX

Posted on Dec 19, 2025

Key Breakthroughs in AI Engineering that Every AI Engineer Must Know

#ai #agents #machinelearning

This blog post provides a clear understanding of the logic behind the entire technological evolution, showing how we went step by step from "this thing can run" to "this thing can actually do work." We'll explain the development of AI engineering from 2017 to the present in a simple and easy-to-understand way. The key breakthroughs are grouped into 4 categories as follows:

The Beginning of Everything: From "Architectural Revolution" to "Emergent Capabilities"

The starting point of this story is 2017. The famous paper called "Attention Is All You Need" is the "birth certificate" of the Transformer architecture, which is the foundation of all modern large language models.

Before it, models processed text word by word "sequentially" like RNNs, which was not only slow but also struggled with long texts (like forgetting what was said earlier when reading to the end). The core contribution of Transformer was introducing the "Self-Attention" mechanism, allowing the model to "simultaneously" look at all words in a sentence and determine which words are more important to each other.

This brought two huge benefits: training can be massively parallelized, and handling long-range dependencies became much better. Following that, in 2020, OpenAI's GPT-3 paper "Language Models are Few-Shot Learners" brought the second key breakthrough. It showed that with a large enough model, it can learn to perform a variety of tasks with just a few examples (few-shot). When you scale up the Transformer model large enough, it will "emerge" with a new capability called "In-Context Learning."

This means you no longer need to fine-tune or "customize" the model for specific tasks (like translation or summarization). You just need to give it a few examples (few-shot) in the prompt, and it can "learn by imitation" and understand what you actually want to do. This discovery completely changed the rules of the game. For practitioners like us, this means we can use a general-purpose foundation model and solve various problems through "prompt engineering" or "context engineering".

"Training" the Model: How to Make It Obedient, Professional, and Cost-Effective?

With GPT-3's "brute force creates miracles" approach, people quickly discovered new problems:

It's powerful but doesn't "listen" - it often talks nonsense and sometimes even outputs toxic content. We need to make it "listen" to us and "obey" our instructions.
It's "expensive" - if you want it to perform better in a specialized domain (like law or medicine), the cost of full fine-tuning is terrifyingly high. We need to find a way to make it more cost-effective.
It's a "bookworm" - training data has a cutoff date, and it doesn't know about new knowledge from the outside world or company internal materials. We need to make it more "knowledgeable" on external knowledge and "professional" in the specialized domain.

So the following key breakthroughs were about solving these "usability" issues.

1. Making It "Obedient" - InstructGPT (2022)

The core of this breakthrough is solving the "Alignment" problem, that is, making the model "listen" to us and "obey" our instructions. The paper "Training language models to follow instructions with human feedback" introduced RLHF (Reinforcement Learning from Human Feedback) on a model called InstructGPT.

Simply put, the process is as follows: first have humans rank the model's different responses, then train a "Reward Model" to mimic human preferences, and finally use this reward model to "train" the large model.

The biggest insight from this breakthrough is: "A smaller but 'aligned' model can outperform a much larger but unaligned model in user satisfaction." This made everyone realize that bigger isn't always better - the value of "alignment" or "obedience" is extremely high.

2. Making It "Cost-Effective" - LoRA (2021)

When we need to make the model "professional" in a specialized domain, full fine-tuning (updating all model parameters with new training data) was usually the only way. However, full fine-tuning is too expensive. Is there a cheaper way? LoRA (Low-Rank Adaptation) came along.

Its idea is particularly clever: during fine-tuning, we don't touch the billions of original parameters (keep them frozen), but instead "insert" some small, trainable "adapters" into different layers of the model. These adapters have very few parameters (possibly only 0.01% of total parameters) but can still achieve good performance.

The result is that it lowered the barrier to fine-tuning from "only big companies can afford it" to "you can run it on a single GPU." This was revolutionary for AI application deployment.

3. Making It "Professional" - RAG (2020)

Before RAG, the model was like a "bookworm" that only knew what it was trained on. It didn't know about new knowledge from the outside world or company internal materials. Moreover, it was prone to "hallucination", that is, making things up when it didn't know the answer. How do you solve the model's "outdated knowledge" and "hallucination" problems? The answer is RAG (Retrieval-Augmented Generation).

The idea is straightforward: "Before the model answers a question, don't rush to make things up. First go to an external knowledge base (like company internal databases, or the internet) to retrieve a batch of relevant documents, treat these documents as 'open-book exam' materials, feed them to the model, and let it answer you based on these materials."

Today, RAG is practically standard for all production-grade LLM applications (like AI customer service, knowledge base Q&A). It is a key technology for making models "professional" and "useful".

Pushing Efficiency to the Limit: How to Make Models Run on Devices with Limited Resources?

When models really need to be deployed to consumer products, or even smartphones, "efficiency" and "cost" become matters of life and death. The next few breakthroughs revolve around "optimization."

1. Making Models "Smaller" - DistilBERT (2019)

This breakthrough uses "Knowledge Distillation" technology. The idea is to have a large, smart "teacher" model (like BERT) teach a small "student" model (like DistilBERT), making the student model mimic the teacher's behavior. The result is that the student model retains 97% of the teacher's language understanding capability, but with 40% fewer parameters and 60% faster speed. This made running AI on smartphones and "edge devices" (like smart home gadgets and wearables) possible. This is a key technology for making models "small" and "efficient".

2. Making Models "Memory-Efficient" - LLM.int8 (2022)

This breakthrough is about "Quantization." Simply put, it's trying to use fewer bits to store model weights, like going from 32-bit floating point numbers down to 8-bit integers (int8), directly reducing memory usage by 4x.

The challenge is that crude compression can cause severe accuracy degradation. The "key insight" of this breakthrough is that they discovered only a very small number of "outlier features" in the model were causing trouble. So they used a "mixed precision" approach: "Store the vast majority of weights using int8, but store those critical 'outlier values' using 16 bits." The result is almost no accuracy loss, but memory savings achieved.

3. Making Models "More Flexible" - Switch Transformers (2021)

This breakthrough is about the "MoE" (Mixture of Experts) architecture. The idea is to instead of training one "jack-of-all-trades" large model, you train a bunch of "specialized" expert small models (like one good at math, one good at writing poetry). For each prediction (i.e., predicting a token), you first use a "Router" to determine which "expert" is most suitable to handle this task, then only activate that one expert.

The benefit is that your model's total parameter count can be very large (like trillion-scale), but the actual computational cost is very low because you only use a small portion of it each time.

The Future Puzzle: "Agents" and "Standards"

When we want to make models really "useful" and "capable of doing work" for us in the real world, we need to solve the problem of "how to make models interact with the outside world". The following three breakthroughs are about solving this problem.

1. Making Models "Capable of Doing Work" - Agents (2023)

The technology called "LLM Agents" proposes a basic system framework which includes: Brain (LLM, responsible for thinking and planning), Perception (responsible for reading external information, like return results from tools), and Action (responsible for calling APIs or tools).

This means models are no longer just "chatbots" - they can start helping you "do things," like booking flights, analyzing financial reports, executing code.

2. Making Models "Interconnected" - MCP (2024)

Before MCP, when AI needs to call an external tool (like your calendar, databases, etc.), you have to write a custom "one-to-one" integration interface, which is very troublesome. Anthropic (the developer of Claude) proposed the "Model Context Protocol (MCP)" in 2024 to solve this problem.

The core idea of MCP is to create an "open standard" for how AI models communicate with all external tools and APIs. Just like the HTTP protocol unified communication between web browsers and servers, MCP hopes to unify how AI models communicate with all external tools and APIs. If this standard gets adopted, the connectivity efficiency of the AI ecosystem will see a qualitative leap.

3. Making Agents "Interoperable" - A2A Protocol (2025)

While MCP connects AI models to tools, what happens when you have multiple AI agents that need to work together? Imagine you have one agent handling your calendar, another managing your emails, and a third analyzing your documents. How do they coordinate? The "Agent2Agent (A2A) Protocol" proposed in 2025 addresses exactly this problem.

Think of it like this: MCP is like giving each AI agent a phone to call different services. A2A is like giving all the agents a group chat so they can talk to each other. The protocol allows AI agents built on different technologies to communicate, share information securely, and coordinate their actions. This is complementary to MCP - together they create a complete ecosystem where AI can both use tools and collaborate with other AI.

Summary

The evolution path of AI engineering is actually very clear. It's a chain of continuously solving key problems: first making the model "able to run" (Transformer), then making it "able to learn" (GPT-3), then making it "obedient" (InstructGPT), then making it "useful and affordable" (LoRA, RAG, Quantization), and finally making it "able to do work" (Agents, MCP, A2A). Each step here represents a huge leverage point.

DEV Community