IBIYEMI Samuel O.

Posted on Jan 17

Imitation Learning: A Stanford Walkthrough

#ai #deeplearning #robotics #algorithms

AI and robotics are evolving rapidly, and reinforcement learning (RL) has become foundational to modern AI systems. You see RL everywhere: in large language models, self-driving cars, and the emerging field of physical AI. Within RL, imitation learning stands out as a particularly practical technique. Instead of engineering complex reward functions, we can simply show an AI agent what to do by demonstrating the desired behavior.

Stanford's CS224R: Deep Reinforcement Learning course offers a rigorous, freely available introduction to these concepts. The course has a well-structured syllabus and covers industry-standard material that's directly applicable to real-world problems—making it an excellent resource for anyone looking to break into AI and robotics. Best of all, it's completely free and available on YouTube.

This walkthrough documents my experience with one of the early assignments: implementing and experimenting with two foundational imitation learning methods; Behavior Cloning (BC) and DAgger. While researching self-driving car systems and robotics applications, I noticed that imitation learning consistently appears as a foundational technique. Companies like Nvidia, for instance, have published work showing its importance in autonomous driving. This assignment provided hands-on experience with the core concepts, implementation decisions, and the practical tradeoffs between different approaches.

Why Imitation Learning?

There are certain behaviors that human experts know how to do but may require rigorous domain knowledge to program into a robot system. These expert behaviors can be demonstrated to the AI agent. The purpose of imitation learning is to efficiently learn a desired behavior by imitating an expert's behavior.

This can be powerful for building autonomous behavior in systems that are desired to mimic human approach. For instance, imitation learning can be deployed for autonomous behavior in vehicles, computer games, and robotic applications. It's a go-to application for many companies dealing with self-driving cars, robotics, and leading physical AI systems.

What is Imitation Learning?

The goal of reinforcement learning is to determine closed-loop control policies that result in the maximization of an accumulated reward. RL systems are generally classified into model-based or model-free. There is a general assumption in both cases that a reward function is known, and the goal is to process it done by running the system in an environment and collect these data, then use it to maximize (update) a learned policy (model-based) or directly update the learned policy (model-free).

Imitation learning is quite similar to RL; the difference exists such that there is no explicit reward function r = R(st, ut), rather it is assumed that a set of demonstrations is provided by an expert. The goal is to learn a policy π that follows the closed-loop control law:

ut = π(st, ut)

There is an expert policy π* (derived from the demonstration) that the learned policy aims to imitate.

The Goal of Imitation Learning: Find a policy π that best imitates the expert policy π*, effectively capturing the expert's behavior from the provided demonstrations.

(Sounding complex? I'll keep it simple, I promise 😊)

Two Approaches to Imitation Learning

There are often two approaches to imitation learning. The first is to directly learn to imitate the expert's policy, and the second is to imitate the policy indirectly by learning the expert's reward function instead. In this blog, I will focus on classical approaches that directly learn to imitate the expert in two ways. This kind of policy can be usually obtained through a standard supervised learning method (in soft behavior cloning and the DAgger algorithm). The learning of the expert's reward function (which I will not be covered, you can read more in the resources listed) is known as inverse reinforcement learning. We shall discuss BC and DAgger, considering their intuition and limitations. I also will explain how my architectural choices.

Behavior Cloning (BC)

The algorithm uses a set of demonstrated trajectories from an expert to determine a policy π that imitates the expert's actions. Supervised learning techniques can be applied; the policy is learned and expert demonstrations are paired in respect to some metric. It is a classical optimisation problem.

But this has a major shortcoming: There are instances where our expert data does not capture large-scale (often called "trajectory drift") that deteriorate the policy's not equivalent to what the policy has been trained on. This is a major challenge in imitation learning.

The core issue is distributional mismatch: during training, the policy only sees states from the expert's trajectories, but during deployment, small errors compound over time, leading the policy into states it has never encountered. These unfamiliar states cause the policy to make poor decisions, which leads to even more unfamiliar states—a cascading failure.

Figure 1: Behavior Cloning in action during early training. Notice how the Ant agent struggles.

DAgger: Data Aggregation

DAgger is a direct patch to the distributional mismatch problem. It collects additional data from the expert iteratively to update the policy.

How DAgger works: Rather than training once on a fixed dataset, DAgger runs the current learned policy in the environment, observes the states it reaches (which may differ from the expert's states), then queries the expert for the correct action at those states. This new data is aggregated with the previous training data, and the policy is retrained. This iterative process progressively reduces the distributional mismatch between training and deployment by ensuring the policy sees and learns from the states it actually encounters.

Figure 2: DAgger in action. The same environments show significantly more stable behavior, the agent now maintains more balance.

My Experimental Setup

I implemented and evaluated both Behavior Cloning and DAgger on three Mujoco continuous control environments: Ant-v2, HalfCheetah-v2, and Hopper-v2. Each environment presents different challenges—Ant requires coordinating multiple legs, HalfCheetah involves high-speed forward locomotion, and Hopper demands delicate balance control.

Implementation Details

Network Architecture:
I used a simple feedforward neural network with 2 hidden layers of 64 units each, using ReLU activations. The policy outputs continuous actions directly without any output activation. This architecture is deliberately simple—I wanted to see how much the algorithm itself (BC vs DAgger) matters compared to model capacity.

Training Configuration:

Learning rate: 0.005 with Adam optimizer
Training steps per iteration: 1000
Mini-batch size: 100
For BC: Single iteration on expert data
For DAgger: 10 iterations with expert relabeling

Data Collection:
Each iteration collected 1000 environment steps. For DAgger, the expert policy relabeled the states visited by the learned policy, and all data was accumulated in a replay buffer with capacity of 1,000,000.

Results

The results were striking, particularly for Hopper:

Environment	BC Return	BC % Expert	DAgger Return	DAgger % Expert	Expert Return	Improvement
Ant	4216.1 ± 0.0	88.9%	4845.1 ± 0.0	102.1%	4744.3	+13.3%
Hopper	879.8 ± 345.1	23.7%	3709.4 ± 0.0	99.8%	3716.0	+76.1%
HalfCheetah	3835.3 ± 0.0	94.3%	3905.8 ± 0.0	96.0%	4067.9	+1.7%

Learning Curves

Figure 1: Training curves showing BC vs DAgger vs Expert baseline for Ant, Hopper, and HalfCheetah environments. The curves illustrate how DAgger's iterative data collection leads to continued improvement, particularly visible in Hopper's dramatic performance gain.

Key Observations:

Hopper showed the most dramatic improvement. BC achieved only 23.7% of expert performance with high variance (±345.1), while DAgger reached 99.8% of expert performance. This makes sense. Hopper requires precise balance, and small deviations from the expert trajectory quickly lead to falls. BC had no way to recover from these states, but DAgger explicitly learned from them.

Ant performed well even with BC (88.9% of expert), but DAgger still provided meaningful gains, actually exceeding expert performance at 102.1%. This suggests Ant's dynamics are more forgiving. The policy can make small errors without catastrophic failure, reducing the distributional mismatch problem.

HalfCheetah saw minimal improvement from DAgger (94.3% → 96.0%). My hypothesis is that HalfCheetah's task is relatively stable once the policy learns the basic running gait, and the expert demonstrations already covered the relevant state space well enough for BC to succeed.

Performance Comparison

Figure 2: Final performance comparison across all three environments, showing the relative success of BC and DAgger as percentages of expert performance.

What I Learned

This assignment reinforced that the choice of algorithm matters immensely depending on the task dynamics. For sensitive, unstable tasks like Hopper, the distributional mismatch in BC is crippling. DAgger's iterative data collection isn't just a theoretical improvement, it's the difference between 24% and 100% success.

Also, I figured that looking at variance is as important as looking at mean performance. BC's high variance on Hopper was a clear signal that something was fundamentally wrong, not just that the policy needed more training.

Not all tasks need DAgger. HalfCheetah's results suggest that for some problems, expert demonstrations alone provide sufficient coverage. Understanding when to use which approach requires carefully consideration of your environment's dynamics.

Imitation learning doesn't solve all autonomy problems on its own. In practice, it's often used to create base models upon which different algorithms (including RL methods like PPO) are applied to fine-tune performance. This two-stage approach—imitation learning for initialization, RL for refinement—is common in real-world applications.

DAgger's benefits come with costs. While it solves the distributional mismatch problem effectively, it's computationally intensive. Each iteration requires running the policy in the environment and querying the expert for labels, which can be expensive or even infeasible in some domains (imagine needing a human expert to label thousands of states). This practical constraint is why BC remains popular despite its theoretical limitations.

Conclusion

Working through this imitation learning assignment gave me hands-on experience with the core challenge of learning from demonstrations: distributional mismatch. The contrast between BC and DAgger wasn't just academic, I saw firsthand how Hopper went from barely functioning to expert-level performance simply by addressing which states the policy trains on.

The bigger lesson is about understanding your problem before choosing your algorithm. DAgger isn't always necessary (as HalfCheetah showed), and BC isn't always insufficient (as Ant demonstrated). The key is recognizing when your task's dynamics will punish distribution shift, and planning accordingly.

These experiments also highlighted that imitation learning is a starting point, not an ending point, for building robust autonomous systems. The path from expert demonstrations to production-ready policies often involves multiple stages—imitation learning for initialization, reinforcement learning for refinement, and careful engineering throughout.

If you're working on similar problems, I hope this walkthrough helps you think about when to use BC, when to invest in DAgger's computational cost, and how to evaluate whether distributional mismatch is your bottleneck.

Code and Course Information

Note on Code: Due to course policy, I cannot share the implementation code for this assignment. However, the concepts and approaches described here can be implemented following the algorithm descriptions in the referenced papers.

Course Resources:

These resources provide comprehensive coverage of imitation learning, reinforcement learning, and related topics.

References

Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., & Peters, J. (2018). "An Algorithmic Perspective on Imitation Learning." Foundations and Trends in Robotics, 7(1-2), 1-179.
Ross, S., Gordon, G., & Bagnell, D. (2011). "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning." Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS).
Pomerleau, D. A. (1989). "ALVINN: An Autonomous Land Vehicle in a Neural Network." Advances in Neural Information Processing Systems (NIPS).
Bojarski, M., et al. (2016). "End to End Learning for Self-Driving Cars." arXiv preprint arXiv:1604.07316.
Todorov, E., Erez, T., & Tassa, Y. (2012). "MuJoCo: A physics engine for model-based control." IEEE/RSJ International Conference on Intelligent Robots and Systems.

You made it to end😊 Thanks for reading! If you found this helpful or have questions about imitation learning, feel free to reach out or leave a comment.

Top comments (2)

Oyetunde Dotun • Jan 17

This is very detailed, thanks for sharing.

IBIYEMI Samuel O. • Jan 17

I appreciate that you took your time to read through. 😊