DEV Community: IBIYEMI Samuel O.

Setting Up Webots with Stable Baselines3 for Reinforcement Learning

IBIYEMI Samuel O. — Sun, 15 Feb 2026 13:43:03 +0000

Ever thought of building an actual robot? Only to be faced with the high price tags for hardware (with a high chance of equipment damage)?

You're not alone. For most of us, physical robots aren't an option. A decent mobile robot platform costs hundreds or thousands of dollars, breaks often, and requires space we don't have. But here's the thing: hardware shouldn't stop you from learning robotics. You don't need an expensive setup to build those amazing projects you've always envisaged.

Simulation gets you remarkably close to real-world environments; close enough to learn, experiment, and prototype effectively. And reinforcement learning (RL) in simulation shouldn't feel abstract. Sure, understanding policy gradients, PPO, SAC, and all those acronyms matters, but there's something uniquely satisfying about watching an agent you trained actually navigate a world that looks and behaves like reality.

This is where Webots comes in: industry-grade physics, used by researchers and companies worldwide, completely free. In this tutorial, we're connecting Webots with Stable Baselines3, pairing a professional simulator with battle-tested RL algorithms.

By the end of this tutorial, you'll have a complete simulation environment ready for RL training. No hardware required, just Python and a handful of curiosity😉.

An example of a Trained Car in webots

What You'll Build

By the end of this tutorial, you'll have:

[ ] A working Webots simulation world with a robot and target
[ ] Python virtual environment with Stable Baselines3 installed
[ ] External controller setup for running RL code from your IDE
[ ] Verified connection between Python and Webots
[ ] Foundation ready for building a Gymnasium environment (next tutorial)

The task: A robot that will learn to navigate toward a target from any starting position. The setup is intentionally simple but powerful—once you understand this foundation, you can extend it to complex scenarios like autonomous driving.

Background: RL and Simulation

Reinforcement Learning (RL) is a branch of Artificial Intelligence that trains agents through trial and error. Mathematically, it can be represented as an optimization problem where we design closed-loop control policies that maximize accumulated reward over time. RL has proven its success in modern systems ranging from LLMs to robotics and autonomous vehicles.

Simulation involves using computer software to create virtual environments that mimic real-world physics and dynamics. Instead of testing your RL agent on expensive hardware that can break or cause safety issues, you train it in a controlled digital replica. Think of it as a sandbox where your agent can fail thousands of times without consequences, learning what works before ever touching physical hardware.

Why This Stack?

Webots gives you industry-standard, physics-accurate simulation that's completely free and robot-agnostic. Whether you're working with wheeled robots, drones, or manipulator arms, Webots handles the physics engine, sensors, and actuators so you can focus on your RL and control logic.

Stable Baselines3 provides production-ready RL algorithms (PPO, SAC, TD3, etc.) with clean APIs, excellent documentation, and active maintenance. Instead of implementing DDPG from scratch and debugging it for weeks, you get reliable, tested implementations.

By connecting Webots with Stable Baselines3, you get professional-grade tools on both ends. Simulation realistic enough to matter, and algorithms robust enough to work.

Prerequisites

Knowledge:

Basic Python programming
Familiarity with RL concepts (agent, environment, reward, policy)
A sprinkle of curiosity to learn is often all you need✨

Software:

Python 3.8+ (I'm using Python 3.12.0)
Webots R2023b or later
Stable Baselines3 and dependencies

Hardware:

Any modern computer (Windows, macOS, or Linux)
4GB+ RAM recommended

Installation

Step 1: Install Python

Download and install Python from python.org. Make sure to check "Add Python to PATH" during installation.

Verify installation:

python --version

Step 2: Install Webots

Visit https://cyberbotics.com/
Download the package for your operating system
Run the installer and follow the prompts (agree to all defaults)
Launch Webots to verify installation

Project Setup

Create Your Webots World

Open Webots
File → New → New Project Directory
Use the Project Creation Wizard:
- Directory name: Webots_SB3_Tutorial
- World name: robot_navigation
- Check "Add a rectangle arena"
- Click Finish

Webots will create the project structure and open your new world with a basic arena.

Set Up Python Environment

Here's something important: Webots uses its own Python environment. Traditional virtual environments don't work directly with Webots controllers. When you set a controller in Webots, it launches a subprocess using the system Python, completely ignoring your activated virtual environment.

For RL/ML workflows with external libraries like Stable Baselines3, we use External Controllers. This lets you run your code from your terminal or IDE (where your virtual environment is active) while connecting to the Webots simulation.

Navigate to your project folder and create a virtual environment:

# Navigate to your Webots project
cd {path-to-your}\Webots_SB3_Tutorial

# Create virtual environment in the project folder
python -m venv webots_rl_env

# Activate it
# On Windows:
webots_rl_env\Scripts\activate

# On macOS/Linux:
source webots_rl_env/bin/activate

Install Required Packages:

pip install stable-baselines3[extra] gymnasium numpy

Verify installation:

python -c "import stable_baselines3; print(stable_baselines3.__version__)"

Set Webots Environment Variable:

For external controllers to work, Python needs to know where Webots is installed. Set this once:

# Windows PowerShell:
$env:WEBOTS_HOME = "C:\Program Files\Webots"

# Windows CMD:
set WEBOTS_HOME=C:\Program Files\Webots

# macOS/Linux:
export WEBOTS_HOME=/Applications/Webots.app
# or wherever you installed Webots

To make this permanent, add it to your system environment variables or shell profile.

Your project structure should now look like this:

Webots_SB3_Tutorial/
├── webots_rl_env/          # Your virtual environment
├── controllers/
├── libraries/
├── plugins/
├── worlds/
│   └── robot_navigation.wbt
└── protos/

Building Your Simulation World

Now we'll add the components our RL agent needs: a robot to control, a target to reach, and a supervisor to manage the training loop.

Understanding the Architecture

Before we build, let's understand how the pieces connect:

Webots runs like this:

Initialize world → Update physics → Read sensors → Control actuators → Repeat

Gymnasium (the RL standard) expects:

reset() → observation step(action) → 
observation, reward, done, info

The bridge: We create a Gymnasium-compatible environment that:

Controls the Webots simulation timestep
Reads sensor data and converts to observations
Receives actions and sends to robot actuators
Calculates rewards based on task progress
Detects episode termination

Webots & Stable-Baseline3 Interaction. We'll implement this bridge in the next tutorial

The Navigation Task

We're building a simple but powerful setup:

A robot starts at random positions in the arena
A target (goal) is placed somewhere in the arena
The robot learns to drive toward the target using relative observations (distance and angle), not absolute positions

This approach means once trained, you can move the target anywhere and the robot will adapt. The policy learns "navigate toward what I see" rather than "go to coordinates (x, y)."

Add the Robot

Add a robot to your world:
- In Webots, click the Add button (+ icon) in the scene
- Navigate to: PROTO nodes (Webots Projects) → robots → gctronic → e-puck → E-puck (Robot)
- or you can search for "E-Puck" in the Add a node pop-up.
- Click Add

Give the robot a DEF name:
- Click on the E-puck in the scene tree
- At the very top of the node properties, add ROBOT to the DEF: field
- This allows our Python code to reference this specific robot

Set the robot controller to external:
- In the properties panel, find the controller field
- Change it from "e-puck" to <extern>
- This tells Webots we'll control it from our Python script

Add the Target

We need a visible target for the robot to navigate toward. We'll use a Solid node so it can be repositioned programmatically (for testing different positions), but we'll make it non-colliding so the robot can reach the exact center.

Add a Solid node:
- Click the Add button
- Select Base nodes → Solid
Give the target a DEF name:
- Select the Solid node
- Add TARGET to the DEF: field
- This allows our Python code to reference and move this object
Add visual appearance:
- Expand the Solid node in the scene tree
- Right-click on children [] → Add New → Choose Shape
- Expand the Shape node
- Right-click on geometry NULL → Add New → Choose Cylinder
- Configure the Cylinder:
  - Set radius to 0.01
  - Set height to 0.05
- Right-click on appearance NULL → Add New → Choose PBRAppearance
- Expand PBRAppearance and set baseColor to red: 1 0 0
Position the target:
- Find the translation field
- Set it to: 0.3 0.025 0.3 (x, y, z coordinates)

Why use Solid without collision?

Solid nodes can be moved programmatically via the Supervisor API (useful for testing)
We skip physics and boundingObject so the robot can drive through the marker
The target is purely visual—a goal marker, not a physical obstacle
Later, you can add physics if you want obstacle avoidance training

Add the Supervisor

For RL to work, we need a "supervisor" that can:

Reset the robot position between episodes
Read positions of both robot and target
Calculate rewards
Control the simulation

Add a Robot node for supervision:
- Click Add
- Select Base nodes → Robot
Configure it as a supervisor:
- Set name to "supervisor_controller"
- Set supervisor field to TRUE
- Set controller to <extern>

Save Your World

File → Save World

Your scene tree should now look like this:

Your scene should look like this:

What we just built:

ROBOT (E-puck): The agent that will learn to navigate
TARGET (red cylinder): The goal position
Supervisor: The "brain" that runs our RL training loop

Verifying Your Setup

Let's make sure everything is connected properly.

Create the test controller:

In your project, create a new folder: controllers/test_supervisor/
Inside that folder, create a file: test_supervisor.py

Your folder structure should look like:

Webots_SB3_Tutorial/
├── webots_rl_env/
├── controllers/
│   └── test_supervisor/
│       └── test_supervisor.py
├── worlds/
│   └── robot_navigation.wbt
└── protos/

Add this code to test_supervisor.py:

from controller import Supervisor

# Initialize supervisor
supervisor = Supervisor()
timestep = int(supervisor.getBasicTimeStep())

# Test: Can we access our nodes?
robot_node = supervisor.getFromDef("ROBOT")
target_node = supervisor.getFromDef("TARGET")

if robot_node and target_node:
    print("✓ Setup successful!")
    print(f"  Robot found at: {robot_node.getPosition()}")
    print(f"  Target found at: {target_node.getPosition()}")

    # Test moving the target
    trans_field = target_node.getField("translation")
    current_pos = trans_field.getSFVec3f()
    print(f"  Target can be moved: {current_pos}")

else:
    print("✗ Setup error!")
    if not robot_node:
        print("  Missing: ROBOT (check DEF name on E-puck)")
    if not target_node:
        print("  Missing: TARGET (check DEF name on Solid)")

# Run one simulation step
supervisor.step(timestep)
print("✓ Simulation step successful!")

To run the test:

In Webots, open your robot_navigation.wbt world
Select the Robot (supervisor_controller) node in the scene tree
Change its controller field from <extern> to test_supervisor
Click the Play button (▶️) in Webots. (You might need to Click restart too)
Check the Webots console (bottom panel)

Expected output in the Webots console:

INFO: test_supervisor: Starting controller: python.exe -u test_supervisor.py
✓ Setup successful!
  Robot found at: [0.0, 0.0, 0.0]
  Target found at: [0.3, 0.025, 0.3]
  Target can be moved: [0.3, 0.025, 0.3]
✓ Simulation step successful!

After testing:

Important: Change the supervisor's controller field back to <extern> (we'll need this for the next tutorial)
File → Save World

Optional test: Hold Shift + Left Click and drag the target in the 3D view. It should move freely, confirming the physics setup is correct.

What You Accomplished

🎉 Congratulations! You've built a complete foundation for RL training in Webots:

Installed Webots and Python environment
Created a simulation world with robot and target
Configured external controller setup
Verified Python can communicate with Webots
Ready to build a Gymnasium environment (next tutorial)

Next Steps

Coming in the next tutorial: "Building a Gymnasium Environment for Webots Robot Control"

We'll write the code that bridges Stable Baselines3 and Webots:

Creating a custom Gymnasium environment class
Implementing reset() and step() methods
Defining observation and action spaces
Designing a reward function
Handling episode termination

Resources:

📦 Complete code: [https://github.com/sam-dude/Webots_SB3_Tutorial]
📚 Webots Documentation
📚 Stable Baselines3 Documentation
📚 Gymnasium Documentation

Final Thoughts

You might find interacting with Webots a bit confusing at first. It can feel daunting getting introduced to a tool with seemingly many features. But here's the catch: the best way to learn is by playing around.

Go beyond what we've covered in this tutorial. Experiment with the "pre-made" robots available in Webots. Try out certain ideas you have by adding and customizing different nodes. Webots allows you to create custom environments, and hands-on exploration is often the fastest way to get comfortable with any new tool.

Conclusion

You now have a professional-grade simulation setup ready for RL experimentation. This foundation uses the same tools researchers and companies use for real robotics projects—no expensive hardware required.

The key insight we've established: by using relative observations (distance and angle to target) instead of absolute positions, our future trained agent will generalize. Move the target anywhere, and the robot will adapt.

In the next Tutorial, we will connect our Webots environment to Gynasium.

Thank you for reading this piece to the end. If you face any issue during implementation, you can drop a comment. I'll do well to respond on time.

Imitation Learning: A Stanford Walkthrough

IBIYEMI Samuel O. — Sat, 17 Jan 2026 01:26:24 +0000

AI and robotics are evolving rapidly, and reinforcement learning (RL) has become foundational to modern AI systems. You see RL everywhere: in large language models, self-driving cars, and the emerging field of physical AI. Within RL, imitation learning stands out as a particularly practical technique. Instead of engineering complex reward functions, we can simply show an AI agent what to do by demonstrating the desired behavior.

Stanford's CS224R: Deep Reinforcement Learning course offers a rigorous, freely available introduction to these concepts. The course has a well-structured syllabus and covers industry-standard material that's directly applicable to real-world problems—making it an excellent resource for anyone looking to break into AI and robotics. Best of all, it's completely free and available on YouTube.

This walkthrough documents my experience with one of the early assignments: implementing and experimenting with two foundational imitation learning methods; Behavior Cloning (BC) and DAgger. While researching self-driving car systems and robotics applications, I noticed that imitation learning consistently appears as a foundational technique. Companies like Nvidia, for instance, have published work showing its importance in autonomous driving. This assignment provided hands-on experience with the core concepts, implementation decisions, and the practical tradeoffs between different approaches.

Why Imitation Learning?

There are certain behaviors that human experts know how to do but may require rigorous domain knowledge to program into a robot system. These expert behaviors can be demonstrated to the AI agent. The purpose of imitation learning is to efficiently learn a desired behavior by imitating an expert's behavior.

This can be powerful for building autonomous behavior in systems that are desired to mimic human approach. For instance, imitation learning can be deployed for autonomous behavior in vehicles, computer games, and robotic applications. It's a go-to application for many companies dealing with self-driving cars, robotics, and leading physical AI systems.

What is Imitation Learning?

The goal of reinforcement learning is to determine closed-loop control policies that result in the maximization of an accumulated reward. RL systems are generally classified into model-based or model-free. There is a general assumption in both cases that a reward function is known, and the goal is to process it done by running the system in an environment and collect these data, then use it to maximize (update) a learned policy (model-based) or directly update the learned policy (model-free).

Imitation learning is quite similar to RL; the difference exists such that there is no explicit reward function r = R(st, ut), rather it is assumed that a set of demonstrations is provided by an expert. The goal is to learn a policy π that follows the closed-loop control law:

ut = π(st, ut)

There is an expert policy π* (derived from the demonstration) that the learned policy aims to imitate.

The Goal of Imitation Learning: Find a policy π that best imitates the expert policy π*, effectively capturing the expert's behavior from the provided demonstrations.

(Sounding complex? I'll keep it simple, I promise 😊)

Two Approaches to Imitation Learning

There are often two approaches to imitation learning. The first is to directly learn to imitate the expert's policy, and the second is to imitate the policy indirectly by learning the expert's reward function instead. In this blog, I will focus on classical approaches that directly learn to imitate the expert in two ways. This kind of policy can be usually obtained through a standard supervised learning method (in soft behavior cloning and the DAgger algorithm). The learning of the expert's reward function (which I will not be covered, you can read more in the resources listed) is known as inverse reinforcement learning. We shall discuss BC and DAgger, considering their intuition and limitations. I also will explain how my architectural choices.

Behavior Cloning (BC)

The algorithm uses a set of demonstrated trajectories from an expert to determine a policy π that imitates the expert's actions. Supervised learning techniques can be applied; the policy is learned and expert demonstrations are paired in respect to some metric. It is a classical optimisation problem.

But this has a major shortcoming: There are instances where our expert data does not capture large-scale (often called "trajectory drift") that deteriorate the policy's not equivalent to what the policy has been trained on. This is a major challenge in imitation learning.

The core issue is distributional mismatch: during training, the policy only sees states from the expert's trajectories, but during deployment, small errors compound over time, leading the policy into states it has never encountered. These unfamiliar states cause the policy to make poor decisions, which leads to even more unfamiliar states—a cascading failure.

Figure 1: Behavior Cloning in action during early training. Notice how the Ant agent struggles.

DAgger: Data Aggregation

DAgger is a direct patch to the distributional mismatch problem. It collects additional data from the expert iteratively to update the policy.

How DAgger works: Rather than training once on a fixed dataset, DAgger runs the current learned policy in the environment, observes the states it reaches (which may differ from the expert's states), then queries the expert for the correct action at those states. This new data is aggregated with the previous training data, and the policy is retrained. This iterative process progressively reduces the distributional mismatch between training and deployment by ensuring the policy sees and learns from the states it actually encounters.

Figure 2: DAgger in action. The same environments show significantly more stable behavior, the agent now maintains more balance.

My Experimental Setup

I implemented and evaluated both Behavior Cloning and DAgger on three Mujoco continuous control environments: Ant-v2, HalfCheetah-v2, and Hopper-v2. Each environment presents different challenges—Ant requires coordinating multiple legs, HalfCheetah involves high-speed forward locomotion, and Hopper demands delicate balance control.

Implementation Details

Network Architecture:
I used a simple feedforward neural network with 2 hidden layers of 64 units each, using ReLU activations. The policy outputs continuous actions directly without any output activation. This architecture is deliberately simple—I wanted to see how much the algorithm itself (BC vs DAgger) matters compared to model capacity.

Training Configuration:

Learning rate: 0.005 with Adam optimizer
Training steps per iteration: 1000
Mini-batch size: 100
For BC: Single iteration on expert data
For DAgger: 10 iterations with expert relabeling

Data Collection:
Each iteration collected 1000 environment steps. For DAgger, the expert policy relabeled the states visited by the learned policy, and all data was accumulated in a replay buffer with capacity of 1,000,000.

Results

The results were striking, particularly for Hopper:

Environment	BC Return	BC % Expert	DAgger Return	DAgger % Expert	Expert Return	Improvement
Ant	4216.1 ± 0.0	88.9%	4845.1 ± 0.0	102.1%	4744.3	+13.3%
Hopper	879.8 ± 345.1	23.7%	3709.4 ± 0.0	99.8%	3716.0	+76.1%
HalfCheetah	3835.3 ± 0.0	94.3%	3905.8 ± 0.0	96.0%	4067.9	+1.7%

Learning Curves

Figure 1: Training curves showing BC vs DAgger vs Expert baseline for Ant, Hopper, and HalfCheetah environments. The curves illustrate how DAgger's iterative data collection leads to continued improvement, particularly visible in Hopper's dramatic performance gain.

Key Observations:

Hopper showed the most dramatic improvement. BC achieved only 23.7% of expert performance with high variance (±345.1), while DAgger reached 99.8% of expert performance. This makes sense. Hopper requires precise balance, and small deviations from the expert trajectory quickly lead to falls. BC had no way to recover from these states, but DAgger explicitly learned from them.

Ant performed well even with BC (88.9% of expert), but DAgger still provided meaningful gains, actually exceeding expert performance at 102.1%. This suggests Ant's dynamics are more forgiving. The policy can make small errors without catastrophic failure, reducing the distributional mismatch problem.

HalfCheetah saw minimal improvement from DAgger (94.3% → 96.0%). My hypothesis is that HalfCheetah's task is relatively stable once the policy learns the basic running gait, and the expert demonstrations already covered the relevant state space well enough for BC to succeed.

Performance Comparison

Figure 2: Final performance comparison across all three environments, showing the relative success of BC and DAgger as percentages of expert performance.

What I Learned

This assignment reinforced that the choice of algorithm matters immensely depending on the task dynamics. For sensitive, unstable tasks like Hopper, the distributional mismatch in BC is crippling. DAgger's iterative data collection isn't just a theoretical improvement, it's the difference between 24% and 100% success.

Also, I figured that looking at variance is as important as looking at mean performance. BC's high variance on Hopper was a clear signal that something was fundamentally wrong, not just that the policy needed more training.

Not all tasks need DAgger. HalfCheetah's results suggest that for some problems, expert demonstrations alone provide sufficient coverage. Understanding when to use which approach requires carefully consideration of your environment's dynamics.

Imitation learning doesn't solve all autonomy problems on its own. In practice, it's often used to create base models upon which different algorithms (including RL methods like PPO) are applied to fine-tune performance. This two-stage approach—imitation learning for initialization, RL for refinement—is common in real-world applications.

DAgger's benefits come with costs. While it solves the distributional mismatch problem effectively, it's computationally intensive. Each iteration requires running the policy in the environment and querying the expert for labels, which can be expensive or even infeasible in some domains (imagine needing a human expert to label thousands of states). This practical constraint is why BC remains popular despite its theoretical limitations.

Conclusion

Working through this imitation learning assignment gave me hands-on experience with the core challenge of learning from demonstrations: distributional mismatch. The contrast between BC and DAgger wasn't just academic, I saw firsthand how Hopper went from barely functioning to expert-level performance simply by addressing which states the policy trains on.

The bigger lesson is about understanding your problem before choosing your algorithm. DAgger isn't always necessary (as HalfCheetah showed), and BC isn't always insufficient (as Ant demonstrated). The key is recognizing when your task's dynamics will punish distribution shift, and planning accordingly.

These experiments also highlighted that imitation learning is a starting point, not an ending point, for building robust autonomous systems. The path from expert demonstrations to production-ready policies often involves multiple stages—imitation learning for initialization, reinforcement learning for refinement, and careful engineering throughout.

If you're working on similar problems, I hope this walkthrough helps you think about when to use BC, when to invest in DAgger's computational cost, and how to evaluate whether distributional mismatch is your bottleneck.

Code and Course Information

Note on Code: Due to course policy, I cannot share the implementation code for this assignment. However, the concepts and approaches described here can be implemented following the algorithm descriptions in the referenced papers.

Course Resources:

These resources provide comprehensive coverage of imitation learning, reinforcement learning, and related topics.

References

Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., & Peters, J. (2018). "An Algorithmic Perspective on Imitation Learning." Foundations and Trends in Robotics, 7(1-2), 1-179.
Ross, S., Gordon, G., & Bagnell, D. (2011). "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning." Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS).
Pomerleau, D. A. (1989). "ALVINN: An Autonomous Land Vehicle in a Neural Network." Advances in Neural Information Processing Systems (NIPS).
Bojarski, M., et al. (2016). "End to End Learning for Self-Driving Cars." arXiv preprint arXiv:1604.07316.
Todorov, E., Erez, T., & Tassa, Y. (2012). "MuJoCo: A physics engine for model-based control." IEEE/RSJ International Conference on Intelligent Robots and Systems.

You made it to end😊 Thanks for reading! If you found this helpful or have questions about imitation learning, feel free to reach out or leave a comment.

I've been exploring Robotics with Reinforcement Learning. In this blog, I wrote about how I confronted the common learning trap - basking in the quagmire of personal deception; posing as if I knew much when I had only acquired little. ...

IBIYEMI Samuel O. — Mon, 01 Dec 2025 14:46:44 +0000

I trained a Robot Arm: What I failed to learn.

IBIYEMI Samuel O. ・ Dec 1

#ai #deeplearning #machinelearning #learning

I trained a Robot Arm: What I failed to learn.

IBIYEMI Samuel O. — Mon, 01 Dec 2025 08:06:34 +0000

First, there is so much to learn.

Understanding ML foundational concepts and having AI-accelerated workflows doesn't mean you can just jump in without going through the hard curve of learning. I learned that the expensive way.

Reinforcement Learning (RL) is distinct from other ML fields. Even though they share boundaries, there are concepts that even hardcore ML engineers won't just grasp immediately.

My first mistake was trying to skip steps. I was ambitious, that was glaring. I wanted results ASAP (my self-destructive habit of posing to the world). I was way too focused on seeing it work without employing my intuition to the hard details.

After completing my first RL project with AI's help, I could feel it in my gut: I had learned nothing, or maybe just too minimal to fit my acclaimed achievement. That's when I went back to relearn. It took time, but it was rewarding.

Now that you've heard the story of my life, let's get technical.

The Setup

The robot arm is the Pusher from Gymnasium with 7 Degrees of Freedom (DOF). It's a multi-jointed robot arm similar to a human arm. The goal is to move a target cylinder (the object) to a goal position using the robot's end effector (the fingertip). The robot has shoulder, elbow, forearm, and wrist joints.Gymnasium

In RL environments, we usually deal with discrete or continuous action spaces. Discrete action spaces have finite sets of actions - they're usually easier to learn, even with low compute. Continuous action spaces are close to infinite sets of available actions. The gradient can easily explode, or the agent gets stuck in a local minimum, or doesn't learn at all. It takes a lot of time to train.

Our Robot is a continuous action space environment. Our battle is long and the environment is as unpredictable as a raving sea.

A cool intuition to this would be:

Since discrete action spaces are easier and faster to learn, What if we try to discretise the action spaces in effort to use the algorithms that work well in discrete action spaces like (DQN)? But The problem is that the number of actions increases exponentially with the degrees of freedom. (Source DDPG Paper ). Hence, the curse of dimensionality.

The Training Journey

I initially tried SAC (Soft Actor Critic). After around 2 million timesteps, it failed to learn significantly. It took around 14 hours of training on my CPU. I introduced tricks: checkpointing (suggested by Mave), interval video recording, low-end optimisations, free GPU runs on Google Colab.

I made significant progress, but it was hard to get something encouraging at that stage. The rewards were signaling that it wasn't learning. I had to terminate.

Best model at 2 million timesteps, it's obvious it ain't learning.😓

But after I dug further, I saw that I could improve my results by trying different algorithms like HER (Hindsight Experience Replay). HER is an off-policy algorithm aimed to work in applications where admissible behaviors aren't necessarily known. Previous algorithms required careful reward shaping and in-depth domain knowledge. HER solves this by learning from unshaped reward signals.

What's Next?

Just like Andrew Ng wrote in "The Batch" (huge fan😊):

"The single biggest predictor of how rapidly a team makes progress building an AI agent lies in their ability to drive a disciplined process for evals (measuring the system's performance) and error analysis (identifying the causes of errors). It's tempting to shortcut these processes and quickly attempt fixes to mistakes rather than slowing down to identify root causes. But evals and error analysis can lead to much faster progress."

He also emphasized that without understanding how computers work, you can't just "vibe code" your way to greatness. Fundamentals are essential.

The ability to understand concepts and apply them is greater than just producing results without adequate understanding. That's exactly what I was doing; shortcutting the process, chasing results without grasping the fundamentals.

So, what will I do differently? I'll check out the original algorithm papers before starting implementation. I'll digest first. I'll also improve my RL algorithm debugging skills - because understanding why something fails is just as important as making it work.

Stay tuned for my learning updates.

Till then,

Keep Learning,
Samuel Ibiyemi