<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: IBIYEMI Samuel O.</title>
    <description>The latest articles on DEV Community by IBIYEMI Samuel O. (@samdude).</description>
    <link>https://dev.to/samdude</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1188173%2F7c7774d1-cae2-4d73-a15a-80717f067653.png</url>
      <title>DEV Community: IBIYEMI Samuel O.</title>
      <link>https://dev.to/samdude</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/samdude"/>
    <language>en</language>
    <item>
      <title>Setting Up Webots with Stable Baselines3 for Reinforcement Learning</title>
      <dc:creator>IBIYEMI Samuel O.</dc:creator>
      <pubDate>Sun, 15 Feb 2026 13:43:03 +0000</pubDate>
      <link>https://dev.to/samdude/setting-up-webots-with-stable-baselines3-for-reinforcement-learning-2ikc</link>
      <guid>https://dev.to/samdude/setting-up-webots-with-stable-baselines3-for-reinforcement-learning-2ikc</guid>
      <description>&lt;p&gt;Ever thought of building an actual robot? Only to be faced with the high price tags for hardware (with a high chance of equipment damage)?&lt;/p&gt;

&lt;p&gt;You're not alone. For most of us, physical robots aren't an option. A decent mobile robot platform costs hundreds or thousands of dollars, breaks often, and requires space we don't have. But here's the thing: hardware shouldn't stop you from learning robotics. You don't need an expensive setup to build those amazing projects you've always envisaged.&lt;/p&gt;

&lt;p&gt;Simulation gets you remarkably close to real-world environments; close enough to learn, experiment, and prototype effectively. And reinforcement learning (RL) in simulation shouldn't feel abstract. Sure, understanding policy gradients, PPO, SAC, and all those acronyms matters, but there's something uniquely satisfying about watching an agent you trained actually navigate a world that looks and behaves like reality.&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;Webots&lt;/strong&gt; comes in: industry-grade physics, used by researchers and companies worldwide, completely free. In this tutorial, we're connecting Webots with &lt;strong&gt;Stable Baselines3&lt;/strong&gt;, pairing a professional simulator with battle-tested RL algorithms.&lt;/p&gt;

&lt;p&gt;By the end of this tutorial, you'll have a complete simulation environment ready for RL training. No hardware required, just Python and a handful of curiosity😉.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jtj7br4risvf8rj1oek.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2jtj7br4risvf8rj1oek.gif" alt="An example of a Train Car in webots" width="480" height="360"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;An example of a Trained Car in webots&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  What You'll Build
&lt;/h2&gt;

&lt;p&gt;By the end of this tutorial, you'll have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ]  A working Webots simulation world with a robot and target&lt;/li&gt;
&lt;li&gt;[ ]  Python virtual environment with Stable Baselines3 installed&lt;/li&gt;
&lt;li&gt;[ ]  External controller setup for running RL code from your IDE&lt;/li&gt;
&lt;li&gt;[ ]  Verified connection between Python and Webots&lt;/li&gt;
&lt;li&gt;[ ]  Foundation ready for building a Gymnasium environment (next tutorial)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The task:&lt;/strong&gt; A robot that will learn to navigate toward a target from any starting position. The setup is intentionally simple but powerful—once you understand this foundation, you can extend it to complex scenarios like autonomous driving.&lt;/p&gt;


&lt;h2&gt;
  
  
  Background: RL and Simulation
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Reinforcement Learning (RL)&lt;/strong&gt; is a branch of Artificial Intelligence that trains agents through trial and error. Mathematically, it can be represented as an optimization problem where we design closed-loop control policies that maximize accumulated reward over time. RL has proven its success in modern systems ranging from LLMs to robotics and autonomous vehicles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simulation&lt;/strong&gt; involves using computer software to create virtual environments that mimic real-world physics and dynamics. Instead of testing your RL agent on expensive hardware that can break or cause safety issues, you train it in a controlled digital replica. Think of it as a sandbox where your agent can fail thousands of times without consequences, learning what works before ever touching physical hardware.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why This Stack?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Webots&lt;/strong&gt; gives you industry-standard, physics-accurate simulation that's completely free and robot-agnostic. Whether you're working with wheeled robots, drones, or manipulator arms, Webots handles the physics engine, sensors, and actuators so you can focus on your RL and control logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stable Baselines3&lt;/strong&gt; provides production-ready RL algorithms (PPO, SAC, TD3, etc.) with clean APIs, excellent documentation, and active maintenance. Instead of implementing DDPG from scratch and debugging it for weeks, you get reliable, tested implementations.&lt;/p&gt;

&lt;p&gt;By connecting Webots with Stable Baselines3, you get professional-grade tools on both ends. Simulation realistic enough to matter, and algorithms robust enough to work.&lt;/p&gt;


&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Knowledge:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Basic Python programming&lt;/li&gt;
&lt;li&gt;Familiarity with RL concepts (agent, environment, reward, policy)&lt;/li&gt;
&lt;li&gt;A sprinkle of curiosity to learn is often all you need✨&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Software:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.8+ (I'm using Python 3.12.0)&lt;/li&gt;
&lt;li&gt;Webots R2023b or later&lt;/li&gt;
&lt;li&gt;Stable Baselines3 and dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hardware:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Any modern computer (Windows, macOS, or Linux)&lt;/li&gt;
&lt;li&gt;4GB+ RAM recommended&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Step 1: Install Python
&lt;/h3&gt;

&lt;p&gt;Download and install Python from &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;python.org&lt;/a&gt;. Make sure to check "Add Python to PATH" during installation.&lt;/p&gt;

&lt;p&gt;Verify installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Install Webots
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Visit &lt;a href="https://cyberbotics.com/" rel="noopener noreferrer"&gt;https://cyberbotics.com/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Download the package for your operating system&lt;/li&gt;
&lt;li&gt;Run the installer and follow the prompts (agree to all defaults)&lt;/li&gt;
&lt;li&gt;Launch Webots to verify installation&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Project Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Create Your Webots World
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Open Webots&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File → New → New Project Directory&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use the Project Creation Wizard:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Directory name: &lt;code&gt;Webots_SB3_Tutorial&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;World name: &lt;code&gt;robot_navigation&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Check "Add a rectangle arena"&lt;/li&gt;
&lt;li&gt;Click Finish&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Webots will create the project structure and open your new world with a basic arena.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwcngsqlknu3zkijpbm7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpwcngsqlknu3zkijpbm7.png" alt="Creating a new project in webots" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Set Up Python Environment
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Here's something important:&lt;/strong&gt; Webots uses its own Python environment. Traditional virtual environments don't work directly with Webots controllers. When you set a controller in Webots, it launches a subprocess using the system Python, completely ignoring your activated virtual environment.&lt;/p&gt;

&lt;p&gt;For RL/ML workflows with external libraries like Stable Baselines3, we use &lt;strong&gt;External Controllers&lt;/strong&gt;. This lets you run your code from your terminal or IDE (where your virtual environment is active) while connecting to the Webots simulation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Navigate to your project folder and create a virtual environment:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Navigate to your Webots project&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;path-to-your&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\W&lt;/span&gt;ebots_SB3_Tutorial

&lt;span class="c"&gt;# Create virtual environment in the project folder&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv webots_rl_env

&lt;span class="c"&gt;# Activate it&lt;/span&gt;
&lt;span class="c"&gt;# On Windows:&lt;/span&gt;
webots_rl_env&lt;span class="se"&gt;\S&lt;/span&gt;cripts&lt;span class="se"&gt;\a&lt;/span&gt;ctivate

&lt;span class="c"&gt;# On macOS/Linux:&lt;/span&gt;
&lt;span class="nb"&gt;source &lt;/span&gt;webots_rl_env/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Install Required Packages:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;stable-baselines3[extra] gymnasium numpy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import stable_baselines3; print(stable_baselines3.__version__)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Set Webots Environment Variable:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For external controllers to work, Python needs to know where Webots is installed. Set this once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Windows PowerShell:&lt;/span&gt;
&lt;span class="nv"&gt;$env&lt;/span&gt;:WEBOTS_HOME &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"C:&lt;/span&gt;&lt;span class="se"&gt;\P&lt;/span&gt;&lt;span class="s2"&gt;rogram Files&lt;/span&gt;&lt;span class="se"&gt;\W&lt;/span&gt;&lt;span class="s2"&gt;ebots"&lt;/span&gt;

&lt;span class="c"&gt;# Windows CMD:&lt;/span&gt;
&lt;span class="nb"&gt;set &lt;/span&gt;&lt;span class="nv"&gt;WEBOTS_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;C:&lt;span class="se"&gt;\P&lt;/span&gt;rogram Files&lt;span class="se"&gt;\W&lt;/span&gt;ebots

&lt;span class="c"&gt;# macOS/Linux:&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;WEBOTS_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/Applications/Webots.app
&lt;span class="c"&gt;# or wherever you installed Webots&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To make this permanent, add it to your system environment variables or shell profile.&lt;/p&gt;

&lt;p&gt;Your project structure should now look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Webots_SB3_Tutorial/
├── webots_rl_env/          # Your virtual environment
├── controllers/
├── libraries/
├── plugins/
├── worlds/
│   └── robot_navigation.wbt
└── protos/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Building Your Simulation World
&lt;/h2&gt;

&lt;p&gt;Now we'll add the components our RL agent needs: a robot to control, a target to reach, and a supervisor to manage the training loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding the Architecture
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foq30uiy3e059qsreqwgy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foq30uiy3e059qsreqwgy.png" alt="Webots &amp;amp; Stable-Baseline3 implementation" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before we build, let's understand how the pieces connect:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Webots runs like this:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Initialize world → Update physics → Read sensors → Control actuators → Repeat 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Gymnasium (the RL standard) expects:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;reset() → observation step(action) → 
observation, reward, done, info 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The bridge:&lt;/strong&gt; We create a Gymnasium-compatible environment that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Controls the Webots simulation timestep &lt;/li&gt;
&lt;li&gt;Reads sensor data and converts to observations &lt;/li&gt;
&lt;li&gt;Receives actions and sends to robot actuators&lt;/li&gt;
&lt;li&gt;Calculates rewards based on task progress &lt;/li&gt;
&lt;li&gt;Detects episode termination&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;Webots &amp;amp; Stable-Baseline3 Interaction. We'll implement this bridge in the next tutorial&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Navigation Task
&lt;/h3&gt;

&lt;p&gt;We're building a simple but powerful setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A robot starts at random positions in the arena&lt;/li&gt;
&lt;li&gt;A target (goal) is placed somewhere in the arena&lt;/li&gt;
&lt;li&gt;The robot learns to drive toward the target using relative observations (distance and angle), not absolute positions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach means once trained, you can move the target anywhere and the robot will adapt. The policy learns "navigate toward what I see" rather than "go to coordinates (x, y)."&lt;/p&gt;

&lt;h3&gt;
  
  
  Add the Robot
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Add a robot to your world:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;In &lt;strong&gt;Webots&lt;/strong&gt;, click the &lt;strong&gt;Add&lt;/strong&gt; button (+ icon) in the scene&lt;/li&gt;
&lt;li&gt;Navigate to: &lt;strong&gt;PROTO nodes (Webots Projects) → robots → gctronic → e-puck → E-puck (Robot)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;or you can search for "E-Puck" in the &lt;code&gt;Add a node&lt;/code&gt; pop-up.&lt;/li&gt;
&lt;li&gt;Click Add&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4a05snqvw8e2y77zh75s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4a05snqvw8e2y77zh75s.png" alt="Add E-Puck robot in Webots" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Give the robot a DEF name:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Click on the E-puck in the scene tree&lt;/li&gt;
&lt;li&gt;At the very top of the node properties, add &lt;code&gt;ROBOT&lt;/code&gt; to the &lt;code&gt;DEF:&lt;/code&gt; field&lt;/li&gt;
&lt;li&gt;This allows our Python code to reference this specific robot&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8khlkqfgzzo6psaxghp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8khlkqfgzzo6psaxghp.png" alt=" " width="364" height="695"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Set the robot controller to external:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;In the properties panel, find the &lt;code&gt;controller&lt;/code&gt; field&lt;/li&gt;
&lt;li&gt;Change it from &lt;code&gt;"e-puck"&lt;/code&gt; to &lt;code&gt;&amp;lt;extern&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;This tells Webots we'll control it from our Python script&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1z02qiyclcc14s5vlyc4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1z02qiyclcc14s5vlyc4.png" alt="Making a robot to be controlled by python in Webots" width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Add the Target
&lt;/h3&gt;

&lt;p&gt;We need a visible target for the robot to navigate toward. We'll use a &lt;strong&gt;Solid node&lt;/strong&gt; so it can be repositioned programmatically (for testing different positions), but we'll make it non-colliding so the robot can reach the exact center.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add a Solid node:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Click the &lt;strong&gt;Add&lt;/strong&gt; button&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;Base nodes → Solid&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Give the target a DEF name:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Select the Solid node&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;TARGET&lt;/code&gt; to the &lt;code&gt;DEF:&lt;/code&gt; field&lt;/li&gt;
&lt;li&gt;This allows our Python code to reference and move this object&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add visual appearance:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expand the Solid node in the scene tree&lt;/li&gt;
&lt;li&gt;Right-click on &lt;code&gt;children []&lt;/code&gt; → &lt;strong&gt;Add New&lt;/strong&gt; → Choose &lt;strong&gt;Shape&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Expand the Shape node&lt;/li&gt;
&lt;li&gt;Right-click on &lt;code&gt;geometry NULL&lt;/code&gt; → &lt;strong&gt;Add New&lt;/strong&gt; → Choose &lt;strong&gt;Cylinder&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Configure the Cylinder:

&lt;ul&gt;
&lt;li&gt;Set &lt;code&gt;radius&lt;/code&gt; to &lt;code&gt;0.01&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;height&lt;/code&gt; to &lt;code&gt;0.05&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Right-click on &lt;code&gt;appearance NULL&lt;/code&gt; → &lt;strong&gt;Add New&lt;/strong&gt; → Choose &lt;strong&gt;PBRAppearance&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Expand PBRAppearance and set &lt;code&gt;baseColor&lt;/code&gt; to red: &lt;code&gt;1 0 0&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Position the target:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Find the &lt;code&gt;translation&lt;/code&gt; field&lt;/li&gt;
&lt;li&gt;Set it to: &lt;code&gt;0.3 0.025 0.3&lt;/code&gt; (x, y, z coordinates)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Why use Solid without collision?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Solid nodes can be moved programmatically via the Supervisor API (useful for testing)&lt;/li&gt;
&lt;li&gt;We skip physics and boundingObject so the robot can drive through the marker&lt;/li&gt;
&lt;li&gt;The target is purely visual—a goal marker, not a physical obstacle&lt;/li&gt;
&lt;li&gt;Later, you can add physics if you want obstacle avoidance training&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Add the Supervisor
&lt;/h3&gt;

&lt;p&gt;For RL to work, we need a "supervisor" that can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reset the robot position between episodes&lt;/li&gt;
&lt;li&gt;Read positions of both robot and target&lt;/li&gt;
&lt;li&gt;Calculate rewards&lt;/li&gt;
&lt;li&gt;Control the simulation&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Add a Robot node for supervision:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Click &lt;strong&gt;Add&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;Base nodes → Robot&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Configure it as a supervisor:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set &lt;code&gt;name&lt;/code&gt; to &lt;code&gt;"supervisor_controller"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;supervisor&lt;/code&gt; field to &lt;code&gt;TRUE&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;controller&lt;/code&gt; to &lt;code&gt;&amp;lt;extern&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Save Your World
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;File → Save World&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your scene tree should now look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fywj18y7ug99uqi73myf5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fywj18y7ug99uqi73myf5.png" alt="Scene tree" width="364" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Your scene should look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvhzlaqcbe5i9gqxyvfl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvhzlaqcbe5i9gqxyvfl.png" alt="Complete Webots Scene" width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we just built:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ROBOT (E-puck):&lt;/strong&gt; The agent that will learn to navigate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TARGET (red cylinder):&lt;/strong&gt; The goal position&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supervisor:&lt;/strong&gt; The "brain" that runs our RL training loop&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Verifying Your Setup
&lt;/h2&gt;

&lt;p&gt;Let's make sure everything is connected properly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create the test controller:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In your project, create a new folder: &lt;code&gt;controllers/test_supervisor/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Inside that folder, create a file: &lt;code&gt;test_supervisor.py&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Your folder structure should look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Webots_SB3_Tutorial/
├── webots_rl_env/
├── controllers/
│   └── test_supervisor/
│       └── test_supervisor.py
├── worlds/
│   └── robot_navigation.wbt
└── protos/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Add this code to &lt;code&gt;test_supervisor.py&lt;/code&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;controller&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Supervisor&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize supervisor
&lt;/span&gt;&lt;span class="n"&gt;supervisor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Supervisor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;timestep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;supervisor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getBasicTimeStep&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Test: Can we access our nodes?
&lt;/span&gt;&lt;span class="n"&gt;robot_node&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;supervisor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getFromDef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ROBOT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;target_node&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;supervisor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getFromDef&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TARGET&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;robot_node&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;target_node&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✓ Setup successful!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Robot found at: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;robot_node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getPosition&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Target found at: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;target_node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getPosition&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Test moving the target
&lt;/span&gt;    &lt;span class="n"&gt;trans_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;target_node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;translation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;current_pos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;trans_field&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getSFVec3f&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Target can be moved: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;current_pos&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✗ Setup error!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;robot_node&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Missing: ROBOT (check DEF name on E-puck)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;target_node&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Missing: TARGET (check DEF name on Solid)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run one simulation step
&lt;/span&gt;&lt;span class="n"&gt;supervisor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timestep&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;✓ Simulation step successful!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;To run the test:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;In Webots, open your &lt;code&gt;robot_navigation.wbt&lt;/code&gt; world&lt;/li&gt;
&lt;li&gt;Select the &lt;strong&gt;Robot (supervisor_controller)&lt;/strong&gt; node in the scene tree&lt;/li&gt;
&lt;li&gt;Change its &lt;code&gt;controller&lt;/code&gt; field from &lt;code&gt;&amp;lt;extern&amp;gt;&lt;/code&gt; to &lt;code&gt;test_supervisor&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Click the &lt;strong&gt;Play&lt;/strong&gt; button (▶️) in Webots. (You might need to Click restart too)&lt;/li&gt;
&lt;li&gt;Check the Webots console (bottom panel)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Expected output in the Webots console:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INFO: test_supervisor: Starting controller: python.exe -u test_supervisor.py
✓ Setup successful!
  Robot found at: [0.0, 0.0, 0.0]
  Target found at: [0.3, 0.025, 0.3]
  Target can be moved: [0.3, 0.025, 0.3]
✓ Simulation step successful!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After testing:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Important:&lt;/strong&gt; Change the supervisor's &lt;code&gt;controller&lt;/code&gt; field back to &lt;code&gt;&amp;lt;extern&amp;gt;&lt;/code&gt; (we'll need this for the next tutorial)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;File → Save World&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Optional test:&lt;/strong&gt; Hold &lt;strong&gt;Shift + Left Click&lt;/strong&gt; and drag the target in the 3D view. It should move freely, confirming the physics setup is correct.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You Accomplished
&lt;/h2&gt;

&lt;p&gt;🎉 Congratulations! You've built a complete foundation for RL training in Webots:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Installed Webots and Python environment
&lt;/li&gt;
&lt;li&gt;Created a simulation world with robot and target
&lt;/li&gt;
&lt;li&gt;Configured external controller setup
&lt;/li&gt;
&lt;li&gt;Verified Python can communicate with Webots
&lt;/li&gt;
&lt;li&gt;Ready to build a Gymnasium environment (next tutorial)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Coming in the next tutorial:&lt;/strong&gt; "Building a Gymnasium Environment for Webots Robot Control"&lt;/p&gt;

&lt;p&gt;We'll write the code that bridges Stable Baselines3 and Webots:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creating a custom Gymnasium environment class&lt;/li&gt;
&lt;li&gt;Implementing &lt;code&gt;reset()&lt;/code&gt; and &lt;code&gt;step()&lt;/code&gt; methods&lt;/li&gt;
&lt;li&gt;Defining observation and action spaces&lt;/li&gt;
&lt;li&gt;Designing a reward function&lt;/li&gt;
&lt;li&gt;Handling episode termination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📦 &lt;strong&gt;Complete code:&lt;/strong&gt; [&lt;a href="https://github.com/sam-dude/Webots_SB3_Tutorial" rel="noopener noreferrer"&gt;https://github.com/sam-dude/Webots_SB3_Tutorial&lt;/a&gt;]&lt;/li&gt;
&lt;li&gt;📚 &lt;a href="https://cyberbotics.com/doc/guide/index" rel="noopener noreferrer"&gt;Webots Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📚 &lt;a href="https://stable-baselines3.readthedocs.io/" rel="noopener noreferrer"&gt;Stable Baselines3 Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;📚 &lt;a href="https://gymnasium.farama.org/" rel="noopener noreferrer"&gt;Gymnasium Documentation&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;You might find interacting with Webots a bit confusing at first. It can feel daunting getting introduced to a tool with seemingly many features. But here's the catch: the best way to learn is by playing around.&lt;/p&gt;

&lt;p&gt;Go beyond what we've covered in this tutorial. Experiment with the "pre-made" robots available in Webots. Try out certain ideas you have by adding and customizing different nodes. Webots allows you to create custom environments, and hands-on exploration is often the fastest way to get comfortable with any new tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;You now have a professional-grade simulation setup ready for RL experimentation. This foundation uses the same tools researchers and companies use for real robotics projects—no expensive hardware required.&lt;/p&gt;

&lt;p&gt;The key insight we've established: by using relative observations (distance and angle to target) instead of absolute positions, our future trained agent will generalize. Move the target anywhere, and the robot will adapt.&lt;/p&gt;

&lt;p&gt;In the next Tutorial, we will connect our Webots environment to Gynasium.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Thank you for reading this piece to the end. If you face any issue during implementation, you can drop a comment. I'll do well to respond on time.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>robotics</category>
      <category>webots</category>
      <category>ai</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Imitation Learning: A Stanford Walkthrough</title>
      <dc:creator>IBIYEMI Samuel O.</dc:creator>
      <pubDate>Sat, 17 Jan 2026 01:26:24 +0000</pubDate>
      <link>https://dev.to/samdude/imitation-learning-a-stanford-walkthrough-22cc</link>
      <guid>https://dev.to/samdude/imitation-learning-a-stanford-walkthrough-22cc</guid>
      <description>&lt;p&gt;AI and robotics are evolving rapidly, and reinforcement learning (RL) has become foundational to modern AI systems. You see RL everywhere: in large language models, self-driving cars, and the emerging field of physical AI. Within RL, imitation learning stands out as a particularly practical technique. Instead of engineering complex reward functions, we can simply show an AI agent what to do by demonstrating the desired behavior.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cs224r.stanford.edu/" rel="noopener noreferrer"&gt;&lt;strong&gt;Stanford's CS224R: Deep Reinforcement Learning&lt;/strong&gt;&lt;/a&gt; course offers a rigorous, freely available introduction to these concepts. The course has a well-structured syllabus and covers industry-standard material that's directly applicable to real-world problems—making it an excellent resource for anyone looking to break into AI and robotics. Best of all, it's completely free and available on &lt;a href="https://www.youtube.com/watch?v=EvHRQhMX7_w&amp;amp;list=PLoROMvodv4rPwxE0ONYRa_itZFdaKCylL" rel="noopener noreferrer"&gt;YouTube&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This walkthrough documents my experience with one of the early assignments: implementing and experimenting with two foundational imitation learning methods; &lt;strong&gt;Behavior Cloning (BC)&lt;/strong&gt; and &lt;strong&gt;DAgger&lt;/strong&gt;. While researching self-driving car systems and robotics applications, I noticed that imitation learning consistently appears as a foundational technique. Companies like Nvidia, for instance, have published work showing its importance in autonomous driving. This assignment provided hands-on experience with the core concepts, implementation decisions, and the practical tradeoffs between different approaches.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Imitation Learning?
&lt;/h2&gt;

&lt;p&gt;There are certain behaviors that human experts know how to do but may require rigorous domain knowledge to program into a robot system. These expert behaviors can be demonstrated to the AI agent. The purpose of imitation learning is to efficiently learn a desired behavior by imitating an expert's behavior.&lt;/p&gt;

&lt;p&gt;This can be powerful for building autonomous behavior in systems that are desired to mimic human approach. For instance, imitation learning can be deployed for autonomous behavior in vehicles, computer games, and robotic applications. It's a go-to application for many companies dealing with self-driving cars, robotics, and leading physical AI systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Imitation Learning?
&lt;/h2&gt;

&lt;p&gt;The goal of reinforcement learning is to determine closed-loop control policies that result in the maximization of an accumulated reward. RL systems are generally classified into model-based or model-free. There is a general assumption in both cases that a reward function is known, and the goal is to process it done by running the system in an environment and collect these data, then use it to maximize (update) a learned policy (model-based) or directly update the learned policy (model-free).&lt;/p&gt;

&lt;p&gt;Imitation learning is quite similar to RL; the difference exists such that there is no explicit reward function r = R(st, ut), rather it is assumed that a set of demonstrations is provided by an expert. The goal is to learn a policy π that follows the closed-loop control law:&lt;/p&gt;

&lt;p&gt;ut = π(st, ut)&lt;/p&gt;

&lt;p&gt;There is an expert policy π* (derived from the demonstration) that the learned policy aims to imitate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Goal of Imitation Learning&lt;/strong&gt;: Find a policy π that best imitates the expert policy π*, effectively capturing the expert's behavior from the provided demonstrations.&lt;/p&gt;

&lt;p&gt;(Sounding complex? I'll keep it simple, I promise 😊)&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Approaches to Imitation Learning
&lt;/h2&gt;

&lt;p&gt;There are often two approaches to imitation learning. The first is to directly learn to imitate the expert's policy, and the second is to imitate the policy indirectly by learning the expert's reward function instead. In this blog, I will focus on classical approaches that directly learn to imitate the expert in two ways. This kind of policy can be usually obtained through a standard supervised learning method (in soft behavior cloning and the DAgger algorithm). The learning of the expert's reward function (which I will not be covered, you can read more in the resources listed) is known as &lt;em&gt;inverse reinforcement learning&lt;/em&gt;. We shall discuss BC and DAgger, considering their intuition and limitations. I also will explain how my architectural choices.&lt;/p&gt;

&lt;h3&gt;
  
  
  Behavior Cloning (BC)
&lt;/h3&gt;

&lt;p&gt;The algorithm uses a set of demonstrated trajectories from an expert to determine a policy π that imitates the expert's actions. Supervised learning techniques can be applied; the policy is learned and expert demonstrations are paired in respect to some metric. It is a classical optimisation problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But this has a major shortcoming:&lt;/strong&gt; There are instances where our expert data does not capture large-scale (often called "trajectory drift") that deteriorate the policy's not equivalent to what the policy has been trained on. This is a major challenge in imitation learning.&lt;/p&gt;

&lt;p&gt;The core issue is distributional mismatch: during training, the policy only sees states from the expert's trajectories, but during deployment, small errors compound over time, leading the policy into states it has never encountered. These unfamiliar states cause the policy to make poor decisions, which leads to even more unfamiliar states—a cascading failure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8b4ainsdfiisewbx0115.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8b4ainsdfiisewbx0115.gif" alt="Behaviour Cloning algorithm applied to Ant" width="760" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 1: Behavior Cloning in action during early training. Notice how the Ant agent struggles.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  DAgger: Data Aggregation
&lt;/h3&gt;

&lt;p&gt;DAgger is a direct patch to the distributional mismatch problem. It collects additional data from the expert iteratively to update the policy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How DAgger works:&lt;/strong&gt; Rather than training once on a fixed dataset, DAgger runs the current learned policy in the environment, observes the states it reaches (which may differ from the expert's states), then queries the expert for the correct action at those states. This new data is aggregated with the previous training data, and the policy is retrained. This iterative process progressively reduces the distributional mismatch between training and deployment by ensuring the policy sees and learns from the states it actually encounters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frdyi3odlh5zbnqy8zwl4.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frdyi3odlh5zbnqy8zwl4.gif" alt="DAgger algorithm applied to Ant" width="720" height="360"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 2: DAgger in action. The same environments show significantly more stable behavior, the agent now maintains more balance.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  My Experimental Setup
&lt;/h2&gt;

&lt;p&gt;I implemented and evaluated both Behavior Cloning and DAgger on three Mujoco continuous control environments: Ant-v2, HalfCheetah-v2, and Hopper-v2. Each environment presents different challenges—Ant requires coordinating multiple legs, HalfCheetah involves high-speed forward locomotion, and Hopper demands delicate balance control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation Details
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Network Architecture:&lt;/strong&gt;&lt;br&gt;
I used a simple feedforward neural network with 2 hidden layers of 64 units each, using ReLU activations. The policy outputs continuous actions directly without any output activation. This architecture is deliberately simple—I wanted to see how much the algorithm itself (BC vs DAgger) matters compared to model capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training Configuration:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learning rate: 0.005 with Adam optimizer&lt;/li&gt;
&lt;li&gt;Training steps per iteration: 1000&lt;/li&gt;
&lt;li&gt;Mini-batch size: 100&lt;/li&gt;
&lt;li&gt;For BC: Single iteration on expert data&lt;/li&gt;
&lt;li&gt;For DAgger: 10 iterations with expert relabeling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data Collection:&lt;/strong&gt;&lt;br&gt;
Each iteration collected 1000 environment steps. For DAgger, the expert policy relabeled the states visited by the learned policy, and all data was accumulated in a replay buffer with capacity of 1,000,000.&lt;/p&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;p&gt;The results were striking, particularly for Hopper:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Environment&lt;/th&gt;
&lt;th&gt;BC Return&lt;/th&gt;
&lt;th&gt;BC % Expert&lt;/th&gt;
&lt;th&gt;DAgger Return&lt;/th&gt;
&lt;th&gt;DAgger % Expert&lt;/th&gt;
&lt;th&gt;Expert Return&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ant&lt;/td&gt;
&lt;td&gt;4216.1 ± 0.0&lt;/td&gt;
&lt;td&gt;88.9%&lt;/td&gt;
&lt;td&gt;4845.1 ± 0.0&lt;/td&gt;
&lt;td&gt;102.1%&lt;/td&gt;
&lt;td&gt;4744.3&lt;/td&gt;
&lt;td&gt;+13.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hopper&lt;/td&gt;
&lt;td&gt;879.8 ± 345.1&lt;/td&gt;
&lt;td&gt;23.7%&lt;/td&gt;
&lt;td&gt;3709.4 ± 0.0&lt;/td&gt;
&lt;td&gt;99.8%&lt;/td&gt;
&lt;td&gt;3716.0&lt;/td&gt;
&lt;td&gt;+76.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HalfCheetah&lt;/td&gt;
&lt;td&gt;3835.3 ± 0.0&lt;/td&gt;
&lt;td&gt;94.3%&lt;/td&gt;
&lt;td&gt;3905.8 ± 0.0&lt;/td&gt;
&lt;td&gt;96.0%&lt;/td&gt;
&lt;td&gt;4067.9&lt;/td&gt;
&lt;td&gt;+1.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Learning Curves
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F01qopjvesesjuqgclmjk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F01qopjvesesjuqgclmjk.png" alt=" " width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 1: Training curves showing BC vs DAgger vs Expert baseline for Ant, Hopper, and HalfCheetah environments. The curves illustrate how DAgger's iterative data collection leads to continued improvement, particularly visible in Hopper's dramatic performance gain.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Observations:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hopper showed the most dramatic improvement.&lt;/strong&gt; BC achieved only 23.7% of expert performance with high variance (±345.1), while DAgger reached 99.8% of expert performance. This makes sense. Hopper requires precise balance, and small deviations from the expert trajectory quickly lead to falls. BC had no way to recover from these states, but DAgger explicitly learned from them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ant performed well even with BC&lt;/strong&gt; (88.9% of expert), but DAgger still provided meaningful gains, actually exceeding expert performance at 102.1%. This suggests Ant's dynamics are more forgiving. The policy can make small errors without catastrophic failure, reducing the distributional mismatch problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HalfCheetah saw minimal improvement&lt;/strong&gt; from DAgger (94.3% → 96.0%). My hypothesis is that HalfCheetah's task is relatively stable once the policy learns the basic running gait, and the expert demonstrations already covered the relevant state space well enough for BC to succeed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance Comparison
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1eh2l5no776um5vryo5v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1eh2l5no776um5vryo5v.png" alt="Performance comparison" width="800" height="475"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 2: Final performance comparison across all three environments, showing the relative success of BC and DAgger as percentages of expert performance.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;This assignment reinforced that &lt;strong&gt;the choice of algorithm matters immensely depending on the task dynamics&lt;/strong&gt;. For sensitive, unstable tasks like Hopper, the distributional mismatch in BC is crippling. DAgger's iterative data collection isn't just a theoretical improvement, it's the difference between 24% and 100% success.&lt;/p&gt;

&lt;p&gt;Also, I figured that &lt;strong&gt;looking at variance is as important as looking at mean performance&lt;/strong&gt;. BC's high variance on Hopper was a clear signal that something was fundamentally wrong, not just that the policy needed more training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not all tasks need DAgger.&lt;/strong&gt; HalfCheetah's results suggest that for some problems, expert demonstrations alone provide sufficient coverage. Understanding when to use which approach requires carefully consideration of your environment's dynamics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Imitation learning doesn't solve all autonomy problems on its own.&lt;/strong&gt; In practice, it's often used to create base models upon which different algorithms (including RL methods like PPO) are applied to fine-tune performance. This two-stage approach—imitation learning for initialization, RL for refinement—is common in real-world applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DAgger's benefits come with costs.&lt;/strong&gt; While it solves the distributional mismatch problem effectively, it's computationally intensive. Each iteration requires running the policy in the environment and querying the expert for labels, which can be expensive or even infeasible in some domains (imagine needing a human expert to label thousands of states). This practical constraint is why BC remains popular despite its theoretical limitations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Working through this imitation learning assignment gave me hands-on experience with the core challenge of learning from demonstrations: distributional mismatch. The contrast between BC and DAgger wasn't just academic, I saw firsthand how Hopper went from barely functioning to expert-level performance simply by addressing which states the policy trains on.&lt;/p&gt;

&lt;p&gt;The bigger lesson is about understanding your problem before choosing your algorithm. DAgger isn't always necessary (as HalfCheetah showed), and BC isn't always insufficient (as Ant demonstrated). The key is recognizing when your task's dynamics will punish distribution shift, and planning accordingly.&lt;/p&gt;

&lt;p&gt;These experiments also highlighted that imitation learning is a starting point, not an ending point, for building robust autonomous systems. The path from expert demonstrations to production-ready policies often involves multiple stages—imitation learning for initialization, reinforcement learning for refinement, and careful engineering throughout.&lt;/p&gt;

&lt;p&gt;If you're working on similar problems, I hope this walkthrough helps you think about when to use BC, when to invest in DAgger's computational cost, and how to evaluate whether distributional mismatch is your bottleneck.&lt;/p&gt;




&lt;h2&gt;
  
  
  Code and Course Information
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Note on Code:&lt;/strong&gt; Due to course policy, I cannot share the implementation code for this assignment. However, the concepts and approaches described here can be implemented following the algorithm descriptions in the referenced papers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Course Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=EvHRQhMX7_w&amp;amp;list=PLoROMvodv4rPwxE0ONYRa_itZFdaKCylL" rel="noopener noreferrer"&gt;Stanford CS224R Deep Reinforcement Learning(YouTube Playlist)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cs224r.stanford.edu/" rel="noopener noreferrer"&gt;Course Website and Syllabus&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These resources provide comprehensive coverage of imitation learning, reinforcement learning, and related topics.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., &amp;amp; Peters, J. (2018). "An Algorithmic Perspective on Imitation Learning." &lt;em&gt;Foundations and Trends in Robotics&lt;/em&gt;, 7(1-2), 1-179.&lt;/li&gt;
&lt;li&gt;Ross, S., Gordon, G., &amp;amp; Bagnell, D. (2011). "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning." &lt;em&gt;Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Pomerleau, D. A. (1989). "ALVINN: An Autonomous Land Vehicle in a Neural Network." &lt;em&gt;Advances in Neural Information Processing Systems (NIPS)&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Bojarski, M., et al. (2016). "End to End Learning for Self-Driving Cars." &lt;em&gt;arXiv preprint arXiv:1604.07316&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Todorov, E., Erez, T., &amp;amp; Tassa, Y. (2012). "MuJoCo: A physics engine for model-based control." &lt;em&gt;IEEE/RSJ International Conference on Intelligent Robots and Systems&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;You made it to end😊 Thanks for reading! If you found this helpful or have questions about imitation learning, feel free to reach out or leave a comment.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>robotics</category>
      <category>algorithms</category>
    </item>
    <item>
      <title>I've been exploring Robotics with Reinforcement Learning. 

In this blog, I wrote about how I confronted the common learning trap - basking in the quagmire of personal deception; posing as if I knew much when I had only acquired little.

...</title>
      <dc:creator>IBIYEMI Samuel O.</dc:creator>
      <pubDate>Mon, 01 Dec 2025 14:46:44 +0000</pubDate>
      <link>https://dev.to/samdude/ive-been-exploring-robotics-with-reinforcement-learning-in-this-blog-i-wrote-about-how-i-3fbg</link>
      <guid>https://dev.to/samdude/ive-been-exploring-robotics-with-reinforcement-learning-in-this-blog-i-wrote-about-how-i-3fbg</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/samdude" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1188173%2F7c7774d1-cae2-4d73-a15a-80717f067653.png" alt="samdude"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/samdude/i-trained-a-robot-arm-what-i-failed-to-learn-2cmf" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;I trained a Robot Arm: What I failed to learn.&lt;/h2&gt;
      &lt;h3&gt;IBIYEMI Samuel O. ・ Dec 1&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#ai&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#deeplearning&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#machinelearning&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#learning&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>learning</category>
    </item>
    <item>
      <title>I trained a Robot Arm: What I failed to learn.</title>
      <dc:creator>IBIYEMI Samuel O.</dc:creator>
      <pubDate>Mon, 01 Dec 2025 08:06:34 +0000</pubDate>
      <link>https://dev.to/samdude/i-trained-a-robot-arm-what-i-failed-to-learn-2cmf</link>
      <guid>https://dev.to/samdude/i-trained-a-robot-arm-what-i-failed-to-learn-2cmf</guid>
      <description>&lt;p&gt;First, there is so much to learn.&lt;/p&gt;

&lt;p&gt;Understanding ML foundational concepts and having AI-accelerated workflows doesn't mean you can just jump in without going through the hard curve of learning. I learned that the expensive way.&lt;/p&gt;

&lt;p&gt;Reinforcement Learning (RL) is distinct from other ML fields. Even though they share boundaries, there are concepts that even hardcore ML engineers won't just grasp immediately.&lt;/p&gt;

&lt;p&gt;My first mistake was trying to skip steps. I was ambitious, that was glaring. I wanted results ASAP (my self-destructive habit of posing to the world). I was way too focused on seeing it work without employing my intuition to the hard details.&lt;/p&gt;

&lt;p&gt;After completing my first RL project with AI's help, I could feel it in my gut: I had learned nothing, or maybe just too minimal to fit my acclaimed achievement. That's when I went back to relearn. It took time, but it was rewarding.&lt;/p&gt;

&lt;p&gt;Now that you've heard the story of my life, let's get technical.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Setup
&lt;/h3&gt;

&lt;p&gt;The robot arm is the Pusher from Gymnasium with 7 Degrees of Freedom (DOF). It's a multi-jointed robot arm similar to a human arm. The goal is to move a target cylinder (the object) to a goal position using the robot's end effector (the fingertip). The robot has shoulder, elbow, forearm, and wrist joints.&lt;a href="https://gymnasium.farama.org/environments/mujoco/pusher/#description" rel="noopener noreferrer"&gt;Gymnasium&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In RL environments, we usually deal with discrete or continuous action spaces. Discrete action spaces have finite sets of actions - they're usually easier to learn, even with low compute. Continuous action spaces are close to infinite sets of available actions. The gradient can easily explode, or the agent gets stuck in a local minimum, or doesn't learn at all. It takes a lot of time to train.&lt;/p&gt;

&lt;p&gt;Our Robot is a continuous action space environment. Our battle is long and the environment is as unpredictable as a raving sea.&lt;/p&gt;

&lt;p&gt;A cool intuition to this would be:&lt;/p&gt;

&lt;p&gt;Since discrete action spaces are easier and faster to learn, What if we try to discretise the action spaces in effort to use the algorithms that work well in discrete action spaces like (DQN)? But The problem is that the number of actions increases exponentially with the degrees of freedom.  (Source &lt;a href="https://arxiv.org/abs/1509.02971" rel="noopener noreferrer"&gt;DDPG Paper&lt;/a&gt; ). Hence, the curse of dimensionality.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falkz1rta86ahq9fzj065.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Falkz1rta86ahq9fzj065.gif" alt="Pusher Robot performance at 20k episode ~ 2 million timesteps" width="480" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Training Journey
&lt;/h3&gt;

&lt;p&gt;I initially tried SAC (Soft Actor Critic). After around 2 million timesteps, it failed to learn significantly. It took around 14 hours of training on my CPU. I introduced tricks: checkpointing (suggested by Mave), interval video recording, low-end optimisations, free GPU runs on Google Colab.&lt;/p&gt;

&lt;p&gt;I made significant progress, but it was hard to get something encouraging at that stage. The rewards were signaling that it wasn't learning. I had to terminate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9a236zbuq1z4np184q6.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9a236zbuq1z4np184q6.gif" alt="Best model at 2 million timesteps" width="480" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Best model at 2 million timesteps, it's obvious it ain't learning.😓&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But after I dug further, I saw that I could improve my results by trying different algorithms like HER (Hindsight Experience Replay). HER is an off-policy algorithm aimed to work in applications where admissible behaviors aren't necessarily known. Previous algorithms required careful reward shaping and in-depth domain knowledge. HER solves this by learning from unshaped reward signals.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's Next?
&lt;/h3&gt;

&lt;p&gt;Just like Andrew Ng wrote in "The Batch" (huge fan😊):&lt;/p&gt;

&lt;p&gt;"The single biggest predictor of how rapidly a team makes progress building an AI agent lies in their ability to drive a disciplined process for evals (measuring the system's performance) and error analysis (identifying the causes of errors). It's tempting to shortcut these processes and quickly attempt fixes to mistakes rather than slowing down to identify root causes. But evals and error analysis can lead to much faster progress."&lt;/p&gt;

&lt;p&gt;He also emphasized that without understanding how computers work, you can't just "vibe code" your way to greatness. Fundamentals are essential.&lt;/p&gt;

&lt;p&gt;The ability to understand concepts and apply them is greater than just producing results without adequate understanding. That's exactly what I was doing; shortcutting the process, chasing results without grasping the fundamentals.&lt;/p&gt;

&lt;p&gt;So, what will I do differently? I'll check out the original algorithm papers before starting implementation. I'll digest first. I'll also improve my RL algorithm debugging skills - because understanding why something fails is just as important as making it work.&lt;/p&gt;

&lt;p&gt;Stay tuned for my learning updates.&lt;/p&gt;

&lt;p&gt;Till then,&lt;/p&gt;

&lt;p&gt;Keep Learning,&lt;br&gt;
Samuel Ibiyemi&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>machinelearning</category>
      <category>learning</category>
    </item>
  </channel>
</rss>
