<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Luca Ostermann</title>
    <description>The latest articles on DEV Community by Luca Ostermann (@lukeo).</description>
    <link>https://dev.to/lukeo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3899472%2F6f1c9d7f-2248-4653-9ef5-0ee3f3ace539.png</url>
      <title>DEV Community: Luca Ostermann</title>
      <link>https://dev.to/lukeo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lukeo"/>
    <language>en</language>
    <item>
      <title>Setting up a local RL environment in 2026 and WHAT I wish I knew!!!</title>
      <dc:creator>Luca Ostermann</dc:creator>
      <pubDate>Mon, 27 Apr 2026 11:48:27 +0000</pubDate>
      <link>https://dev.to/lukeo/setting-up-a-local-rl-environment-in-2026-and-what-i-wish-i-knew-57fe</link>
      <guid>https://dev.to/lukeo/setting-up-a-local-rl-environment-in-2026-and-what-i-wish-i-knew-57fe</guid>
      <description>&lt;p&gt;I spent three days last month getting a reinforcement learning environment to run locally before I could write a single line of training code.&lt;/p&gt;

&lt;p&gt;Three days. For the setup.&lt;/p&gt;

&lt;p&gt;I'm writing this because I found almost no practical guide that covers the annoying parts, the ones that actually eat your time. So here's everything I wish someone had told me before I started.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Your "simple" environment is probably not simple
&lt;/h2&gt;

&lt;p&gt;I started with what I thought was a minimal setup: a custom browser-based environment for testing a web navigation agent. I figured I'd have something running in an afternoon.&lt;/p&gt;

&lt;p&gt;What I didn't account for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rendering backends.&lt;/strong&gt; If your env involves any visual observation (even a headless browser), you need a display server. On a Linux dev machine without a monitor, that means Xvfb or a virtual framebuffer. This alone took me half a day to debug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gym vs Gymnasium.&lt;/strong&gt; OpenAI Gym is deprecated. A lot of tutorials still use it. Gymnasium is the maintained fork. They're mostly compatible but not perfectly especially around &lt;code&gt;reset()&lt;/code&gt; return signatures. If you're getting &lt;code&gt;too many values to unpack&lt;/code&gt; errors, this is probably why.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step API changes.&lt;/strong&gt; Gymnasium introduced a new step API that returns 5 values instead of 4 (&lt;code&gt;terminated&lt;/code&gt; and &lt;code&gt;truncated&lt;/code&gt; are now separate). Half the example code online still uses the old API.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Lesson: read the Gymnasium migration docs before anything else. It takes 15 minutes and saves hours.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Dependency hell is real, and it's specifically bad for RL
&lt;/h2&gt;

&lt;p&gt;RL libraries have notoriously tangled dependencies. In my case:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stable-baselines3 → requires torch &amp;gt;= 1.11
ray[rllib] → pins its own torch version
my browser env → needs playwright which needs its own chromium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These don't always play nice together. My recommendation: &lt;strong&gt;one environment per project, managed with &lt;code&gt;uv&lt;/code&gt; or at minimum a fresh &lt;code&gt;venv&lt;/code&gt;&lt;/strong&gt;. Don't try to share an environment across RL projects. It will break.&lt;/p&gt;

&lt;p&gt;Also: pin your versions immediately. RL libraries update fast and breaking changes are common. Future-you will thank present-you.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Episode resets are where bugs hide
&lt;/h2&gt;

&lt;p&gt;The most subtle bugs I've hit are in &lt;code&gt;reset()&lt;/code&gt;, not &lt;code&gt;step()&lt;/code&gt;. Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;State leakage between episodes.&lt;/strong&gt; If your environment holds any mutable state (a browser session, a file handle, a DB connection), make sure &lt;code&gt;reset()&lt;/code&gt; actually clears it. I had an agent that looked like it was learning when it was just reusing the previous episode's state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seeding.&lt;/strong&gt; If you don't seed your environment properly, your results aren't reproducible. Gymnasium has a &lt;code&gt;seed&lt;/code&gt; parameter in &lt;code&gt;reset()&lt;/code&gt; now. Use it. Log the seed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slow resets kill training speed.&lt;/strong&gt; If your environment takes 2 seconds to reset and you're running 10,000 episodes, that's 5+ hours just in resets. Profile this early.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4. Observation and action spaces: be boring
&lt;/h2&gt;

&lt;p&gt;I made the mistake of designing a fancy observation space early on — nested dicts, variable-length sequences, mixed types. It looked elegant. It was a nightmare to work with.&lt;/p&gt;

&lt;p&gt;For a first pass: flatten everything. Use &lt;code&gt;gym.spaces.Box&lt;/code&gt; with a fixed shape. Use &lt;code&gt;gym.spaces.Discrete&lt;/code&gt; for actions. You can make it fancy later once the training loop actually runs.&lt;/p&gt;

&lt;p&gt;The goal at setup is to get &lt;em&gt;something&lt;/em&gt; training, not to get the &lt;em&gt;right&lt;/em&gt; thing training.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Validate your environment before training
&lt;/h2&gt;

&lt;p&gt;This saved me from a week of confused debugging. Before running any RL algorithm on your env, run this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;gymnasium.utils.env_checker&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;check_env&lt;/span&gt;
&lt;span class="nf"&gt;check_env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will catch observation/action space mismatches, incorrect reset signatures, and a bunch of other subtle issues. It's not perfect but it's fast and it catches the obvious stuff.&lt;/p&gt;

&lt;p&gt;Also manually step through a few episodes with random actions and print everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reset&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;action_space&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reward&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;terminated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;truncated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reward&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;terminated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;truncated&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;terminated&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;truncated&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reset&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this breaks, your RL algorithm will too — but with much less helpful error messages.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Local is good, but know its limits
&lt;/h2&gt;

&lt;p&gt;Local setup is great for iteration speed and not burning cloud credits. But there are limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parallelism is hard locally.&lt;/strong&gt; Most serious RL training benefits from running many environments in parallel. On a laptop or a single dev machine, you'll hit CPU/memory limits fast.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Browser-based environments are especially heavy.&lt;/strong&gt; Each environment instance might spin up its own browser process. 8 parallel envs = 8 browser processes. Your machine will notice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You'll eventually want to scale.&lt;/strong&gt; Whether that's a cloud VM, a university compute cluster, or an RL environment platform, local setup is a starting point — not the final destination.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm still figuring out the scaling part myself. If you've solved this in an interesting way, I'd genuinely like to hear it in the comments.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR  the checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;Gymnasium&lt;/strong&gt;, not Gym. Read the migration docs.&lt;/li&gt;
&lt;li&gt;Isolate dependencies. Use &lt;code&gt;uv&lt;/code&gt; or a fresh &lt;code&gt;venv&lt;/code&gt; per project.&lt;/li&gt;
&lt;li&gt;Profile your &lt;code&gt;reset()&lt;/code&gt;. State leakage and slow resets are silent killers.&lt;/li&gt;
&lt;li&gt;Start with flat, boring observation and action spaces.&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;check_env()&lt;/code&gt; before you touch an RL algorithm.&lt;/li&gt;
&lt;li&gt;Local is fine to start but plan for the day you need to scale.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you're setting up your first RL environment and hit something I didn't cover, drop it in the comments. I'm definitely still learning and would appreciate the discussion.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>rlenvironement</category>
    </item>
  </channel>
</rss>
