SimpleRL: Building DQN to PPO from Scratch in 500 Lines

#reinforcementlearnin #dqn #ppo #pytorch

800 Lines Later, I Finally Understood Policy Gradients

Most RL tutorials show you how to use Stable Baselines3. That's great for getting results fast, but it's terrible for understanding what's actually happening when your agent refuses to learn. I spent months using PPO("MlpPolicy", env) like a magic incantation before finally deciding to implement both DQN and PPO from scratch in a single, minimal codebase.

The result: SimpleRL, a ~500-line library that implements both algorithms with enough shared infrastructure to see exactly how they differ. Building it broke nearly every assumption I had about reinforcement learning.

Abstract 3D render visualizing artificial intelligence and neural networks in digital form. — Photo by Google DeepMind on Pexels

Why Build Another RL Library?

This isn't about creating something production-ready. Stable Baselines3 exists. CleanRL exists. The point is pedagogical: when you implement the Bellman backup yourself, when you compute the GAE advantage yourself, the equations stop being abstract and become debugging targets.

What I wanted: