800 Lines Later, I Finally Understood Policy Gradients
Most RL tutorials show you how to use Stable Baselines3. That's great for getting results fast, but it's terrible for understanding what's actually happening when your agent refuses to learn. I spent months using PPO("MlpPolicy", env) like a magic incantation before finally deciding to implement both DQN and PPO from scratch in a single, minimal codebase.
The result: SimpleRL, a ~500-line library that implements both algorithms with enough shared infrastructure to see exactly how they differ. Building it broke nearly every assumption I had about reinforcement learning.
Why Build Another RL Library?
This isn't about creating something production-ready. Stable Baselines3 exists. CleanRL exists. The point is pedagogical: when you implement the Bellman backup yourself, when you compute the GAE advantage yourself, the equations stop being abstract and become debugging targets.
What I wanted:
- Shared replay buffer and environment wrappers between algorithms
Continue reading the full article on TildAlice

Top comments (0)