On Optimization Objectives in Reinforcement Learning

#reinforcementlearning #machinelearning #algorithms #neuralnetwork

Reinforcement Learning: Optimization and Objective Methods

There are a few paradigms or methods of training involved in machine learning. Each employs a different approach to presenting a model with data and what the model should do with that input. Reinforcement learning is one such method we will explore in more detail.

In supervised learning, you give the model labeled example data and then test the model with examples it has not yet seen. Think of it like studying with access to an answer key. In unsupervised learning, you give the model unlabeled data and it finds structure. Imagine watching a painting in pointillism render from start to finish. There's no order to it in the beginning. Eventually though, you might start to recognize an image forming. From there you might have an easier time predicting what the finished painting will look like.

Reinforcement learning (RL) takes a different approach, in that there may be no explicit training data provided as examples - labelled or unlabelled. Rather, the input data received comes as progression of its own experience. The agent makes decisions, interacts directly with an environment, and receives feedback as a reward or penalty signal. It is essentially training by trial and error.

(minecraft gif)

By aiming to maximize cumulative reward over time, the agent 'learns' how to behave in a given environment. This seems simple in theory but in practice, it’s one of the harder optimization problems in ML. You’re not minimizing a fixed loss function over a static dataset. The data the agent trains on depends on the actions it took, which depend on the policy it’s currently learning, which depends on the data it collected. It’s circular by design, and that circularity is where most of the difficulty comes from.

What the Agent Is Actually Optimizing

Looking at RL objectives, we have the policy (the agent’s decision-making function), a discount factor between 0 and 1, and the reward at a specific time. The discount factor controls how much the agent weights future rewards versus immediate ones. If it is too low, it becomes concerned with only the next reward. Too high and it treats rewards 100 steps away as nearly equivalent to rewards right now. Balance is generally the ideal.

This works differently from supervised ML objectives like MSE or cross-entropy. Those are computed over a fixed dataset — you know what you’re optimizing against before training starts. In RL, the objective depends on trajectory data the agent generates during training. You’re optimizing a target that shifts as the agent improves, using data the agent collected under an earlier, 'worse' version of itself.

Three Approaches to Optimization

There are three broad families of RL algorithms, each one taking a unique approach addressing policy improvement.

Value-based methods estimate how good each state or state-action pair is, then derive a policy from those estimates. The foundational example here is Q-learning: the agent learns a Q-function — Q(s, a) — that estimates expected future reward for taking action a in state s. Policy improvement is implicit: pick the action with the highest Q-value. The optimization objective is minimizing the Bellman error*, the difference between the current Q-value estimate and the target computed from actual experience.

Policy gradient methods skip the value estimates and optimize the policy directly. They compute the gradient of J(π) with respect to the policy parameters and step in that direction. REINFORCE* is the classic version. The problem that arises here is policy gradients have high variance, especially early in training when the agent has seen little of the environment. The gradient estimates are noisy, and that noise slows convergence.

Actor-critic methods combine both. The actor is the policy and decides actions. The critic is a value function and evaluates how good those actions were. The critic’s estimates reduce variance in the actor’s gradient updates. Most production-grade RL systems — PPO, A3C, SAC — are actor-critic architectures for exactly this reason.

Problems in Practice

Credit assignment: An agent plays a game and wins 200 moves later. Which moves mattered? The reward is clear but the attribution is not. Discount factors and value functions both exist, in part, to address this.

Exploration vs. exploitation: An agent that only takes actions it already knows are good stops learning. An agent that constantly explores won’t converge on anything useful. Balancing these two is a design decision. Common approaches include epsilon-greedy exploration (taking a random action with probability ε) and entropy regularization (adding a term to the objective that rewards taking varied actions).

Sample efficiency: RL needs a lot of experience to learn from. A person can pick up a new video game and achieve competence in under an hour. A RL agent usually needs millions of environment interactions to do something similar. Model-based RL addresses this by building an internal model of the environment and planning ahead from it, rather than learning purely from direct experience.

Reward Design as the Objective Function

Clean mathematical formulation glosses over the fact that in most real RL problems, you don’t have a natural reward function. You must design one. Designing an effective reward function can be painful.

The agent optimizes whatever you give it. If you reward a robot arm for moving toward a target but don’t penalize inefficient paths, it finds paths that increase the reward metric without doing what you wanted. This may not be a bug in the algorithm so much as it is the algorithm working correctly on a poorly defined objective.

A good example comes from OpenAI’s early research: an agent trained to move fast in a simulated environment learned to make itself tall and then fall over repeatedly, because falling generated high velocity without the actual cost of locomotion. The reward function said “go fast,” and it did.

This connects directly to a concept in a supervised ML context where objective functions and performance metrics are often different things. You optimize one to improve the other, and they don’t always agree on which model is best. In RL that gap is almost always present. The reward function you are optimizing is in general terms, an approximation, of the end goal, and the agent will find every corner case where the approximation breaks down.

How This Connects to Standard ML

RL uses the same building blocks covered in core ML such as neural networks as function approximators, gradient descent for parameter updates, regularization to prevent overfitting (entropy terms work similarly to L2 penalties), and held-out evaluation environments that parallel cross-validation. The main difference is the sequential, interactive data collection process.

Standard supervised learning assumes training data is independent and identically distributed. RL violates that assumption constantly. Each action changes the state of the environment, which affects what data gets collected next. The dataset and the model are coupled. That’s what makes RL its own subfield rather than just another model type — but the mathematical machinery underneath it is mostly the same machinery covered everywhere else in ML.

Source material: M. Clark, Models Demystified — Chapter 10: Core Concepts in Machine Learning and Chapter 11: Common Models in Machine Learning

*This is the Bellman equation. It may be simplified even further if the time subscripts are dropped and the value of the next state is plugged in:

V(x)=maxa∈Γ(x){F(x,a)+βV(T(x,a))}. $V(x)=\max _{a\in \Gamma (x)}\{F(x,a)+\beta V(T(x,a))\}.$