Mariano Gobea Alcoba

Posted on Mar 30 • Originally published at mgatc.com

Hamilton-Jacobi-Bellman Equation: Reinforcement Learning and Diffusion Models!

#hamiltonjacobibellman #hjb #reinforcementlearning #rl

The Hamilton-Jacobi-Bellman (HJB) equation stands as a cornerstone in the theory of optimal control, providing a necessary and sufficient condition for optimality in continuous-time dynamic systems. Its implications extend beyond classical control, finding profound connections in contemporary fields such as reinforcement learning (RL) and generative modeling, particularly diffusion models. This analysis aims to elaborate on these connections, providing a technical perspective on how the HJB equation underpins the theoretical foundations of these diverse areas.

The Hamilton-Jacobi-Bellman Equation in Optimal Control

The HJB equation is a first-order, non-linear partial differential equation (PDE) that characterizes the value function of an optimal control problem. Consider a system whose state evolves according to a deterministic ordinary differential equation (ODE):

dX_t = f(X_t, u_t, t) dt

where X_t is the state vector at time t, u_t is the control input, and f is the dynamics function. The objective is to find a control policy u_t = \pi(X_t, t) that minimizes a cost functional over a time horizon [t_0, T]:

J(X_{t_0}, t_0, u) = \int_{t_0}^{T} L(X_t, u_t, t) dt + M(X_T, T)

Here, L is the running cost (or instantaneous cost rate), and M is the terminal cost. Assuming u_t is chosen to minimize this cost, the optimal value function V(x, t) is defined as:

V(x, t) = \min_{u \in \mathcal{U}} \left\{ \int_{t}^{T} L(X_s, u_s, s) ds + M(X_T, T) \right\}

subject to X_t = x.
Applying the principle of optimality from dynamic programming, Bellman's principle states that an optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the initial decision. For a sufficiently small time step dt, the value function can be expressed as:

V(x, t) = \min_{u} \left\{ L(x, u, t) dt + V(x + f(x, u, t) dt, t + dt) \right\}

Expanding V(x + f(x, u, t) dt, t + dt) using a Taylor series approximation:

V(x + f(x, u, t) dt, t + dt) \approx V(x, t) + \nabla_x V(x, t)^T f(x, u, t) dt + \frac{\partial V}{\partial t}(x, t) dt

Substituting this back and rearranging terms, and then dividing by dt as dt \to 0, yields the deterministic HJB equation:

-\frac{\partial V}{\partial t}(x, t) = \min_{u} \left\{ L(x, u, t) + \nabla_x V(x, t)^T f(x, u, t) \right\}

with the terminal condition V(x, T) = M(x, T). The optimal control u^*(x, t) is found by solving the minimization problem on the right-hand side. This HJB equation is a central tool for analytically determining optimal control laws.

HJB and Reinforcement Learning

Reinforcement Learning (RL) concerns an agent learning to make optimal decisions in an environment to maximize a cumulative reward. The HJB equation serves as the continuous-time, continuous-state counterpart to the discrete-time Bellman equation, which is fundamental to most RL algorithms. In RL, the objective is typically to maximize a return (negative of cost), R(X_{t_0}, t_0, u) = \int_{t_0}^{T} R(X_t, u_t, t) dt, where R is the instantaneous reward. The value function V(x, t) in this context represents the maximum expected future reward.

Transforming the HJB equation from cost minimization to reward maximization simply changes the sign of L to R and replaces min with max:

\frac{\partial V}{\partial t}(x, t) = \max_{u} \left\{ R(x, u, t) + \nabla_x V(x, t)^T f(x, u, t) \right\}

The challenge in RL is that the dynamics f and reward function R are often unknown or partially known. Furthermore, the state space and action space can be high-dimensional and continuous. Direct analytical solutions to the HJB equation are typically intractable for all but the simplest problems due to the curse of dimensionality.

Consequently, RL algorithms often resort to approximation methods. Value-based methods aim to approximate the value function V(x, t) (or the action-value function Q(x, u, t)) using function approximators like neural networks. Policy-based methods directly learn the optimal policy u^*(x, t) (or \pi(x, t)). Actor-Critic methods combine both, with an 'actor' learning the policy and a 'critic' learning the value function to guide the actor's updates. These methods effectively try to solve an approximate version of the HJB equation without explicitly formulating it as a PDE. For instance, in continuous-action spaces, methods like DDPG (Deep Deterministic Policy Gradient) or SAC (Soft Actor-Critic) attempt to find an optimal Q function and an optimal policy that satisfies conditions analogous to the HJB optimality condition.

Stochastic Optimal Control and Diffusion Processes

Many real-world systems are subject to inherent randomness, necessitating the use of stochastic differential equations (SDEs) to model their dynamics. The HJB equation extends naturally to this domain. Consider a system evolving according to an Ito SDE:

dX_t = f(X_t, u_t, t) dt + \sigma(X_t, t) dW_t

where dW_t is a vector of Wiener processes, and \sigma(X_t, t) is the diffusion coefficient matrix, which can depend on the state and time. The cost functional remains similar, but now we seek to minimize its expected value:

J(X_{t_0}, t_0, u) = E \left[ \int_{t_0}^{T} L(X_t, u_t, t) dt + M(X_T, T) \right]

To derive the HJB equation for this stochastic system, we again employ Bellman's principle of optimality and Ito's Lemma for the increment of the value function V(X_t, t):

dV(X_t, t) = \frac{\partial V}{\partial t} dt + \nabla_x V^T dX_t + \frac{1}{2} \text{Tr}\left( \sigma \sigma^T \nabla_{xx}^2 V \right) dt

Substituting dX_t from the SDE, taking the expectation, and following a similar limiting procedure as in the deterministic case, we arrive at the stochastic HJB equation:

-\frac{\partial V}{\partial t}(x, t) = \min_{u} \left\{ L(x, u, t) + \nabla_x V(x, t)^T f(x, u, t) + \frac{1}{2} \text{Tr}\left( \sigma(x, t) \sigma(x, t)^T \nabla_{xx}^2 V(x, t) \right) \right\}

This equation is a non-linear parabolic PDE. The last term, involving the second derivative (Hessian) of V, accounts for the diffusion (randomness) in the system. The matrix \sigma \sigma^T is often denoted as D (diffusion matrix), so the term becomes \frac{1}{2} \text{Tr}\left( D(x, t) \nabla_{xx}^2 V(x, t) \right). This term highlights the fundamental connection between optimal control of stochastic systems and the theory of diffusion processes. The probability density function p(x, t) of the state X_t in such a system evolves according to the Fokker-Planck equation (also known as the Kolmogorov Forward equation):

\frac{\partial p}{\partial t} = -\nabla_x \cdot (f p) + \frac{1}{2} \nabla_x \nabla_x : (D p)

This equation describes how the probability distribution of the system's state changes over time under the influence of drift f and diffusion D.

Diffusion Models: A Probabilistic Generative Approach

Diffusion models (DMs) are a class of generative models that have gained prominence for their ability to generate high-fidelity samples from complex data distributions. They operate by defining a two-stage process: a forward diffusion process and a reverse denoising process.

Forward Diffusion Process

The forward process gradually transforms data samples x_0 \sim q(x_0) into pure noise over a fixed time horizon T. This is typically modeled as an SDE:

dX_t = f(X_t, t) dt + g(t) dW_t

where f(X_t, t) is the drift function and g(t) is the diffusion coefficient, often chosen such that g(t)^2 = \sigma^2(t) (variance schedule). This process perturbs the data x_0 towards an isotropic Gaussian distribution \mathcal{N}(0, I). The probability distribution q(X_t | x_0) is usually tractable, often being a Gaussian distribution.

Reverse Denoising Process

The core idea of DMs is that if the forward diffusion process is a specific type of SDE, then its time-reversed counterpart can also be described by an SDE, which starts from pure noise and gradually denoises it back to a data sample x_0. The reverse SDE for X_t evolving from T down to 0 is given by:

dX_t = \left[ f(X_t, t) - g(t)^2 \nabla_x \log q(X_t, t) \right] dt + g(t) d\bar{W}_t

where d\bar{W}_t is a Wiener process for time flowing backward, and q(X_t, t) is the marginal probability density of the forward process at time t. The term \nabla_x \log q(X_t, t) is known as the score function. It represents the direction in the state space that increases the probability density at X_t. Critically, f(X_t, t) here is the drift of the forward process. If the forward process is a simple Ornstein-Uhlenbeck process, then f(X_t, t) is usually a linear function of X_t. The reverse SDE essentially uses the score function to guide the noisy samples back towards regions of higher data density.

Training a diffusion model involves learning to estimate this score function s_\theta(X_t, t) \approx \nabla_x \log q(X_t, t) using a neural network, often via a score-matching objective or a denoising objective as in Denoising Diffusion Probabilistic Models (DDPMs). Once trained, samples are generated by initializing X_T from \mathcal{N}(0, I) and simulating the reverse SDE using the learned score network.

Connecting HJB, Reinforcement Learning, and Diffusion Models

The connections between these three domains are deep, stemming from the underlying principles of optimal control and stochastic processes.

The Score Function as an Optimal Control

The reverse diffusion process can be framed as an optimal control problem. The goal is to steer the noisy distribution q(X_T, T) (pure noise) towards the data distribution q(X_0, 0) (real data) in an "optimal" way.
Consider the reverse SDE:

dX_t = \left[ f(X_t, t) - g(t)^2 \nabla_x \log q(X_t, t) \right] dt + g(t) d\bar{W}_t

Here, the term f(X_t, t) represents the 'natural' drift of the forward process, and g(t)^2 \nabla_x \log q(X_t, t) can be interpreted as an optimal control input that guides the process against its natural tendency (or purely random diffusion) to reach the desired terminal distribution.

More formally, this connection is solidified through the theory of Schrödinger Bridge Problems (SBPs) and Optimal Transport. An SBP seeks to find the most likely stochastic process that connects two marginal distributions at two different times, given a reference stochastic process (often a simple diffusion). The solution to an SBP involves finding a drift modification to the reference process. This optimal drift is precisely related to the score function.

In the context of optimal transport, one might seek to minimize the "energy" or "work" required to transform one probability distribution into another. For stochastic processes, this often leads to control problems where the cost is related to the deviation from a reference diffusion. The HJB equation, in turn, provides the value function for such minimum-cost control problems.

For instance, consider a value function V(x, t) that measures the "cost-to-go" from state x at time t to reach the target data distribution q(x_0). If we define a cost function L(x, u, t) related to the control u (e.g., (u - f(x,t))^2 / (2 g(t)^2)) then the HJB equation would describe this optimal value. The optimal control u^*(x, t) derived from the HJB equation will correspond to the drift required to achieve the optimal path. The score function \nabla_x \log q(x, t) emerges as a crucial component of this optimal drift.

Specifically, the Schrödinger system (a pair of coupled linear parabolic PDEs) describes the forward and backward marginal distributions of the optimal stochastic process in an SBP. These equations have structural similarities to the HJB equation. In some formulations, the solution of the SBP (the optimal drift) can be derived from the gradients of potentials, which are themselves solutions to systems related to HJB, particularly when one considers the connection between optimal control, HJB, and the logarithm of the probability density function (log-density, which is what the score function differentiates).

If we define a value function related to the negative log-likelihood of reaching the target distribution, then its gradient would essentially be the optimal "force" to apply. In the case of score-based generative models, the learned score function s_\theta(X_t, t) is precisely this force.

Reinforcement Learning Perspective on Diffusion Models

From an RL standpoint, training a diffusion model can be viewed as learning an optimal policy (the score function s_\theta(X_t, t)) for a continuous-time stochastic control problem.

Agent: The neural network that estimates the score function.
Environment: The forward diffusion process, which introduces noise.
State: The noisy data X_t.
Action/Control: The estimated score s_\theta(X_t, t) which determines the drift in the reverse SDE.
Reward: Implicitly defined. The ultimate "reward" is successfully generating samples that belong to the original data distribution q(x_0). The training objective (e.g., denoising score matching) quantifies how well the learned policy steers the samples towards the data manifold. It penalizes discrepancies between the true score and the estimated score, which can be interpreted as a penalty for suboptimal control.

In this context, the training of diffusion models can be seen as a form of Inverse Reinforcement Learning (IRL) or Model-Based RL.

IRL: Instead of specifying a reward function, we observe "optimal" trajectories (the paths from noise to data) and try to infer the underlying reward (or cost) function that would lead to such behavior. Here, the "optimal trajectories" are implicitly encoded in the data distribution, and the score function helps reconstruct these optimal paths.
Model-Based RL: The SDE for the reverse process acts as a dynamic model. The score network is learning to control this dynamic model to achieve a desired outcome.

The objective function in diffusion models, often an expectation over noise scales and data samples, can be conceptually linked to minimizing a divergence (e.g., KL divergence) between the generated data distribution and the true data distribution. Such divergences can sometimes be formulated as costs in an optimal control problem, making the learned score function a derivative of the optimal value function described by an HJB equation.

For example, the continuous-time version of DDPMs often optimizes an objective that resembles a variational lower bound on the negative log-likelihood, which, when translated to an SDE setting, can be interpreted as minimizing a generalized energy or cost associated with the reverse path. The HJB equation, in this light, provides the necessary conditions for the value function of this underlying optimal control problem, whose optimal policy corresponds to the score function.

Challenges and Future Directions

Despite these theoretical connections, several challenges remain. The HJB equation's analytical intractability for high-dimensional problems persists. While RL techniques offer approximation strategies, they often struggle with sample efficiency and convergence guarantees. For diffusion models, the computational cost of simulating SDEs for sampling can be high, and while methods like ODE samplers or fewer sampling steps exist, they still rely on an accurate score function estimate.

Future research could explore:

Direct HJB Solvers for Diffusion Models: Can we leverage advanced numerical methods for PDEs (e.g., neural PDE solvers, sparse grids) to directly approximate the value function or optimal control (score function) from an HJB formulation, rather than relying solely on score matching?
Explicit RL Formulations for Diffusion: Can diffusion model training be framed as a more explicit RL problem with defined rewards and transitions, allowing the application of sophisticated RL algorithms (e.g., policy iteration, Q-learning variants) to learn the score function? This could potentially lead to more robust or efficient training paradigms.
Optimal Control for Latent Spaces: Extending these ideas to diffusion in latent spaces could provide a principled way to navigate and generate complex data by controlling latent dynamics.
Connecting Different Optimal Transport Costs: Investigate how different cost functions in optimal transport problems give rise to different forms of diffusion models and their corresponding HJB equations. This could lead to a richer family of generative models.

The Hamilton-Jacobi-Bellman equation provides a unified mathematical framework that deeply connects optimal control, reinforcement learning, and modern generative models like diffusion models. Understanding these underlying theoretical links not only enriches our comprehension of these fields but also opens avenues for cross-pollination of ideas and the development of more powerful algorithms. The HJB equation, originating in the quest for optimal decision-making, continues to be a fertile ground for innovation in artificial intelligence.

For inquiries regarding advanced control systems design, reinforcement learning applications, or high-dimensional generative modeling, please visit our website at https://www.mgatc.com for expert consulting services.

Originally published in Spanish at www.mgatc.com/blog/hamilton-jacobi-bellman-equation-reinforcement-learning-diffusion-models/