I Introduction
Recently, reinforcement learning (RL) [1] has achieved impressive performance on a class of continuous control problems including locomotion [2] and robot manipulation [3]. Policybased optimization is the main engine behind these RL applications [4]. Specifically, the natural policy gradient method [5] and several related methods including TRPO [6], natural AC [7], and PPO [8] are among the most popular RL algorithms for continuous control tasks. These methods enable flexible policy parameterizations, and are endtoend in the sense that the control performance metrics are directly optimized.
Despite the empirical successes of policy optimization methods, how to choose these algorithms for a specific control task is still more of an art than a science [9, 10]. This motivates a recent research trend focusing on understanding the performances of RL algorithms on simplified benchmarks. Specifically, significant research has recently been conducted to understand the performance of various modelfree or modelbased RL algorithms on the classic Linear Quadratic Regulator (LQR) problem [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]. In [11], it is shown that despite the nonconvexity in the objective function, policy gradient methods can still provably learn the optimal LQR controller. This provides a good sanity check for policy optimization on further control applications.
Built upon the good progress on understanding RL for the LQR problem, this paper moves one step further and studies policy optimization for Markov Jump Linear Systems (MJLS) [22] from a theoretical perspective. MJLS form an important class of systems that arise in many control applications [23, 24, 25, 26, 27, 28]
. Recently, stochastic methods in machine learning are also modeled as jump systems
[29, 30]. The research on MJLS has great practical value while in the mean time also provides many new interesting theoretical problems. In the classic LQR problem, one aims to control a linear timeinvariant (LTI) system whose state/input matrices do not change over time. On the other hand, the state/input matrices of a Markov jump linear system are functions of a jump parameter that is sampled from an underlying Markov chain. Consequently, the behaviors of MJLS become very different from those of LTI systems. Controlling unknown MJLS poses many new challenges over traditional LQR due to the appearance of this Markov jump parameter. For example, in a modelbased approach, one has to learn both the state/input matrices and the transition probability of the jump parameter; here, it is the coupling effect between the state/input matrices and the jump parameter distribution causes the main difficulty. Therefore, the quadratic control of MJLS is a meaningful benchmark for further understanding of RL algorithms.
Obviously, studying policy optimization on MJLS control problems is important for further understanding of policybased RL algorithms. In this paper, we present various convergence guarantees for policy optimization methods on the quadratic control of MJLS. First, we study the optimization landscape of direct policy optimization for MJLS, and demonstrate that despite the nonconvexity of the resultant problem, the unique stationary point is the global optimal solution. Next, we prove that the GaussNewton method and the natural policy gradient method converge to the optimal state feedback controller for MJLS at a linear rate if a stabilizing initial controller is used. We introduce a novel Lyapunov argument to fix a key stability issue in the convergence proof. Finally, numerical simulations are provided to support our theory.
The most relevant reference of our paper is [11]. Our results generalize the convergence theory of the GaussNewton method and the natural policy gradient method in [11] to the MJLS case. This extension is nontrivial. Specifically, one key issue in the convergence proof is to ensure that the iterates never wander into the region of instability. In [11], the system is LTI and the stability argument can be made by using the properties of spectral radius of the state matrix. For MJLS, one cannot directly make such arguments any more due to the stochastic nature of the system. Alternatively, we propose a novel Lyapunov argument to show that the resultant controller is always stabilizing for the MJLS in the mean square sense along the optimization trajectory of the GaussNewton method and the natural policy gradient method, if learning rates are chosen properly.
Ii Background and Preliminaries
Iia Notation
We denote the set of real numbers by . Let be a square matrix, we use the notation , , ,
to denote its transpose, spectral norm, trace, and minimum singular value, respectively. We indicate positive definite and positive semidefinite matrices by
and , respectively. Given matrices , let denote the block diagonal matrix whose th block is . Given a function , we use to denote its total derivative.IiB Quadratic Control of Markovian Jump Linear Systems
A Markovian jump linear system is governed by the following discretetime statespace model
(1) 
where is the system state at time , and corresponds to the control action at time . The initial state is assumed to have a distribution . The system matrices and depend on the switching parameter , which takes values on for each . Obviously, we have and for all . The jump parameter forms a discretetime Markov chain sampled from . The transition probabilities and initial distribution of are given by
(2) 
respectively. The transition probabilities satisfy and for each . The initial distribution satisfies .
In this paper, we focus on the quadratic control problem whose objective is to choose the control actions to minimize the following cost function
(3) 
where it is assumed that and for each . This problem can be viewed as the MJLS counterpart of the standard LQR problem, and hence is termed as “MJLS LQR problem.” The optimal controller to this MJLS LQR problem, defined by dynamics (1), cost (3), and switching probabilities (2), can be computed by solving a system of coupled Riccati Algebraic Equations [31], which we now describe. First, it is known that the optimal cost can be achieved by a linear state feedback of the form
(4) 
with for each . One can solve for all as follows. Let . Formally, let be the unique positive definite solution to the following equations:
(5) 
It can be shown that the linear state feedback controller that minimizes the cost function (3) is given by
(6) 
IiC Policy Optimization for Quadratic Control of LTI Systems
Before proceeding to policy optimization of MJLS, here we review policy gradient methods for the quadratic control of LTI systems [11]. Consider the LTI system with an initial state distribution and a static state feedback controller . We adopt a standard quadratic cost function which can be calculated as
(7) 
Obviously, the cost in (IIC) can be computed as where is the solution to the Lyapunov equation . It is also well known [34, 11] that the gradient of (IIC) with respect to can be calculated as
where is the state correlation matrix, i.e. . Based on this gradient formula, one can optimize (IIC) using the (deterministic) policy gradient method , the natural policy gradient method , or the GaussNewton method . More explanations for these methods can be found in [11].
In [11], it is shown that there exists a unique such that if is full rank. In addition, all the above methods are shown to converge to linearly if a stabilizing initial policy is used.
Iii Policy Gradient and Optimization Landscape
Now we focus on the policy optimization of the MJLS LQR problem. Since we know the optimal cost can be achieved by a linear state feedback, it is reasonable to restrict the policy search within the class of linear state feedback controllers. Specifically, we can set , where each of the components is the feedback gain of the corresponding mode. With this notation, we consider the following policy optimization problem whose decision variable is .
Problem 1: Policy Optimization for MJLS.
minimize:  
subject to:  
In this section, we present an explicit formula for the policy gradient and discuss the optimization landscape for the above problem. We want to emphasize that the above problem is indeed a constrained optimization problem. The feasible set of the above problem consists of all stabilizing the closedloop dynamics in the mean square sense (and hence yielding finite ). We denote this feasible set as . For , the cost in (3) can blow up to infinity, and the differentiability is also an issue. For , the cost is finite and differentiable. To obtain the formula for , we can first rewrite the quadratic cost (3) as
(8) 
where is defined to be the solution to the coupled Lyapunov equations
(9) 
for all . Recall that we have .
We also make the following technical assumptions.
Assumptions.
Along with the standard assumption that and being positive definite for all , we assume that for all and . This indicates that there is a chance of starting from any mode . Moreover, the expected covariance of the initial state is full rank.
Now we are ready to present an explicit formula for the policy gradient .
Lemma 1.
Proof.
The differentiability of can be proved using the implicit function theorem, and this step is similar to the proof of Lemma 3.1 in [35]. Now we derive the gradient formula by modifying the total derivative arguments in [35, 34]. Start by denoting . Then, we can take the total derivative of (9) to show the following relation for each
Hence, the total derivative of the cost (III) is
Recall from [35] that . This leads to the desired result. ∎
Optimization Landscape for MJLS. LTI systems are just a special case of MJLS. Since policy optimization for quadratic control of LTI systems is nonconvex, the same is true for the MJLS case. However, from our gradient formula in Lemma 1, we can see that as long as is full rank and for all , it is necessary that a stationary point given by satisfies
Substituting the above equation into the coupled Lyapunov equation (9) leads to the global solution defined by the coupled Algebraic Riccati Equations (IIB). Therefore, the only stationary point is the global optimal solution. Overall, the optimization landscape for the MJLS case is quite similar to the classic LQR case if we allow the initial mode to be sufficiently random, i.e. for all . Based on such similarity, it is reasonable to expect that the local search procedures (e.g. policy gradient) will be able to find the unique global minimum for MJLS despite the nonconvex nature of the problem. Compared with the LTI case, the characterization of is more complicated for MJLS. Hence the main technical issue is how to show gradientbased methods can handle the feasibility constraint without using projection. We will use a Lyapunov argument to tackle this issue.
Iv Main Convergence Results
As reviewed in Section IIC, the natural policy gradient method for the LTI case iterates as . For the MJLS case, the natural policy gradient method adopts a similar update rule and iterates as
(13) 
The initial policy is denoted as . The GaussNewton method uses the following update rule:
(14) 
where . In this section, we focus on the convergence guarantees of (13) and (14), and show that both converge to the global optimal solution at a linear rate if they are initialized at a policy in .
To state the main convergence result, it is helpful to denote , , and . We also denote .
Theorem 2.
Proof Sketch.
We briefly outline the main proof steps for the GaussNewton case. The proof for the natural policy gradient case is similar. The detailed proofs are presented in the appendix.

Show that the onestep progress of the GaussNewton gives a policy stabilizing the closedloop dynamics and yielding a finite cost.

Apply the socalled “almost smoothness” condition to show that the cost associated by the onestep progress of the GaussNewton method decreases as follows:

Use induction to show the final convergence result.
It is worth noting that the proof steps for the MJLS and LTI cases are quite similar. We can simply modify the proof arguments for the LTI case in [11] to finish the second and third steps. The main challenge for the MJLS case is how to handle the first step, since one cannot directly modify the spectral radius argument in [11] due to the stochastic nature of MJLS. We develop a novel Lyapunov argument to address this issue. We will only present the details for the first step here since that is the only part requiring new proving techniques. Other steps of the proof for both cases are deferred to the appendix.
How do the policy optimization methods ensure the finite cost along the iteration path? We need to show that for every , we can choose a step size such that the new controller obtained in onestep update of the GaussNewton method or the natural policy gradient method (which is denoted as ) will also be stabilizing the closedloop dynamics in the meansquare sense. This lemma is of new technical novelty compared with the argument for the LTI case in [11]. Notice that the “almost smoothness” condition is required in the second step of the proof outline, as it gives a useful upper bound for in terms of . However, to apply such a condition, one needs to ensure that both and are stabilizing controllers such that and are both finite in the first place. Hence, one has to prove that the iterate never wanders into the region of instability before applying the “almost smoothness” condition.
To show that every controller computed by the GaussNewton method or the natural policy gradient method is stabilizing, we propose the following Lyapunov argument. The main idea is that the value function at the current step serves naturally as a Lyapunov function for the next move due to the positive definiteness of . The positive definiteness of guarantees that there is a stability margin around every point along the optimization trajectory. The result for the GaussNewton case is formally stated below.
Lemma 3.
Proof.
Recall from [36] that the controller stabilizes (1) in the meansquare sense if and only if there exists matrices such that
(17) 
We will show that the above condition can be satisfied by setting where solves the MJLS Lyapunov equation . Notice the existence of is guaranteed by the assumption . Denote . The Lyapunov equation for can be rewritten as . From this, we can directly obtain
Since are all positive definite, the sum of the first two terms on the right hand side is negative definite. We only need the last two terms to be negative semidefinite. Note that . We have
which is positive semidefinite under the condition . ∎
For the natural gradient method, we have a similar result.
Lemma 4.
Proof.
The proof starts with the same steps as the proof of Lemma 3. We will show that the condition (17) can be met by setting where solves the MJLS Lyapunov equation associated with the controller . For the natural policy gradient method, we have and . To show that the last two terms are negative semidefinite, we make the following calculations:
Clearly, the above term is guaranteed to be positive semidefinite if satisfies
Lastly, notice for all . This leads to the desired conclusion. ∎
From the above proof, we can clearly see that can be used to construct a Lyapunov function for if satisfies a bound. This leads to a novel proof for the stability along the natural policy gradient iteration path. This idea can even be extended beyond the linear quadratic control case. Very recently, a similar idea is used to show the convergence properties of policy optimization methods for the mixed control problem where the cost function may not blow up to infinity on the boundary of the feasible set [37].
V Simulation Results
Consider a system with 100 states, 20 inputs, and 100 modes. The system matrices and were generated using drss in MATLAB in order to guarantee that the system would have finite cost with . The probability transition matrix was sampled from a Dirichlet Process . We also assumed that we had equal probability of starting in any initial mode. For simplicity we set and for all .
In Figure 1 we can see that both policy optimization methods converge to the optimal solution. As expected, GaussNewton converges much faster than the natural policy gradient method. The step size of the natural policy gradient method depends on various system parameters, and requires some tuning efforts for each different problem instance.
Vi Conclusion
In this paper we have studied policy optimization for the quadratic control of Markovian jump linear systems. We developed an exact formula for computing the gradient of the cost with respect to a given policy, and presented convergence guarantees for the GaussNewton method and the natural policy gradient method. The results include a novel Lyapunov argument to prove the stability of the iterations along the optimization trajectories.
The results obtained further suggest that one could use modelfree methods, such as zerothorder optimization or the REINFORCE algorithm, to learn the optimal control from data. Such modelfree techniques will allow us to learn the control of unknown MJLS without dealing with system identification. This would be particularly useful for large scale systems, where the computational complexity grows as the system size increases. We will work on such extensions in the future.
References
 [1] R. Sutton and A. Barto, Reinforcement learning: An introduction. MIT press, 2018.

[2]
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” in
International Conference on Learning Representation, 2015.  [3] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
 [4] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International Conference on Machine Learning, 2016, pp. 1329–1338.
 [5] S. Kakade, “A natural policy gradient,” in Advances in neural information processing systems, 2002, pp. 1531–1538.
 [6] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
 [7] J. Peters and S. Schaal, “Natural actorcritic,” Neurocomputing, vol. 71, no. 79, pp. 1180–1190, 2008.
 [8] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.

[9]
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep
reinforcement learning that matters,” in
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [10] A. Rajeswaran, K. Lowrey, E. Todorov, and S. Kakade, “Towards generalization and simplicity in continuous control,” in Advances in Neural Information Processing Systems, 2017, pp. 6550–6561.
 [11] M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80, 2018, pp. 1467–1476.
 [12] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample complexity of the linear quadratic regulator,” Foundations of Computational Mathematics, pp. 1–47.
 [13] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. Bartlett, and M. Wainwright, “Derivativefree methods for policy optimization: Guarantees for linear quadratic systems,” arXiv preprint arXiv:1812.08305, 2018.
 [14] S. Tu and B. Recht, “Leastsquares temporal difference learning for the linear quadratic regulator,” in International Conference on Machine Learning, 2018, pp. 5005–5014.
 [15] ——, “The gap between modelbased and modelfree methods on the linear quadratic regulator: An asymptotic viewpoint,” arXiv preprint arXiv:1812.03565, 2018.
 [16] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “Regret bounds for robust adaptive control of the linear quadratic regulator,” in Advances in Neural Information Processing Systems, 2018, pp. 4188–4197.
 [17] Y. AbbasiYadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
 [18] Y. AbbasiYadkori, N. Lazic, and C. Szepesvári, “Regret bounds for modelfree linear quadratic control,” arXiv preprint arXiv:1804.06021, 2018.
 [19] K. Krauth, S. Tu, and B. Recht, “Finitetime analysis of approximate policy iteration for the linear quadratic regulator,” in Advances in Neural Information Processing Systems, 2019, pp. 8512–8522.
 [20] H. Mohammadi, A. Zare, M. Soltanolkotabi, and M. Jovanović, “Convergence and sample complexity of gradient methods for the modelfree linear quadratic regulator problem,” arXiv preprint arXiv:1912.11899, 2019.
 [21] H. Mohammadi, A. Zare, M. Soltanolkotabi, and M. Jovanovic, “Global exponential convergence of gradient methods over the nonconvex landscape of the linear quadratic regulator,” in 2019 IEEE 58th Conference on Decision and Control (CDC), 2019.
 [22] O. Costa, M. Fragoso, and R. Marques, DiscreteTime Markov Jump Linear Systems. Springer London, 2006.
 [23] Y. BarShalom and X. Li, “Estimation and tracking principles, techniques, and software,” Norwood, MA: Artech House, Inc, 1993., 1993.
 [24] E. Fox, E. S. M. Jordan, and A. Willsky, “Bayesian nonparametric inference of switching dynamic linear models,” IEEE Transactions on Signal Processing, vol. 59, no. 4, pp. 1569 – 1585, 2011.
 [25] K. Gopalakrishnan, H. Balakrishnan, and R. Jordan, “Stability of networked systems with switching topologies,” in IEEE Conference on Decision and Control, 2016, pp. 1889–1897.
 [26] V. Pavlovic, J. Rehg, and J. MacCormick, “Learning switching linear models of human motion,” in Advances in Neural Information Processing Systems, 2000.
 [27] D. Sworder and J. Boyd, Estimation problems in hybrid systems. Cambridge University Press, 1999.
 [28] A. N. Vargas, E. F. Costa, and J. B. R. do Val, “On the control of markov jump linear systems with no mode observation: application to a dc motor device,” Int. J. Robust. Nonlinear Control, vol. 23, no. 10, pp. 1136–1150, 2013.
 [29] B. Hu, P. Seiler, and A. Rantzer, “A unified analysis of stochastic optimization methods using jump system theory and quadratic constraints,” in Conference on Learning Theory, 2017, pp. 1157–1189.
 [30] B. Hu and U. Syed, “Characterizing the exact behaviors of temporal difference learning algorithms using markov jump linear system theory,” in Advances in Neural Information Processing Systems, 2019, pp. 8477–8488.
 [31] M. Fragoso, “Discretetime jump lqg problem,” International Journal of Systems Science, vol. 20, no. 12, pp. 2539–2545, 1989.
 [32] S. Bittanti, P. Colaneri, and G. De Nicolao, The Periodic Riccati Equation. Springer Berlin Heidelberg, 1991, pp. 127–162.
 [33] J. J. Hench and A. J. Laub, “Numerical solution of the discretetime periodic riccati equation,” IEEE Transactions on Automatic Control, vol. 39, no. 6, pp. 1197–1210, 1994.
 [34] K. Mårtensson and A. Rantzer, “Gradient methods for iterative distributed control synthesis,” in Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, 2009, pp. 549–554.
 [35] T. Rautert and E. Sachs, “Computational design of optimal output feedback controllers,” SIAM Journal on Optimization, vol. 7, no. 3, pp. 837–852, 1997.
 [36] O. Costa and M. Fragoso, “Stability results for discretetime linear systems with markovian jumping parameters,” Journal of mathematical analysis and applications, vol. 179, no. 1, pp. 154–178, 1993.
 [37] K. Zhang, B. Hu, and T. Başar, “Policy optimization for linear control with robustness guarantee: Implicit regularization and global convergence,” arXiv preprint arXiv:1910.09496, 2019.
Appendix
This appendix provides the detailed proofs of the convergence rate results presented in this paper. We will first start by proving a few helper lemmas. Then we will provide upper bounds for the cost associated with the onestep progress. Lastly we will show that the both algorithms converge to the optimal policy. Most steps mimic their LTI counterparts.
Lemma 5 (“Almost smoothness”).
Proof.
Next, we show that is gradient dominated. Recall that we have .
Lemma 6 (Gradient Domination).
Suppose , and . Let be the optimal policy. Given the definitions in Lemma 1, the following sequence of inequalities always holds:
Comments
There are no comments yet.