TRPO
Introduction
The Trust Region Policy Optimization (TRPO) algorithm[1], introduced in 2017, addresses critical limitations of policy gradient methods in reinforcement learning by incorporating a trust region constraint to enable stable learning of complex policies.
Main Idea
In TRPO, an objective function is maximized subject to a constraint on the size of the policy update. Specifically,
\[\begin{aligned} &\underset{\theta}{\rm maximize}\ \hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\rm old}}(a_t|s_t)}\hat{A_t}\right]\\ &{\rm subject\ to}\ \hat{\mathbb{E}}_t\left[{\rm KL}[\pi_{\theta_{\rm old}}(\cdot|s_t),\pi_\theta(\cdot|s_t)]\right]\leq\delta \end{aligned}\]Here, $\theta_{\rm old}$ is the vector of policy parameters before the update. This problem can efficiently be approximately solved using the conjugate gradient algorithm, after making a linear approximation to the objective and a quadratic approximation to the constraint.
From theory perspective, TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem
\[\underset{\theta}{\rm maximize}\ \hat{\mathbb{E}}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\rm old}}(a_t|s_t)}\hat{A}_t-\beta\ {\rm KL}\left[\pi_{\theta_{\rm old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)\right]\right]\]for some coefficient $\beta$. TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of $\beta$ that performs well across different problems – or even within a single problem, where the characteristics change over the course of learning. Hence, to achieve goal of a first-order algorithm that emulates the monotonic improvement of TRPO, experiments show that it is not sufficient to simply choose a fixed penalty coefficient $\beta$ and optimize the penalized objective equation with SGD; additional modifications are required.
Implementation
Below introduce an efficient way to solve the TRPO:
\[{\rm maximize}\ L(\theta)\quad {\rm subject\ to}\ \overline{D}_{\rm KL}(\theta_{\rm old}, \theta)\leq\delta.\]The method involves two steps:
- compute a search direction, using a linear approximation to objective and quadratic approximation to the constraint
- perform a line search in that direction, ensuring that we improve the nonlinear objective while satisfying the nonlinear contraint.
To enforce the KL divergence constraint \(\overline{D}_{\rm KL}(\theta_{\rm old},\theta)\leq\delta\), we solve $Ax = g$, where $A$ is the Fisher information matrix. Using the conjugate gradient method, we approximate the search direction $s\approx A^{-1}g$. The maximal step size is then computed as $\beta=\sqrt{2\delta/(s^TAs)}$. Finally, a line search is performed along $\theta+\beta s$ to ensure surrogate objective improvement and constraint satisfaction.
Conjugate Gradient Method (CG Method)
To solve equation
\[Ax =b\tag{1}\]where
- $A\in\mathbb{R}^{n\times n}$ is a symmetric positive-definite metrix
- $x\in\mathbb{R}^n$ is unknown vectors
- $b\in\mathbb{R}^n$ is a constant vector
We transfer equation $(1)$ into a quadratic function minimization problem:
\[\min_x\ f(x) = \frac{1}{2}x^TAx-b^Tx\tag{2}\]and the gradient is $\nabla f(x) = Ax -b$. When $\nabla f(x)=0$, we have $Ax=b$.
*Tips: Implementation are shown as follows:
Initialization:
\[x_0, r_0 = b - A x_0, p_0 = r_0\]Iteration:
-
$\alpha_k = \frac{r_k^T r_k}{p_k^T A p_k}$
-
$x_{k+1} = x_k + \alpha_k p_k$
-
$r_{k+1} = r_k - \alpha_k A p_k$
-
Check $| r_{k+1} |^2 < \epsilon$, if yes then stop
-
$\beta_k = \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k}$
-
$p_{k+1} = r_{k+1} + \beta_k p_k$
References
- Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2017). Trust Region Policy Optimization. arXiv preprint arXiv:1502.05477v5.