TRPO

Introduction

The Trust Region Policy Optimization (TRPO) algorithm[1], introduced in 2017, addresses critical limitations of policy gradient methods in reinforcement learning by incorporating a trust region constraint to enable stable learning of complex policies.


Main Idea

In TRPO, an objective function is maximized subject to a constraint on the size of the policy update. Specifically,

\[\begin{aligned} &\underset{\theta}{\rm maximize}\ \hat{\mathbb{E}}_t\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\rm old}}(a_t|s_t)}\hat{A_t}\right]\\ &{\rm subject\ to}\ \hat{\mathbb{E}}_t\left[{\rm KL}[\pi_{\theta_{\rm old}}(\cdot|s_t),\pi_\theta(\cdot|s_t)]\right]\leq\delta \end{aligned}\]

Here, $\theta_{\rm old}$ is the vector of policy parameters before the update. This problem can efficiently be approximately solved using the conjugate gradient algorithm, after making a linear approximation to the objective and a quadratic approximation to the constraint.

From theory perspective, TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem

\[\underset{\theta}{\rm maximize}\ \hat{\mathbb{E}}\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\rm old}}(a_t|s_t)}\hat{A}_t-\beta\ {\rm KL}\left[\pi_{\theta_{\rm old}}(\cdot|s_t), \pi_\theta(\cdot|s_t)\right]\right]\]

for some coefficient $\beta$. TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of $\beta$ that performs well across different problems – or even within a single problem, where the characteristics change over the course of learning. Hence, to achieve goal of a first-order algorithm that emulates the monotonic improvement of TRPO, experiments show that it is not sufficient to simply choose a fixed penalty coefficient $\beta$ and optimize the penalized objective equation with SGD; additional modifications are required.


Implementation

Below introduce an efficient way to solve the TRPO:

\[{\rm maximize}\ L(\theta)\quad {\rm subject\ to}\ \overline{D}_{\rm KL}(\theta_{\rm old}, \theta)\leq\delta.\]

The method involves two steps:

  • compute a search direction, using a linear approximation to objective and quadratic approximation to the constraint
  • perform a line search in that direction, ensuring that we improve the nonlinear objective while satisfying the nonlinear contraint.

To enforce the KL divergence constraint \(\overline{D}_{\rm KL}(\theta_{\rm old},\theta)\leq\delta\), we solve $Ax = g$, where $A$ is the Fisher information matrix. Using the conjugate gradient method, we approximate the search direction $s\approx A^{-1}g$. The maximal step size is then computed as $\beta=\sqrt{2\delta/(s^TAs)}$. Finally, a line search is performed along $\theta+\beta s$ to ensure surrogate objective improvement and constraint satisfaction.


Conjugate Gradient Method (CG Method)

To solve equation

\[Ax =b\tag{1}\]

where

  • $A\in\mathbb{R}^{n\times n}$ is a symmetric positive-definite metrix
  • $x\in\mathbb{R}^n$ is unknown vectors
  • $b\in\mathbb{R}^n$ is a constant vector

We transfer equation $(1)$ into a quadratic function minimization problem:

\[\min_x\ f(x) = \frac{1}{2}x^TAx-b^Tx\tag{2}\]

and the gradient is $\nabla f(x) = Ax -b$. When $\nabla f(x)=0$, we have $Ax=b$.

*Tips: Implementation are shown as follows:

Initialization:

\[x_0, r_0 = b - A x_0, p_0 = r_0\]

Iteration:

  1. $\alpha_k = \frac{r_k^T r_k}{p_k^T A p_k}$

  2. $x_{k+1} = x_k + \alpha_k p_k$

  3. $r_{k+1} = r_k - \alpha_k A p_k$

  4. Check $| r_{k+1} |^2 < \epsilon$, if yes then stop

  5. $\beta_k = \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k}$

  6. $p_{k+1} = r_{k+1} + \beta_k p_k$


References

  1. Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2017). Trust Region Policy Optimization. arXiv preprint arXiv:1502.05477v5.

results matching ""

    No results matching ""