Summary of Notation
Capital letters are used for random variables, whereas lower case letters are used for the values of random variables and for scalar functions. Quantities that are required to be real-valued vectors are written in bold and in lower case (even if random variables). Matrics are bold capitals.
In a Markov Decision Process:
Symbol | Meaning |
---|---|
$s, s’$ | states |
$a$ | an action |
$r$ | a reward |
$\mathcal{S}$ | set of a nonterminal states |
$\mathcal{S}^+$ | set of all states, including the terminal state |
$\mathcal{A}(s)$ | set of all actions available in state $s$ |
$\mathcal{R}$ | set of all possible rewards, a finite subset of $\mathbb{R}$ |
$\subset$ | subset of (e.g., $\mathcal{R}\subset\mathbb{R}$) |
$\in$ | is an element of; e.g. ($s\in\mathcal{S}, r\in\mathcal{R}$) |
$|\mathcal{S}|$ | number of elements in set $\mathcal{S}$ |
$t$ | descrete time step |
$T, T(t)$ | final time step of an episode, or of the episode including time step $t$ |
$A_t$ | action at time $t$ |
$S_t$ | state at time $t$, typically due, stochastically, to $S_{t-1}$ and $A_{t-1}$ |
$R_t$ | reward at time $t$, typically due, stochastically, to $S_{t-1}$ and $A_{t-1}$ |
$\pi$ | policy (decision-making rule) |
$\pi(s)$ | action taken in state $s$ under deterministic policy $\pi$ |
G_t | return following time $t$ |
$h$ | horizon, the time step one looks up to in a forward view |
$G_{t:t+n}, G_{t:h}$ | $n$-step return from $t+1$ to $t+n$, or to $h$ (discounted and corrected) |
$\overline{G}_{t:h}$ | flat return (undiscounted and uncorrected) from $t+1$ to $h$ |
$G_t^\lambda$ | $\lambda$-return |
$G_{t:h}^\lambda$ | truncated, corrected $\lambda$-return |
$G_t^{\lambda s}, G_t^{\lambda a}$ | $\lambda$-return, corrected by estimated state, or action, values |
$p(s’,r|s,a)$ | probability of transition to state $s’$ with reward $r$, from state $s$ and action $a$ |
$p(s’|s,a)$ | probability of transition to state $s’$, from state $s$ taking action $a$. |
$r(s,a)$ | expected immediate reward from state $s$ after action $a$ |
$r(s,a,s’)$ | expected immediate reward on transition from $s$ to $s’$ under action $a$ |
$v_\pi(s)$ | value of state $s$ under policy $\pi$ (expected return) |
$v_*(s)$ | value of state under the optimal policy |
$q_\pi(s,a)$ | value of taking action $a$ in state $s$ under policy $\pi$ |
$q_*(s,a)$ | value of taking action $a$ in state $s$ under optimal policy |
$V, V_t$ | array estimates of state-value function $v_\pi$ or $v_*$ |
$Q, Q_t$ | array estimates of action-value function $q_\pi$ or $q_*$ |
$\overline{V}_t(s)$ | expected approximate action value; fro example, $\overline{V}_t(s)\doteq\sum_a\pi(a|s)Q_t(s,a)$ |
$U_t$ | target for estimate at time $t$ |
$\delta_t$ | temporal-difference (TD) error at $t$ (a random variable) |
$\delta_t^s, \delta_t^a$ | state- and action-specific forms of the TD error |
$n$ | in $n$-step methods, $n$ is the number of steps of bootstrapping |