3.1 Bellman optimality equation

1. Action-value function

Previously, we learned state-value function $v_{\pi}(s)$ which denotes the extectation of cumulative reward at state $s\in\mathcal{S}$ when we run policy $\pi$. In this lecture, we will study action-value function which provides similar information to $v_{\pi}(s)$ when a state-action pair $(s,a)\in\mathcal{S}\times \mathcal{A}$ is jointly conditioned. Both value functions calculate the expected cumulative reward obtained through policy $\pi$, but while state-value function focuses on the state, action-value function can provide the value of each action in addition to the state.

$$ q_{\pi}(s,a)=\mathbb{E}{\pi}[G{t}|S_{t}=s,A_{t}=a] $$

Value functions are connected with the following formulas

$$ v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_{\pi}(s,a),\quad q_{\pi}(s,a)=\sum_{s'\in\mathcal{S},r\in\mathcal{R}} [r +\gamma v_{\pi}(s')]\mathcal{P}(s',r|s,a) $$

The above relations can be understood through the diagrams

스크린샷 2023-03-13 오후 5.49.09.png

스크린샷 2023-03-13 오후 5.49.21.png

How to prove the formulas? Recall the following repersentation for state-value function

$$ v_{\pi}(s)=\mathbb{E}{\pi}[G{0}|S_{0}=s]=\sum_{k=1}^{\infty}\gamma^{k-1}\sum_{s_{k},r_{k},a}r_{k}\cdot\mathcal{P}{k|0}(s{k},r_{k}|s,a)\pi(a|s) $$

Since we can change the order of summation (really?), observe that

$$ \begin{aligned} &\sum_{k=1}^{\infty}\gamma^{k-1}\sum_{s_{k},r_{k},a}r_{k}\cdot\mathcal{P}{k|0}(s{k},r_{k}|s,a)\pi(a|s)\\ &=\sum_{a\in\mathcal{A}}\pi(a|s){\color{red}\left(\sum_{k=1}^{\infty}\gamma^{k-1}\sum_{s_{k},r_{k}}r_{k}\cdot\mathcal{P}{k|0}(s{k},r_{k}|s,a)\right)}\\ &=\sum_{a\in\mathcal{A}}\pi(a|s){\color{red}\sum_{k=1}^{\infty}\gamma^{k-1}\mathbb{E}{\pi}[R{k}|S_{0}=s,A_{0}=a]}\\ &=\sum_{a\in\mathcal{A}}\pi(a|s){\color{red}\mathbb{E}[G_{0}|S_{0}=s,A_{0}=a]}=\sum_{a\in\mathcal{A}}\pi(a|s){\color{red}q_{\pi}(s,a)} \end{aligned} $$

Hence we prove the first left-hand side formula. To understand the second right-hand side, recall the Bellman equation we derived previously

$$ v_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{s',r}[r+\gamma v_{\pi}(s')]\cdot\mathcal{P}(s',r|s,a) $$

Due to the left-hand side formula, it should be hold that for every $s\in\mathcal{S}$,

$$ \sum_{a\in\mathcal{A}}\pi(a|s){\color{red}q_{\pi}(s,a)}=\sum_{a\in\mathcal{A}}\pi(a|s){\color{red}\sum_{s',r}[r+\gamma v_{\pi}(s')]\cdot\mathcal{P}(s',r|s,a)} $$

Hence the Bellman equation connects the relation between $v_{\pi}(s)$ and $q_{\pi}(s,a)$. By the definition of action-value function and return, we get

$$ \begin{aligned} q_{\pi}(s,a)&=\mathbb{E}{\pi}[G{0}|S_{0}=s,A_{0}=a]\\ &=\mathbb{E}{\pi}[R{1}+\gamma G_{1}|S_{0}=s,A_{0}=a]\\ &=\mathbb{E}{\pi}[R{1}|S_{0}=s,A_{0}=a]+\gamma\mathbb{E}{\pi}[G{1}|S_{0}=s,A_{0}=a] \end{aligned} $$

Now we can compute the first expectation as follows