Suppose we have a Markov decision process with a finite state set and a finite action set. We calculate the expected reward with a discount of $\gamma \in [0,1]$. The Sutton & Barto book states in chapter 3.8 that there always exists at least one optimal policy but it doesn't prove why. Obviously the various optimal policies yield the same optimal value function at least this is what would make sense and also assumed in the book.
Can someone give me a proof for the above statement or a link to a proof?
Late answer, but here it goes:
Assume that there exists two optimal policies π and π' with respective value functions V and V'. Assume that for some state x V(x) ≠ V'(x). Without loss of generality we can assume that V(x) < V'(x). But if V(x) is lower than V'(x), then it is not optimal since it is better to follow π'. Therefore it is proved by contradiction, V(x) and V'(x) must be the same.