Uniqueness of the optimal value function for an MDP

by jakab922   Last Updated November 17, 2017 17:19 PM

Suppose we have a Markov decision process with a finite state set and a finite action set. We calculate the expected reward with a discount of $\gamma \in [0,1]$. The Sutton & Barto book states in chapter 3.8 that there always exists at least one optimal policy but it doesn't prove why. Obviously the various optimal policies yield the same optimal value function at least this is what would make sense and also assumed in the book.

Can someone give me a proof for the above statement or a link to a proof?



Answers 1


Late answer, but here it goes:

Assume that there exists two optimal policies π and π' with respective value functions V and V'. Assume that for some state x V(x) ≠ V'(x). Without loss of generality we can assume that V(x) < V'(x). But if V(x) is lower than V'(x), then it is not optimal since it is better to follow π'. Therefore it is proved by contradiction, V(x) and V'(x) must be the same.

user3685312
user3685312
November 17, 2017 16:32 PM

Related Questions


disease progression R markov chains

Updated April 24, 2015 03:08 AM

prediction using markov chain

Updated July 23, 2017 10:19 AM

Hidden Markov Models as Dynamic Bayesian Networks

Updated September 21, 2017 16:19 PM