Letโs begin by giving a proper definition of MDPs:
A Markov Choice Course of is a 5-tuple (S, A, R, P, ฮณ) such that:
- S is the set of states the agent could be in
- A is the set of actions the agent can take
- R : S x A โ R the reward operate
- P is the set of likelihood distributions outlined such that P(sโ|s,a) is the likelihood of transitioning to state sโ if the agent takes motion a in state s. Be aware that MDPs are Markov processes, that means that the Markov property holds on the transition possibilities: P(Sโโโ|Sโ, Aโ, โฆ, Sโ, Aโ) = P(Sโโโ|Sโ, Aโ)
- ฮณ โ (0, 1] is a low cost issue. Whereas we often take care of discounted issues (i.e. ฮณ < 1), the formulations offered are additionally legitimate for undiscounted MDPs (ฮณ = 1)
We then outline the coverage, i.e. what dictates the agentโs conduct in an MDP:
A coverage ฯ is a likelihood measure over the motion house outlined as: ฯ(a|s) is the likelihood of taking motion a when the agent is in state s.
We lastly introduce the worth operate, i.e. the agentโs goal in an MDP:
The worth operate of a coverage ฯ is the anticipated discounted reward beneath this coverage, when beginning at a given state s:
Specifically, the worth operate of the optimum coverage ฯ* satisfies the Bellman optimality equation:
Which yields the deterministic optimum coverage:
Deriving the LP formulation of MDPs:
Given the above definitions, we are able to begin by noticing that any worth operate V that satisfies
is an higher certain on the optimum worth operate. To see it, we are able to begin by noticing that such worth operate additionally satisfies:
We acknowledge the worth iteration operator utilized to V:
i.e.
Additionally noticing that the H*operator is growing, we are able to apply it iteratively to have:
the place we used the property of V* being the fastened level of H*.
Due to this fact, discovering V* comes right down to discovering the tightest higher certain V that obeys the above equation, which yields the next formulation:
Right here we added a weight time period similar to the likelihood of beginning in state s. We are able to see that the above drawback is linear in V and could be rewritten as follows: