Markov Decision Process

An MDP is defined by:
  • \(s \in S\) – a set of states
  • \(a \in A\) – a set of actions
  • \(T(s,a,s')\) – A transition_function/model/dynamics
    • prob that \(a\) from \(s\) leads to \(s'\), i.e., \(P(s'|a,s)\)
  • \(R(s,a,s')\) – a reward(cost) function
    • aka \(R(s')\) or \(R(s)\)
    • We want to maximize the reward/cost.
  • \(\alpha\) – a start state
  • \(\gamma\) – a discount factor
  • Maybe a terminal state
  • MDPs are non-deterministic/stochastic search problems

Markov Property

The future is independent of the past given the present. [RL_course_mdp_DeepMind]

A state \(S_t\) is Markov if and only if

\[\mathbb { P } \left[ S _ { t + 1} | S _ { t } \right] = \mathbb { P } \left[ S _ { t + 1} | S _ { 1} ,\dots ,S _ { t } \right]\]
  • The state \(S_t\) captures all relevant information from the history i.e., the state \(S_{t+1}\) is only dependent on \(S_t\) not on states before that.
  • Once the state is known, the history may be thrown away i.e., the state is a sufficient statistic of the future

State transition matrix

For a Markov state s and successor state s’, the state transition probability is defined by

\[\mathcal { P } _ { S S ^ { \prime } } = \mathbb { P } \left[ S _ { t + 1} = s ^ { \prime } | S _ { t } = s \right]\]

State transition matrix \(\mathcal { P }\) defines transition probabilities from all states s to all successor states s’.

\[\begin{split}\mathcal { P } = \text{ from } \left[ \begin{array} { c c c } { \mathcal { P } _ { 11} } & { \dots } & { P _ { 1n } } \\ { \vdots } & { } & { } \\ { \mathcal { P } _ { 11} } & { \cdots } & { P _ { n n } } \end{array} \right]\end{split}\]

where each row of the matrix sum to 1.

Markov Process

A Markov process is a memoryless random process, i.e., a sequence of random states \(S_1,S_2,...\) with the Markov property.

  • A Markov process/Chain is a tuple \(<\mathcal{S}, \mathcal{P}>\)
    • \(\mathcal{S}\) is a finite set of states
    • \(\mathcal{P}\) is a state transition probability matrix, \(\mathcal { P } _ { S S ^ { \prime } } = \mathbb { P } \left[ S _ { t + 1} = s ^ { \prime } | S _ { t } = 5\right]\)

Markov Reward Process

A Markov reward process is a Markov chain with values, \(\langle \mathcal { S } ,\mathcal { P } ,\mathcal { R } ,\gamma \rangle\)

  • \(\mathcal{R}\) is a reward function, \(\mathcal { R } _ { s } = \mathbb { E } \left[ R _ { t + 1} | S _ { t } = s \right]\)
  • \(\gamma\) is a discount factor, \(\gamma \in [ 0,1]\)

Return

The return \(G_t\) is the total discounted reward from time-step t.

\[G _ { t } = R _ { t + 1} + \gamma R _ { t + 2} + \ldots = \sum _ { k = 0} ^ { \infty } \gamma _ { t + k + 1}\]

Value Function

The value function \(v(s)\) gives the long-term value of state \(s\). It is the expected return starting from state \(s\)

\[v ( s ) = E \left[ G _ { t } | S _ { t } = s \right]\]
\[\]

Bellman Equation in Matrix Form

The Bellman equation can be expressed concisely using matrices.

\[v = \mathcal { R } + \gamma \mathcal { P } \mathbf { v }\]
\[\begin{split}\left[ \begin{array} { l } { v ( 1) } \\ { \vdots } \\ { v ( n ) } \end{array} \right] = \left[ \begin{array} { l } { \mathcal { R } _ { 1} } \\ { \vdots } \\ { \mathcal { R } _ { n } } \end{array} \right] + \gamma \left[ \begin{array} { c c c } { \mathcal { P } _ { 11} } & { \dots } & { P _ { 1n } } \\ { P _ { 11} } & { \dots } & { P _ { n n } } \end{array} \right] \left[ \begin{array} { l } { v ( 1) } \\ { \vdots } \\ { v ( n ) } \end{array} \right]\end{split}\]

You can solve the Bellman equation as a simple linear equation. The complexity is \(O \left( n ^ { 3} \right)\) for n states.

\[\begin{split}\begin{align} v &= \mathcal { R } + \gamma P _ { \nu } \\ ( 1- \gamma P ) v &= \mathcal { R } \\ v &= ( 1- \gamma P ) ^ { - 1} R \end{align}\end{split}\]

As we are using inversion, the direct solution is possible only for small MRPs. Other iterative solutions are

  • Dynamic programming
  • Monte-Carlo evalution
  • Temporal-Difference Learning

Markov Decision Process

MDP is a MRP with decisions. It is an environment in which all states are Markov.

Policies

A policy \(\pi\) is a distribution over actions given states. It’s a stochastic transition matrix.

\[\pi ( a | s ) = \mathbb { P } \left[ A _ { t } = a | S _ { t } = s \right]\]
  • A policy defines the behaviour of an agent.
  • It depends on the current state.
  • Policy is stationary(time-independnet)

Value Functions

The state-value function \(V _ { \pi } ( s )\) of and MDP is the expected return starting from state s, and then following policy \(\pi\).

\[v _ { \pi } ( s ) = \mathbb { E } _ { \pi } \left[ G _ { t } | S _ { t } = s \right]\]

The action-value function \(q _ { \pi } ( s,a )\) is the expected return starting from state s, taking action a, and then following policy \(\pi\).

\[q _ { \pi } ( s,a ) = \mathbb { E } _ { \pi } \left[ G _ { t } | S _ { t } = s ,A _ { t } = a \right]\]

Bellman expectation equation

The state-value function can again be decomposed into immediate reward plus discounted value of successor state.

\[\begin{split}\begin{align} v _ { \pi } ( s ) &= \mathbb { E } _ { \pi } \left[ R _ { t + 1} + \gamma v _ { \pi } \left( S _ { t + 1} \right) | S _ { t } = s \right] \\ &= \sum _ { a \in A } \pi ( a | s ) \big( \mathcal{ R }_{ s } ^ { a } + \gamma \sum _ { s^{ \prime } \in S } \mathcal { P } _ { \text{ ss' } } ^ { 2} v _ { \pi } ( s' ) \big) \end{align}\end{split}\]

The action-value function can similarly be decomposed.

\[\begin{split}\begin{align} q _ { \pi } ( s,a ) &= \mathbb { E } _ { \pi } \left[ R _ { t + 1} + \gamma q _ { \pi } \left( S _ { t + 1} ,A _ { t + 1} \right) | S _ { t } = s ,A _ { t } = a \right] \\ &= \gamma \sum _ { s ^ { \prime } \in S } \mathcal { P } _ { S S ^ { \prime } } ^ { a } \sum _ { d ^ { \prime } \in A } \pi \left( a ^ { \prime } | s ^ { \prime } \right) q _ { \pi } \left( 5^ { \prime } ,a ^ { \prime } \right) \end{align}\end{split}\]

Optimal Value Function

The optimal state-value function \(V _ { * } ( s )\) is the maximum state-value function over all policies

\[V _ { * } ( s ) = \max _ { \pi } v _ { \pi } ( s )\]

The optimal action-value function \(q_* ( s,a )\) is the maximum action-value function over all policies

\[q_* ( s,a ) = \max _ { \pi } q _ { \pi } ( s,a )\]

An MDP is “solved” when we know the optimal value fn.

Define a partial ordering over policies

\[\pi \geq \pi ^ { \prime } \text{ if } v _ { \pi } ( s ) \geq v _ { \pi ^ { \prime } } ( s ) ,\forall _ { 5}\]

An optimal policy can be found by maximizing over \(q_* ( s,a )\),

\[\begin{split}\pi _ { * } ( a | s ) = \left\{ \begin{array} { l l } { 1} & { \text{ if } a = \operatorname{arg} \max q _ { x } ( 5,a ) } \\ { 0} & { \text{ otherwise } } \end{array} \right.\end{split}\]
  • There is always a deterministic optimal policy for any MDP.
  • If we know \(q_* ( s,a )\), we immediately have the optimal policy.

Solving the Bellman Optimality Equation

  • Bellman optimality equation is non-linear
  • No closed form solution in general
  • Many iterative solution methods
    • Value iteration
    • Policy iteration
    • Q-learning
    • Sarsa

Q learning

“Q” is for “quality” of a certain action in a given state. We define a function \(Q \left( s ,a \right)\) representing the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on.

\[Q \left( s _ { t } ,a _ { t } \right) = m a x R _ { t + 1}\]

Bellman equation represents the optimal Q-value of state s and action a in terms of the next state s’ as following,

\[Q ^ { * } ( s ,a ) = r + \gamma m a x _ { a } Q ^ { * } \left( s ^ { \prime } ,a ^ { \prime } \right)\]

In words, it means the optimal future reward for this state and action is the immediate reward plu maximum future reward for the next state.

Explore-exploit dilemma

Should an agent exploit the known working strategy or explore possibly better unknown strategies?

Reference

[RL_course_mdp_DeepMind]https://www.youtube.com/watch?v=lfHX2hHRMVQ

osascript -e ‘set volume without output muted output volume 17 –100%’