Markov Decision processes

A tuple – (S,s1,A,P,R)

S – finite set of states.

s1 – initial state.

A – finite set of actions.

P – Given a state s1 and action a, what is the probability of ending up at a particular state s2? This information is provided by P. This is called a State transition probability matrix. S*A*S

R – What is the reward for taking an action a at the state s?

Pi – Policy, what action to take at which state? A mapping from state space to action space. The optimal policy is the policy with the maximum reward.

Value function – Expected reward starting from state s and following policy pi.

Bellman equations –

Bellman

Ways to find an optimal policy:

Optimal policy when starting in state s is the policy whose value function is maximum starting from state s.

Think about the case where a particular policy is always better than another policy for all initial states. Then first policy is greater than second policy and this is called partial ordering.

Theorem:

There always exists a policy such that the partial ordering of it with all other policies is greater or equal. Such a policy/policies is called optimal policy.

Iterative methods:

  1. Policy iteration:
    1. At each state, pick the action with max value function.
    2. you get the new policy.
    3. Again go to step 1 and loop until the new policy and old policy are same.
  2. Value iteration:
    1. Finding an optimal value function rather than explicit policy.
    2. For every iteration improve value vector.
    3. once you converge to a particular value vector, use it to find the optimal policy.

Model-free methods:

Reinforcement learning is Markov decision process with unknown transition model and/or reward distribution.

Agent can observe samples and induce transition model and reward distribution from that.

Q-learning:

Uses Q-values instead of the value function.

To be continued……….