# Markov Decision processes

A tuple – (S,s1,A,P,R)

S – finite set of states.

s1 – initial state.

A – finite set of actions.

P – Given a state s1 and action a, what is the probability of ending up at a particular state s2? This information is provided by P. This is called a State transition probability matrix. S*A*S

R – What is the reward for taking an action a at the state s?

Pi – Policy, what action to take at which state? A mapping from state space to action space. The optimal policy is the policy with the maximum reward.

Value function – Expected reward starting from state s and following policy pi.

Bellman equations –

Ways to find an optimal policy:

Optimal policy when starting in state s is the policy whose value function is maximum starting from state s.

Think about the case where a particular policy is always better than another policy for all initial states. The first policy is greater than second policy and this is called partial ordering.

Theorem:

There always exists a policy such that the partial ordering of it with all other policies is greater or equal. Such a policy/policies is called optimal policy.

Iterative methods:

1. Policy iteration:
1. At each state, pick the action with max value function.
2. you get the new policy.
3. Again go to step 1 and loop until the new policy and old policy are same.
2. Value iteration:
1. Finding an optimal value function rather than explicit policy.
2. For every iteration improve value vector.
3. once you converge to a particular value vector, use it to find the optimal policy.
4. This is cheap compared to policy iteration.

Model-free methods:

Reinforcement learning is Markov decision process with unknown transition model and/or reward distribution.

Agent can observe samples and induce transition model and reward distribution from that.

Q-learning:

Uses Q-values instead of the value function.

To be continued……….

Utility: Utility of the policy at a state is what happens if we start running from that state.

Reward gives us immediate feedback but utility gives us long-term feedback. utilities allow you to take short-term negatives for long-term positives.

Credit assignment problem:

Advertisements