A tuple – (S,s1,A,P,R)

S – finite set of states.

s1 – initial state.

A – finite set of actions.

P – Given a state s1 and action a, what is the probability of ending up at a particular state s2? This information is provided by P. This is called a State transition probability matrix. S*A*S

R – What is the reward for taking an action a at the state s?

Pi – Policy, what action to take at which state? A mapping from state space to action space. The optimal policy is the policy with the maximum reward.

Value function – Expected reward starting from state s and following policy pi.

Bellman equations –

**Ways to find an optimal policy:**

Optimal policy when starting in state s is the policy whose value function is maximum starting from state s.

Think about the case where a particular policy is always better than another policy for all initial states. The first policy is greater than second policy and this is called **partial ordering**.

Theorem:

There always exists a policy such that the partial ordering of it with all other policies is greater or equal. Such a policy/policies is called optimal policy.

Iterative methods:

- Policy iteration:
- At each state, pick the action with max value function.
- you get the new policy.
- Again go to step 1 and loop until the new policy and old policy are same.

- Value iteration:
- Finding an optimal value function rather than explicit policy.
- For every iteration improve value vector.
- once you converge to a particular value vector, use it to find the optimal policy.
- This is cheap compared to policy iteration.

**Model-free methods**:

Reinforcement learning is Markov decision process with unknown transition model and/or reward distribution.

Agent can observe samples and induce transition model and reward distribution from that.

**Q-learning:**

Uses Q-values instead of the value function.

To be continued……….

Utility: Utility of the policy at a state is what happens if we start running from that state.

Reward gives us immediate feedback but utility gives us long-term feedback. utilities allow you to take short-term negatives for long-term positives.

Credit assignment problem: