Home Artificial Intelligence Reinforcement Learning: an Easy Introduction to Value Iteration

Reinforcement Learning: an Easy Introduction to Value Iteration

1
Reinforcement Learning: an Easy Introduction to Value Iteration

Solving the instance using Value Iteration

VI should make much more sense once we complete an example problem, so let’s get back to our golf MDP. We’ve formalised this as an MDP but currently, the agent doesn’t know the very best strategy when playing golf, so let’s solve the golf MDP using VI.

We’ll start by defining our model parameters using fairly standard values:

γ = 0.9 // discount factor
θ = 0.01 // convergence threshold

We’ll then initialise our worth table to 0 for states in S:

// value table

V(s0) = 0
V(s1) = 0
V(s2) = 0

We will now start within the outer loop:

Δ = 0

And three passes of the inner loop for every state in S:

// Bellman update rule
// V(s) ← maxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]

//******************* state s0 *******************//

v = 0

// we've only checked out one motion here as just one is on the market from s0
// we all know that the others will not be possible and would subsequently sum to 0

V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]

V(s0) = max[0.1 * (0 + 0.9 * 0) +
0.9 * (0 + 0.9 * 0)]

V(s0) = max[0] = 0

// Delta update rule
// Δ ← max(Δ,| v - V(s)|)

Δ = max[Δ, |v - v(s0)|] = max[0, |0 - 0|] = 0

//******************* state s1 *******************//

v = 0

// there are 2 available actions here

V(s1) = max[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) +
T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) +
T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]

V(s1) = max[0.9 * (0 + 0.9 * 0) +
0.1 * (0 + 0.9 * 0),
0.1 * (0 + 0.9 * 0) +
0.9 * (10 + 0.9 * 0)]

V(s1) = max[0, 9] = 9

Δ = max[Δ, |v - v(s1)|] = max[0, |0 - 9|] = 9

//******************* state s2 *******************//

// terminal state with no actions

This offers us the next update to our worth table:

V(s0) = 0
V(s1) = 9
V(s2) = 0

We don’t have to worry about s2 as this can be a terminal state, meaning no actions are possible here.

We now break out the inner loop and proceed the outer loop, performing a convergence check on:

Δ < θ = 9 < 0.01 = False

Since we’ve not converged, we do a second iteration of the outer loop:

Δ = 0

And one other 3 passes of the inner loop, using the updated value table:

//******************* state s0 *******************//

v = 0

V(s0) = max[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))]

V(s0) = max[0.1 * (0 + 0.9 * 0) +
0.9 * (0 + 0.9 * 9)]

V(s0) = max[7.29] = 7.29

Δ = max[Δ, |v - v(s0)|] = max[0, |0 - 7.29|] = 7.29

//******************* state s1 *******************//

v = 9

V(s1) = max[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) +
T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) +
T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]

V(s1) = max[0.9 * (0 + 0.9 * 7.29) +
0.1 * (0 + 0.9 * 9),
0.1 * (0 + 0.9 * 9) +
0.9 * (10 + 0.9 * 0)]

V(s1) = max(6.7149, 9.81) = 9.81

Δ = max[Δ, |v - v(s1)|] = max[7.29, |9 - 9.81|] = 7.29

//******************* state s2 *******************//

// terminal state with no actions

At the top of the second iteration, our values are:

V(s0) = 7.29
V(s1) = 9.81
V(s2) = 0

Check convergence once more:

Δ < θ = 7.29 < 0.01 = False

Still no convergence, so we proceed the identical process as above until Δ < θ. I won’t show all of the calculations, the above two are enough to grasp the method.

After 6 iterations, our policy converges. That is our values and convergence rate as they modify over each iteration:

Iteration   V(s0)        V(s1)        V(s2)        Δ             Converged
1 0 9 0 9 False
2 7.29 9.81 0 7.29 False
3 8.6022 9.8829 0 1.3122 False
4 8.779447 9.889461 0 0.177247 False
5 8.80061364 9.89005149 0 0.02116664 False
6 8.8029969345 9.8901046341 0 0.0023832945 True

Now we will extract our policy:

// Policy extraction rule
// π(s) = argmaxₐ Σₛ′, ᵣ p(s′, r|s, a) * [r + γV(s′)]

//******************* state s0 *******************//

// we all know there is just one possible motion from s0, but let's just do it anyway

π(s0) = argmax[T(s0 | s0, hit to green) * (R(s0, hit to green, s0) + γ * V(s0)) +
T(s1 | s0, hit to green) * (R(s0, hit to green, s1) + γ * V(s1))

π(s0) = argmax[0.1 * (0 + 0.9 * 8.8029969345) +
0.9 * (0 + 0.9 * 9.8901046341)]

π(s0) = argmax[8.80325447773]

π(s0) = hit to green

//******************* state s1 *******************//

π(s1) = argmax[T(s0 | s1, hit to fairway) * (R(s1, hit to fairway, s0) + γ * V(s0)) +
T(s1 | s1, hit to fairway) * (R(s1, hit to fairway, s1) + γ * V(s1)),
T(s1 | s1, hit in hole) * (R(s1, hit in hole, s1) + γ * V(s1)) +
T(s2 | s1, hit in hole) * (R(s1, hit in hole, s2) + γ * V(s2))]

V(s1) = max[0.9 * (0 + 0.9 * 8.8029969345) +
0.1 * (0 + 0.9 * 9.8901046341),
0.1 * (0 + 0.9 * 9.8901046341) +
0.9 * (10 + 0.9 * 0)]

π(s1) = argmax[8.02053693401, 9.89010941707]

π(s1) = hit in hole

Our final policy is:

π(s0) = hit to green
π(s1) = hit to hole
π(s2) = terminal state (no motion)

So, when our agent is within the Ball on fairway state (s0), the very best motion is to hit to green. This seems pretty obvious since that’s the only available motion. Nonetheless, in s1, where there are two possible actions, our policy has learned to hit in hole. We will now give this learned policy to other agents who need to play golf!

And there you could have it! We’ve just solved a quite simple RL problem using Value Iteration.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here