In the previous session you have learned about two armed bandit.
Specifically about the Q-learning algorithm, where the outcome of an action help updating the value of this action. We have mentioned two parameters:
$$ PE_t = r_t - Q_t(a_t) \\ Q_{t+1}(a) = Q_{t}(a_t) + \alpha \cdot PE_t $$
However, most of our everyday actions don’t lead to a desired outcome, but they are part of a sequence of actions we need to perform in order to achieve a specific goal. Using the Q-learning algorithm in a sequence scenario will allow us to update the different actions, however it won’t be that efficient.
For example, lets say we need to perform three actions, one after the other in order to achieve an outcome. For a more realistic example: Desired outcome: Reward (A cup of coffee) Actions: action1 (turn on the machine), action2 (insert a coffee capsule), action3 (press the long espresso button)
At the first time we perform this sequence, only action3 will be updated, and in order to update the second action we will need to perform the sequence all over again.
Eligibility traces are used to speed up the learning process by combining information from multiple time steps. They can be thought of as a temporary record of the occurrence of state-action pairs during an episode. The idea is to keep track of these pairs and use this information to update the value function or policy more effectively.
The TD(λ) algorithm is used to learn the value function V(s) for state s. Here is a step-by-step outline of the TD(λ) algorithm with eligibility traces:
Initialize:
For each episode:
For each step of the episode:
Take action a(t) leading to the next state s(t+1) and receive reward r(t+1).
Calculate the TD error: δt=r(t+1)−V(st)
Update the eligibility trace for the current state: e(st)=e(st)+1
For all states s: V(s)=V(s)+αδ(t)e(s)
e(s)=λe(s)
End of episode: If the episode ends, reset the eligibility traces e(s).