In the last post Overview of RL we've seen two different methodologies: Policy Gradient which aims at training a policy (Actor); and Q-Learning which aims at training a state-action value function (Critic).
We start this post by first providing some insights on the intuition behind Actor-Critic. I am inspired by this nice post : https://towardsdatascience.com/introduction-to-actor-critic-7642bdb2b3d2.
AC = Policy Gradient + Q-Learning
I suppose you all know what is Policy Gradient. Concisely, we want to learn a policy (Actor), we decide to play the game 1000 times and get the total reward of each episode. Now we have the expected reward and the aim is to maximize it by gradient ascent. Remember this equation:
The here is same for all state actions in a given trajectory, which is not good thing. Some extra work is done:
here is the discounted reward from time t and is the average reward over all actions taken on this state. can help avoid some sampling issues like "low-weighted actions are sampled less and less".
Now the last part of the quation reminds us about Q-Learning:
And it is now clear that the blue part is an Actor and the red part is a Critic. The value given by the Critic is called Advantage.