Actor-Critic and DDPG

In the last post Overview of RL we've seen two different methodologies: Policy Gradient which aims at training a policy (Actor); and Q-Learning which aims at training a state-action value function (Critic).

We start this post by first providing some insights on the intuition behind Actor-Critic. I am inspired by this nice post :  https://towardsdatascience.com/introduction-to-actor-critic-7642bdb2b3d2.

AC = Policy Gradient + Q-Learning

I suppose you all know what is Policy Gradient. Concisely, we want to learn a policy (Actor), we decide to play the game 1000 times and get the total reward of each episode. Now we have the expected reward and the aim is to maximize it by gradient ascent. Remember this equation:

\bigtriangledown_{\theta}J(\theta) = \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\bigtriangledown_{\theta}\log\pi_\theta(a_t|s_t)R_i

The R_i here is same for all state actions in a given trajectory, which is not good thing. Some extra work is done:

\bigtriangledown_{\theta}J(\theta) = \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\bigtriangledown_{\theta}\log\pi_\theta(a_t|s_t)(R_t(\tau_i)-b_t)

R_t here is the discounted reward from time t and b_t is the average reward over all actions taken on this state. b_t can help avoid some sampling issues like "low-weighted actions are sampled less and less".

Now the last part of the quation reminds us about Q-Learning:

\bigtriangledown_{\theta}J(\theta) = \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\bigtriangledown_{\theta}\log{\color{Blue} \pi_\theta(a_t|s_t)}({\color{Red} Q(s_t,a_t)-V(s_t)})

And it is now clear that the blue part is an Actor and the red part is a Critic. The value given by the Critic is called Advantage.

 

 

發佈了73 篇原創文章 · 獲贊 101 · 訪問量 22萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章