Actor-Critic and DDPG

原創

2020-02-22 06:43

In the last post Overview of RL we've seen two different methodologies: Policy Gradient which aims at training a policy (Actor); and Q-Learning which aims at training a state-action value function (Critic).

We start this post by first providing some insights on the intuition behind Actor-Critic. I am inspired by this nice post : https://towardsdatascience.com/introduction-to-actor-critic-7642bdb2b3d2.

AC = Policy Gradient + Q-Learning

I suppose you all know what is Policy Gradient. Concisely, we want to learn a policy (Actor), we decide to play the game 1000 times and get the total reward of each episode. Now we have the expected reward and the aim is to maximize it by gradient ascent. Remember this equation:

$\bigtriangledown_{\theta}J(\theta) = \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\bigtriangledown_{\theta}\log\pi_\theta(a_t|s_t)R_i$

The here is same for all state actions in a given trajectory, which is not good thing. Some extra work is done:

$\bigtriangledown_{\theta}J(\theta) = \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\bigtriangledown_{\theta}\log\pi_\theta(a_t|s_t)(R_t(\tau_i)-b_t)$

here is the discounted reward from time t and is the average reward over all actions taken on this state. can help avoid some sampling issues like "low-weighted actions are sampled less and less".

Now the last part of the quation reminds us about Q-Learning:

$\bigtriangledown_{\theta}J(\theta) = \frac{1}{N}\sum_{i=1}^N\sum_{t=1}^T\bigtriangledown_{\theta}\log{\color{Blue} \pi_\theta(a_t|s_t)}({\color{Red} Q(s_t,a_t)-V(s_t)})$

And it is now clear that the blue part is an Actor and the red part is a Critic. The value given by the Critic is called Advantage.

發佈了73 篇原創文章 · 獲贊 101 · 訪問量 22萬+

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

智慧家庭場景的推薦系統的發展歷程和方向 | InfoQ《公開課》

直播概要：隨着計算機的蓬勃發展，互聯網進入大數據和人工智能時代，爲了解決信息過載和長尾商品，推薦系統成爲唯一選擇，而面對不同的業務場景，爲了解決業務痛點，會根據不同的場景特點尋找不同的方法和手段來解決推薦中實際遇到的問題。在智慧家庭領域，

InfoQ 中文站

2021-12-21 10:54:01

Alexa 全球排名網站將關閉，排名曾引爭議

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-14 14:53:55

Thinking Above Code：TLA+思維概述

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-12-07 17:23:58

你的2.6朵雲裏，會有火山引擎嗎？

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-07 10:28:54

數字化轉型這麼火，你真的看懂了嗎？

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-12-02 21:08:57

基於圖像的機器學習技術將數十億的電子商務產品分爲數千個類別

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-11-29 16:28:50

如何用 PyTorch 構建 GAN？

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-11-23 11:18:54

繞過硬件瓶頸，成倍提升芯片算力，軟件層面深挖芯片性能可行嗎？

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-11-23 11:18:54

App Annie發佈預測：TikTok 將達 15 億活躍用戶，遙遙領先 Instagram

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-11-19 19:53:55

不是隻有數字化水平高，纔可以落地知識圖譜

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockq

2021-11-11 15:23:53

科大訊飛在AI源頭技術上的突破，實現系統性創新

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-11-08 15:13:57

不滿被辭退，一程序員寫爬蟲程序侵入公司後臺刪庫泄憤，造成經濟損失10餘萬元

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockq

2021-11-08 14:03:51

“Trojan Source”算法漏洞幾乎影響所有代碼的安全

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-11-05 18:33:59

谷歌前CEO發出警告：元宇宙對人類未必是好事，AI技術是“僞神”

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null

2021-11-02 14:03:53

騰訊發佈超大預訓練系統派大星，聚焦解決BERT等超大模型訓練時的“GPU內存牆”問題

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"typ

2021-11-02 13:38:53

24小時熱門文章

最新文章

最新評論文章