強化學習 | COMA

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"本文首發於:","attrs":{}},{"type":"link","attrs":{"href":"https://xingzheai.cn/details/ec62be1696c","title":"","type":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"行者AI","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在多agent的強化學習算法中,前面我們講了QMIX,其實VDN是QMIX的一個特例,當求導都爲1的時候,QMIX就變成了VDN。QTRAN也是一種關於值分解的問題,在實際的問題中QTRAN效果沒有QMIX效果好,主要是QTRAN的約束條件太過於鬆散,導致實際沒有理論效果好。但是QTRAN有兩個版本,QTRAN_BASE和QTRAN_ALT,第二版本效果比第一要好,在大部分實際問題中和QMIX的效果差不多。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述的算法都是關於值分解的,每個agent的回報都是一樣的。如果在一局王者榮耀的遊戲中,我方大順風,我方一名角色去1打5,導致陣亡,然後我方4打5,由於我方處於大優勢,我方團滅對方,我方所有的agent都獲得正的獎勵。開始去1打5的agnet也獲得了一個正的獎勵,顯然他的行爲是不能獲得正的獎勵。就出現了“喫大鍋飯”的情況,置信度分配不均。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"COMA算法就解決了這種問題,利用反事實基線來解決置信度分配的問題。COMA是一種“非中心化”的策略控制系統。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. Actor-Critic","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"COMA主要採樣了Actor-Critic的主要思想,一種基於策略搜索的方法,中心式評價,邊緣式決策。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. COMA","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"COMA主要使用反事實基線來解決置信分配問題。在協作智能體的系統中,判斷一個智能體執行一個動作的的貢獻有多少,智能體選取一個動作成爲默認動作(以一種特殊的方式確認默認動作),分別執行較默認動作和當前執行的動作,比較出這兩個動作的優劣性。這種方式需要模擬一次默認動作進行評估,顯然這種方式增加了問題的複雜性。在COMA中並沒有設置默認動作,就不用額外模擬這基線,直接採用當前策略計算智能體的邊緣分佈來計算這個基線。COMA採用這種方式大大減少了計算量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基線的計算:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"katexblock","attrs":{"mathString":"\\sum_{u'a}\\pi^a(u^{'a}|\\tau^a)Q(s,(u^{-a},u^{'a}))"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"COMA網絡結構","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7b/7b0007b4564d2cd8cf410fd4791b1567.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中(a)表示COMA的集中式網絡結構,(b)表示actior的網絡結構,(c)表示Critic的網絡結構。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 算法流程","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始化actor_network,eval_critic_network,target_critic_network,將eval_critic_network的網絡參數複製給target_critic_network。初始化buffer ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"D"}},{"type":"text","text":",容量爲","attrs":{}},{"type":"katexinline","attrs":{"mathString":"M"}},{"type":"text","text":",總迭代輪數","attrs":{}},{"type":"katexinline","attrs":{"mathString":"T"}},{"type":"text","text":",target_critic_network網絡參數更新頻率","attrs":{}},{"type":"katexinline","attrs":{"mathString":"p"}},{"type":"text","text":"。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"katexinline","attrs":{"mathString":"for"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"t"}},{"type":"text","text":"=","attrs":{}},{"type":"katexinline","attrs":{"mathString":"1"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"to"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"T"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"do"}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)初始化環境","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)獲取環境的","attrs":{}},{"type":"katexinline","attrs":{"mathString":"S"}},{"type":"text","text":",每個agent的觀察值","attrs":{}},{"type":"katexinline","attrs":{"mathString":"O"}},{"type":"text","text":",每個agent的","attrs":{}},{"type":"katexinline","attrs":{"mathString":"avail"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"action"}},{"type":"text","text":",獎勵","attrs":{}},{"type":"katexinline","attrs":{"mathString":"R"}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)","attrs":{}},{"type":"katexinline","attrs":{"mathString":"for"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"step=1"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"to"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"episode"}},{"type":"text","text":"_","attrs":{}},{"type":"katexinline","attrs":{"mathString":"limit"}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"a)每個agent通過actor_network,獲取每個動作的概率,隨機sample獲取動作","attrs":{}},{"type":"katexinline","attrs":{"mathString":"action"}},{"type":"text","text":"。actor_network,採用的GRU循環層,每次都要記錄上一次的隱藏層。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"b)執行","attrs":{}},{"type":"katexinline","attrs":{"mathString":"action"}},{"type":"text","text":",將","attrs":{}},{"type":"katexinline","attrs":{"mathString":"S"}},{"type":"text","text":",","attrs":{}},{"type":"katexinline","attrs":{"mathString":"S_{next}"}},{"type":"text","text":",每個agent的觀察值","attrs":{}},{"type":"katexinline","attrs":{"mathString":"O"}},{"type":"text","text":",每個agent的","attrs":{}},{"type":"katexinline","attrs":{"mathString":"avail"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"action"}},{"type":"text","text":",每個agent的","attrs":{}},{"type":"katexinline","attrs":{"mathString":"next"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"avail"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"action"}},{"type":"text","text":",獎勵","attrs":{}},{"type":"katexinline","attrs":{"mathString":"R"}},{"type":"text","text":",選擇的動作","attrs":{}},{"type":"katexinline","attrs":{"mathString":"u"}},{"type":"text","text":",env是否結束","attrs":{}},{"type":"katexinline","attrs":{"mathString":"terminated"}},{"type":"text","text":",存入經驗池","attrs":{}},{"type":"katexinline","attrs":{"mathString":"D"}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"c)","attrs":{}},{"type":"katexinline","attrs":{"mathString":"if"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"len(D)"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":">="}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"M"}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"d)隨機從","attrs":{}},{"type":"katexinline","attrs":{"mathString":"D"}},{"type":"text","text":"中採樣一些數據,但是數據必須是不同的episode中的相同transition。因爲在選動作時不僅需要輸入當前的inputs,還要給神經網絡輸入hidden_state,hidden_state和之前的經驗相關,因此就不能隨機抽取經驗進行學習。所以這裏一次抽取多個episode,然後一次給神經網絡傳入每個episode的同一個位置的transition。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"e)","attrs":{}},{"type":"katexinline","attrs":{"mathString":"td_error =G_t-Q_eval"}},{"type":"text","text":"計算loss,更新Critic參數。","attrs":{}},{"type":"katexinline","attrs":{"mathString":"G_t"}},{"type":"text","text":"表示從狀態","attrs":{}},{"type":"katexinline","attrs":{"mathString":"S"}},{"type":"text","text":",到結束,獲得的總獎勵。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"f)通過當前策略計算每個agent的每個step的基線,基線計算公式:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"katexblock","attrs":{"mathString":"\\sum_{u'a}\\pi^a(u^{'a}|\\tau^a)Q(s,(u^{-a},u^{'a}))(邊緣分佈)"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"g)計算執行當前動作的優勢advantage:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"katexblock","attrs":{"mathString":"A^a(s,u) = Q(s,u)-\\sum_{u'a}\\pi^a(u^{'a}|\\tau^a)Q(s,(u^{-a},u^{'a}))"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"h)計算loss,更新actor網絡參數:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"katexblock","attrs":{"mathString":"loss=((advantage*select_action_pi_log)*mask).sum()/mask.sum()"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"i)","attrs":{}},{"type":"katexinline","attrs":{"mathString":"if"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"t"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"p==0"}},{"type":"text","text":" :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"j)將eval_critic_network的網絡參數複製給target_critic_network。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 結果對比","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/36/367edefa686f7899b9b2c1181be718af.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我自己跑的數據,關於QMIX,VDN,COMA,三者之間的對比,在相同場景下。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f0/f030413402981f3c825a455e7004e14e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e0a2430f1936314c82cf9b061d6e40b6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5. 算法總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"COMA在論文寫的算法原理很好,但是在實際的場景中,正如上面的兩張圖所示,COMA的表現並不是很理想。在一般的場景中,並沒有QMIX的表現好。筆者建議讀者,在實際的環境中,可以試試VDN,QMIX等等,COMA不適合“帶頭衝鋒”。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6. 資料","attrs":{}}]},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"COMA:https://arxiv.org/abs/1705.08926","attrs":{}}]}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章