强化学习 | COMA

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"本文首发于:","attrs":{}},{"type":"link","attrs":{"href":"https://xingzheai.cn/details/ec62be1696c","title":"","type":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"行者AI","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在多agent的强化学习算法中,前面我们讲了QMIX,其实VDN是QMIX的一个特例,当求导都为1的时候,QMIX就变成了VDN。QTRAN也是一种关于值分解的问题,在实际的问题中QTRAN效果没有QMIX效果好,主要是QTRAN的约束条件太过于松散,导致实际没有理论效果好。但是QTRAN有两个版本,QTRAN_BASE和QTRAN_ALT,第二版本效果比第一要好,在大部分实际问题中和QMIX的效果差不多。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述的算法都是关于值分解的,每个agent的回报都是一样的。如果在一局王者荣耀的游戏中,我方大顺风,我方一名角色去1打5,导致阵亡,然后我方4打5,由于我方处于大优势,我方团灭对方,我方所有的agent都获得正的奖励。开始去1打5的agnet也获得了一个正的奖励,显然他的行为是不能获得正的奖励。就出现了“吃大锅饭”的情况,置信度分配不均。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"COMA算法就解决了这种问题,利用反事实基线来解决置信度分配的问题。COMA是一种“非中心化”的策略控制系统。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. Actor-Critic","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"COMA主要采样了Actor-Critic的主要思想,一种基于策略搜索的方法,中心式评价,边缘式决策。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. COMA","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"COMA主要使用反事实基线来解决置信分配问题。在协作智能体的系统中,判断一个智能体执行一个动作的的贡献有多少,智能体选取一个动作成为默认动作(以一种特殊的方式确认默认动作),分别执行较默认动作和当前执行的动作,比较出这两个动作的优劣性。这种方式需要模拟一次默认动作进行评估,显然这种方式增加了问题的复杂性。在COMA中并没有设置默认动作,就不用额外模拟这基线,直接采用当前策略计算智能体的边缘分布来计算这个基线。COMA采用这种方式大大减少了计算量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基线的计算:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"katexblock","attrs":{"mathString":"\\sum_{u'a}\\pi^a(u^{'a}|\\tau^a)Q(s,(u^{-a},u^{'a}))"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"COMA网络结构","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7b/7b0007b4564d2cd8cf410fd4791b1567.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"图中(a)表示COMA的集中式网络结构,(b)表示actior的网络结构,(c)表示Critic的网络结构。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 算法流程","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"初始化actor_network,eval_critic_network,target_critic_network,将eval_critic_network的网络参数复制给target_critic_network。初始化buffer ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"D"}},{"type":"text","text":",容量为","attrs":{}},{"type":"katexinline","attrs":{"mathString":"M"}},{"type":"text","text":",总迭代轮数","attrs":{}},{"type":"katexinline","attrs":{"mathString":"T"}},{"type":"text","text":",target_critic_network网络参数更新频率","attrs":{}},{"type":"katexinline","attrs":{"mathString":"p"}},{"type":"text","text":"。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"katexinline","attrs":{"mathString":"for"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"t"}},{"type":"text","text":"=","attrs":{}},{"type":"katexinline","attrs":{"mathString":"1"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"to"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"T"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"do"}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1)初始化环境","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2)获取环境的","attrs":{}},{"type":"katexinline","attrs":{"mathString":"S"}},{"type":"text","text":",每个agent的观察值","attrs":{}},{"type":"katexinline","attrs":{"mathString":"O"}},{"type":"text","text":",每个agent的","attrs":{}},{"type":"katexinline","attrs":{"mathString":"avail"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"action"}},{"type":"text","text":",奖励","attrs":{}},{"type":"katexinline","attrs":{"mathString":"R"}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3)","attrs":{}},{"type":"katexinline","attrs":{"mathString":"for"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"step=1"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"to"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"episode"}},{"type":"text","text":"_","attrs":{}},{"type":"katexinline","attrs":{"mathString":"limit"}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"a)每个agent通过actor_network,获取每个动作的概率,随机sample获取动作","attrs":{}},{"type":"katexinline","attrs":{"mathString":"action"}},{"type":"text","text":"。actor_network,采用的GRU循环层,每次都要记录上一次的隐藏层。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"b)执行","attrs":{}},{"type":"katexinline","attrs":{"mathString":"action"}},{"type":"text","text":",将","attrs":{}},{"type":"katexinline","attrs":{"mathString":"S"}},{"type":"text","text":",","attrs":{}},{"type":"katexinline","attrs":{"mathString":"S_{next}"}},{"type":"text","text":",每个agent的观察值","attrs":{}},{"type":"katexinline","attrs":{"mathString":"O"}},{"type":"text","text":",每个agent的","attrs":{}},{"type":"katexinline","attrs":{"mathString":"avail"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"action"}},{"type":"text","text":",每个agent的","attrs":{}},{"type":"katexinline","attrs":{"mathString":"next"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"avail"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"action"}},{"type":"text","text":",奖励","attrs":{}},{"type":"katexinline","attrs":{"mathString":"R"}},{"type":"text","text":",选择的动作","attrs":{}},{"type":"katexinline","attrs":{"mathString":"u"}},{"type":"text","text":",env是否结束","attrs":{}},{"type":"katexinline","attrs":{"mathString":"terminated"}},{"type":"text","text":",存入经验池","attrs":{}},{"type":"katexinline","attrs":{"mathString":"D"}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"c)","attrs":{}},{"type":"katexinline","attrs":{"mathString":"if"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"len(D)"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":">="}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"M"}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"d)随机从","attrs":{}},{"type":"katexinline","attrs":{"mathString":"D"}},{"type":"text","text":"中采样一些数据,但是数据必须是不同的episode中的相同transition。因为在选动作时不仅需要输入当前的inputs,还要给神经网络输入hidden_state,hidden_state和之前的经验相关,因此就不能随机抽取经验进行学习。所以这里一次抽取多个episode,然后一次给神经网络传入每个episode的同一个位置的transition。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"e)","attrs":{}},{"type":"katexinline","attrs":{"mathString":"td_error =G_t-Q_eval"}},{"type":"text","text":"计算loss,更新Critic参数。","attrs":{}},{"type":"katexinline","attrs":{"mathString":"G_t"}},{"type":"text","text":"表示从状态","attrs":{}},{"type":"katexinline","attrs":{"mathString":"S"}},{"type":"text","text":",到结束,获得的总奖励。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"f)通过当前策略计算每个agent的每个step的基线,基线计算公式:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"katexblock","attrs":{"mathString":"\\sum_{u'a}\\pi^a(u^{'a}|\\tau^a)Q(s,(u^{-a},u^{'a}))(边缘分布)"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"g)计算执行当前动作的优势advantage:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"katexblock","attrs":{"mathString":"A^a(s,u) = Q(s,u)-\\sum_{u'a}\\pi^a(u^{'a}|\\tau^a)Q(s,(u^{-a},u^{'a}))"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"h)计算loss,更新actor网络参数:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"katexblock","attrs":{"mathString":"loss=((advantage*select_action_pi_log)*mask).sum()/mask.sum()"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"i)","attrs":{}},{"type":"katexinline","attrs":{"mathString":"if"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"t"}},{"type":"text","text":" ","attrs":{}},{"type":"katexinline","attrs":{"mathString":"p==0"}},{"type":"text","text":" :","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"j)将eval_critic_network的网络参数复制给target_critic_network。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 结果对比","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/36/367edefa686f7899b9b2c1181be718af.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我自己跑的数据,关于QMIX,VDN,COMA,三者之间的对比,在相同场景下。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f0/f030413402981f3c825a455e7004e14e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e0/e0a2430f1936314c82cf9b061d6e40b6.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5. 算法总结","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"COMA在论文写的算法原理很好,但是在实际的场景中,正如上面的两张图所示,COMA的表现并不是很理想。在一般的场景中,并没有QMIX的表现好。笔者建议读者,在实际的环境中,可以试试VDN,QMIX等等,COMA不适合“带头冲锋”。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6. 资料","attrs":{}}]},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"COMA:https://arxiv.org/abs/1705.08926","attrs":{}}]}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章