Unity ML-agents 參數設置解明

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"本文首發於:","attrs":{}},{"type":"link","attrs":{"href":"https://xingzheai.cn/details/eccfa000888","title":"","type":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"行者AI","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Unity 是全球最受歡迎的遊戲開發引擎之一,有大量的遊戲開發者在使用Unity開發他們的遊戲。在這個AI、大數據等流行詞遍佈各行各業的時代,Unity也沒有被潮流拋下,推出了他們自己的基於深度強化學習來訓練遊戲AI的工具包Unity ML-agents。這個工具包功能豐富,十分強大。可以幫助你在你的遊戲內實現一個新的AI算法,並且快速的用到你的遊戲當中。這麼強大的工具包難以在一篇文章裏面概括其所有功能。本文就先拋磚引玉,稍微討論一下Unity ML-agents訓練的時候需要用到的各種參數的意義,其常用的取值又是如何。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文所有內容參考Unity ML-agents的官方文檔(地址:https://github.com/Unity-Technologies/ml-agents/tree/main/docs)","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 訓練參數設置","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在你開始你的訓練之前,你需要針對你的訓練任務設定一個訓練參數文件(一般是一個.yaml文件)。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來就簡單介紹一下ml-agents環境裏的參數設置概要。本文主要參考ml-agents最新版本關於參數設置的官方文檔,做了一些概括性的翻譯並加入一定個人理解。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體文檔地址:https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練參數主要分爲常用訓練參數(Common Trainer Configurations), 訓練方式專用參數(Trainer-specific Configurations),超參數(hyper-parameters) ,獎勵信號參數(Reward Signals), 行爲克隆(Behavioral Cloning),使用RNN增強智能體記憶能力的參數(Memory-enhanced Agents using Recurrent Neural Networks),以及自我對抗訓練參數(Self-Play)這幾個大的模塊。這幾個模塊下面又有一些小的模塊,在後文會進一步說明,而且這些模塊並不需要總是全都去設定。事實上,除了前三個模塊是幾乎每個環境訓練必須的參數之外,其他的模塊僅用在需要使用到對應功能的訓練任務。接下來具體說明每個參數的含義。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 常用訓練參數模塊(Common Trainer Configurations)","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"trainer_type: (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"ppo","attrs":{}}],"attrs":{}},{"type":"text","text":") 這個參數決定了使用什麼智能體訓練算法,現在暫時只支持Proximal Policy Gradient(PPO,具體應爲OpenAI版本的PPO2),Soft Actor-Critic(SAC),以及MA-POCA。前兩種都只是單智能體訓練算法。 注意:在你改動了訓練算法後,記得去調整一下後面的相應參數。對於不同的算法,下面的參數也往往有不同的適用範圍,並非無縫銜接的。下面會具體說明。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"summary_freq: (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"50000","attrs":{}}],"attrs":{}},{"type":"text","text":") 這個參數決定了多少步數(step)之後,開始記錄我們的訓練統計數據。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"time_horizon","attrs":{}},{"type":"text","text":": (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"64","attrs":{}}],"attrs":{}},{"type":"text","text":") 這個參數決定了在多少步數之後,開始把收集到的經驗數據放入到經驗池(experience buffer)。這個量同樣也決定了使用多少步後的採樣來對當前動作的預期獎勵進行訓練。簡單來說,這個值如果越大,就相當於你更接近於一局(episode)遊戲的真實回報,從而偏差更小。但是由於要進行一局遊戲才能更新一個動作的獎勵預期,這個過程相當的長,並且每局遊戲可能情況變化很大。不同局之間,做同樣的動作可能最終收益大相徑庭(因爲這個動作可能其實對這個遊戲的影響根本沒有那麼大),從而導致方差較大。反過來,當你採樣的步數太小,可能對最終的獎勵預估會偏差很大,但是可能帶來較小的方差。其實這也跟機器學習裏面經典的簡單模型複雜模型(過擬合欠擬合)問題一樣,需要在方差和偏差當中取一個平衡。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"官方建議當你的環境太大(跑一步耗時太長)或者你設置的獎勵比較密集的時候,可以把這個值設的低一點,反之則需要增大。比如在足球比賽這樣獎勵非常稀疏的任務當中,範例文檔設置的該參數值爲1000","attrs":{}},{"type":"text","text":"。 注意,這個參數決定了採樣的步數,和batch_size、 buffer_size、 epoch等參數亦有聯繫。後面提到這些參數的時候會再對其中關係加以說明。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:32 - 2048","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"max_steps: (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"500000","attrs":{}}],"attrs":{}},{"type":"text","text":") 這個參數決定了本次訓練任務總共會進行多少步數。如果你有多個同樣動作的智能體,他們每個的步數都會計入到這個總步數當中;同樣的,如果你有多個環境並行的在多臺服務器上運行,所有環境裏的智能體的總步數都會計入考慮。所以,當你並行訓練多agent算法的時候,務必將這個值設的更大一些。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:5e5 - 1e7","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"keep_checkpoints: (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"5","attrs":{}}],"attrs":{}},{"type":"text","text":") 這個參數決定了保留多少個訓練時候產生的checkpoint,其中每一個checkpoint都是在checkpoint_interval步數後產生的。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"checkpoint_interval: (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"500000","attrs":{}}],"attrs":{}},{"type":"text","text":") 如前文所說,這個參數決定了每多少步數以後,你的模型會存儲一個節點。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"init_path: (default=None) 這個參數決定了你的模型是否從某個之前訓練好的模型存儲點繼續訓練。需要提供模型的確切位置,比如說,","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"./models/{run-id}/{behavior_name}","attrs":{}}],"attrs":{}},{"type":"text","text":"。其實在訓練的時候,使用--initialize-from這個CLI參數就足以讓所有的訓練任務從同一個存儲模型繼續訓練,提供這個參數是爲了方便讓不同的訓練從不同的模型繼續訓練(需要逐個設定)。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"很少用到","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"threaded:(default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"false","attrs":{}}],"attrs":{}},{"type":"text","text":") 開啓python的多線程功能來實現一邊訓練一邊存儲模型,避免I/O消耗太多時間。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"官方建議在開啓self-play功能的時候儘量避免使用該功能。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 接下來是一些常見超參數(hyper-parameters)的詳細設定","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"hyperparameters → learning_rate: (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"3e-4","attrs":{}}],"attrs":{}},{"type":"text","text":"),這個參數決定了梯度下降的學習率。當該值過大時訓練會變的不穩定(reward不能穩步上升,時常大幅震盪)。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:1e-5 - 1e-3","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"hyperparameters → batch size","attrs":{}},{"type":"text","text":":這個參數決定了每次更新參數的時候,會有多少步的(state,action, reward...) 狀態元組被用來進行學習。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"注意:這個參數應該永遠是buffer_size的幾分之一(因爲實際上程序會先將buffersize拿來切分成batchsize大小的幾個batches。所以buffer_size永遠要能被batch_size整除","attrs":{}},{"type":"text","text":")。 對於一個連續動作空間的訓練任務,這個參數應該設定在1000以上這個級別(因爲動作空間需要儘可能多的採樣來得到不同的數據),如果是離散空間,這個值往往設定在幾十到幾百就好了(根據你動作空間的大小來定)。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:(連續-PPO)512 - 5120;(連續-SAC)128 - 1024;(離散,PPO&SAC):32 - 512.","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"hyperparameters → buffer size:","attrs":{}},{"type":"text","text":" (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"10240","attrs":{}}],"attrs":{}},{"type":"text","text":" for PPO and ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"50000","attrs":{}}],"attrs":{}},{"type":"text","text":" for SAC) 。 PPO:對於PPO算法來說,程序先要集滿buffer_size那麼多步數之後纔會開啓一輪訓練。然後如前文所說,每次收集夠了buffer_size個狀態元組之後,我們把buffer裏面的狀態分成buffer/batch_size個batch,然後每個batch進行一次訓練更新參數, 這個過程重複epoch(epoch也是一個超參數)次。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"所以實際上每過一個buffer_size的採樣之後,參數都會進行num_epoch * buffer/batch次數的更新","attrs":{}},{"type":"text","text":"。通常一個較大的buffer_size會帶來一個更加穩定的訓練結果。 SAC:對於SAC算法來說,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"通常來說經驗池要是其設episode長度的幾千倍,從而能讓它訓練的時候同時利用到較舊和最近的採樣。常見範圍:(PPO)2048 - 409600;(SAC) 50000 - 1000000","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"hyperparameters → learning_rate_schedule: (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"linear","attrs":{}}],"attrs":{}},{"type":"text","text":" for PPO and ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"constant","attrs":{}}],"attrs":{}},{"type":"text","text":" for SAC) 這個參數決定了是否使用學習率衰減的辦法來穩定訓練。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"兩個可選的模式分別是linear和constant,前者會對你的學習率進行線性遞減,後者則不會改變","attrs":{}},{"type":"text","text":"。通常來說,學習率越大,往往容易出現訓練不穩定,學習率越小則容易出現長時間訓練不收斂的情況。所以使用學習率衰減先用較大的學習率快速的找到一個極值點方向,然後再逐步收縮學習率使得訓練穩定在極值附近,而不會來回擺動。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"官方建議,對PPO開啓linear模式,進行線性遞減,以達到更快的收斂速度。對SAC來說,則建議維持學習率不變。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 接下來介紹一些常用的網絡模型的超參數","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"twork_settings → hidden_units","attrs":{}},{"type":"text","text":":(default = 1","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"28","attrs":{}}],"attrs":{}},{"type":"text","text":") 這個參數決定了你的訓練網絡隱藏層的神經元數量,或者說它的維度。這個數值的大小決定了神經網絡對遊戲狀態的表達能力。簡單來說,較大的隱藏層往往對更大的觀察空間有着較好的表達能力。所以這個值應該隨着觀察空間的維度變大而變大。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:32 - 512","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"network_settings → num_layers: (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"2","attrs":{}}],"attrs":{}},{"type":"text","text":") 這個參數決定了你的訓練網絡的層數。這個數字越大,你的模型越深。雖然多層疊加能夠提高模型對環境的表達能力。但是模型並不是越深越好,過深的層數往往會帶來梯度消失的問題,並且會使得訓練速度變緩。建議優先加大hiddent_unitys再加大本參數。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:1 - 3","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"network_settings → normalize: (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"false","attrs":{}}],"attrs":{}},{"type":"text","text":") 這個參數決定了模型是否對輸入的觀察向量進行歸一化(normalization)。具體公式大概爲 (obs - mean) / std,其中mean和std分別是observation的平均值和標準差。官方建議,對於複雜的連續動作空間任務,使用normalization可能有幫助,但是對於簡單的離散任務使用normalization則可能帶來負作用。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"network_settings → vis_encode_type: (default = simple)","attrs":{}},{"type":"text","text":" 這個參數決定了你的編碼器的結構。注意,ppo也好,sac也好,只是一個訓練架構,不同的任務你使用ppo也需要不同的編碼器。比如說一個卡牌類遊戲,你可以把所有的觀察向量用普通的全連接層處理,但是如果是一個動作遊戲,你的觀察可能是一張圖片,那麼這個時候你最好使用CNN作爲你的特徵提取編碼器。這個參數就是在讓你選擇使用何種編碼器。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"simple","attrs":{}},{"type":"text","text":"是一個兩層的卷積網絡;","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Nature CNN","attrs":{}},{"type":"text","text":"則是使用","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Human Level Reinforcement Learning","attrs":{}},{"type":"text","text":"這篇文章的實現**;resent則是使用IMPALA RESNET這種架構**,這個網絡的特點在於提取pattern能力強,不易梯度消失。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"match3","attrs":{}},{"type":"text","text":"則是更適合於卡牌遊戲的一種CNN結構,特點是更小的結構但是可以捕捉空洞空間的表示信息。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"fully_connected","attrs":{}},{"type":"text","text":"則是一個單層全連接層。 因爲卷積核大小的匹配要求,每種編碼方式有一個最小輸入維度的下限。比如simple最小20 * 20;nature_cnn最小支持36 * 36;resent最小支持15 *15,match3最小支持5 * 5, 注意:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"使用match3來處理超大的輸入維度可能會拖慢訓練速度。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"network_settings → contioning_type: (default = hyper) 這個參數決定了是否使用hypernetwork來處理任務目標的信息。hyper網絡的參數比較多,當使用hyper網絡的時候請減少隱藏網絡的層數。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"5. Trainer-specific Configurations","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們介紹一下針對不同訓練算法的專門參數。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"5.1 PPO-specific Configurations (PPO專門參數)","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"beta: (default = 5.0e-3)","attrs":{}},{"type":"text","text":" 這個參數決定了鼓勵探索多樣化策略的熵正則項的強度(entropy regularization)。簡單來說,熵正則化通常是爲了豐富我們的策略函數探索不同策略和動作,我們使用一個KL散度去衡量新的策略和老策略的相似性,並通過增加entropy來讓動作策略更加隨機。換句話說,當beta越大,越鼓勵探索。調節這個參數要結合TensorBoard裏面的entropy和reward來看。當你的entropy下降的時候,reward仍然沒有起色,那就把這個beta增大使得entropy能夠再持續一段時間。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:1e-4 - 1e-2","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"epsilon: (default = 0.2)","attrs":{}},{"type":"text","text":" 這個參數決定了策略更新的速度。這個參數就是ppo2論文裏面的clip範圍,這個範圍限制了每一次參數更新偏離範圍,以保證ppo能夠依靠importance sampling作爲一個on policy的方法繼續訓練。 可以發現當epsilon越大,我們的舊策略參數theta’變動較小。這可以使得訓練更加穩定,但是隨之而來的就是更慢的收斂速度。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:0.1 - 0.3","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"lambd: (default = 0.95) 這個參數根據官方文檔說明是決定了我們的算法多大程度上依賴預估的獎勵值,又多大程度上依賴實際的獎勵值。實際上這個laambda指的是GAE方法中的一個超參數lambda。GAE方法是在策略梯度下降中常用的一種Advantage,用於更新控制policy在gradient方向上的更新幅度。GAE方法簡單來說就是求TD(0), TD(1) 一直到TD無窮(即蒙特卡洛採樣)的一個加權平均值。lambda這個值介於0到1之間,當lambda等於0時就是TD(0)估計,當lambda等於1時就是蒙特卡洛採樣。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:0.9 - 0.95","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"num_epoch:(default = 3) 這個參數決定了每次進行梯度下降更新的次數。詳見前文buffersize的部分。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:3 - 10","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"6. Reward Signals","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"獎勵信號可分爲兩種,一種是外在的(extrinsic)來自於環境的獎勵,一種是內在的(intrinsic)來自於內部的獎勵(比如好奇心獎勵等)。不管是內部還是外部獎勵信號,都至少要設定兩個參數,一個是信號強度(strength),一個是信號衰減率(gamma)。並且,你需要至少設定一種獎勵,不然沒辦法訓練。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"6.1 Extrinsic Rewards","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"extrinsic → strength: (default = 1.0) 這個參數決定了模型收到的外部環境獎勵信號的強度。 ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:1.00","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"extrinsic → gamma:(default = 0.99) 這個參數決定了遠期獎勵的衰減率。簡單來說,加入我們的模型在某一步收到了100的獎勵,那麼我們前一步的獎勵應該是多少呢?如果上一步沒有得到其他的獎勵,那麼上一步我們的收益就應該是gamma * 100 = 99, 同理,再上一步的收益就是gamma^2 * 100, 以此類推。直觀的理解,gamma約接近一,那麼較後期的收益也能反饋到前期的動作。反之就是動作策略進行學習的時候會更加倚重於短期內的回報。對於那種獎勵較爲稀疏,必須通過一系列動作之後才能獲得一次獎勵的任務,務必將這個值設定得更接近於一,反之則可以稍微小一些。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:0.8 - 0.995","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"6.2 Intrinsic Reward","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"curiosity → strength: (default = 1.0) 這個參數決定了好奇心獎勵的強度。這個比例需要調整到一個剛好的程度,使得好奇心獎勵既不會淹沒了外部獎勵的信號,又不會和外部獎勵比起來過於不值一提。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:0.001 - 0.1","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"curiosity → gamma: (default = 0.99)如前所述,這個參數決定了遠期獎勵的衰減率。詳見上文。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"curiosity → network_settings: 這個參數主要是用來決定ICM模型的隱藏層維度的。即不能太大也不能太小。(64 - 256)","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"curiosity → learning_rate: (default = 0.99) 這個參數決定了你的ICM模型的學習率。過大的學習率會導致訓練不穩定,過小的學習率會導致收斂緩慢。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:1e-5 - 1e-3","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"7. Memory-enhanced Agents Using Recurrent Neural Networks","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"7.1 可以通過增加記憶模塊的辦法來增加模型的表達能力。注意,memory section要加在network下面","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"network_settings → memory → memory_size: (default=128)","attrs":{}},{"type":"text","text":" 這個參數決定了你的LSTM網絡的隱藏層的維度。這個值必須是偶數。大小視你的狀態的複雜程度而定。最好要能夠足夠大,才能學習到如何記憶之前的情況。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"常見範圍:32 - 256","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"network_settings → memory → sequence_length: (default = 64)","attrs":{}},{"type":"text","text":" 這個參數決定了你的記憶網絡RNN循環的次數或者說是序列長度。爲了能夠訓練這麼長的序列,我們的採集到的經驗也需要有這麼長。所以根據我的猜測,儘管文檔沒有明說,但是這個參數一定要小於time_horizon的值。另外,這個數如果設定的太小,那麼可能他無法記住太多的東西,反過來,如果太大,訓練會很慢。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"7.2 使用記憶網絡需要注意以下幾點","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"LSTM 網絡在連續動作空間任務上表現不佳,官方建議更多的用在離散動作空間的任務上面。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"添加了一個RNN層會增加神經網絡的複雜度,務必適度降低神經網絡的層數。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一定要記住把memory_size設定爲偶數。","attrs":{}}]}]}],"attrs":{}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"8. Self Play 參數設置","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"self play 這個section只有在對抗性訓練的時候需要使用,如果僅僅只有一個agent,或者多個agent中沒有任何意義上的交互,則不需要設定這一個參數。Unity ml-agent也是利用self play參數的加入來啓動它自帶的對抗性訓練模塊。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"參考(https://github.com/Unity-Technologies/ml-agents/blob/ddfe054e204415d76b39be93f5bcbec1b456d928/docs/Training-Configuration-File.md#self-play)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"trainer_steps 和 ghost steps","attrs":{}},{"type":"text","text":" : 在理解self play參數之前我們需要先理解兩個概念,trainer_steps 和 ghost_steps。在一個對抗訓練當中,我們往往需要固定住一些agent的參數,在一定的步驟裏面讓他們作爲對手去訓練我們的agents。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"那麼我們的learning agents進行的步數就是trainer_steps,而與之相對的,那些固定參數的對手所走的步數就是ghost_steps,爲什麼這兩個值要分別計數呢?因爲有些遊戲並不是對稱對抗訓練(asymmetrical game)","attrs":{}},{"type":"text","text":",比如我們訓練一個2v1的場景,這時候在學習的隊伍是2個agent,對手可能就是1個agent。在這種情況下,trainer_steps的增長就會兩倍快於ghost_steps,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"因爲我們計步的時候都是計算總和","attrs":{}},{"type":"text","text":"。理解了這兩個概念之後,再來看下面的參數設定就會清楚很多,不然會一頭霧水。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"save_steps: 這個值決定了我們的learning agents,每save_steps個trainer steps會去存儲一個當前策略的參數。另外,如果save_steps足夠大,比如我們把剛纔例子裏的save_steps改成20480,那麼在存儲一次快照之前,參數就要進行至少80次更新,這樣每個快照之間的難度曲線就會更陡峭,使得每次訓練的不穩定性增大,但是帶來的可能是更好的最終結果,以及在複雜任務上更好的表現。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"team_change和swap_steps","attrs":{}},{"type":"text","text":":剛纔我們已經討論過了什麼是trainer steps,什麼是ghost steps。現在來看一下這兩個值怎麼在對抗訓練中決定我們更換對手的頻率。team_change 參數是決定我們要用同一個learning agent訓練多少次的參數。比如我們現在有紅藍兩隊球隊,如果我們設定team_change=10000,那紅隊就會先訓練10000個trainer steps纔會輪到藍隊。而swap_steps則決定了我們在一次team change之中,要更換幾次對手。用上面的例子來看,就是紅隊在這10000個trainer steps裏面要面對幾個不同的藍隊的過去的快照。這裏有一個簡單的公式可以計算這種關係,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"假設我們現在有設定team change爲t,然後想要切換對手的次數爲x,而我們的隊伍有a1個agents,對手的隊伍有a2個agents,則我們的swap_steps = (a2/a1) * (t / x), 如果這是一個對稱對抗遊戲,則上式可以簡化爲 t/x。 可以看見,team_changes 和 swap_steps,共同決定了一個更換對手的頻率,這個頻率越大,如同我們前面分析過的那樣,我們可能會遇到更多不同等級和策略的對手,從而會學到更多東西,但也造成了訓練不穩定,參數難以收斂的問題。但是每個learning agent學到的策略可能更加不侷限於某種對手,更加通用,不會過擬合。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"play_against_latest_model_ratio: (default = ","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"0.5","attrs":{}}],"attrs":{}},{"type":"text","text":") 這個參數決定了你和自己當前的模型對決的概率。這個值越大約容易和當前的自己的策略對決,也就更少的和自己以往的snapshot交戰。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"yaml"},"content":[{"type":"text","text":"behaviors:\n SoccerTwos: #遊戲的名字\n trainer_type: ppo #選定使用何種訓練算法\n hyperparameters: #PPO算法的超參數設置\n batch_size: 2048 \n buffer_size: 20480 # buffer大小必須是batch大小的整數倍,這裏是十倍\n learning_rate: 0.0003\n beta: 0.005 #熵正則的超參數\n epsilon: 0.2 #PPO算法的clip範圍\n lambd: 0.95 #GAE算法的lambda\n num_epoch: 8 #每次更新訓練多少次\n learning_rate_schedule: linear #學習率衰減方式,這裏是線性衰減\n network_settings: #處理observation信息的網絡結構設置\n normalize: false \n hidden_units: 256\n num_layers: 2\n vis_encode_type: simple \n reward_signals: #獎勵超參數設置\n extrinsic:\n gamma: 0.99\n strength: 1.0\n keep_checkpoints: 5 #一共保留最近的五個checkpoint\n checkpoint_interval: 100000 #每100000個timestep保存一個checkpoint\n max_steps: 50000000 #最多訓練這麼多不(注意,這是多個agent加起來的值)\n time_horizon: 1000\n summary_freq: 5000 \n threaded: false # 是否使用線程,在使用self-play功能時最好關掉\n self_play: #self-play相關參數設定\n save_steps: 50000\n team_change: 250000\n swap_steps: 2000\n window: 10 #一共保留十個過去過去的snapshot\n play_against_latest_model_ratio: 0.5\n initial_elo: 1200.0 \n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章