DeepMind提出引導式元學習算法,讓元學習器具備自學能力

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DeepMind的一個研究小組近期提出了一種引導式(Bootstrap)的元學習算法,用於解決元優化以及短視的元目標問題,並讓學習器具備自學能力。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大部分人類學會學習的過程都是應用過往經驗,再學習新任務。然而,將這種能力賦予人工智能時卻仍是頗具挑戰。自學意味着機器學習的學習器需要學會更新規則,而這件事一般都是由人類根據任務手動調整的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"元學習的目標是爲研究如何讓學習器學會學習,而自學也是提升人工代理效率的一個很重要的研究領域。自學的方法之一,便是讓學習器通過將新的規則應用於已完成的步驟,通過評估新規則的性能來進行學習。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了讓元學習的潛能得到全面的開發,我們需要先解決元優化和短視元目標的問題。針對這兩大問題,DeepMind的一個研究小組提出了一種新的算法,可以讓學習器學會自我學習。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/12\/126db4b2e46c60ac88602b8b6a138a98.png","alt":"Image1","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"元學習器需要先應用規則,並評估其性能才能學會更新的規則。然而,規則的應用一般都會帶來過高的計算成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先前的研究中有一個假設情況:在K個應用中實施更新規則後再進行性能優化,會讓學習器在剩餘生命週期中的性能得到提升。然而,如果該假設失敗,那麼元學習器在一定週期內會存在短視偏見。除此之外,在K個更新之後再優化學習器的性能還可能會導致其無法計算到學習過程本身。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這類的元優化過程還會造成兩種瓶頸情況:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一是曲率,元目標被限制在學習器相同類型的幾何範圍內。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"二是短視,元目標從根本上被侷限在這個評估K步驟的平面裏,從而無視掉後續的動態學習。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"論文中提出的算法包括了兩個主要特徵來解決這些問題。首先,爲減輕學習器短視的情況,算法通過bootstrap將動態學習的信息注入目標之中。至於曲率問題,論文是通過計算元目標到引導式目標的最小距離來控制曲率的。可以看出,論文中提出的算法背後的核心思想是,讓元學習器通過更少的步驟來匹配未來可能的更新,從而更效率地進行自我學習。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該算法構建元目標有兩個步驟:"}]},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"從學習器的新參數中引導一個目標。在論文中,研究者在多個步驟中,依據元學習器的更新規則或其他的更新規則,不斷刷新元學習器的參數,從而生成新的目標。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"將學習器的新參數,或者說包含元學習器參數的函數,與目標一同投射到一個匹配空間中,而這個匹配空間簡單來說可以是一個歐幾里得參數空間。爲控制曲率,研究者選擇使用另一個(僞)度量空間,舉例來說,概率模型中的一個常見選擇,KL散度(Kullback-Leibler divergence)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/i2.wp.com\/syncedreview.com\/wp-content\/uploads\/2021\/09\/image-77.png","alt":null,"title":"引導式元梯度","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"總體來說,元學習器的目的是最小化到引導式目標的距離。爲此,研究團隊提出了一種新穎的引導式元梯度(BMG),在不新增反向傳播更新步驟的情況下將未來動態學習的信息注入。因此,BMG可以加快優化的進程,並且就如論文中展示的一樣,確保了性能的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研究團隊通過大量的實驗測試了BMG在標準元梯度下的性能。這些實驗是通過一個經典的強化學習馬爾可夫決策過程(MDP)任務,學習在特定期望下達到最優值的策略進行的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/f3\/4f\/f366cf735e7d3829fc9f9c9ec581db4f.png","alt":null,"title":"非穩態網格世界(第5.1節)左:在超過50個隨機種子之中,演員-評價者代理下的總回報對比。右:學習的熵值-正則化的時間表。","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/91\/bf\/91fe819c4328ceab2e50ebab0f6c3ebf.png","alt":null,"title":"在Atari ALE[8]的57種遊戲中,人類得分標準化。左:2億幀時,對比BMG與我們實現的STACX*的賽前得分。右:對比公佈的基準線與學習得分中位數。陰影部分表示3個隨機種子之間的標準偏差。","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/i0.wp.com\/syncedreview.com\/wp-content\/uploads\/2021\/09\/image-75.png?resize=950%2C484&ssl=1","alt":null,"title":"Atari的消融實驗。左:人類標準化得分分解,優化器(SGD,RMS),匹配函數(L2,KL,KL&V),以及引導式步驟(L)。BMG在(SGD,L2,L=1)的情況下與STACX相同。中:不同L下喫豆人關卡返回。右:在57種遊戲中關卡返回的分佈,按照平均值和標準偏差對每種遊戲進行標準化處理。所有結果均爲三個獨立隨機種子,1.9-2億幀之間觀察所得。","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在評估中,BMG在Atari ALE的基準測試中展現了大幅度的性能改進,到達了全新的技術水平。BMG同樣改善了在少數情況下模型診斷元學習(MAML)的表現,爲高效元學習探索開拓了新的可能性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"論文地址:https:\/\/arxiv.org\/abs\/2109.04504"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/syncedreview.com\/2021\/09\/20\/deepmind-podracer-tpu-based-rl-frameworks-deliver-exceptional-performance-at-low-cost-107\/","title":null,"type":null},"content":[{"type":"text","text":"DeepMind’s Bootstrapped Meta-Learning Enables Meta Learners to Teach Themselves"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章