DeepMind科學家:強化學習足以滿足通用AI需求

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}},{"type":"strong"}],"text":"本文是我們對"},{"type":"link","attrs":{"href":"https:\/\/bdtechtalks.com\/tag\/ai-research-papers\/?fileGuid=qgprgqTXvgQxwTXJ","title":"","type":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"AI研究論文的評論文章"}],"marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}},{"type":"strong"}]},{"type":"text","marks":[{"type":"italic"},{"type":"size","attrs":{"size":10}},{"type":"strong"}],"text":"之一,這個系列主要探索人工智能領域的最新發現。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在創造人工智能的長達數十年的旅途中,計算機科學家設計並開發了各種複雜的機制和技術來複制視覺、語言、推理、運動技能和其他與智慧生命相關的能力。雖然這些努力已經帶來了可以在有限環境中有效解決特定問題的AI系統,但他們還沒有開發出見於人類和動物中的那種通用智能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/57\/7f\/57ca0a510aab2be7791d6d3c55769f7f.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在《人工智能》期刊提交給同行評審的一篇新論文中,英國人工智能實驗室DeepMind的科學家認爲,智能及其相關能力不是通過形成和解決複雜問題而產生的,而是源於長期遵循一個簡單而強大的原則:獎勵最大化."}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這篇題爲“"},{"type":"link","attrs":{"href":"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S0004370221000862?fileGuid=qgprgqTXvgQxwTXJ","title":"","type":null},"content":[{"type":"text","text":"獎勵就夠了"}]},{"type":"text","text":"”的論文(在本文撰寫時仍處於預證明階段)從自然智能進化的相關研究以及人工智能的最新成就中汲取了靈感。作者認爲,獎勵最大化和試錯經驗足以培養出可表現與智力相關能力的行爲。由此他們得出結論,強化學習這一基於獎勵最大化理念的人工智能分支,可以引領通用人工智能的發展。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"AI的兩條路徑"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/97\/ea\/972401287b5c63bca2b4dd2d061220ea.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"創建AI的一種常見方法是嘗試在計算機中複製智能行爲的元素。例如,我們對哺乳動物視覺系統的理解催生了各種視覺人工智能系統,這些系統可以對圖像分類、定位照片中的對象、定義對象之間的邊界等等。同樣,我們對語言的理解有助於開發各種自然語言處理系統,例如問答、文本生成和機器翻譯等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些都是狹義人工智能的實例,這些系統旨在執行特定任務,不具備解決一般問題的能力。一些科學家認爲,拼裝多個狹義的人工智能模塊會製成更高級別的智能系統。例如,你可以發展一個軟件系統,其綜合運用單獨的計算機視覺、語音處理、NLP和電機控制模塊,以解決需要多種技能的複雜問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DeepMind研究人員提出的另一種創建人工智能的方法,是重新創建產生自然智能的簡單而有效的規則。研究人員寫道:“[我們]考慮了一個替代假設:最大化獎勵的一般化目標足以驅動表現出自然和人工智能研究領域中大部分(如果不是全部)能力的行爲。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這基本上就是大自然的機制。從科學角度來看,在我們周圍看到的複雜有機體中並沒有自上而下的智能設計。數十億年的自然選擇和隨機變異過濾出了種種適合生存和繁殖的生命形式。能夠更好地應對環境中各類挑戰和情況的生物得以生存和繁衍下去,其餘的都被淘汰了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種簡單而有效的機制推動了具有各種感知、導航、改變環境和相互交流技能和能力的生物的進化。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研究人員說:“動物和人類面臨的自然世界,以及人工代理未來將面對的環境,本質上是非常複雜的,需要複雜的能力才能在這些環境中取得成功(例如生存)。”“因此,以獎勵最大化來衡量的成功需要各種與智力相關的能力。在這樣的環境中,任何使獎勵最大化的行爲都必須表現出這些能力。從這個意義上說,獎勵最大化的一般化目標包含了許多甚至可能是所有的智能目標。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,考慮一隻尋求減少飢餓這一獎勵的松鼠。一方面,它的感官和運動技能幫助它在有食物時定位和收集堅果。但當食物變得稀缺時,只能找到食物的松鼠必然會餓死。這就是爲什麼它也具備計劃技能和記憶來收集堅果,並在冬天存儲它們。松鼠具有社交技能和知識,可以確保其他動物不會偷喫它的堅果。宏觀來看,減少飢餓可能是“活下去”的一個子目標,後者還需要其他一些技能,例如發現和躲避危險動物、保護自己免受環境威脅以及尋找應對季節變遷的更好棲息地。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研究人員寫道:“當與智力相關的能力作爲獎勵最大化的單一目標的解決方案出現時,實際上可能讓我們獲得了更深入的理解,因爲它解釋了這種能力爲什麼會出現。”“相比之下,當每個能力都被理解爲針對特定目標的解決方案時,'爲什麼'這個問題就被迴避了,重心都放在了這個能力能'做什麼'的問題上。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,研究人員認爲,“最通用和可擴展”的最大化獎勵路徑是讓智能體與環境交互來不斷學習。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"通過最大化獎勵發展各種能力"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/65\/d2\/6549378914e4d2d8e58b3b6c647c71d2.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在論文中,人工智能研究人員提供了一些高層示例,說明“在最大化許多可行的獎勵信號之一的服務中,智能和相關能力將如何隱式地浮現,並如何對應自然或人工智能領域可能探索的許多實用目標。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,感官技能服務於在複雜環境中生存的需要。對象識別使動物能夠檢測食物、獵物、朋友和威脅,或找到路徑、庇護所和棲息地。圖像分割使它們能夠分辨不同對象之間的差異,避免致命錯誤,例如從懸崖上掉下來或從樹枝上掉下來。同時,聽覺有助於動物發現它們看不到的威脅,或者找到僞裝起來的獵物。觸覺、味覺和嗅覺也賦予動物在棲息地更豐富的感官體驗,和在危險環境中更多生存機會的優勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"獎勵和環境也塑造了動物與生俱來的知識。例如,由獅子和獵豹等掠食性動物統治的危險棲息地會獎勵反芻動物,讓它們自出生以來就具備逃避威脅的先天知識。同時,動物也因其學習棲息地特定知識的能力而獲得獎勵,例如在哪裏可以找到食物和住所。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研究人員還討論了語言、社交智能、模仿以及最終一般化智能的獎勵驅動基礎,他們將其描述爲“在單一、複雜的環境中最大化單一獎勵”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這裏,他們類比了自然智能和AGI:“動物的經驗流足夠豐富和多樣,它可能需要靈活的能力來實現各種各樣的子目標(例如覓食、戰鬥或逃跑),以便成功地最大化其整體獎勵(例如飢餓或繁殖)。類似地,如果一個人工智能代理的經驗流足夠豐富,那麼許多目標(例如電池壽命或生存)可能隱式地需要實現同樣廣泛的子目標的各種能力,因此獎勵最大化應該足以產生一種通用人工智能。”"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"獎勵最大化的強化學習"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/80\/a0\/80d575d8cyyc7e9ab863172858fa4ba0.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"強化學習是人工智能算法的一個特殊分支,由三大關鍵要素組成:環境、代理和獎勵。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過執行動作,代理會改變自己和環境的狀態。根據這些動作對代理必須實現目標的影響程度,代理會受到獎勵或懲罰。在許多強化學習問題中,智能體沒有環境的初始知識,並從隨機動作起步。根據收到的反饋,代理學習如何調整其行爲,並制定最大化其獎勵的策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在他們的論文中,DeepMind的研究人員建議將強化學習作爲主要算法,它可以複製自然界中體現的獎勵最大化機制,並最終走向通用人工智能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研究人員寫道:“如果一個智能體可以不斷調整其行爲以提高其累積獎勵,那麼它所處環境反覆要求的任何能力最終都必須產生在智能體的行爲中”。研究人員補充說,在最大化獎勵的過程中,一個好的強化學習代理最終可以學會感知、語言、社交智能等能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在論文中,研究人員提供了幾個例子,展示了強化學習代理如何在遊戲和機器人環境中學習一般技能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而研究人員也強調,一些根本性的挑戰仍未解決。例如,他們說,“我們不對強化學習代理的樣本效率提供任何理論保證。”強化學習以需求海量數據而聞名。例如,強化學習代理可能需要幾個世紀的遊戲時間才能掌握某款計算機遊戲的玩法。人工智能研究人員仍不知道如何才能創建出能將學習推廣到多個領域的強化學習系統。因此,環境的微小變化往往也需要對模型進行全面重新訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"研究人員還承認,獎勵最大化的學習機制是一個尚待解決的問題,也仍然是強化學習中有待進一步研究的核心問題。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"獎勵最大化的優點和缺點"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/resource\/image\/ce\/e6\/ce54967f3235d1425f3c3fb06f1bc9e6.jpg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":10}}],"text":"加州大學聖地亞哥分校的神經科學家、哲學家和名譽教授Patricia Churchland將論文中的想法稱爲“非常謹慎和有深度的解決方案”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而,Churchland指出,該論文關於社會決策的討論可能存在缺陷。DeepMind研究人員專注於社交互動中的個人收益。Churchland最近寫了一本關於道德直覺的生物學起源的書,他認爲依戀和聯繫是影響哺乳動物和鳥類社會決策的一個強大因素,這就是爲什麼動物會爲了保護它們的孩子而將自己置於極大的危險之中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Churchland說:“我傾向於將親密關係以及其他人的關懷視爲自我的延伸——‘由我及人’(me-and-mine)。”“在這種情況下,我認爲對[論文]假設進行小幅修改以實現對'由我及人'的獎勵最大化會非常有效。當然,我們羣居動物都有依戀水平——對後代的依戀水平超強,對配偶親人很強,對朋友和熟人很強等等。依戀類型的強弱會因環境和發育階段而異。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Churchland說,這不是一條主要批評意見,並且很可能會非常優雅地融入這個假設。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“我對論文中的細節水平以及他們對可能弱點考慮的仔細程度印象深刻。”“我可能錯了,但我傾向於認爲這是一個里程碑。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據科學家Herbert Roitblat對該論文的立場——即簡單的學習機制和試錯經驗足以培養與智能相關的能力——提出了挑戰。Roitblat認爲,論文中提出的理論在現實生活中實施時面臨着一些挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“如果沒有時間限制,那麼試錯學習可能就足夠了。但存在時間限制的話,我們就會遇到無限數量的猴子在無限長的時間內打字的問題,”Roitblat說。無限猴子定理指出,一隻猴子在無限長的時間內隨機敲打打字機上的按鍵,最終可能會打出任何給定的文本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Roitblat是《算法還不夠》一書的作者,他在其中解釋了爲什麼當前所有的AI算法,包括強化學習,都需要仔細規劃人類創建的問題和表示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“一旦建立了模型及其內在表示,優化或強化過程就可以引導模型進化,但這並不意味着單純強化就足夠了,”Roitblat說。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同樣,Roitblat補充說,該論文沒有就如何定義強化學習的獎勵、動作和其他元素提出任何建議。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“強化學習假設智能體具有一組有限的潛在動作。獎勵信號和價值函數也已經指定就緒。換句話說,通用智能的問題恰恰是如何提供被強化學習作爲先決條件的那些東西,”Roitblat說。“因此,如果機器學習都可以簡化爲某種形式的優化策略,以最大化某些評估措施,那麼強化學習肯定是有效的,但這種解釋的說服力並不是很強。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/bdtechtalks.com\/2021\/06\/07\/deepmind-artificial-intelligence-reward-maximization\/?fileGuid=qgprgqTXvgQxwTXJ","title":"","type":null},"content":[{"type":"text","text":"https:\/\/bdtechtalks.com\/2021\/06\/07\/deepmind-artificial-intelligence-reward-maximization\/"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章