微軟和谷歌的 AI 模型在 SuperGLUE 語言基準上超越了人類的表現

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"隸屬於 Facebook、紐約大學(NYU)、華盛頓大學和 DeepMind 的研究人員在 2019 年底推出了"},{"type":"link","attrs":{"href":"https:\/\/venturebeat.com\/2019\/08\/14\/ai-researchers-launch-superglue-a-rigorous-benchmark-for-language-understanding\/amp\/","title":"","type":null},"content":[{"type":"text","text":"SuperGLUE"}]},{"type":"text","text":",這是一種新的人工智能基準,用於總結各種語言任務的研究進展。基於去年發佈的 GLUE 基準,SuperGLUE 包含了一系列更難的語言理解挑戰、改進的資源以及"},{"type":"link","attrs":{"href":"https:\/\/super.gluebenchmark.com\/leaderboard\/","title":"","type":null},"content":[{"type":"text","text":"公開的排行榜"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在 SuperGLUE 推出時,在排行榜上,表現最好的模型和人類的表現有近 20 分的差距。但截至 1 月初,有兩個模型,一個是來自微軟的 DeBERTa,另一個是來自谷歌的 T5+Meena,它們已經超越了人類的基準線,成爲第一批超越人類的模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"紐約大學數據科學中心助理教授 Sam Bowman 表示,這一成就反映了機器學習的創新,包括自監督學習,即模型從未標記的數據集中學習,並制定了將洞察力用於目標任務的方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“這些數據集反映了一些最難的監督語言理解任務數據集,這些數據集在兩年前是免費提供的。沒有理由相信 SuperGLUE 將能夠檢測到自然語言處理的進一步進展,至少會超過剩下的一小部分”,Sam Bowman說。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是 SuperGLUE 並非人類語言能力的完美測試,也並非完整測試。DeBERTa 背後的微軟團隊在一篇博文中也指出,他們的模型“絕非”達到自然語言理解的人類級智能。他們表示,這需要研究突破,以及衡量它們及其效果的新基準。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"SuperGLUE"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如研究人員在介紹 SuperGLUE 的論文《"},{"type":"link","attrs":{"href":"https:\/\/w4ngatang.github.io\/static\/papers\/superglue.pdf","title":"","type":null},"content":[{"type":"text","text":"SuperGLUE:通用語言理解系統更嚴格的基準"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"SuperGLUE: A Stickier Benchmark forGeneral-Purpose Language Understanding Systems"},{"type":"text","text":")所寫的那樣,他們的基準旨在成爲一個簡單的而又有難度的衡量標準,用以衡量英語通用語言理解技術的進展。它包括 8 個語言理解任務,它們來自於已有的數據,並配有性能度量和分析工具包。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些任務是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"布爾問題"},{"type":"text","text":"(Boolean Questions,"},{"type":"text","marks":[{"type":"strong"}],"text":"BoolQ"},{"type":"text","text":"):要求模型回答一個關於維基百科文章中包含答案的短文的問題。這是一些谷歌用戶通過谷歌搜索提交的問題。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"CommitmentBank"},{"type":"text","text":"("},{"type":"text","marks":[{"type":"strong"}],"text":"CB"},{"type":"text","text":"):要求模型識別 文本中包含的假設,包括《華爾街日報》的信息來源,並確定該假設是否成立。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"合理選擇"},{"type":"text","text":"(Choice of plausible alternatives,"},{"type":"text","marks":[{"type":"strong"}],"text":"COPA"},{"type":"text","text":"): 提供了一個關於博客主題的前提語句,以及一本與攝影相關的百科全書,模型必須從中確定兩種可能選擇的因果關係。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"多句閱讀理解"},{"type":"text","text":"(Multi-Sentence Reading Comprehension,"},{"type":"text","marks":[{"type":"strong"}],"text":"MultiRC"},{"type":"text","text":"):這是一項問答式的任務,其中每個樣本都包含一段上下文段落、一個關於該段落的問題,以及一系列可能的答案。一種模型必須預測哪些答案是真的,哪些答案是假的。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"基於常識推理數據集的閱讀理解"},{"type":"text","text":"(Reading Comprehension with Commonsense Reasoning Dataset,"},{"type":"text","marks":[{"type":"strong"}],"text":"ReCoRD"},{"type":"text","text":"):模型根據 CNN 和《每日郵報》的選文列表中預測被掩蓋的單詞和短語,在這些選文中,同一單詞或短語可能以多種不同的形式表達,所有這些都被認爲是正確的。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"識別文本內容"},{"type":"text","text":"(Recognizing Textual Entailment,"},{"type":"text","marks":[{"type":"strong"}],"text":"RTE"},{"type":"text","text":"):挑戰自然語言模型,以確定一個文本摘錄的真實性是否來自另一個文本摘錄。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Word-in-Context"},{"type":"text","text":"("},{"type":"text","marks":[{"type":"strong"}],"text":"WiC"},{"type":"text","text":"):爲兩個文本片段和一個多義詞(即具有多重含義的單詞)提供模型,並要求它們判定這個單詞是否在兩個句子中有相同的含義。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Winograd 模式挑戰"},{"type":"text","text":"(Winograd Schema Challenge,"},{"type":"text","marks":[{"type":"strong"}],"text":"WSC"},{"type":"text","text":"):是一項任務,在這項任務中,模型給定小說書中的段落,必須回答關於歧義代詞先行詞的多項選擇題。它被設計爲圖靈測試的改進。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SuperGLUE 也嘗試在 Winogender 圖式的模型中測量性別偏見,這些模型是僅由句子中某一代詞的性別不同的句對。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但,研究人員指出,這種方法有其侷限性,因爲它只能提供積極的預測值:較差的偏見分數清楚地表明模型顯示出性別偏見,而良好的分數並不意味着模型是無偏見的。而且,它並不包括一切形式的性別或社會偏見,因此它只是一種粗略的偏見衡量標準。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了建立人類表現的基線,研究人員借鑑了 WiC、MultiRC、RTE 和 ReCoRD 的現有文獻,並通過亞馬遜的 Mechanical Turk 平臺僱傭了衆包註釋員。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個衆包人員每小時的平均工資爲 23.75 美元,他們完成了一個短期培訓階段,之後纔會使用說明和常見問題來對多達 30 個選定測試集樣本進行註釋。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"架構改進"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管 DeBERTa 背後的微軟研究人員在 1 月 6 日發表的一篇題爲《"},{"type":"link","attrs":{"href":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark\/","title":"","type":null},"content":[{"type":"text","text":"微軟 DeBERT 在 SuperGLUE 基準上超越人類"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"Microsoft DeBERTa surpasses human performance on the SuperGLUE benchmark"},{"type":"text","text":")的博文中提供了他們的工作細節,但是谷歌團隊還沒有提供關於其模型性能改進的細節。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DeBERTa 並非新鮮事,它是去年開源的,但研究人員表示,他們已經訓出練一個包含 15 億個參數(即模型用來進行預測的內部變量)的更大版本。它將以開源的方式發佈,並集成到下一個版本的微軟圖靈自然語言表示模型中,支持諸如 Bing、Office、Dynamics 和 Azure 認知服務等產品。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DeBERTa 是通過掩蔽語言建模進行預訓練的,這是一項填空任務,教會模型使用與被掩蔽標記相關的詞來預測被掩蔽的詞應該是什麼。DeBERTa 利用上下文詞的內容和位置信息來建立掩蔽語言模型,比如它能夠識別出“a new store opened beside the new mall”句子中的“store”和“mall”扮演着不同的句法角色。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"與其他一些模型不同的是,DeBERTa 在語言建模過程中將詞的絕對位置考慮在內。此外,它還對模型中轉換後的輸入數據進行參數計算,並根據詞的相對位置衡量詞之間依賴關係的強弱。舉例來說,DeBERTa 會理解“deep”和“learning”這兩個詞相鄰出現時,依賴關係要比單獨出現在不同句子中更強。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DeBERTa 還受益於對抗訓練,這種技術利用對訓練數據進行小幅度改變而獲得的對抗樣本。在訓練過程中,這些對抗樣本被輸入到模型中,提高了模型的泛化能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"微軟研究人員希望下一步探索如何讓 DeBERTa 能夠泛化到新的子任務或基本的解決問題的能力,這個概念被稱爲“合成泛化”(compositional generalization)。未來的一條路可能是更加明確地融合所謂的合成結構,這可能需要將人工智能與符號推理,換句話說,按照數學和邏輯規則操縱符號和表達式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“DeBERTa 在 SuperGLUE 上超越人類的表現標誌着向通用人工智能邁進的重要里程碑,”微軟研究人員寫道。“但與 DeBERTa 不同的是,人類非常善於利用從不同任務中學到的知識來解決一個新的任務,並不需要或很少需要特定任務的演示。”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"新基準"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據 Bowman 的說法,SuperGLUE 的繼任者尚未出現,至少在短期內是如此。但是人工智能研究界越來越多的共識是,未來的基準,特別是在語言領域,要起作用,就必須考慮到更廣泛的倫理、技術和社會挑戰。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,一些研究表明,流行的基準在評估現實中的人工智能性能方面效果不佳。一份最新"},{"type":"link","attrs":{"href":"https:\/\/venturebeat.com\/2020\/08\/12\/natural-language-benchmarks-dont-measure-ai-models-general-knowledge-well-research-shows\/","title":"","type":null},"content":[{"type":"text","text":"報告"}]},{"type":"text","text":"顯示,自然語言處理模型給出的 60%~70% 的答案都嵌入在基準訓練集中,這表明模型通常只是在記憶答案。在對超過 3000 篇人工智能論文進行的元分析中,另一項研究發現,用來衡量人工智能和機器學習模型的指標往往不一致,追蹤不規則,並且信息也不特別豐富。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一部分原因是因爲諸如 OpenAI 的"},{"type":"link","attrs":{"href":"https:\/\/venturebeat.com\/2020\/07\/24\/ai-weekly-the-promise-and-shortcomings-of-openais-gpt-3\/","title":"","type":null},"content":[{"type":"text","text":"GPT-3"}]},{"type":"text","text":"、谷歌的 T5+Meena 和微軟的 DeBERTa 這樣的語言模型,通過將公共網絡中的樣本內化,學會了如何寫出與人類相似的文本。它們使用諸如電子書、維基百科和 Reddit 這樣的社會媒體平臺來推斷整句話甚至整段話。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"結果是,語言模型經常會放大這些公共數據中編碼的偏見;部分培訓數據"},{"type":"link","attrs":{"href":"https:\/\/venturebeat.com\/2020\/08\/07\/researchers-quantify-bias-in-reddit-content-sometimes-used-to-train-ai\/","title":"","type":null},"content":[{"type":"text","text":"並非不常見"}]},{"type":"text","text":",它們來自具有普遍性別、種族和宗教偏見的社區。 OpenAI 是一家人工智能研究公司,它指出,這可能導致把像“naughty”或“sucked”這樣的詞放在女性代詞旁邊,把“Islam”放在 terrorism 旁邊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今年 4 月,英特爾、麻省理工學院以及加拿大人工智能項目 CIFAR 的研究人員發表了一份研究報告,報告指出,一些最流行的模型存在着很強的刻板印象,包括谷歌的"},{"type":"link","attrs":{"href":"https:\/\/venturebeat.com\/2018\/11\/02\/google-open-sources-bert-a-state-of-the-art-training-technique-for-natural-language-processing\/","title":"","type":null},"content":[{"type":"text","text":"BERT"}]},{"type":"text","text":"和"},{"type":"link","attrs":{"href":"https:\/\/venturebeat.com\/2019\/06\/21\/google-brains-xlnet-bests-bert-at-20-nlp-tasks\/","title":"","type":null},"content":[{"type":"text","text":"XLNet"}]},{"type":"text","text":"、OpenAI 的"},{"type":"link","attrs":{"href":"https:\/\/venturebeat.com\/2019\/08\/20\/openai-releases-curtailed-version-of-gpt-2-language-model\/","title":"","type":null},"content":[{"type":"text","text":"GPT-2"}]},{"type":"text","text":"和 Facebook 的"},{"type":"link","attrs":{"href":"https:\/\/venturebeat.com\/2019\/07\/29\/facebook-ais-roberta-improves-googles-bert-pretraining-methods\/","title":"","type":null},"content":[{"type":"text","text":"RoBERTa"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"據 Middlebury Institute of International Studies 稱,惡意行爲者可能會利用這種偏見,通過傳播錯誤信息、虛假信息和徹頭徹尾的謊言來煽動不和諧,從而“使個人處於極端的極右思想和行爲之中,成爲暴力的個人”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大部分已有的語言基準不能捕捉到這一點。在 SuperGLUE 發表的兩年中,它的發現激發了人們,或許未來的基準可以做到這一點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Kyle Wiggers,技術記者,現居美國紐約市,爲 VentureBeat 撰寫有關人工智能的文章。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/venturebeat.com\/2021\/01\/06\/ai-models-from-microsoft-and-google-already-surpass-human-performance-on-the-superglue-language-benchmark\/"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章