爲什麼訓練數據是自然語言處理的瓶頸?

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人工智能並不像某些資料中宣傳的那樣真正理解人類的語言。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"本文最初發表於 Towards Data Science 博客,經原作者 Tolga Akiner 授權,InfoQ 中文站翻譯並分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在你看今天那些鋪天蓋地的關於“人工智能將會危害人類”的討論時,你有沒有問過“怎麼會呢”?據谷歌一位"},{"type":"link","attrs":{"href":"https:\/\/blog.google\/products\/search\/search-language-understanding-bert\/","title":"","type":null},"content":[{"type":"text","text":"副總裁"}]},{"type":"text","text":"說,當你對着最先進的語言模型詢問“內布拉斯加以南是哪個州”時,它會回答“南內布拉斯加”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣的科技水平,真的能超越我們所知的宇宙中最聰明的物種,即智人,達到更聰明的程度嗎?嗯,我們從青銅時代一直到現在,所以答案可能是肯定的,指的是將來的某個時候。但是,要提升當前的人工智能,還需要克服太多的障礙。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不得不承認,當我第一次接觸 “遷移學習”時,我非常興奮。當有人看到像 ImageNet 和 BERT 這樣的模型,讀到一些關於這些模型上的“炒作”文章,我想,我們其實已經非常接近電影《她》("},{"type":"text","marks":[{"type":"italic"}],"text":"Her"},{"type":"text","text":")那樣的場景了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之後,我開始在企業環境中作爲數據科學家工作,並接觸了醫療行業中的一些關鍵業務問題,我意識到,在現實世界中的應用涉及到一些不同於 SOTA 或 GLUE 這樣的標準任務。當我將他們的模型應用到一些不同的數據集時,我看到不同的遷移學習包、創業公司和潛在的廠商公司所報告的關於不同任務的 95% 以上的準確性,而所有這些花哨的模型都以某種角度和 \/ 或某種方式失敗了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,對於我來說,這幅圖景變得更清晰了,我從自然語言處理領域的遷移學習中得到了一個更實際的結論:那些令人着迷的人工智能的結果大多隻適用於非常特定的測試集,而這些測試集可能都是經過精心挑選的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在一個與語言學有關的例子中(我的知識主要涉及自然語言處理,對計算機視覺也不太熟悉,所以我將僅舉這一領域的例子),我們可以認爲,人工智能並不像某些資料中宣傳的那樣真正理解人類的語言,例如,一些新聞文章中,人工智能僅僅理解它之前看到的語料的某些方面,並試圖將其理解推論到一個新的數據點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一觀點可能被某些人認爲是多餘的,我們不應該等待人工智能給出一些與訓練集有很大不同的數據的神奇答案。我當然同意,但是,如果我們想要邁向更廣泛、更實用的人工智能應用,並且在遷移學習方面取得更好的成果,那麼作爲一個社區,我們最好有一個堅實的路線圖,並且對此提出堅實的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在最近幾個月,隨着遷移學習應用的興起,訓練數據的重要性越來越受到人們的關注,特別是在自然語言領域。你還可以通過觀察越來越多的衆包或數據標籤初創公司在市場中的業務來把握這個趨勢。"},{"type":"link","attrs":{"href":"https:\/\/acl2020.org\/","title":"","type":null},"content":[{"type":"text","text":"ACL2020"}]},{"type":"text","text":"上最近發表的一篇論文《"},{"type":"link","attrs":{"href":"https:\/\/www.aclweb.org\/anthology\/2020.acl-main.463.pdf","title":"","type":null},"content":[{"type":"text","text":"走向自然語言理解:數據時代的意義、形式與理解"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"Climbing towards NLU:On Meaning, Form, and Understanding in the Age of Data"},{"type":"text","text":")提出了一種非常有趣的方法,我發現它與以往完全不同。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管這一研究可能被認爲是在語言學領域對科學哲學和一些嚴格定義的術語(如意義、形式和交際意圖)的極大關注,它還是得出了一個非常明確的結論:BERTology 論文中有證據表明,大規模語言模型可以學習語言形式結構的各個方面,並使用訓練數據中的人工產物,但是它們並不瞭解人類如何溝通,也不記住事實知識,也不瞭解你的問題背後的意圖。一些快速失敗的例子(用 GPT2 生成),關於這句話,在另一篇有趣的文章《"},{"type":"link","attrs":{"href":"https:\/\/www.elementalcognition.com\/mental-models-blog","title":"","type":null},"content":[{"type":"text","text":"爲什麼人工智能會被語言所困惑?這都是心智模型的問題"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"Why is AI so confused by language? It’s all about mental models."},{"type":"text","text":")中可以找到,作者提出了一個名爲“心智模型”的概念,它模仿人類大腦如何在遷移學習協議中消化語言。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種觀點是基於這樣一個事實:我們可以根據非常不同的因素給句子和短語賦予非常不同的意義。比方說,讓我們看一看這一句:“… Micheal Jordan now if he gets Bryon Russel with a quick crossover look at Bryon Russell slips and Micheal pulls and buries the shot…”我假設那個鈴聲已經響了,讓一些人想起了 1998 年的 NBA 總決賽,即使你對此並不感興趣,“Bryon Russel”、“Micheal Jordan”、“cross-over”和“shot”也許會告訴你,這個句子實際上是在描述一個發生在過去的事件,而這個事件發生在一座擠滿了成千上萬人的體育館裏,而且是在猶他州或者芝加哥。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即便遷移學習模型通過從不同的角度看整句話,能夠在一定程度上理解每個詞和模式,但它們並不知道這些明顯的(對人來說肯定的) 細節和聯繫,這就是爲什麼“心智模型”能夠提供一些解決障礙的初步辦法。儘管如此,我還是希望能在另一篇文章中深入探討這個新的想法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這篇文章《"},{"type":"link","attrs":{"href":"https:\/\/www.technologyreview.com\/2020\/11\/18\/1012234\/training-machine-learning-broken-real-world-heath-nlp-computer-vision\/","title":"","type":null},"content":[{"type":"text","text":"我們訓練人工智能的方式存在根本性的缺陷"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"The way we train AI is fundamentally flawed"},{"type":"text","text":")中,作者討論了另外一個非常有趣的概念,即 “壓力測試”,它的理念是,除了標準的驗證和測試集外,對模型進行更廣泛的測試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這聽起來很好,但是我個人更願意從真實世界的應用程序的角度來評價模型。如果我們對每個遷移學習模型都有詳細的實際效果報告的話,我想這會很棒。這聽起來似乎很需要數據,但這只是一個想法,我將在這篇文章中試着做一個壓力測試,希望能有點意思。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於不同的想法、觀點,以及人工智能對未來的影響,我喜歡用相對簡單易於理解的模型來進行討論,但是這可能已經夠多了,讓我們來討論一下這個模型。迄今爲止,我的重點是介紹並討論了訓練的重要性,以及如何將這一基本部分的各個方面引入當前的遷移學習研究與可能更廣泛的應用之間的鴻溝。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過對以上問題的閱讀和思考,我想對一些語言模型進行一次非常快速、特別的壓力測試(多虧了 HuggingFace 的強大功能,它可以確定不同訓練集對相同語言模型架構的影響,如果只使用 Tensorflow 或 PyTorch 的話)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文所要展示的就是,對於由不同語言的微調集合所引起的一些基於 BERT 的掩蔽語言模型的差異,我試圖給出一個直觀的解釋。通過研究不同語言的語言模型,我希望可以鞏固訓練集的重要性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲在不同的訓練集中,我需要訓練多少個 BERT 模型,所以我考慮了不同的語言。我們的目的是通過一些主觀的問題來對不同的語言模型(不同的語言)進行壓力測試,同時利用非常易於使用的 Transformer"},{"type":"link","attrs":{"href":"https:\/\/huggingface.co\/transformers\/main_classes\/pipelines.html","title":"","type":null},"content":[{"type":"text","text":"管道"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BERT 針對不同的任務進行了微調,包括但不限於掩蔽標記預測、文本分類、名稱實體識別和問題回答;但是,由於問題提取需要上下文輸入,我決定使用與掩蔽語言模型相似的程序。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣,我通過掩蔽其中一個標記(理想中引入句子主觀性的標記)生成了 15 個相對較短的句子,並將所有這些被掩蔽的句子分別輸入到基於 BERT 的掩蔽語言模型中,分別對英語、德語、法語和土耳其語進行訓練。我認爲,呈現代碼將是描述此工作流程的最好方法,因此,我只想通過顯示這些包和句子來了解:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"import pandas as pd\nfrom transformers import pipeline\nfrom transformers import BertTokenizer\nfrom google_trans_new import google_translator\n# Our 15 test sentences\nsentences = ['The most delicious food in the world is [MASK].',\n'The best vacation spot in the world is [MASK].',\n'When I grow up, I want to be [MASK].',\n'[MASK] won the Cold War.',\n'The most powerful nation in the world is [MASK].',\n'The cleanest energy source is [MASK].',\n'The most exciting artificial intelligence application is [MASK].',\n'The best smartphone in the market is [MASK].',\n'Weed is [MASK] for your health.',\n'Religions are [MASK] for society.',\n'The most cheerful color is [MASK].',\n'The most fascinating field of science is [MASK].',\n'The average temperature of the earth is going to [MASK] in the future.',\n'The highest paid job of the 21st century is [MASK].',\n'Mathematics is useful for [MASK].'\n]\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如你所看到的,我試着給出一些或多或少的主觀臆斷和正在進行的辯論問題(如果不是全部,也是上面的大部分),它們可以用一個詞來回答。這個想法就是觀察不同的語言模型(在不同的語言上訓練)如何預測這些標記。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在訓練語料中,我最初的一個期望就是通過模型輸出是否反映出與文化、習慣或社會有關的差異。由於這是訓練集如何影響遷移學習預測的重要證據。雖然我沒有百分之百肯定我是否做到了,但是我希望你們能夠確定並告訴我!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我只在遷移學習部分使用了"},{"type":"link","attrs":{"href":"https:\/\/huggingface.co\/transformers\/index.html","title":"","type":null},"content":[{"type":"text","text":"Transformers"}]},{"type":"text","text":","},{"type":"link","attrs":{"href":"https:\/\/pypi.org\/project\/google-trans-new\/","title":"","type":null},"content":[{"type":"text","text":"google-translator"}]},{"type":"text","text":"只用於將預測的標記翻譯爲英語。第一個翻譯是用 Transformer 在局級級別完成的,除了我的母語土耳其語外,這對我來說更容易手動翻譯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有一點很重要,那就是我已經和我的前同事和朋友 Emir Kocer 和 Umut Soysal 一起做過德語和法語的翻譯工作,並且試着儘量減少因翻譯而產生的掩蔽標記的錯誤預測。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# We'll translate to German and French first\ntranslator_de = pipeline('translation_en_to_de')\ntranslator_fr = pipeline('translation_en_to_fr')\n# Create De and Fr sentences\nde_sents = []\nfr_sents = []\nfor eng in sentences:\nde_sents.append(translator_de(eng)[0]('translation_text'))\nfr_sents.append(translator_fr(eng)[0]('translation_text'))\n\n# Change [MASK] to for French\nfr_sents_mod = [sents.replace('[MASK]','') for sents in fr_sents]\n# I did not use translation (neither did I trust) for my mother-tongue. With weird letters such as ü,ğ,ş,ı.\ntr_sents = ['[MASK] dünyadaki en lezzetli yiyecektir.',\n'[MASK] dünyadaki en guzel tati yeridir.',\n'Ben büyüyünce [MASK] olmak istiyorum.',\n'Soğuk savaşı [MASK] kazandı.',\n'[MASK] dünyadaki en güçlü millettir.',\n'[MASK] en temiz enerji kaynağıdır.',\n'[MASK] en heyecan verici yapay zeka uygulamasıdır.',\n'[MASK] piyasadaki en iyi akıllı telefondur.',\n'Kenevir sağlığınız icin [MASK].',\n'Dinler toplumlar icin [MASK].',\n'[MASK] en neşeli renktir.',\n'[MASK] bilimin en büyüleyici alanıdır.',\n'Dünyanın ortalama sıcaklığı gelecekte [MASK].',\n'21. yüzyılın en yüksek maaşlı işi [MASK].',\n'Matematik [MASK] için kullanışlıdır.'\n]\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下一步就是將這些句子輸入到相應的掩蔽語言模型中,提取掩蔽的標記預測,並將其轉換爲英文,這樣我們就可以更方便、更全面地評估結果。由於最後一個翻譯階段是在單詞級別,所以我使用了谷歌翻譯。要知道,有時候你只是想測試一個新的包,即使它做同樣的工作……"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# Create the fill mask objects\nfill_mask_eng = pipeline(\n\"fill-mask\",\nmodel=\"bert-base-uncased\",\ntokenizer='bert-base-uncased'\n)\nfill_mask_de = pipeline(\n\"fill-mask\",\nmodel=\"bert-base-german-cased\",\ntokenizer='bert-base-german-cased'\n)\nfill_mask_fr= pipeline(\n\"fill-mask\",\nmodel=\"camembert-base\",\ntokenizer=\"camembert-base\"\n)\nfill_mask_tr = pipeline(\n\"fill-mask\",\nmodel=\"dbmdz\/bert-base-turkish-cased\",\ntokenizer='dbmdz\/bert-base-turkish-cased'\n)\n# Run the fill-mask pipelines for each language and translate German and French back to English\ntranslator = google_translator()\neng_res = []\nde_res = []\nfr_res = []\ntr_res = []\nfor i,sents in enumerate(sentences):\nres_eng = fill_mask_eng(sents)\nres_de = fill_mask_de(de_sents[i])\nres_fr = fill_mask_fr(fr_sents_mod[i])\nres_tr = fill_mask_tr(tr_sents[i])\n\neng_res.append(', '.join([j['token_str'] for j in res_eng]))\nde_res.append(', '.join([translator.translate(j['token_str'],lang_src='de',lang_tgt='en') for j in res_de]))\nfr_res.append(', '.join([translator.translate(j['token_str'].replace('▁',''),lang_src='fr',lang_tgt='en') for j in res_fr]))\ntr_res.append(', '.join([translator.translate(j['token_str'],lang_src='tr',lang_tgt='en') for j in res_tr]))\n\n# Push the results into a dataframe\nresult_df = pd.DataFrame(list(zip(sentences,eng_res,de_res,fr_res,tr_res)),\ncolumns=['Sentence','English','German','French','Turkish'])\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於這些結果,我不能說它是開創性的,但是具有爭議性的,我很喜歡進一步的頭腦風暴和建設性的批評。這裏有我的發現和一些淺顯的觀察,希望能在這張奇怪的桌子上找到答案。因爲這篇文章比我預期的要長,所以我只談了幾個數據點。別忘了,這是"},{"type":"link","attrs":{"href":"https:\/\/github.com\/Tolga28A\/BERT-MLM-EN_DE_FR.git","title":"","type":null},"content":[{"type":"text","text":"完整代碼"}]},{"type":"text","text":",你可以下載看看。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/19\/1902a550aaae543d1dab65b37f9a4322.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“here”標記僅以法語出現在第 0 行的食物上,這似乎是非常有趣的開始,我找不到快速的高亮數據,所以我無法在這裏進行數據驅動(慚愧),但是還有一個"},{"type":"link","attrs":{"href":"https:\/\/www.quora.com\/Why-do-the-French-think-they-have-the-best-cuisine-on-planet-Earth","title":"","type":null},"content":[{"type":"text","text":"quora 問題"}]},{"type":"text","text":":“Why do the French think they have the best cuisine on planet Earth?”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於德國來說,假設“Italy”這個詞指的是這句話中的意大利食物,我發現這項"},{"type":"link","attrs":{"href":"https:\/\/www.euronews.com\/2018\/05\/15\/which-foreign-cuisines-do-europeans-love-to-eat-","title":"","type":null},"content":[{"type":"text","text":"調查"}]},{"type":"text","text":"顯示,在德國,意大利菜比德國菜更受歡迎。你能說社會中的這些趨勢是基於遷移學習?也許吧……"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於英文中的“chocolate”,我唯一的解釋是,根據"},{"type":"link","attrs":{"href":"https:\/\/www.statista.com\/forecasts\/758627\/revenue-of-the-snack-food-market-worldwide-by-country","title":"","type":null},"content":[{"type":"text","text":"Statista"}]},{"type":"text","text":"的數據,美國在零食消費方面有明顯的優勢,但這可能是一個弱關係,所以我實際上只是想用這個來……"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在第 1 行關於度假的句子中,“here”這個詞在英語、法語和土耳其語中都出現過,根據《"},{"type":"link","attrs":{"href":"https:\/\/www.e-unwto.org\/doi\/epdf\/10.18111\/wtobarometereng.2020.18.1.5","title":"","type":null},"content":[{"type":"text","text":"世界旅遊晴雨表"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"World Tourism Barometer"},{"type":"text","text":"),這些國家是世界前六名中的三個。因此,另一個線索表明,這些國家的某些特徵差異可能已經從訓練語料遷移到模型預測中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Turkish BERT 在第 2 行預測了“doctor”的標記,立刻引起了我的共鳴,因爲我的經驗也告訴我,土耳其這個國家是多麼地迷戀醫生這個職業。這句話不是我憑空說的,也不是因爲我在土耳其長大,這裏是"},{"type":"link","attrs":{"href":"https:\/\/www.dailysabah.com\/turkey\/doctors-judges-have-the-most-respected-profession-turkish-survey-shows\/news","title":"","type":null},"content":[{"type":"text","text":"調查報告"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從第 9 行與宗教相關語句來看,“bad”和“dangerous”標記僅在德語和法語中輸出,這兩個國家的"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Importance_of_religion_by_country#\/media\/File:Countries_by_importance_of_religion.svg.png","title":"","type":null},"content":[{"type":"text","text":"宗教重要性"}]},{"type":"text","text":"方面的排名遠低於美國和土耳其。另一個潛在的信息流,可能是通過訓練數據和遷移學習,從社會思維轉向機器學習預測。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一結果的數據框架中有許多東西需要解釋,同時也有一些我目前無法解釋的非常奇怪的預測。舉例來說,只有土耳其語中的 yellow 沒有出現在最悅目的顏色中,德語中的 Iphone 是最好的智能手機預測,儘管英語中的 BERT 似乎對冷戰的勝利者充滿信心,而 USA 卻不是它預測的最強國家,只有法國的語言模型對 cannabis 有着強烈的憎恨。當然,更不用說掩蔽標記的一些停用詞預測了,這可能是由於語法或翻譯錯誤造成的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過對不同語言語料的比較,我相信我們也可以討論數據偏差在這些結果中的作用,事實上,訓練數據偏差也是另外一個很大的討論主題,但是我不想做過多的討論,我現在只是想給大家介紹一下與此相關的觀點(不確定我是否會再爲此寫一篇博文,也許會吧……)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你們也許會得出一個與我截然不同的解釋,請記住,我也很想聽到更多。所以,讓我總結一下最終降落這架飛機的一些關鍵要點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自然語言處理是人工智能的一個快速興起的領域,最近在支撐研究和企業級應用方面取得了顯著的進展。但目前的研究中存在着大量的誇大現象,忽略了訓練語料選擇的重要性,依附性和後果性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於有監督的自然語言處理的訓練語料方面,如壓力測試,已經有不同的觀點和討論,本文背後的動機是在同一遷移學習架構上應用一種非常短的壓力測試,來識別訓練數據在預測中引起的差異。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我想提供一個關於遷移學習的引人注目的例子,着重於由於訓練集的差異而產生的差異。通過輸入不同的主觀掩蔽句,選擇了不同語言的語言模型,並進行了測試。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於統一句子,不同語言的語言模型的標記預測存在較大差異。關於這些差異,我已經找到了一些額外的數據,但並非全部,希望本文的數據能引起你的思考,並得出結論:訓練數據是遷移學習模型在廣泛應用中的有效性和有效性的界限。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tolga Akiner,博士,數據科學家,機器學習從業者,專注於醫療保健中的自然語言處理領域。LinkedIn:"},{"type":"link","attrs":{"href":"https:\/\/www.linkedin.com\/in\/tolga-akiner\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/www.linkedin.com\/in\/tolga-akiner\/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/towardsdatascience.com\/why-is-training-data-the-bottleneck-for-nlp-a-multilingual-bert-example-44b86c11f5a"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章