爲什麼訓練數據是自然語言處理的瓶頸？

原創

2021-01-13 15:48

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"人工智能並不像某些資料中宣傳的那樣真正理解人類的語言。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic"},{"type":"strong"}],"text":"本文最初發表於 Towards Data Science 博客，經原作者 Tolga Akiner 授權，InfoQ 中文站翻譯並分享。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在你看今天那些鋪天蓋地的關於“人工智能將會危害人類”的討論時，你有沒有問過“怎麼會呢”？據谷歌一位"},{"type":"link","attrs":{"href":"https:\/\/blog.google\/products\/search\/search-language-understanding-bert\/","title":"","type":null},"content":[{"type":"text","text":"副總裁"}]},{"type":"text","text":"說，當你對着最先進的語言模型詢問“內布拉斯加以南是哪個州”時，它會回答“南內布拉斯加”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣的科技水平，真的能超越我們所知的宇宙中最聰明的物種，即智人，達到更聰明的程度嗎？嗯，我們從青銅時代一直到現在，所以答案可能是肯定的，指的是將來的某個時候。但是，要提升當前的人工智能，還需要克服太多的障礙。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不得不承認，當我第一次接觸 “遷移學習”時，我非常興奮。當有人看到像 ImageNet 和 BERT 這樣的模型，讀到一些關於這些模型上的“炒作”文章，我想，我們其實已經非常接近電影《她》（"},{"type":"text","marks":[{"type":"italic"}],"text":"Her"},{"type":"text","text":"）那樣的場景了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之後，我開始在企業環境中作爲數據科學家工作，並接觸了醫療行業中的一些關鍵業務問題，我意識到，在現實世界中的應用涉及到一些不同於 SOTA 或 GLUE 這樣的標準任務。當我將他們的模型應用到一些不同的數據集時，我看到不同的遷移學習包、創業公司和潛在的廠商公司所報告的關於不同任務的 95% 以上的準確性，而所有這些花哨的模型都以某種角度和 \/ 或某種方式失敗了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此，對於我來說，這幅圖景變得更清晰了，我從自然語言處理領域的遷移學習中得到了一個更實際的結論：那些令人着迷的人工智能的結果大多隻適用於非常特定的測試集，而這些測試集可能都是經過精心挑選的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在一個與語言學有關的例子中（我的知識主要涉及自然語言處理，對計算機視覺也不太熟悉，所以我將僅舉這一領域的例子），我們可以認爲，人工智能並不像某些資料中宣傳的那樣真正理解人類的語言，例如，一些新聞文章中，人工智能僅僅理解它之前看到的語料的某些方面，並試圖將其理解推論到一個新的數據點。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一觀點可能被某些人認爲是多餘的，我們不應該等待人工智能給出一些與訓練集有很大不同的數據的神奇答案。我當然同意，但是，如果我們想要邁向更廣泛、更實用的人工智能應用，並且在遷移學習方面取得更好的成果，那麼作爲一個社區，我們最好有一個堅實的路線圖，並且對此提出堅實的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在最近幾個月，隨着遷移學習應用的興起，訓練數據的重要性越來越受到人們的關注，特別是在自然語言領域。你還可以通過觀察越來越多的衆包或數據標籤初創公司在市場中的業務來把握這個趨勢。"},{"type":"link","attrs":{"href":"https:\/\/acl2020.org\/","title":"","type":null},"content":[{"type":"text","text":"ACL2020"}]},{"type":"text","text":"上最近發表的一篇論文《"},{"type":"link","attrs":{"href":"https:\/\/www.aclweb.org\/anthology\/2020.acl-main.463.pdf","title":"","type":null},"content":[{"type":"text","text":"走向自然語言理解：數據時代的意義、形式與理解"}]},{"type":"text","text":"》（"},{"type":"text","marks":[{"type":"italic"}],"text":"Climbing towards NLU:On Meaning, Form, and Understanding in the Age of Data"},{"type":"text","text":"）提出了一種非常有趣的方法，我發現它與以往完全不同。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管這一研究可能被認爲是在語言學領域對科學哲學和一些嚴格定義的術語（如意義、形式和交際意圖）的極大關注，它還是得出了一個非常明確的結論：BERTology 論文中有證據表明，大規模語言模型可以學習語言形式結構的各個方面，並使用訓練數據中的人工產物，但是它們並不瞭解人類如何溝通，也不記住事實知識，也不瞭解你的問題背後的意圖。一些快速失敗的例子（用 GPT2 生成），關於這句話，在另一篇有趣的文章《"},{"type":"link","attrs":{"href":"https:\/\/www.elementalcognition.com\/mental-models-blog","title":"","type":null},"content":[{"type":"text","text":"爲什麼人工智能會被語言所困惑？這都是心智模型的問題"}]},{"type":"text","text":"》（"},{"type":"text","marks":[{"type":"italic"}],"text":"Why is AI so confused by language? It’s all about mental models."},{"type":"text","text":"）中可以找到，作者提出了一個名爲“心智模型”的概念，它模仿人類大腦如何在遷移學習協議中消化語言。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種觀點是基於這樣一個事實：我們可以根據非常不同的因素給句子和短語賦予非常不同的意義。比方說，讓我們看一看這一句：“… Micheal Jordan now if he gets Bryon Russel with a quick crossover look at Bryon Russell slips and Micheal pulls and buries the shot…”我假設那個鈴聲已經響了，讓一些人想起了 1998 年的 NBA 總決賽，即使你對此並不感興趣，“Bryon Russel”、“Micheal Jordan”、“cross-over”和“shot”也許會告訴你，這個句子實際上是在描述一個發生在過去的事件，而這個事件發生在一座擠滿了成千上萬人的體育館裏，而且是在猶他州或者芝加哥。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"即便遷移學習模型通過從不同的角度看整句話，能夠在一定程度上理解每個詞和模式，但它們並不知道這些明顯的（對人來說肯定的）細節和聯繫，這就是爲什麼“心智模型”能夠提供一些解決障礙的初步辦法。儘管如此，我還是希望能在另一篇文章中深入探討這個新的想法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這篇文章《"},{"type":"link","attrs":{"href":"https:\/\/www.technologyreview.com\/2020\/11\/18\/1012234\/training-machine-learning-broken-real-world-heath-nlp-computer-vision\/","title":"","type":null},"content":[{"type":"text","text":"我們訓練人工智能的方式存在根本性的缺陷"}]},{"type":"text","text":"》（"},{"type":"text","marks":[{"type":"italic"}],"text":"The way we train AI is fundamentally flawed"},{"type":"text","text":"）中，作者討論了另外一個非常有趣的概念，即 “壓力測試”，它的理念是，除了標準的驗證和測試集外，對模型進行更廣泛的測試。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這聽起來很好，但是我個人更願意從真實世界的應用程序的角度來評價模型。如果我們對每個遷移學習模型都有詳細的實際效果報告的話，我想這會很棒。這聽起來似乎很需要數據，但這只是一個想法，我將在這篇文章中試着做一個壓力測試，希望能有點意思。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於不同的想法、觀點，以及人工智能對未來的影響，我喜歡用相對簡單易於理解的模型來進行討論，但是這可能已經夠多了，讓我們來討論一下這個模型。迄今爲止，我的重點是介紹並討論了訓練的重要性，以及如何將這一基本部分的各個方面引入當前的遷移學習研究與可能更廣泛的應用之間的鴻溝。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過對以上問題的閱讀和思考，我想對一些語言模型進行一次非常快速、特別的壓力測試（多虧了 HuggingFace 的強大功能，它可以確定不同訓練集對相同語言模型架構的影響，如果只使用 Tensorflow 或 PyTorch 的話）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文所要展示的就是，對於由不同語言的微調集合所引起的一些基於 BERT 的掩蔽語言模型的差異，我試圖給出一個直觀的解釋。通過研究不同語言的語言模型，我希望可以鞏固訓練集的重要性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲在不同的訓練集中，我需要訓練多少個 BERT 模型，所以我考慮了不同的語言。我們的目的是通過一些主觀的問題來對不同的語言模型（不同的語言）進行壓力測試，同時利用非常易於使用的 Transformer"},{"type":"link","attrs":{"href":"https:\/\/huggingface.co\/transformers\/main_classes\/pipelines.html","title":"","type":null},"content":[{"type":"text","text":"管道"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BERT 針對不同的任務進行了微調，包括但不限於掩蔽標記預測、文本分類、名稱實體識別和問題回答；但是，由於問題提取需要上下文輸入，我決定使用與掩蔽語言模型相似的程序。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣，我通過掩蔽其中一個標記（理想中引入句子主觀性的標記）生成了 15 個相對較短的句子，並將所有這些被掩蔽的句子分別輸入到基於 BERT 的掩蔽語言模型中，分別對英語、德語、法語和土耳其語進行訓練。我認爲，呈現代碼將是描述此工作流程的最好方法，因此，我只想通過顯示這些包和句子來了解："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"import pandas as pd\nfrom transformers import pipeline\nfrom transformers import BertTokenizer\nfrom google_trans_new import google_translator\n# Our 15 test sentences\nsentences = ['The most delicious food in the world is [MASK].',\n'The best vacation spot in the world is [MASK].',\n'When I grow up, I want to be [MASK].',\n'[MASK] won the Cold War.',\n'The most powerful nation in the world is [MASK].',\n'The cleanest energy source is [MASK].',\n'The most exciting artificial intelligence application is [MASK].',\n'The best smartphone in the market is [MASK].',\n'Weed is [MASK] for your health.',\n'Religions are [MASK] for society.',\n'The most cheerful color is [MASK].',\n'The most fascinating field of science is [MASK].',\n'The average temperature of the earth is going to [MASK] in the future.',\n'The highest paid job of the 21st century is [MASK].',\n'Mathematics is useful for [MASK].'\n]\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如你所看到的，我試着給出一些或多或少的主觀臆斷和正在進行的辯論問題（如果不是全部，也是上面的大部分），它們可以用一個詞來回答。這個想法就是觀察不同的語言模型（在不同的語言上訓練）如何預測這些標記。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在訓練語料中，我最初的一個期望就是通過模型輸出是否反映出與文化、習慣或社會有關的差異。由於這是訓練集如何影響遷移學習預測的重要證據。雖然我沒有百分之百肯定我是否做到了，但是我希望你們能夠確定並告訴我！"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我只在遷移學習部分使用了"},{"type":"link","attrs":{"href":"https:\/\/huggingface.co\/transformers\/index.html","title":"","type":null},"content":[{"type":"text","text":"Transformers"}]},{"type":"text","text":"，"},{"type":"link","attrs":{"href":"https:\/\/pypi.org\/project\/google-trans-new\/","title":"","type":null},"content":[{"type":"text","text":"google-translator"}]},{"type":"text","text":"只用於將預測的標記翻譯爲英語。第一個翻譯是用 Transformer 在局級級別完成的，除了我的母語土耳其語外，這對我來說更容易手動翻譯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有一點很重要，那就是我已經和我的前同事和朋友 Emir Kocer 和 Umut Soysal 一起做過德語和法語的翻譯工作，並且試着儘量減少因翻譯而產生的掩蔽標記的錯誤預測。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# We'll translate to German and French first\ntranslator_de = pipeline('translation_en_to_de')\ntranslator_fr = pipeline('translation_en_to_fr')\n# Create De and Fr sentences\nde_sents = []\nfr_sents = []\nfor eng in sentences:\nde_sents.append(translator_de(eng)[0]('translation_text'))\nfr_sents.append(translator_fr(eng)[0]('translation_text'))\n\n# Change [MASK] to for French\nfr_sents_mod = [sents.replace('[MASK]','') for sents in fr_sents]\n# I did not use translation (neither did I trust) for my mother-tongue. With weird letters such as ü,ğ,ş,ı.\ntr_sents = ['[MASK] dünyadaki en lezzetli yiyecektir.',\n'[MASK] dünyadaki en guzel tati yeridir.',\n'Ben büyüyünce [MASK] olmak istiyorum.',\n'Soğuk savaşı [MASK] kazandı.',\n'[MASK] dünyadaki en güçlü millettir.',\n'[MASK] en temiz enerji kaynağıdır.',\n'[MASK] en heyecan verici yapay zeka uygulamasıdır.',\n'[MASK] piyasadaki en iyi akıllı telefondur.',\n'Kenevir sağlığınız icin [MASK].',\n'Dinler toplumlar icin [MASK].',\n'[MASK] en neşeli renktir.',\n'[MASK] bilimin en büyüleyici alanıdır.',\n'Dünyanın ortalama sıcaklığı gelecekte [MASK].',\n'21. yüzyılın en yüksek maaşlı işi [MASK].',\n'Matematik [MASK] için kullanışlıdır.'\n]\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下一步就是將這些句子輸入到相應的掩蔽語言模型中，提取掩蔽的標記預測，並將其轉換爲英文，這樣我們就可以更方便、更全面地評估結果。由於最後一個翻譯階段是在單詞級別，所以我使用了谷歌翻譯。要知道，有時候你只是想測試一個新的包，即使它做同樣的工作……"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"python"},"content":[{"type":"text","text":"# Create the fill mask objects\nfill_mask_eng = pipeline(\n\"fill-mask\",\nmodel=\"bert-base-uncased\",\ntokenizer='bert-base-uncased'\n)\nfill_mask_de = pipeline(\n\"fill-mask\",\nmodel=\"bert-base-german-cased\",\ntokenizer='bert-base-german-cased'\n)\nfill_mask_fr= pipeline(\n\"fill-mask\",\nmodel=\"camembert-base\",\ntokenizer=\"camembert-base\"\n)\nfill_mask_tr = pipeline(\n\"fill-mask\",\nmodel=\"dbmdz\/bert-base-turkish-cased\",\ntokenizer='dbmdz\/bert-base-turkish-cased'\n)\n# Run the fill-mask pipelines for each language and translate German and French back to English\ntranslator = google_translator()\neng_res = []\nde_res = []\nfr_res = []\ntr_res = []\nfor i,sents in enumerate(sentences):\nres_eng = fill_mask_eng(sents)\nres_de = fill_mask_de(de_sents[i])\nres_fr = fill_mask_fr(fr_sents_mod[i])\nres_tr = fill_mask_tr(tr_sents[i])\n\neng_res.append(', '.join([j['token_str'] for j in res_eng]))\nde_res.append(', '.join([translator.translate(j['token_str'],lang_src='de',lang_tgt='en') for j in res_de]))\nfr_res.append(', '.join([translator.translate(j['token_str'].replace('▁',''),lang_src='fr',lang_tgt='en') for j in res_fr]))\ntr_res.append(', '.join([translator.translate(j['token_str'],lang_src='tr',lang_tgt='en') for j in res_tr]))\n\n# Push the results into a dataframe\nresult_df = pd.DataFrame(list(zip(sentences,eng_res,de_res,fr_res,tr_res)),\ncolumns=['Sentence','English','German','French','Turkish'])\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於這些結果，我不能說它是開創性的，但是具有爭議性的，我很喜歡進一步的頭腦風暴和建設性的批評。這裏有我的發現和一些淺顯的觀察，希望能在這張奇怪的桌子上找到答案。因爲這篇文章比我預期的要長，所以我只談了幾個數據點。別忘了，這是"},{"type":"link","attrs":{"href":"https:\/\/github.com\/Tolga28A\/BERT-MLM-EN_DE_FR.git","title":"","type":null},"content":[{"type":"text","text":"完整代碼"}]},{"type":"text","text":"，你可以下載看看。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/19\/1902a550aaae543d1dab65b37f9a4322.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“here”標記僅以法語出現在第 0 行的食物上，這似乎是非常有趣的開始，我找不到快速的高亮數據，所以我無法在這裏進行數據驅動（慚愧），但是還有一個"},{"type":"link","attrs":{"href":"https:\/\/www.quora.com\/Why-do-the-French-think-they-have-the-best-cuisine-on-planet-Earth","title":"","type":null},"content":[{"type":"text","text":"quora 問題"}]},{"type":"text","text":"：“Why do the French think they have the best cuisine on planet Earth?”。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於德國來說，假設“Italy”這個詞指的是這句話中的意大利食物，我發現這項"},{"type":"link","attrs":{"href":"https:\/\/www.euronews.com\/2018\/05\/15\/which-foreign-cuisines-do-europeans-love-to-eat-","title":"","type":null},"content":[{"type":"text","text":"調查"}]},{"type":"text","text":"顯示，在德國，意大利菜比德國菜更受歡迎。你能說社會中的這些趨勢是基於遷移學習？也許吧……"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於英文中的“chocolate”，我唯一的解釋是，根據"},{"type":"link","attrs":{"href":"https:\/\/www.statista.com\/forecasts\/758627\/revenue-of-the-snack-food-market-worldwide-by-country","title":"","type":null},"content":[{"type":"text","text":"Statista"}]},{"type":"text","text":"的數據，美國在零食消費方面有明顯的優勢，但這可能是一個弱關係，所以我實際上只是想用這個來……"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在第 1 行關於度假的句子中，“here”這個詞在英語、法語和土耳其語中都出現過，根據《"},{"type":"link","attrs":{"href":"https:\/\/www.e-unwto.org\/doi\/epdf\/10.18111\/wtobarometereng.2020.18.1.5","title":"","type":null},"content":[{"type":"text","text":"世界旅遊晴雨表"}]},{"type":"text","text":"》("},{"type":"text","marks":[{"type":"italic"}],"text":"World Tourism Barometer"},{"type":"text","text":"），這些國家是世界前六名中的三個。因此，另一個線索表明，這些國家的某些特徵差異可能已經從訓練語料遷移到模型預測中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Turkish BERT 在第 2 行預測了“doctor”的標記，立刻引起了我的共鳴，因爲我的經驗也告訴我，土耳其這個國家是多麼地迷戀醫生這個職業。這句話不是我憑空說的，也不是因爲我在土耳其長大，這裏是"},{"type":"link","attrs":{"href":"https:\/\/www.dailysabah.com\/turkey\/doctors-judges-have-the-most-respected-profession-turkish-survey-shows\/news","title":"","type":null},"content":[{"type":"text","text":"調查報告"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從第 9 行與宗教相關語句來看，“bad”和“dangerous”標記僅在德語和法語中輸出，這兩個國家的"},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Importance_of_religion_by_country#\/media\/File:Countries_by_importance_of_religion.svg.png","title":"","type":null},"content":[{"type":"text","text":"宗教重要性"}]},{"type":"text","text":"方面的排名遠低於美國和土耳其。另一個潛在的信息流，可能是通過訓練數據和遷移學習，從社會思維轉向機器學習預測。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這一結果的數據框架中有許多東西需要解釋，同時也有一些我目前無法解釋的非常奇怪的預測。舉例來說，只有土耳其語中的 yellow 沒有出現在最悅目的顏色中，德語中的 Iphone 是最好的智能手機預測，儘管英語中的 BERT 似乎對冷戰的勝利者充滿信心，而 USA 卻不是它預測的最強國家，只有法國的語言模型對 cannabis 有着強烈的憎恨。當然，更不用說掩蔽標記的一些停用詞預測了，這可能是由於語法或翻譯錯誤造成的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過對不同語言語料的比較，我相信我們也可以討論數據偏差在這些結果中的作用，事實上，訓練數據偏差也是另外一個很大的討論主題，但是我不想做過多的討論，我現在只是想給大家介紹一下與此相關的觀點（不確定我是否會再爲此寫一篇博文，也許會吧……）。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你們也許會得出一個與我截然不同的解釋，請記住，我也很想聽到更多。所以，讓我總結一下最終降落這架飛機的一些關鍵要點："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"自然語言處理是人工智能的一個快速興起的領域，最近在支撐研究和企業級應用方面取得了顯著的進展。但目前的研究中存在着大量的誇大現象，忽略了訓練語料選擇的重要性，依附性和後果性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關於有監督的自然語言處理的訓練語料方面，如壓力測試，已經有不同的觀點和討論，本文背後的動機是在同一遷移學習架構上應用一種非常短的壓力測試，來識別訓練數據在預測中引起的差異。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我想提供一個關於遷移學習的引人注目的例子，着重於由於訓練集的差異而產生的差異。通過輸入不同的主觀掩蔽句，選擇了不同語言的語言模型，並進行了測試。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於統一句子，不同語言的語言模型的標記預測存在較大差異。關於這些差異，我已經找到了一些額外的數據，但並非全部，希望本文的數據能引起你的思考，並得出結論：訓練數據是遷移學習模型在廣泛應用中的有效性和有效性的界限。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者介紹："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tolga Akiner，博士，數據科學家，機器學習從業者，專注於醫療保健中的自然語言處理領域。LinkedIn："},{"type":"link","attrs":{"href":"https:\/\/www.linkedin.com\/in\/tolga-akiner\/","title":"","type":null},"content":[{"type":"text","text":"https:\/\/www.linkedin.com\/in\/tolga-akiner\/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"原文鏈接："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/towardsdatascience.com\/why-is-training-data-the-bottleneck-for-nlp-a-multilingual-bert-example-44b86c11f5a"}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

對接HiveMetaStore，擁抱開源大數據

本文分享自華爲雲社區《對接HiveMetaStore，擁抱開源大數據》，作者：睡覺是大事。 1. 前言適用版本：9.1.0及以上在大數據融合分析時代，面對海量的數據以及各種複雜的查詢，性能是我們使用一款數據處理引擎最重要的考量

2024-04-24 22:33:08

重磅新品發佈！雲耀數據庫HRDS，享受輕量級的極致體驗

本文分享自華爲雲社區《重磅新品發佈！雲耀數據庫HRDS，享受輕量級的極致體驗！》，作者：GaussDB 數據庫。所謂，凡有井水處，即能歌柳詞。大數據時代，凡有數據處，必有數據庫。隨着業務需求的不斷擴大和數據量的激增，數

2024-04-23 22:32:33

沙特2030年願景和對中國IT企業的市場機會分析

沙特2030年願景和對中國IT企業的市場機會分析前言：最近“開源老DJ，帶你去沙特”欄目第一期已經播出，收到了不錯的反響。見COPU官網的回顧。（https://mp.weixin.qq.com/s/3B0jNVhybxTF1xPiy

2024-04-23 22:24:54

03-爲啥大模型LLM還沒能完全替代你？

1 不具備記憶能力的它是零狀態的，我們平常在使用一些大模型產品，尤其在使用他們的API的時候，我們會發現那你和它對話，尤其是多輪對話的時候，經過一些輪次後，這些記憶就消失了，因爲它也記不住那麼多。 2 上下文窗口的限制大模型對其inpu

2024-04-23 01:07:00

入職3年-我如何做一名AI產品經理

引言從2021年校招加入京東開始，我一直從事AI產品經理的工作，有幸見證了AI行業的熱情從一臺臺服務器燒到了全世界各個角落，也見證了京東AI中臺團隊的影響力如何一步步的擴大。從21年的迷茫到24年的堅定，很慶幸我正走在適合自己的道路上，

2024-04-22 11:16:31

01-大語言模型發展

AI大模型的相關的一些基礎知識，一些背景和基礎知識。多模型強應用AI 2.0時代應用開發者的機會。 0 大綱 AI產業的拆解和常見名詞應用級開發者，在目前這樣一個大背景下的一個職業上面的一些機會實戰部分的，做這個agent，即所謂智

2024-04-22 01:12:50

WhaleScheduler爲銀行業全信創環境打造統一調度管理平臺解決方案

項目背景數字金融是數字經濟的重要支撐和驅動力。近年來，我國針對數字金融的發展政策頻頻出臺，《金融科技發展規劃（2022-2025年）》、《“十四五”數字經濟發展規劃》、《關於銀行業保險業數字化轉型的指導意見》、《金融標準化“十四五”

2024-04-19 21:18:25

用戶行爲分析模型實踐（四）—— 留存分析模型

作者：vivo 互聯網大數據團隊- Wu Yonggang、Li Xiong 本文是vivo互聯網大數據團隊《用戶行爲分析模型實踐》系列文章第4篇 -留存分析模型。本文詳細介紹了留存分析模型的概念及基本原理，並

2024-04-19 11:26:00

京東內部研效架構師訓練營，首次對外公開課，不可錯過的研效之旅！

五月繁花似錦，讓我們帶你走進京東，開啓研效實戰之旅！四大單位聯合發起本次活動由“全國雲計算技術行業產教融合共同體”發起，聯合工業和信息化部電子第五研究所、E³CI軟件研發效能度量工作委員會、京東雲共同主辦，重磅推出“卓越研效架構師”

京東雲開發者

2024-04-19 11:16:30

軟件測試從自動化到智能化，大模型開始加入

隨着科技的飛速發展，軟件行業也在不斷地演進和創新。作爲軟件行業的關鍵環節之一，軟件測試行業也在經歷着前所未有的變革。從最初的手動測試，到自動化測試，再到如今的智能化測試，軟件測試行業正在經歷一場深刻的技術革命。在這場革命中，Testin雲測

2024-04-19 00:53:25

GaussDB(DWS)基於Flink的實時數倉構建

本文分享自華爲雲社區《GaussDB(DWS)基於Flink的實時數倉構建》，作者：胡辣湯。大數據時代，廠商對實時數據分析的訴求越來越強烈，數據分析時效從T+1時效趨向於T+0時效，爲了給客戶提供極速分析查詢能力，華爲雲數倉GaussDB

2024-04-18 10:32:57

這篇 DolphinScheduler on k8s 雲原生部署實踐，值得所有大數據人看！

在當前快速發展的技術格局中，企業尋求創新解決方案來簡化運營並提高效率成爲一種趨勢。 Apache DolphinScheduler作爲一個強大的工具，允許跨分佈式系統進行復雜的工作流任務調度。本文將深入探討如何將Apache Dolphin

2024-04-17 21:18:15

Hive引擎底層初探

1、什麼是Hive Hive是一個基於Hadoop的數據倉庫工具,用於處理和分析大規模結構化數據。Hive提供了類似SQL的查詢語言(HiveQL)，使得熟悉SQL的用戶能夠查詢數據。Hive將SQL查詢轉換爲MapReduce任務，以在

2024-04-17 11:18:21

五一假期暢遊指南：Python技術構建的熱門景點分析系統解讀

導言五一假期即將到來，作爲一名熱愛旅遊的技術達人，我總是希望能夠通過技術手段更好地規劃我的旅行路線。在這篇文章中，我將向大家介紹一款基於Python技術的熱門景點分析系統，幫助您在五一假期中游玩得更加盡興！ 1. 系統概述熱門景點

2024-04-16 23:25:46

裁員了！別錯過2024年大數據工程師必備的10項技能

在當今快速發展的世界中，數據被視爲新的石油。隨着對數據驅動洞察的日益依賴，大數據工程師的角色比以往任何時候都更爲關鍵。這些專業人員在管理和優化組織內的數據操作中扮演着至關重要的角色。在本文中，我們將探索2024年大數據工程師必須具備的十

2024-04-16 11:00:53

24小時熱門文章

最新文章

爲什麼訓練數據是自然語言處理的瓶頸？

最新評論文章