阿里多語言翻譯模型的前沿探索及技術實踐

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文的主題爲阿里多語言翻譯模型的前沿探索及技術實踐,將分三個部分介紹阿里巴巴在機器翻譯方面的工作:首先是多語言神經網絡機器翻譯的動機、定義和挑戰;其次是阿里在機器翻譯方面的前沿探索;最後介紹多語言神經網絡機器翻譯在阿里巴巴的應用。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c5\/c5c48d58fb0f1d52f4157310b37b6cb8.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"多語言神經網絡機器翻譯"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"多語言機器翻譯的好處與挑戰"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"機器翻譯的目標是讓全世界沒有語言障礙。現在整個世界有6000種語言,但其實只有很少種語言(幾百種)能在網絡上搜索到,其中有200多種語言涵蓋了大多數國家。阿里巴巴想要通過技術快速支持這200多種語言的翻譯,然後擴展整個阿里巴巴的商業邊界。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/08\/082d3aa84a8bb9f5d6399ac2d06c1d07.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"200多種語言互譯的話就會有四萬多種語言對,對阿里巴巴團隊來說,同時handle四萬多種翻譯方向無論是在模型還是數據處理方面都是極其困難的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此阿里的策略是採用多語言機器翻譯,本質很簡單,就是用一個模型來處理所有的翻譯方向。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/84\/8403cd4a86f3c63fb418051601aba583.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣的話一個模型可以處理四萬多個語言對。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"多語言機器翻譯有哪些好處?"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0d\/0dd64b7a4964a664f5c3817e66c4684b.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 首先多語言機器翻譯採用獨一的模型框架,它可以減少一些部署或訓練開銷。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② 統一訓練一個模型會帶來一些知識的共享。a.一些rich language能夠把一些知識transfer到一些low source language的上面,能提升低資源語言對的翻譯效果。b.同時因爲有多語言混合,一些低資源語言對可以見到一些原來看不到的輸入,可以提升低資源語言對模型的泛化能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"但是,多語言機器翻譯也帶來一個挑戰"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 如果同時做200多種語言的機器翻譯,需要收集和篩選出200多個語種的高質量數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② 阿里內部擁有超過10 billion的句對數據,要同時訓練這些數據的話,需要很強的工程能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ a.四萬多種語言方向用一個模型來做,這個模型的容量是不是夠?這是個很大的問題。b.在採用一個大模型的情況下,我們怎樣才能提升各個翻譯方向上的性能或提升它的解碼效率?"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"阿里在機器翻譯方面的前沿探索"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次報告主要是針對算法方面,因爲數據和工程方面其實已經有一些現成的方案了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/02\/0284cd3bdd9c886b4015f794af03b78c.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 機器翻譯主流框架"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"過去一年阿里在機器翻譯算法方面做了一些探索,首先是多語言NMT的框架選擇。目前有兩套主流框架:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① 分離式模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設要做200多個語種的機器翻譯,我可以給每個語言對定義一個encoder和一個decoder,這樣的話我會有200多個encoder,200多個decoder,在翻譯其中某個語言對時,選擇對應的encoder和對應的decoder去進行翻譯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種框架的好處是每個語言有單獨的encoder和decoder,但是有個很大的問題,還是需要有400個模塊來支持200多個語種的機器翻譯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以後增加語言會比較困難。模型規模化程度比較低,部署難度也比較大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② Universal Model"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所有語種共享編碼器和解碼器,在輸入源語言的時候加一個語言標誌,告訴這個模型要翻譯到哪個語言去。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種模型有一個很好的優勢,就是說整個模型只有一個encoder和一個decoder,模型很優雅。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是它有個缺點,對模型容量有要求,200多種語言同時用encoder和一個decoder,模型表達能力肯定會受到挑戰,在統一的模型框架裏面會出現很多語言衝突問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/da\/da910a3d1ded6d08052e1136190433e2.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"阿里還是傾向於選擇統一的框架,採用shared encoder和shared decoder。單一模型具有很好的可擴展性,但是會面臨一個很大的一個問題,語言的通用性和語言的特殊性這二者怎麼能在模型裏很好地表現出來?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/52\/52c701223cb4b127e86bc752d1921ce4.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對這個問題,我們在ACL2020的時候發表了一篇論文,提出了一個新的模型解決方案。假定原來只用單個shared encoder就不能足夠的去充分地去model語言的相似性和特殊性。我們提出了編碼器-中間語-解碼器這樣一個新的模型框架。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原來的編碼器和解碼器不變,在中間插了一箇中間語模塊,採用固定長度和固定內存。假設我要做中譯英,中文進來的時候,先用編碼器給中文做編碼,再經過中間語模塊,可能把中文一些特殊的信號去除掉,獲得更通用的一個表示,最後採用這個表示來譯出英文。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"中間語模塊有個好處,不管輸入什麼語言,都用一個共用的表示去表達。具有相同意思的一箇中文或英文句子進入編碼器,然後通過中間語模塊可以抽取出很類似的表示,這樣的話我能得到一個更好的通用語言表示。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同時引入一些跟語言相關的特性給解碼器去區分不同語言。通過這樣一個框架來協調語言的通用性和語言的特殊性。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/48\/48f41dfc960d957c6901c72d6bcd1eaa.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在訓練中間語模塊的時候,我們引入了兩個損失函數來約束它,圖右邊的損失函數有五項。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一部分是翻譯的條件概率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二部分我們希望中間語這個模塊capture的信息不會丟失掉,所以需要做一個編碼解碼的過程。假設源語言爲中文進入模型,通過中間語模塊之後得到一個表示,解碼的時候通過中間語能把原來中文的信息完整地表示出來,這樣的話能保證中間語模塊學到有用信息,這是一個reconstruction loss。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"model還需要跨語言之間semantic的一致性。無論是中文得到的中間語還是英文得到的中間語,在具有相同含義的情況下,中間語模塊給出的隱變量應該是很接近的。通過這兩個約束來訓練這個模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/47\/47e64b93923b610ac11b4db7fcef9e2f.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是我們的實驗結果,我們在2013年WMT公開測試集和我們內部的一些測試集上做評測。我們的benchmark是一個九層的encoder,我們採用的模型結構是六層的encoder、六層的decoder、三層的中間語模塊。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/dc\/dca6c33616a7c2fb14b385bda1e79593.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表一和表二是我們的實驗結果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的添加約束的多語言翻譯模型與直接單獨訓練單個模型的效果相當,達到comparable的效果。同時也比universal的模型有更好的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在訓練模型的時候是沒有法西或西法的數據的,但是因爲我們構建了一箇中間語模塊儘量去model更通用的語義的信息,而且用固定的一個隱變量去存儲,這樣能有更好的跨語言cross-lingual model能力,能更好的做zero-shot translation。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在zero-shot translation上其實比universal model高出十個bleu,但是跟單獨的橋接的模型還是有差距的,還有四個bleu的差距。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後續我們打算採用數據增強的方法去解決這種zero-shot的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5c\/5ce82897cc8170ee92b83b28bcdd1547.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. zero-shot translation"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在訓練過程中我們可能採用中英或英法句對訓練,但我們沒有給模型喂中法句對數據,所以模型沒有見過中法翻譯句對。但是因爲我們用的multi-lingual模型具備這樣的翻譯能力,那麼這種中法的翻譯對於這個模型來說是就是zero-shot translation。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們想要做200個語種的互譯,其實我們統計過,大部分語言之間基本是沒有平行語料的,99%的翻譯方向上沒有平行語料。大部分平行語料都是其他語言譯爲英文以及英文譯爲其他語言,很多非英語語種相互之間基本沒有雙語平行數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然我們之前的框架已經能緩解一部分問題,但是我們還想繼續提升,接下來怎麼做?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最直接的方式就是加數據。沒有雙語數據,但是有很多單語數據。現在最有效的一個策略就是反向翻譯。訓練好多語言模型之後,假設沒有中法雙語數據,可以將法語的單語反向翻譯成中文,就得到了一批中法的雙語數據,可以加進去訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是反向翻譯有個問題,因爲我們多語言模型的zero-shot翻譯效果很差。通過反向翻譯造出的數據其實質量很低,直接用於訓練可能不能充分發揮單語數據的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/48\/48ecaa43d31b1771b09c0db49e055483.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後我們在今年EMNLP 2020的時候提出了一個新的策略,是基於反向翻譯的一個改進,引入了一種修復模型,去對原本比較低質量的譯文做修復,從而提升整個僞語料的質量,來充分利用單語數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"① Repair Translation"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先給一個問題定義,假設只考慮兩個語言,這兩種語言有很多單語數據,假設有一個預先訓練好的翻譯模型,用x到y和y到x表示不同的方向。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"引入了兩個階段,一個階段是做翻譯的repair,對譯文做修復,構建一個多語言修復模型去修復反向翻譯得到的語料。這個模型用DR來表示,DR也是一個多語言模型,可以用一個語言tag來標明要修復的語種。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"怎麼訓練修復模型?需要給出真實的原文、模型翻譯的譯文以及修復好的譯文,我們需要一個3元組。可以通過來回翻譯的策略來構建這樣一個3元組。假設給定中文原文,可以先翻譯成英文,再翻譯回中文,就可以得到一個在給定英文的情況下修復中文譯文的一個語料。基於這種語料就可以去訓練DR,即多語言修復模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上圖右部可以看到整個流程。假設做中法zero-shot translation翻譯,預先基於中英和英法雙語語料訓練好一個多語言模型,給定一些中文和法文的單語數據,用訓練好的模型對這些單語數據做反向翻譯,將中文翻譯成法語或將法語翻譯成中文,得到一些僞語料,然後使用修復模型對僞語料做修復。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果把中文翻譯成法語,那麼法語譯文就是僞造的部分,我們用修復模型對法語譯文做進一步的修復,得到新的法語譯文,得到更好的僞語料。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了更好的僞語料,就可以去更新多語言模型,這是第一個階段,即repair translation。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5b\/5bbaf2f1bfd326008795e653000803af.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"② Iterative Repaired Back-Translation"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個階段,在NMT的訓練過程中也可以優化修復模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"給定單語,可以做兩次翻譯,得到修復語料,然後構建新的修復模型,這整個過程是一個持續迭代的過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設給定一個修復模型,先用兩種語言的單語數據得到一些僞語料,再用修復模型得到一些修復好的僞語料,然後去訓練NMT模型,然後用新的NMT模型重新做來回翻譯,得到一些新的修復數據,然後繼續訓練修復模型,這個過程會逐漸修復僞語料的質量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多語言修復模型其實很簡單,有兩個輸入和一個輸出,採用雙源dual source transformer architecture來做。給出源語言X,然後得到翻譯,然後直接去預測目標語言句子Y。loss也很簡單,給定一個句對,去預測要修復的目標句子,然後得到一個loss。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fc\/fcd1cfb99375955c94c183112290b7fb.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這套方法不止可以做多語言的翻譯,其實它也可以做一些model domain setting。在EMNLP 2020的時候,我們其實是在多領域的setting下進行實驗,最後一行是我們方法的結果,可以看出我們的方法優於包括反向翻譯或迭代反向翻譯在內的過去的所有方法。其實證明了我們可以藉助這種修復model去提前修復back-translation產生的僞語料中的錯誤,提高僞語料質量,從而提升整個訓練的效率,充分利用單語數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bb\/bbb1a418300b15160d9d9d8ce4a20879.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然採用了修復模型去修復僞語料裏面的一些錯誤,但是訓練過程中的僞語料並不是越多越好,到達一定程度之後再加僞語料性能反而會下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"不管是採用修復模型還是直接用back-translation構建僞語料,得到的僞語料其實不可能完美,跟人工語料還是有差距。所以加很多僞語料其實有可能會污染到訓練數據。加的僞語料越多,整個語料的質量越可能會下降,從而影響到模型的訓練效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一方面,採用反向翻譯這種策略去構造僞語料雖然能提升zero-shot translation的效果,但是它的cost很大,反向翻譯這個過程很耗時,需要很多GPU資源。雖然反向翻譯策略在實際過程中很好用,但是無法生成太多的僞語料,而且僞語料過多也會影響到模型的訓練。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/18\/186e0b4c1fbdeaf2d009cbdd4f2f4a27.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 怎麼更好地去利用那些單語數據呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在一個主流的策略就是利用預訓練模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們也在整合預訓練模型上做了一些探索。預訓練模型在NLP領域已經成爲了一個新的範式,對一些NLU和NLG任務都有很顯著的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"預訓練是怎麼做的?首先通過bert、RoBERTa或者MASS利用大量單語數據預訓練一個模型,然後在帶label的下游任務上做finetuning,就可以充分利用預訓練模型的效果。這個方法對於NLG任務很重要,但是在用到NLG或NMT上會有一些問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一,現在的預訓練模型finetuning的策略是直接用bert去初始化,這樣的話會有個災難性遺忘問題。初始化之後在下游任務做微調,如果下游任務數據量足夠大(類似於翻譯),可能下游任務有幾百萬上千萬的數據,這種情況下進行finetuning,預訓練的效果就很可能不存在。因爲整個參數被調整了,所以模型會遺忘掉之前訓練的一些知識。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二,現在一些比較好的現成模型很難在decoder上整合,因爲NMT是帶條件生成式語言模型,而bert的訓練往往是不帶條件的,很難用到decoder裏面去。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三,預訓練模型容量很大,往往很難finetuning,而且對學習率很敏感,稍微調一下學習率,模型效果就有很大波動。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這就是爲什麼現在一些主流pre-trained model很難在NMT上有很好的推廣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5e\/5e77c667f8ed4975265a25730e4315c8.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 進一步的探索"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們其實還是沿着這個方向做了一些探索,我們在想怎麼更好地把bert用到multi-lingual NMT上。我們去年和今年分別做了兩個工作去整合預訓練模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個工作是表示融合,把預訓練這種帶有上下文的表示融合到NMT裏面去。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個工作就是在第一個工作的基礎上進一步簡化,提出了一個很輕量的適配器,將其插入到bert的層數裏面,在下游任務finetuning的時候,只要調一些調適配器的參數就能比較好地利用預訓練模型的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cb\/cbb68b9ecb185d1d97b4fca3110c1500.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一,動態融合的方法,如上圖,左下是一個預訓練模型,可以從中得到一堆向量表示,右邊是一個NMT的encoder或者decoder。在融合預訓練表示的時候,我會做兩個事情,首先我們要決定預訓練模型的哪些層是重要的(不同的層輸出不同表示)。對於當前encoder來說,需要做一個attention去找到預訓練模型不同層的重要程度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二,我們得到attention之後,用gating的機制去選擇哪些信息對當前的encoder是有用的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實這兩個事情的本質是把預訓練模型裏面的有用信息更好地萃取到NMT模型中。我們可以用這套方法在encoder裏面整合bert,在decoder裏面我們可以整合GPT。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d3\/d361ffa46afd1a52774646af863a6662.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在WMT一些主流的benchmark上做了實驗,通過動態融合的策略,我們可以很好地利用預訓練模型的一些知識來提升當前的NMT的效果。這種策略分別在WMT17中英、WMT14英德和德英上都有比較明顯的提升,比之前直接finetuning能高出一個多點(bleu值)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們今年在NIPS上重新反思了原來的萃取方法,做了進一步簡化。之前的這套方法還是很重的,因爲我們會有不同的模塊,存在預訓練模型和NMT模型兩個模型,沒有很好地融合在一起。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a6\/a690911f3630c6dfa121446ac82466fc.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們提出了一個更簡化的新方法,就是用適配器來用整合bert。採用用一個很輕量的適配器,這樣的話我們就可以插到bert層裏面,然後在下游任務finetuning的時候,只finetuning一些跟適配器相關的參數。這樣做有什麼好處?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一,其實跟之前的方法是一個思路,我們保證bert不變,這樣的話bert就不會存在遺忘問題,就可以儘量緩解之前提到的災難性遺忘問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二,如果要用bert,可以在decoder裏使用mask-predict來解決帶條件和不帶條件的問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三,因爲用了適配器這種很輕量的結構,所以不需要對整個bert做finetuning,所以整個模型的訓練對學習率很魯棒。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7e\/7e505fececaa171da97623f669d448f3.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"拿非遞歸的NMT來舉例,假設encoder decoder都用bert做初始化,然後我們在bert上面插兩個layer,即兩個adapter。Source adapter插入到encoder部分,由兩個feed forward layer組成。Target adapter包括encoder-decoder attention和一個feed forward layer。訓練的時候bert的相關參數固定不變,只調adapter就能取得比較好的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/38\/38cfd3d5ad515a8e91165d26da2a83f4.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"訓練的時候把Y裏面的某些詞屏蔽掉,然後預測這些詞。整個訓練的loss跟bert更相似,這樣的話也能更充分地保證預訓練和finetuning更接近。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們採用mask-predict的策略做解碼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這套策略其實還能用到正常的NMT上。例如對源語言用bert做初始化,然後插入一個source adapter,對decoder端做完全隨機初始化,這樣去訓練,後面實驗也會說明這樣的策略效果也不錯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"或者我們直接去訓練一個mbart,直接採用sequence to sequence的pre-trained model,在其上插入adapter也有效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/be\/be2fbf4b796ed01500c674d373837601.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面講一下我們內部的一些實驗,上圖是我們在非遞歸解碼NMT上取得的結果。我們在IWSLT14和WMT16和WMT14上都做了一些實驗。可以看到我們的這個策略比之前直接做bert fusion要好很多,而且這個策略不需要有兩個encoder,解碼速度也較快。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在一些正常的NMT上我們也做了實驗,例如直接在encoder上用bert做初始化,同時固定住bert,然後插入adapter,decoder隨機初始化,然後做訓練。這種方式在WMT14的英德和英法上都取得了不錯的效果,能達到之前微軟bert fusion策略的效果,但是我們整個模型容量更小,因爲我們的encoder部分不需要用額外的參數,只需要用adapter。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f8\/f84cf1f725225ef923adf8e8bb1c3793.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後續我們還探索了multi-lingual bert的效果。我們可以用multi-lingual bert做初始化,然後採用adapter。可我們在IWSLT上做了實驗,可以看出這種策略能比mask-predict方式高出五個點(bleu值)。這個策略也好於直接訓練的transformer-base模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後續還有一個比較有意思的發現,back-translation生成的僞語料不是加得越多越好,加到一定程度之後反而翻譯性能會下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是藉助pre-training策略就能很好地利用這種單語信息,能比直接用back-translation高出兩個點(bleu值)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"多語言神經網絡機器翻譯在阿里巴巴的應用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了預訓練模型之後,我們就可以很好的去提升多語言NMT場景下的低資源和零資源問題,因爲大部分的語言對(翻譯方向)的語料資源都很少,藉助預訓練模型和adapter策略,我們可以融合更多的單語數據來訓練,提升multi-lingual NMT的效果。這套策略其實已經用到了我們的線下系統上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ed\/ed52cf8ce628240ee556c360c2c5318d.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"後面會講一下落地的實現。使用預訓練模型會有個很大的問題,因爲模型很大,解碼速度會很慢。爲了上線我們會做一些優化,其實大部分時間開銷是在decoder部分,它佔了80%到90%的開銷。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們採用了deep encoder shallow decoder策略、share attention策略,最後我們還做一些工程上的改進,採用short list預測來提升速度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/93\/93849360928dd3c16ed70b31c524dcc9.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲decoder解碼是一步一步執行的,所以整個瓶頸都在decoder端。之前有論文做了一些分析,如果把encoder加深,把decoder層數降低,基本不怎麼損失翻譯效果,但是速度上能有兩倍提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"論文裏面很極端,把decoder層數從六層減到了一層。實際過程中我們發現一層的decoder的表達能力很受限,語言模型能力會比較弱。我們的decoder一般用三層。我們把encoder的層數加得很深,這樣能在整個模型容量、表達能力不怎麼下降的情況下加速解碼。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/02\/0211713a6b071ec3f4c5edbe836377f5.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據肖桐博士在IJCAI的論文Sharing Attention Weights for Fast Transformer,我們也做了實驗去驗證,得到一些相似的結論,即decoder裏面有幾部分的attention其實沒有那麼重要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"① encoder decoder attention這部分,假設用六層,每層都要做一次計算。後來發現後面五層encoder decoder attention其實可以被第一層複用,如果直接用第一層的encoder decoder attention,結果基本沒什麼差距。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"② self-attention裏面的權重,這裏也可以被簡化,我們發現每一層的self-attention權重其實差異不大,採用這個策略也能加速。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/b0\/b0e679603a8886abaf8bed93a95239ec.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"③ Shortlist Prediction。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"多語言機器翻譯模型會share一個大字典,但我們在向某個語種做翻譯的時候不需要採用整個字典做預測,只需要採用與這個語種相關的一些詞,這樣的話我們可以動態地調整字典,用一個字典子集也能對整個模型做提速。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有了這三個策略再結合一些工程上的圖優化改進,解碼效率能提升三到四倍。這樣能使我們基於大規模預訓練模型的NMT能上線運行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/41\/41f5b8f41e66f5370f0e954120180622.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"阿里翻譯主要做的一個事情是讓商業沒有語言障礙,其實我們想要貫徹的是阿里巴巴的全球買、全球賣、全球旅遊、全球支付理念。爲什麼需要阿里翻譯,如果沒有解決語言問題,這些理念就無法實現。商業邊界會很受限。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/71\/715a924c969640470d6394b180a93ae4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"阿里翻譯構建了一個大規模的知識挖掘系統,然後在上面構建了各種各樣的機器翻譯模型,包括多領域機器翻譯模型、多語言機器翻譯模型,以及人機協同平臺。阿里翻譯支持了阿里內部大部分跨語言相關應用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/11\/11c8bccdd8dd2de7893cab53fe302552.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面簡單介紹一下多語言NMT在阿里巴巴的應用。我們最近九月份剛上線了支持200多個語種的機器翻譯服務,已經用於阿里巴巴速賣通,在速賣通上可以看到阿里翻譯可以支持200多個語種。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的200多個語種的機器翻譯服務最近也在阿里雲上線了,可以直接在阿里雲上搜機器翻譯做一些產品體驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1d\/1dec1fc2a01b9979c5a2903e79084fb4.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"簡單總結一下。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/40\/40637e586dbf5ecfec502c26e15fd4c7.jpeg","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了支持阿里巴巴全球買全球賣的策略,我們想要直接構建一個NMT系統,能夠支持涵蓋全球大部分國家的200多種語言。爲了更好、更方便的部署,我們採用了multi-lingual NMT框架。我們設計了一個基於中間語的多語言模型結構,以及基於迭代修復的反向翻譯的數據增強方法,然後我們設計了兩種不同的策略去整合預訓練模型,最後爲了整個模型上線還做了一些加速。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"分享嘉賓:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"張志銳 博士"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"阿里巴巴達摩院 | 算法專家"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"演講者簡介: 現阿里巴巴達摩院算法專家,中國科學技術大學與微軟亞洲研究院聯合培養博士,主要研究方向是機器翻譯、自然語言生成、對話系統等,曾在微軟亞洲研究院、微軟雷德蒙德研究院實習,已在ACL\/EMNLP\/NAACL\/NeurIPS\/AAAI等國際頂級會議上發表相關論文10餘篇,並擔任多個國際頂級會議審稿人,Google Scholar的論文Citation達到500, H-index爲10。目前在阿里巴巴達摩院翻譯團隊負責基礎通用模型優化和先進翻譯技術研究。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:DataFunTalk(ID:dataFunTalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/bU8p52VAP3pr9VSqnDFFNg","title":"xxx","type":null},"content":[{"type":"text","text":"阿里多語言翻譯模型的前沿探索及技術實踐"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章