愛奇藝多語言臺詞機器翻譯技術實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"7月3日下午,愛奇藝技術產品團隊舉辦了","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"“i技術會”第16期","attrs":{}},{"type":"text","text":"技術沙龍,本次技術會的主題是“","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"NLP與搜索","attrs":{}},{"type":"text","text":"”。我們邀請到了來自","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"字節跳動、去哪兒和騰訊","attrs":{}},{"type":"text","text":"的技術專家,與愛奇藝技術產品團隊共同分享與探討NLP與搜索結合的魔力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中,來自愛奇藝的技術專家","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"張軒瑋","attrs":{}},{"type":"text","text":"爲大家帶來了","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"愛奇藝多語言臺詞機器翻譯技術實踐","attrs":{}},{"type":"text","text":"的分享。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"福利!關注公衆號“愛奇藝技術產品團隊”,在後臺回覆關鍵詞“NLP”,就可以獲得本次i技術會嘉賓分享完整PPT和錄播視頻。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以下爲“愛奇藝多語言臺詞機器翻譯技術實踐”分享精華內容,根據【i技術會】現場演講整理而成。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本次分享的第一部分是愛奇藝多語言臺詞機器翻譯實踐開展的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"相關背景","attrs":{}},{"type":"text","text":",第二部分是愛奇藝針對多語言臺詞機器翻譯","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"模型的一些探索和優化","attrs":{}},{"type":"text","text":",最後是該模型在愛奇藝的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"落地與應用情況","attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"01 愛奇藝多語言臺詞機器翻譯實踐的相關背景","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2019年6月,愛奇藝正式推出服務全球用戶的產品","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"iQIYI App","attrs":{}},{"type":"text","text":",並通過中臺系統爲iQIYI App提供全球化運營支持,由此開啓了","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"海外市場佈局之路","attrs":{}},{"type":"text","text":"。作爲影視內容服務商,其中必然涉及大量長視頻,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"而長視頻的出海,重要的一環就是臺詞翻譯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,愛奇藝已在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"多個國家佈局","attrs":{}},{"type":"text","text":",涉及多種語言的臺詞翻譯,主要有泰語、越南語、印尼語、馬來語、西班牙語、阿拉伯語等等語言,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"這就使得多語言翻譯成爲了迫在眉睫的現實需求。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,與通用翻譯相比,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"臺詞翻譯有一些獨有的特點如:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)臺詞一般句子較短,上下文信息不足,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"歧義性大;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)很多臺詞來","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"源於OCR或ASR識別的結果","attrs":{}},{"type":"text","text":",會有錯誤,可能影響翻譯質量;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(3)在臺詞對話中往往會涉及很多人物的指代,故而","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"角色名和代詞的翻譯","attrs":{}},{"type":"text","text":"對於臺詞翻譯來說尤爲重要;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(4)部分臺詞需要","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"結合視頻場景信息","attrs":{}},{"type":"text","text":"才能進行語義消歧。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"愛奇藝海外多國佈局的現實需要以及臺詞翻譯的獨有特點","attrs":{}},{"type":"text","text":"這兩大因素使得臺詞場景下多語言機器翻譯實踐成爲現實。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"02 多語言臺詞機器翻譯模型的探索和優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong","attrs":{}}],"text":"1.one-to-many翻譯模型優化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先介紹一下什麼是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"one-to-many模型","attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"One-to-many顧名思義,即通過不同語言翻譯之間的參數共享,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"通過一種模型來達到翻譯多種目標語言的目的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個模型設計的初衷是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"節約維護和訓練成本","attrs":{}},{"type":"text","text":"。前面已經講到,目前,愛奇藝已經佈局到海外多個國家,這就涉及到多種語言的翻譯,如果採用一種語言一個模型的方法,隨着目標語言的增多,我們需要訓練、部署和維護的模型也會越來越多,導致運營成本的增加。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經調研,我們發現了one-to-many模型,它極大地減輕了模型的訓練和部署維護的成本,可以充分","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"利用不同語言之間遷移學習的特點,起到相互促進的作用,從而提高模型效果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖1是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"transformer架構","attrs":{}},{"type":"text","text":",是目前大多數機器翻譯模型優化的主流框架,我們也是以此爲基礎,在上面進行優化。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0b/0b31b09cdd3551658d836b43ca486a75.jpeg","alt":null,"title":"","style":[{"key":"width","value":"50%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖1:transformer模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而針對one-to-many模型,我們借鑑近期大家較爲熟悉的預訓練模型","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"BERT","attrs":{}},{"type":"text","text":",設計了一個特定的輸入形式。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/be/be39c8b1b6ad048b88cb7a442634b8bd.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖2","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個輸入的token的表達都是由三種embedding組成,分別是:","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"token embeddings、segment embeddings、 position embeddings。","attrs":{}},{"type":"text","text":"我們把語言類型token作爲單獨的一個域,那它具有不同於內容的segment embeddings。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"segment embeddings由","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"兩部分組成","attrs":{}},{"type":"text","text":",一個叫EA,一個叫EB。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"EA相當於前面語言的token的segment,後面的EB就是內容的embeddings,不同語言的L是不同的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外語言token表達也會作爲decoder的第一個輸入作爲指導模型的解碼。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong","attrs":{}}],"text":"2.融合臺詞上下文信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"剛纔提到,臺詞翻譯第一個顯著特點就是文本較短,上下文信息不足,容易產生歧義。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這裏舉個例子,比如“我想靜靜”就可能有兩種意思,一是let me alone,二是 I miss Jingjing。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"僅憑文本,我們很難區分究竟是哪一個意思。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"但我們如果能夠結合臺詞的上句和下句,就可以減少這種歧義性。","attrs":{}},{"type":"text","text":"比如,上下句分別是“你走吧”、“再見”,我們就可以知道他想說的是let me alone。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,我們設計了用","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"BERT style","attrs":{}},{"type":"text","text":"的方式融合臺詞上下文的模型,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"輸入時將上文和下文分別與中心句進行拼接,以特定的分隔符做分隔,","attrs":{}},{"type":"text","text":"而在encoder輸出,我們還會","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"對上句和下句進行mask","attrs":{}},{"type":"text","text":",因爲在解碼這個時候,由於上下句在編碼時已經被中心句吸收了相關信息,上句和下句已經不起太多作用。並且,如果不進行mask,有可能還會引來一些翻譯錯位的問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"那麼我們是如何融合上下文的呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/07/0785090da5ee3be60a0af66b32c95422.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖3","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"也是在輸入端,我們可以看到圖3和圖2的不同就在於我們除了把語言token和中心句用三種embedding向量融合之外,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"還會將上句“你走吧”和下句“再見”放在中心句的前後,","attrs":{}},{"type":"text","text":"然後以同樣的方式,每個token也是三種embedding的相加融合,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"把上下文作爲輔助信息,幫助中心句進行消歧。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們把語言、上句、中心句、下句分別標記爲","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"EA、EB、EC、ED","attrs":{}},{"type":"text","text":"四種,對這四種信息進行區分,每一種標記都對應一種segment embedding。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個輸入經過encoder之後,我們會對“你走吧”和“再見”進行mask,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"也就是在解碼的時候隱藏上句和下句,減少它對解碼的影響。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong","attrs":{}}],"text":"3.增強編碼能力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此之外我們還對編碼端做了一些提高。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Transformer裏面一個比較主要的組件就是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"attention","attrs":{}},{"type":"text","text":",","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"其中base版本包含8個head","attrs":{}},{"type":"text","text":"。我們爲了強化attention,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"鼓勵不同的head學習不同的特徵,從而豐富模型的表徵能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖是這4種attention的示意圖。","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"我們通過不同的mask策略實現不同的attention,圖中黑色的方塊代表mask掉的部分。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6e/6e4cd32667fb2a3176a217aed20d45b2.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖4","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"global attention:","attrs":{}},{"type":"text","text":"建模任意詞之間的依賴關係;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"local attention:","attrs":{}},{"type":"text","text":"強制模型發掘局部的信息特徵;","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"forward and backward attention:","attrs":{}},{"type":"text","text":"代表建模模型序列順序信息。forward只能看到前面,backward只能看到後面。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"通過人爲設定特點的attention,我們強制不同的head學習不同的特徵,避免產生冗餘的情況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除此之外我們還借鑑bert,使用","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Masked LM","attrs":{}},{"type":"text","text":"任務增強模型對文本的理解能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先將輸入的某一個詞進行mask,然後在輸出端進行恢復。比如“你走吧”,“我想靜靜”,“再見”,其中,“走”,“見”,都會被mask,輸出的時候再被恢復。這就使得encoder在這種任務中充分地學習到文本的表達,增進它對文本的理解。同時","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"將mlm loss乘以一定的權重加到總體的loss上,進行聯合訓練。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/38/386c6ce602631c339421a56d38eacc5d.jpeg","alt":null,"title":"","style":[{"key":"width","value":"50%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖5:MLM模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong","attrs":{}}],"text":"4.增強解碼能力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了增強模型解碼端的能力,在訓練階段,我們要求解碼端在預測每個token的同時,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"增加對全局信息的預測,同時增強模型解碼端的全局前瞻能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/69/69082f571a99cc3b0ea1ba658e720e73.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖6","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如這裏的G代表let me alone的embedding的平均向量,每個token都會預測這個向量,從而產生GLOBAL loss。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個好處就是你在解碼每個token的時候,我們可以讓模型也預計我們將要解碼的信息,而不會","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"過分依賴於我們前面已經解碼的信息","attrs":{}},{"type":"text","text":",這樣就使得模型具有一定的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"未來規劃的能力","attrs":{}},{"type":"text","text":"。同樣這也會產生一個loss,這個loss會和總體loss進行加權求和,用了一個β,也是小於1的權重,和整體的模型進行聯合訓練。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong","attrs":{}}],"text":"5.欠翻譯和過翻譯問題的解決","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"欠翻譯和過翻譯是模型在做翻譯時可能會經常遇到的一些問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"欠翻譯是指翻譯的目標語言詞語缺失,過翻譯指的是目標語言詞語冗餘。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如上文提到的“你走吧、我想靜靜、再見”這個案例,就有可能在模型訓練不到位的時候產生let alone,缺少了me,這就是所謂的欠翻譯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,也有可能翻譯成Let me me alone, 重複翻譯me, 這就是所謂的過翻譯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些都是我們不希望在翻譯結果中出現的,而產生這兩大問題的一個","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"本質原因就在於解碼生成的信息和編碼的信息不夠對等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/84/8466c076fa8451b58ed210894d8b26fc.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖7","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以我們增加了一個重建的模塊,對其進行約束。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"重建模塊對解碼端的輸出通過","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"一個反向翻譯的decoder翻譯成source","attrs":{}},{"type":"text","text":",也就是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"恢復輸入","attrs":{}},{"type":"text","text":",從而使得解碼端的信息和編碼端的信息保持一致,約束解碼端,從而減輕欠翻譯和過翻譯問題,同樣它也會產生一個loss,和之前一樣也是加到總體loss,進行聯合訓練。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong","attrs":{}}],"text":"6.增強容錯能力","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了以上的探索優化之外我們剛纔也提到了一點,就是我們的臺詞字幕有很大一部分是來源於OCR或者ASR識別的結果, 難免會出現一些詞識別錯誤的問題,如果我們不進行特定處理有可能影響最後的翻譯質量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"所以我們針對這個臺詞識別錯誤的問題,設計了一個容錯模塊。這個容錯模塊可以認爲是糾錯模塊。我們借鑑了去年發表的一篇論文所提出的一個模型——","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"T-TA模型","attrs":{}},{"type":"text","text":"。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個模塊類似大家應該很熟悉的transformer結構,但是我們在裏面做了一些特定的處理:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是它使用了一種叫做","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"language autoencoding","attrs":{}},{"type":"text","text":"的方式,輸出的每個token","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"只能看到其他的token,但看不到它自己。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"也就是說,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"它輸出的表達是由它周邊的token的意思來產生的。","attrs":{}},{"type":"text","text":"比如X1是錯的,但X2,X3,X4是對的,在你經過大量的數據訓練之後,可以通過X2、X3、X4去生成正確的X1,從而達到一種糾錯的能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"那要怎麼才能讓每個token看不到它自己而只能看到周邊呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/8b/8beb967d0d56868235764808b77610b2.jpeg","alt":null,"title":"","style":[{"key":"width","value":"50%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖8","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實也很簡單,就是使用了一個","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"對角線mask的方式。","attrs":{}},{"type":"text","text":"這樣,每次它就只能看到他其他的token,中間的黑色對角是看不見的,也就看不到它自己。通過這個方式,對深黃色那部分進行這樣的特定處理,從而實現一種糾錯能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們可以注意到","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"它的Q也只是用到position embedding","attrs":{}},{"type":"text","text":",因爲如果QKV和Self attention是一樣的,再殘差連接的話就會把token embedding給加到輸出,相當於把你剛剛挖掉的部分又填補上了,會產生信息泄露的問題,這樣就訓練不出一種糾錯的模塊。所以Q只是position embedding。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個模塊大致就是這樣,但是它們是怎麼融合到機器翻譯模型中呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/9e/9e07ca04a17151c168e9412c9fd2199b.jpeg","alt":null,"title":"","style":[{"key":"width","value":"50%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖9","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其實,只要直接和我們之前介紹的那些encoder進行","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"相加融合","attrs":{}},{"type":"text","text":",兩個encoder輸入都是一樣的,輸出進行相加融合,融合之後再進入後面的兩個decoder的處理,這樣就可以對原始encoder的錯誤進行糾正。比如這裏的“靜”錯輸成了“淨”,但T-TA的encoder卻能輸出正確結果,起到了糾錯的作用。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong","attrs":{}}],"text":"7.代詞翻譯","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"剛纔我們也提到,在臺詞翻譯領域另一個重要問題就是","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"代詞的翻譯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲在對話中我們會涉及到很多人物之間的指代,比如提到你、我、他等等,在不同的場景下,對應的翻譯是不同的,這就大大提高了臺詞翻譯的難度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"遇到這種情況我們該怎麼辦呢?","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對這個問題,我們首先可以看一下它的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"表達數量","attrs":{}},{"type":"text","text":"以及","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"表達場景","attrs":{}},{"type":"text","text":"。因爲代詞在中文裏面可能很簡單,就是你、我、他,可能也就最多3、4種或者4、5種,但在其他語言中未必是這樣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"泰語的代詞第一人稱就有12種表達,第二人稱代詞有15種表達,第三人稱有5種表達。","attrs":{}},{"type":"text","text":"對於第一人稱,這12個表達還會隨着性別和使用場合的不同而發生變化。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"對話人身份之間的差異,也會使得這種代詞表達有所區別。","attrs":{}},{"type":"text","text":"這對於臺詞機器翻譯來說,是一個巨大的挑戰。所有這些不同場合都需要我們將其區分出來,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"而這項工作很難僅僅只通過文本來完成。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e7/e7f6888d93aa67c957ccfdd3c27da051.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖10:中文-泰語人稱代詞對應表","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此,我們做了一個","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"融合視頻場景信息的代詞的語義增強。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先我們通過","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"人臉識別和聲紋識別","attrs":{}},{"type":"text","text":"對齊臺詞和角色,通過這種對齊可以使得","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"每一句臺詞定位到它所處的場景。","attrs":{}},{"type":"text","text":"再將","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"角色人物屬性","attrs":{}},{"type":"text","text":"比如性別,年齡,人物關係,身份等","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"標註好,使角色的信息更豐富、更加立體。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fe/fed09edee39662f40b399aa42cce2867.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖11","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"左邊的模型裏面有兩個代詞,就是“你”和“我”,右邊的模塊是對“我”和“你”的一些信息的編碼。比如“我”就屬於男性,年齡是青年,“我”和對話人之間的關係是朋友等等。這樣分別對“我”和“你”進行編碼,編碼後用這些信息做一個","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"變換和降維","attrs":{}},{"type":"text","text":",分別加到對應的代詞上,使得解碼的時候,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"知道這個代詞所處的場景及人物關係,從而使它能夠解碼出正確的代詞翻譯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong","attrs":{}}],"text":"8.成語翻譯","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除代詞外,成語的翻譯在臺詞機器翻譯中也是比較困難的一個部分。這是因爲:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(1)隨着多年演變,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"很多成語都不再只是它字面的意思,而是包含了很多引申義。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這時候如果我們不做特定處理的話,極有可能僅將字面意思翻譯出來,影響翻譯準確度。所以,我們需要其他的輔助信息,比如釋義等。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"(2)","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"有些成語具有語義獨立的特點","attrs":{}},{"type":"text","text":",也就是說某個成語的含義和上下文沒有那麼大的關聯。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對這兩個特點,我們設計了針對成語翻譯的模塊,使用預訓練的BERT,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"對中文以及中文釋義進行編碼,直接替換encoder的成語輸入和添加到encoder的輸出","attrs":{}},{"type":"text","text":",來確保成語真正含義的表達能夠在模型中學習得到。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c8/c81c1e41c946477fa47ea6b20008dac1.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖12","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong","attrs":{}}],"text":"9.角色名翻譯","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"針對這一部分,我們是通過","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"增加特殊標識以及數據增強的方式","attrs":{}},{"type":"text","text":",使得模型學習到特定的","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"拷貝能力","attrs":{}},{"type":"text","text":"。大部分的臺詞從中文翻譯到對應的語言的時候,角色名都是以拼音作爲翻譯的。當然在一些不適宜拼音的語言中,也會有一些其他的對應關係,在這裏我們暫且以拼音爲例。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們首先將","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"人名替換成拼音","attrs":{}},{"type":"text","text":",因爲這時候它的真正的文本已經不重要了,最重要的是它將要翻譯的目標語言。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如在圖13這個例子中,“你認識李飛嗎?”,我們首先將李飛中文替換成拼音li fei,對其增加一個特殊的標識,這也就是想告訴模型:這部分是要拷貝過去的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/49/4998bdf4445a036b34369950214a28f2.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖13","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,爲了增加模型見過的拼音輸入表達的數量,我們","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"通過訓練集挖掘了人名和姓氏的模板將其與僞名字合併成增強的數據,將增強數據和原來的數據串在一起進行訓練","attrs":{}},{"type":"text","text":",使得模型能學到足夠的拷貝能力。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這種方式通過訓練模型,使得機器能夠識別這種標識以及裏面的拼音,將其複製到對應的位置。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"03 多語言臺詞機器翻譯在愛奇藝的落地應用","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們在多語言臺詞機器翻譯模型上做了一些優化探索後,也對優化後模型的質檢差錯率做了一些評測,這裏列舉一部分。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/96/9606a1054fda611cd1193a3835037f0c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","text":"圖14:各語言質檢差錯率","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖中的每種語言都有第三方機器、人工、自研機器三種翻譯,其中,自研的機器翻譯就是我們自己經過模型探索、優化後的效果。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從圖14可以看出,","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"我們自研的翻譯差錯率已經明顯低於第三方","attrs":{}},{"type":"text","text":",這個第三方指的目前市場上最好的第三方。在泰語、印尼語、英語等語言中,我們自研的機器翻譯已經接近於人工,而在馬來語、西班牙語、阿拉伯語的翻譯中,自研翻譯甚至已經超過人工。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,我們做的翻譯主要應用在","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"國際站長視頻出海","attrs":{}},{"type":"text","text":"的項目中,目前已經支持從","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"簡體中文到印尼語,馬來語,泰語,越南語,阿拉伯語,繁體中文","attrs":{}},{"type":"text","text":"等多種語言的翻譯。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章