NMT十篇必讀論文（十）BLEU：a Method for Automatic Evaluation of Machine Translation

原創

2020-06-02 18:39

原文鏈接：https://www.jianshu.com/p/15c22fadcba5

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human: "the closer a machine translation is to a professional human translation, the better it is" – this is the central idea behind BLEU. BLEU was one of the first metrics to achieve a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metric. -- 維基百科

bleu是一種文本評估算法，它是用來評估機器翻譯跟專業人工翻譯之間的對應關係，核心思想就是機器翻譯越接近專業人工翻譯，質量就越好，經過bleu算法得出的分數可以作爲機器翻譯質量的其中一個指標。使用bleu的目的是給出一個快且不差的自動評估解決方案，評估的是機器翻譯與人翻譯的接近程度。

bleu的核心在於：N-gram和懲罰因子

BLEU也是採用了N-gram的匹配規則，通過它能夠算出比較譯文和參考譯文之間n組詞的相似的一個佔比。

通常情況

例子：

原文：貓坐在墊子上
機器翻譯：The cat sat on the mat.
人工翻譯：The cat is on the mat.

我們分別看下1-4 gram的匹配情況：

1-gram

可以看到機器翻譯6個詞，有5個詞命中參考譯文，那麼它的匹配度爲 5/6。

2-gram

2元詞組的匹配度則是 3/5。

3-gram

3元詞組的匹配度是1/4。

4-gram

4元詞組的匹配情況就沒有了。

一般情況1-gram可以代表原文有多少詞被單獨翻譯出來，可以反映譯文的充分性，2-gram以上可以反映譯文的流暢性，它的值越高說明可讀性越好。這兩個指標是能夠跟人工評價對標的。

特殊情況

根據上面的準則，會有一些錯誤的翻譯得到更大的得分，考慮以下兩種情況。

N-gram錯誤情況

例如

原文：貓坐在墊子上
機器譯文： the the the the the the the.
參考譯文：The cat is on the mat.

1-gram下，所有的the都匹配到了，得分是7/7，這顯然是錯誤的，因此對計算公式做以下修改：

提出取機器翻譯譯文N-gram的出現次數和參考譯文中N-gram最大出現次數中的最小值的算法，具體如下：

這裏count=7，Max_ref_Count = 2，取它們之間的最小值爲2，那麼修正後的1-gram的匹配度應該爲2/7

句子過短情況

例：

機器譯文：The cat
參考譯文：The cat is on the mat.

顯然，得分爲1，但是不可取，因此對於長度過短的句子加以懲罰：

這裏的c是機器譯文的詞數，r是參考譯文的詞數，

這樣的話我們重新算精度就應該是：

BP = e^(1- 6 / 2) =0.13

總結

綜上，bleu的最終計算公式爲：

通過例子進行說明：

機器翻譯：The cat sat on the mat.
人工翻譯：The cat is on the mat.

1、計算各gram的精度（一般最多取到4-gram）

P1 = 5 / 6 = 0.833333333333333
P2 = 3 / 5 = 0.6
P3 = 1 / 4 = 0.25
P4 = 0 / 3 = 0

2、加權求和

取權重：Wn = 1 / 4 = 0.25

有0的項不做計算

3、求BP

這裏c=r，則BP=1

4、求BLEU

寫程序的時候，不用費那麼大的勁去實現上面的算法，現成的工具就可以用:

from nltk.translate.bleu_score import sentence_bleu
reference = [['The', 'cat', 'is', 'on', 'the', 'mat']]
candidate = ['The', 'cat', 'sat', 'on', 'the', 'mat']
score = sentence_bleu(reference, candidate,weights=(0.34,0.33,0.33,0))
print(score)
# 輸出結果：0.5025606628115007

BLEU的優缺點

優點：方便、快速，結果比較接近人類評分。

缺點：

不考慮語言表達（語法）上的準確性；
測評精度會受常用詞的干擾；
短譯句的測評精度有時會較高；
沒有考慮同義詞或相似表達的情況，可能會導致合理翻譯被否定；

BLEU本身就不追求百分之百的準確性，也不可能做到百分之百，它的目標只是給出一個快且不差的自動評估解決方案。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

NMT十篇必讀論文（十）BLEU：a Method for Automatic Evaluation of Machine Translation

通常情況

1-gram

2-gram

3-gram

4-gram

特殊情況

N-gram錯誤情況

句子過短情況

總結

BLEU的優缺點

【筆記】動手學深度學習-前言

公司新來一個幹練小夥，把 MyBatis 替換成 MyBatis-Plus，上線後哭暈在廁所。。。

支持非IE瀏覽器真的那麼難嗎？

爲啥就那麼痛恨IE？

Brian Sun：回覆“爲啥就那麼痛恨IE？”

體驗下，大廠在使用功能的API網關！

見鬼了！我家的 WiFi 只有下雨天才能正常使用...

短視頻文案提取原來如此簡單

oa系統集成及案例樣式

開發人員爲什麼要支持非IE瀏覽器的四個故事

NMT十篇必讀論文（十）BLEU：a Method for Automatic Evaluation of Machine Translation

NMT十篇必讀論文（一）attention is all you need

sublime修改註釋顏色及默認語言

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結