看圖說話實戰教程 | 第四節 | 模型評估

歡迎來到《看圖說話實戰教程》系列第四節。在這一節中，我們開始評估訓練好的看圖說話模型。

評估指標

在正式進入模型評估實現之前，讓我們簡單地聊一聊看圖說話模型的評估指標。

1. BLEU

看圖說話任務的Caption Generation類似於機器翻譯任務中的目標語言序列生成。因此，看圖說話任務可以採用機器翻譯任務的指標對序列生成質量進行評估。機器翻譯任務常用的評估指標之一是 BLEU (BiLingual Evaluation Understudy)，意爲雙語評估替補。Understudy意思就是代替人進行翻譯結果的評估。畢竟人工處理過於耗時費力。BLEU算法由IBM研究科學家 Kishore Papineni 於2002年在其論文《BLEU: A Method for Automatic Evaluation of Machine Translation》中首次提出的。

當我們評估一段機器翻譯的源語言（如英語）到目標語言（如中文）的序列時，評判機器翻譯好壞的準則就是：機器翻譯結果越接近專業人士翻譯的結果，則翻譯得越好。BLEU算法的基本設計思想也是如此。實際上，BLEU算法就是用來判斷兩個句子的相似程度，做法是：將一句機器翻譯的話與其對應的幾個參考翻譯作比較，計算出一個綜合分數，分數越高說明機器翻譯得越好。

我們稱模型生成的句子爲候選句子 (candidate)，語料庫中的句子爲參考句子 (reference)。BLEU算法會計算candidate與reference之間的相似分數。BLEU分數取值範圍在0.0到1.0之間，如果兩個句子完美匹配 (perfect match)，那麼BLEU值爲1.0；反之，如果兩個句子完全不匹配 (perfect mismatch)，那麼BLEU值爲0.0。

2. 優缺點

BLEU算法的優點非常明顯：

計算代價小，速度快；
容易理解；
與語言無關（這意味着你可以使用全世界任意的語言來測試）；
與人類評通過計算
被學術界和工業界廣泛採用。

但其缺點也不能被忽略：

不考慮語言表達（語法）上的準確性；
測評精度會受常用詞的干擾；
短譯句的測評精度有時會較高；
沒有考慮同義詞或相似表達的情況，可能會導致合理翻譯被否定；

要知道，BLEU算法是做不到百分之百地準確，它只能做到個大概判斷，它的目標也只是給出一個快且不差的自動評估解決方法。

Kishore Papineni在其論文中提出了一種改進方法——修正的N-Grams精度——以確保它考慮到參考句子reference文本中單詞的出現，而非獎勵生成大量合理翻譯單詞的候選結果。爲了提升多個句子組成的block的翻譯效果，論文通過正則化N-Grams進行改進。更多的細節參見論文。

3. 計算BLEU分數

Python自然語言工具包庫（NLTK）提供了BLEU評分的實現，你可以使用它來評估生成的文本，通過與參考文本對比。

3.1 語句BLEU分數

NLTK提供了sentence_bleu()函數，用於根據一個或多個參考語句來評估候選語句。

from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'a', 'test'], ['this', 'is', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(refernce, candidate)
print(score) # 輸出1.0，完全匹配

3.2 語料庫BLEU分數

NLTK還提供了一個稱爲corpus_bleu()的函數來計算多個句子（如段落或文檔）的BLEU分數。

# two references for one document
from nltk.translate.bleu_score import corpus_bleu
references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]
candidates = [['this', 'is', 'a', 'test']]
score = corpus_bleu(references, candidates)
print(score) # 輸出1.0

3.3 累加和單獨的BLEU分數

NLTK中提供的BLEU評分方法允許你在計算BLEU分數時爲不同的n元組指定權重。這使你可以靈活地計算不同類型的BLEU分數，如單獨和累加的n-gram分數。

單獨的N-Gram分數

單獨的N-gram分數是對特定順序的匹配n元組的評分，例如單個單詞（稱爲1-gram）或單詞對（稱爲2-gram或bigram）。權重被指定爲一個數組，其中每個索引對應相應次序的n元組。僅要計算1-gram匹配的BLEU分數，你可以指定1-gram權重爲1，對於2元,3元和4元指定權重爲0，也就是權重爲（1,0,0,0）。

# n-gram individual BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'a', 'test']
print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1))

累加的N-Gram分數

累加分數是指對從1到n的所有單獨n-gram分數的計算，通過計算加權幾何平均值來對它們進行加權計算。默認情況下，sentence_bleu()和corpus_bleu()分數計算累加的4元組BLEU分數，也稱爲BLEU-4分數。BLEU-4對1元組，2元組，3元組和4元組分數的權重爲1/4（25％）或0.25。

# 4-gram cumulative BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))
print(score) # 輸出 0.707106781187

累加的和單獨的1元組BLEU使用相同的權重，也就是（1,0,0,0）。計算累加的2元組BLEU分數爲1元組和2元組分別賦50％的權重，計算累加的3元組BLEU爲1元組，2元組和3元組分別爲賦33％的權重。讓我們通過計算BLEU-1，BLEU-2，BLEU-3和BLEU-4的累加得分來具體說明：

# cumulative BLEU scores
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))
print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))
print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

# 分別輸出：0.750000，0.500000，0.632878，0.707107

在描述文本生成系統的性能時，通常會報告從BLEU-1到BLEU-4的累加分數。

模型評估

一旦模型訓練完成，我們就需要評估其在測試集上的預測能力。看圖說話任務主要目標就是給定一張圖片，爲其生成一段文字描述。那麼，評估的目標就是評判生成的文字描述與該圖片默認的文字描述的接近程度。

首先，我們需要加載訓練好的模型；然後，用訓練好的模型來爲測試集中的每張圖片生成對應的文字描述。以符號startseq標記序列開始，不斷地輸出下一個單詞，直到遇到序列生成結束標記endseq，或者達到序列最大長度。

模型加載代碼如下：

from keras.models import load_model
filename = 'model-ep001-loss3.112-val_loss3.153.h5' # 替換成你自己的
model = load_model(filename)

實際上，模型在預測下一個單詞時輸出的是該單詞在字典中的索引位置，我們需要將該索引位置映射成對應的單詞。

def word_for_id(integer, tokenizer):
  for word, index in tokenizer.word_index.items():
    if index == integer:
      return word
  return None

接下來，我們定義一個函數用於生成序列：

from numpy import argmax
from keras.preprocessing.sequence import pad_sequences

def generate_desc(model, tokenizer, photo, max_length):
  in_text = 'startseq'
  # 在序列最大長度上遍歷
  for i in range(max_length):
    sequence = tokenizer.texts_to_sequences([in_text])[0]
    sequence = pad_sequences([sequence], maxlen=max_length)
    yhat = model.predict([photo, sequence], verbose=0)
    yhat = argmax(yhat)
    word = word_for_id(yhat, tokenizer)
    if word is None:
      break
    # 拼接成新的輸入繼續預測
    in_text += ' ' + word
    # 如果遇到序列終止符號則停止迭代
    if word == 'endseq':
      break
  return in_text

接下來，我們爲測試集裏的所有圖片生成對應的文字描述。然後計算BLEU分數。

from nltk.translate.bleu_score import corpus_bleu

def evaluate_model(model, descriptions, photos, tokenizer, max_length):
  actual, predicted = list(), list()
  for key, desc_list in descriptions.items():
    yhat = generate_desc(model, tokenizer, photos[key], max_length)
    references = [d.split() for d in desc_list]
    actual.append(references)
    predicted.append(yhat.split())
  # 計算BLEU分數
  print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
  print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
  print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
  print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

完整代碼

模型評估小節涉及到的完整的代碼如下：

from numpy import argmax
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
from nltk.translate.bleu_score import corpus_bleu

def word_for_id(integer, tokenizer):
  for word, index in tokenizer.word_index.items():
    if index == integer:
      return word
  return None

def generate_desc(model, tokenizer, photo, max_length):
  in_text = 'startseq'
  # 在序列最大長度上遍歷
  for i in range(max_length):
    sequence = tokenizer.texts_to_sequences([in_text])[0]
    sequence = pad_sequences([sequence], maxlen=max_length)
    yhat = model.predict([photo, sequence], verbose=0)
    yhat = argmax(yhat)
    word = word_for_id(yhat, tokenizer)
    if word is None:
      break
    # 拼接成新的輸入繼續預測
    in_text += ' ' + word
    # 如果遇到序列終止符號則停止迭代
    if word == 'endseq':
      break
  return in_text

def evaluate_model(model, descriptions, photos, tokenizer, max_length):
  actual, predicted = list(), list()
  for key, desc_list in descriptions.items():
    yhat = generate_desc(model, tokenizer, photos[key], max_length)
    references = [d.split() for d in desc_list]
    actual.append(references)
    predicted.append(yhat.split())
  # 計算BLEU分數
  print('BLEU-1: %f' % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
  print('BLEU-2: %f' % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
  print('BLEU-3: %f' % corpus_bleu(actual, predicted, weights=(0.3, 0.3, 0.3, 0)))
  print('BLEU-4: %f' % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))


# 加載測試集
filename = 'Flickr8k_text/Flickr_8k.testImages.txt'
test = load_set(filename)
print('Dataset: %d' % len(test))

# 測試集的描述
test_descriptions = load_clean_descriptions('descriptions.txt', test)
print('Descriptions: test=%d' & len(test_descriptions))

# 測試集的圖片特徵向量
test_features = load_photo_features('features.pkl', test)
print('Photos: test=%d' % len(test_features))
  
# 加載模型
filename = 'model-ep001-loss3.112-val_loss3.153.h5' # 替換成你自己的
model = load_model(filename)

# 評估模型
evaluate_model(model, test_descriptions, test_features, tokenizer, max_length)

結束語

感謝花費寶貴時間閱讀本節教程，敬請期待下一節！希望您能這篇教程中受益匪淺！也特別歡迎大家在評論區提出寶貴的改進意見。如有錯誤或表述不當之處，也歡迎指正出來！

想要了解更多的自然語言處理最新進展、技術乾貨及學習教程，歡迎關注微信公衆號“語言智能技術筆記簿”或掃描二維碼添加關注。

看圖說話實戰教程 | 第四節 | 模型評估

評估指標

1. BLEU

2. 優缺點

3. 計算BLEU分數

3.1 語句BLEU分數

3.2 語料庫BLEU分數

3.3 累加和單獨的BLEU分數

模型評估

完整代碼

結束語

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

Nginx R31 doc-13-Limiting Access to Proxied HTTP Resources 訪問限流

python包：pandas

中外程序員到底有啥區別？

Python數據分析與挖掘實戰（5章）

一、什麼是Docker

C++文件/流

二、Docker 組件

揹包九講一 01揹包

今天！通義靈碼在北京、成都、杭州三城開講啦

頂會速遞 | ICLR 2020錄用論文之自然語言處理篇

一週新論文 | 2020年第11周 | 自然語言處理相關

請查收！頂會AAAI 2020錄用論文之神經架構搜索與推薦系統篇合集

Ubuntu系統搭建深度學習開發環境

一起讀論文 | 高質量的同行評審意見應該寫哪些內容及如何組織？

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結