感謝參考原文-http://bjbsair.com/2020-04-01/tech-info/18508.html
當您看到一個圖像,您的大腦可以輕鬆分辨出圖像的含義,但是計算機可以分辨出圖像的含義嗎?計算機視覺研究人員爲此做了很多工作,他們認爲直到現在都不可能!隨着深度學習技術的進步,海量數據集的可用性和計算機功能的增強,我們可以構建可以爲圖像生成字幕的模型。
這就是我們將在這個項目中實現的目標,在該項目中,我們將一起使用卷積神經網絡和一種循環神經網絡(LSTM)的深度學習技術。
什麼是圖像字幕生成器?
圖像標題生成器是一項任務,涉及計算機視覺和自然語言處理概念,以識別圖像的上下文並以自然語言描述它們。
我們項目的目的是學習CNN和LSTM模型的概念,並通過使用LSTM實現CNN來構建圖像字幕生成器的工作模型。
在這個項目中我們將使用CNN(卷積神經網絡) 和LSTM(長短期記憶)實現字幕生成器。圖像特徵將從Xception中提取,Xception是在imagenet數據集上訓練的CNN模型,然後我們將特徵輸入到LSTM模型中,該模型將負責生成圖像標題。
整理數據集
對於圖像標題生成器,我們將使用Flickr_8K數據集。還有其他一些大數據集,例如Flickr_30K和MSCOCO數據集,但是訓練網絡可能需要數週的時間,因此我們將使用一個小的Flickr8k數據集。龐大的數據集的優勢在於我們可以構建更好的模型。
準備條件
我們將需要以下的幾種庫
- tensorflow
- keras
- pillow
- numpy
- tqdm
- jupyterlab
1.首先,我們導入所有必需的庫
import string
import numpy as np
from PIL import Image
import os
from pickle import dump, load
import numpy as np
from keras.applications.xception import Xception, preprocess_input
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers.merge import add
from keras.models import Model, load_model
from keras.layers import Input, Dense, LSTM, Embedding, Dropout
# small library for seeing the progress of loops.
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()
2、獲取並執行數據清理
我們文件的格式是圖像和標題,用新行(“ \ n”)分隔。
每個圖像有5個字幕,我們可以看到爲每個字幕分配了#(0到5)數字。
我們將定義5個函數:
- load_doc(filename)–用於加載文檔文件並將文件內部的內容讀取爲字符串。
- all_img_captions(filename)–此函數將創建一個描述字典,該字典映射具有5個字幕列表的圖像。
- cleaning_text(descriptions)–此函數獲取所有描述並執行數據清理。當使用文本數據時,這是重要的一步,根據目標,我們決定要對文本執行哪種類型的清理。在我們的例子中,我們將刪除標點符號,將所有文本轉換爲小寫並刪除包含數字的單詞。
- text_vocabulary(descriptions)–這是一個簡單的函數,它將分隔所有唯一的單詞並從所有描述中創建詞彙表。
- save_descriptions(descriptions,filename)–該函數將創建一個已被預處理的所有描述的列表,並將它們存儲到文件中。我們將創建一個descriptions.txt文件來存儲所有標題。
# Loading a text file into memory
def load_doc(filename):
# Opening the file as read only
file = open(filename, 'r')
text = file.read()
file.close()
return text
# get all imgs with their captions
def all_img_captions(filename):
file = load_doc(filename)
captions = file.split('\n')
descriptions ={}
for caption in captions[:-1]:
img, caption = caption.split('\t')
if img[:-2] not in descriptions:
descriptions[img[:-2]] =
else:
descriptions[img[:-2]].append(caption)
return descriptions
#Data cleaning- lower casing, removing puntuations and words containing numbers
def cleaning_text(captions):
table = str.maketrans('','',string.punctuation)
for img,caps in captions.items():
for i,img_caption in enumerate(caps):
img_caption.replace("-"," ")
desc = img_caption.split()
#converts to lowercase
desc = [word.lower() for word in desc]
#remove punctuation from each token
desc = [word.translate(table) for word in desc]
#remove hanging 's and a
desc = [word for word in desc if(len(word)>1)]
#remove tokens with numbers in them
desc = [word for word in desc if(word.isalpha())]
#convert back to string
img_caption = ' '.join(desc)
captions[img][i]= img_caption
return captions
def text_vocabulary(descriptions):
# build vocabulary of all unique words
vocab = set()
for key in descriptions.keys():
[vocab.update(d.split()) for d in descriptions[key]]
return vocab
#All descriptions in one file
def save_descriptions(descriptions, filename):
lines = list()
for key, desc_list in descriptions.items():
for desc in desc_list:
lines.append(key + '\t' + desc )
data = "\n".join(lines)
file = open(filename,"w")
file.write(data)
file.close()
# Set these path according to project folder in you system
dataset_text = "D:\dataflair projects\Project - Image Caption Generator\Flickr_8k_text"
dataset_images = "D:\dataflair projects\Project - Image Caption Generator\Flicker8k_Dataset"
#we prepare our text data
filename = dataset_text + "/" + "Flickr8k.token.txt"
#loading the file that contains all data
#mapping them into descriptions dictionary img to 5 captions
descriptions = all_img_captions(filename)
print("Length of descriptions =" ,len(descriptions))
#cleaning the descriptions
clean_descriptions = cleaning_text(descriptions)
#building vocabulary
vocabulary = text_vocabulary(clean_descriptions)
print("Length of vocabulary = ", len(vocabulary))
#saving each description to file
save_descriptions(clean_descriptions, "descriptions.txt")
3、從所有圖像中提取特徵向量
這項技術也稱爲轉移學習,我們不必自己做任何事情,我們使用已經在大型數據集上進行訓練的預訓練模型,並從這些模型中提取特徵並將其用於我們的任務。我們正在使用Xception模型,該模型已經在imagenet數據集中進行了訓練,該數據集具有1000個不同的類別進行分類。我們可以直接從keras.applications導入此模型。由於Xception模型最初是爲imagenet構建的,因此與模型集成時,我們所做的改動很少。需要注意的一件事是,Xception模型採用299 299 3的圖像尺寸作爲輸入。我們將刪除最後一個分類層,並獲得2048個特徵向量。
模型= Xception(include_top = False,pooling ='avg')
函數extract_features()將提取所有圖像的特徵,然後將圖像名稱與它們各自的特徵數組映射。然後,我們將特徵字典轉儲到“ features.p”pickle文件中。
def extract_features(directory):
model = Xception( include_top=False, pooling='avg' )
features = {}
for img in tqdm(os.listdir(directory)):
filename = directory + "/" + img
image = Image.open(filename)
image = image.resize((299,299))
image = np.expand_dims(image, axis=0)
#image = preprocess_input(image)
image = image/127.5
image = image - 1.0
feature = model.predict(image)
features[img] = feature
return features
#2048 feature vector
features = extract_features(dataset_images)
dump(features, open("features.p","wb"))
根據您的系統,此過程可能會花費很多時間。
features = load(open("features.p","rb"))
4、加載數據集以訓練模型
在Flickr_8k_test文件夾中,我們有Flickr_8k.trainImages.txt文件,其中包含用於訓練的6000個圖像名稱的列表。
爲了加載訓練數據集,我們需要更多函數:
- load_photos(filename)–這將以字符串形式加載文本文件,並返回圖像名稱列表。
- load_clean_descriptions(文件名,照片)–此函數將創建一個字典,其中包含照片列表中每張照片的標題。我們還爲每個字幕附加了<start>和<end>標識符。我們需要這樣做,以便我們的LSTM模型可以識別字幕的開始和結束。
- load_features(photos)–此函數將爲我們提供先前從Xception模型提取的圖像名稱及其特徵向量的字典。
#load the data
def load_photos(filename):
file = load_doc(filename)
photos = file.split("\n")[:-1]
return photos
def load_clean_descriptions(filename, photos):
#loading clean_descriptions
file = load_doc(filename)
descriptions = {}
for line in file.split("\n"):
words = line.split()
if len(words)<1 :
continue
image, image_caption = words[0], words[1:]
if image in photos:
if image not in descriptions:
descriptions[image] = []
desc = '<start> ' + " ".join(image_caption) + ' <end>'
descriptions[image].append(desc)
return descriptions
def load_features(photos):
#loading all features
all_features = load(open("features.p","rb"))
#selecting only needed features
features = {k:all_features[k] for k in photos}
return features
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"
#train = loading_data(filename)
train_imgs = load_photos(filename)
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
train_features = load_features(train_imgs)
5、詞彙化
我們將用唯一的索引值映射詞彙表中的每個單詞。Keras庫爲我們提供了tokenizer函數,我們將使用該函數從詞彙表創建令牌並將其保存到“ tokenizer.p”pickle文件中。
#calculate maximum length of descriptions
def max_length(descriptions):
desc_list = dict_to_list(descriptions)
return max(len(d.split()) for d in desc_list)
max_length = max_length(descriptions)
max_length
我們的詞彙表包含7577個單詞。
我們計算描述的最大長度。這對於確定模型結構參數很重要。說明的最大長度爲32。
#create input-output sequence pairs from the image description.
#data generator, used by model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length):
while 1:
for key, description_list in descriptions.items():
#retrieve photo features
feature = features[key][0]
input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)
yield [[input_image, input_sequence], output_word]
def create_sequences(tokenizer, max_length, desc_list, feature):
X1, X2, y = list(), list(), list()
# walk through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
X1.append(feature)
X2.append(in_seq)
y.append(out_seq)
return np.array(X1), np.array(X2), np.array(y)
#You can check the shape of the input and output for your model
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))
a.shape, b.shape, c.shape
#((47, 2048), (47, 32), (47, 7577))
6、創建數據生成器
首先讓我們看一下模型輸入和輸出的樣子。爲了使此任務成爲監督學習任務,我們必須爲模型提供輸入和輸出以進行訓練。我們必須在6000張圖像上訓練模型,每張圖像將包含2048個長度的特徵向量,並且標題也以數字表示。不能將這6000個圖像的數據量保存到內存中,因此我們將使用生成器方法來生成批處理。
生成器將產生輸入和輸出序列。
#create input-output sequence pairs from the image description.
#data generator, used by model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length):
while 1:
for key, description_list in descriptions.items():
#retrieve photo features
feature = features[key][0]
input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)
yield [[input_image, input_sequence], output_word]
def create_sequences(tokenizer, max_length, desc_list, feature):
X1, X2, y = list(), list(), list()
# walk through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
X1.append(feature)
X2.append(in_seq)
y.append(out_seq)
return np.array(X1), np.array(X2), np.array(y)
#You can check the shape of the input and output for your model
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))
a.shape, b.shape, c.shape
#((47, 2048), (47, 32), (47, 7577))
7.定義CNN-RNN模型
爲了定義模型的結構,我們將使用Functional API中的Keras模型。它將包括三個主要部分:
- Feature Extractor–從圖像中提取的特徵大小爲2048,帶有密集層,我們會將尺寸減小到256個節點。
- Sequence Processor–嵌入層將處理文本輸入,然後是LSTM層。
- Decoder –通過合併以上兩層的輸出,我們將按密集層進行處理以做出最終預測。最後一層將包含等於我們詞彙量的節點數。
最終模型的視覺表示如下:
from keras.utils import plot_model
# define the captioning model
def define_model(vocab_size, max_length):
# features from the CNN model squeezed from 2048 to 256 nodes
inputs1 = Input(shape=(2048,))
fe1 = Dropout(0.5)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
# LSTM sequence model
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.5)(se1)
se3 = LSTM(256)(se2)
# Merging both models
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)
# tie it together [image, seq] [word]
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')
# summarize model
print(model.summary())
plot_model(model, to_file='model.png', show_shapes=True)
return model
8、訓練模型
爲了訓練模型,我們將使用6000個訓練圖像,方法是分批生成輸入和輸出序列,並使用model.fit_generator()方法將它們擬合到模型中。我們還將模型保存到我們的模型文件夾中。
# train our model
print('Dataset: ', len(train_imgs))
print('Descriptions: train=', len(train_descriptions))
print('Photos: train=', len(train_features))
print('Vocabulary Size:', vocab_size)
print('Description Length: ', max_length)
model = define_model(vocab_size, max_length)
epochs = 10
steps = len(train_descriptions)
# making a directory models to save our models
os.mkdir("models")
for i in range(epochs):
generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)
model.save("models/model_" + str(i) + ".h5")
9、測試模型
該模型已經過訓練,現在,我們將製作一個單獨的文件testing_caption_generator.py,它將加載模型並生成預測。預測包含索引值的最大長度,因此我們將使用相同的tokenizer.p pickle文件從其索引值中獲取單詞。
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import argparse
ap = argparse.ArgumentParser()
ap.add_argument('-i', '--image', required=True, help="Image Path")
args = vars(ap.parse_args())
img_path = args['image']
def extract_features(filename, model):
try:
image = Image.open(filename)
except:
print("ERROR: Couldn't open image! Make sure the image path and extension is correct")
image = image.resize((299,299))
image = np.array(image)
# for images that has 4 channels, we convert them into 3 channels
if image.shape[2] == 4:
image = image[..., :3]
image = np.expand_dims(image, axis=0)
image = image/127.5
image = image - 1.0
feature = model.predict(image)
return feature
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
def generate_desc(model, tokenizer, photo, max_length):
in_text = 'start'
for i in range(max_length):
sequence = tokenizer.texts_to_sequences([in_text])[0]
sequence = pad_sequences([sequence], maxlen=max_length)
pred = model.predict([photo,sequence], verbose=0)
pred = np.argmax(pred)
word = word_for_id(pred, tokenizer)
if word is None:
break
in_text += ' ' + word
if word == 'end':
break
return in_text
#path = 'Flicker8k_Dataset/111537222_07e56d5a30.jpg'
max_length = 32
tokenizer = load(open("tokenizer.p","rb"))
model = load_model('models/model_9.h5')
xception_model = Xception(include_top=False, pooling="avg")
photo = extract_features(img_path, xception_model)
img = Image.open(img_path)
description = generate_desc(model, tokenizer, photo, max_length)
print("\n\n")
print(description)
plt.imshow(img)
two girls are playing in the grass(兩個女孩在草地上玩)
結論
在這個項目中,我們通過構建圖像標題生成器實現了CNN-RNN模型。需要注意的一些關鍵點是,我們的模型取決於數據,因此,它無法預測詞彙量之外的單詞。我們使用了一個包含8000張圖像的小型數據集。對於生產級別的模型,我們需要對大於100,000張圖像的數據集進行訓練,以產生更好的精度模型。