使用CNN和LSTM構建圖像字幕標題生成器

感謝參考原文-http://bjbsair.com/2020-04-01/tech-info/18508.html

當您看到一個圖像,您的大腦可以輕鬆分辨出圖像的含義,但是計算機可以分辨出圖像的含義嗎?計算機視覺研究人員爲此做了很多工作,他們認爲直到現在都不可能!隨着深度學習技術的進步,海量數據集的可用性和計算機功能的增強,我們可以構建可以爲圖像生成字幕的模型。

這就是我們將在這個項目中實現的目標,在該項目中,我們將一起使用卷積神經網絡和一種循環神經網絡(LSTM)的深度學習技術。

什麼是圖像字幕生成器?

圖像標題生成器是一項任務,涉及計算機視覺和自然語言處理概念,以識別圖像的上下文並以自然語言描述它們。

我們項目的目的是學習CNN和LSTM模型的概念,並通過使用LSTM實現CNN來構建圖像字幕生成器的工作模型。

在這個項目中我們將使用CNN(卷積神經網絡) 和LSTM(長短期記憶)實現字幕生成器。圖像特徵將從Xception中提取,Xception是在imagenet數據集上訓練的CNN模型,然後我們將特徵輸入到LSTM模型中,該模型將負責生成圖像標題。

整理數據集

對於圖像標題生成器,我們將使用Flickr_8K數據集。還有其他一些大數據集,例如Flickr_30K和MSCOCO數據集,但是訓練網絡可能需要數週的時間,因此我們將使用一個小的Flickr8k數據集。龐大的數據集的優勢在於我們可以構建更好的模型。

準備條件

我們將需要以下的幾種庫

  • tensorflow
  • keras
  • pillow
  • numpy
  • tqdm
  • jupyterlab

1.首先,我們導入所有必需的庫

import string  
import numpy as np  
from PIL import Image  
import os  
from pickle import dump, load  
import numpy as np  
from keras.applications.xception import Xception, preprocess_input  
from keras.preprocessing.image import load_img, img_to_array  
from keras.preprocessing.text import Tokenizer  
from keras.preprocessing.sequence import pad_sequences  
from keras.utils import to_categorical  
from keras.layers.merge import add  
from keras.models import Model, load_model  
from keras.layers import Input, Dense, LSTM, Embedding, Dropout  
# small library for seeing the progress of loops.  
from tqdm import tqdm_notebook as tqdm  
tqdm().pandas()

使用CNN和LSTM構建圖像字幕標題生成器

2、獲取並執行數據清理

我們文件的格式是圖像和標題,用新行(“ \ n”)分隔。

每個圖像有5個字幕,我們可以看到爲每個字幕分配了#(0到5)數字。

我們將定義5個函數:

  • load_doc(filename)–用於加載文檔文件並將文件內部的內容讀取爲字符串。
  • all_img_captions(filename)–此函數將創建一個描述字典,該字典映射具有5個字幕列表的圖像。
  • cleaning_text(descriptions)–此函數獲取所有描述並執行數據清理。當使用文本數據時,這是重要的一步,根據目標,我們決定要對文本執行哪種類型的清理。在我們的例子中,我們將刪除標點符號,將所有文本轉換爲小寫並刪除包含數字的單詞。
  • text_vocabulary(descriptions)–這是一個簡單的函數,它將分隔所有唯一的單詞並從所有描述中創建詞彙表。
  • save_descriptions(descriptions,filename)–該函數將創建一個已被預處理的所有描述的列表,並將它們存儲到文件中。我們將創建一個descriptions.txt文件來存儲所有標題。
# Loading a text file into memory  
def load_doc(filename):  
    # Opening the file as read only  
    file = open(filename, 'r')  
    text = file.read()  
    file.close()  
    return text  
# get all imgs with their captions  
def all_img_captions(filename):  
    file = load_doc(filename)  
    captions = file.split('\n')  
    descriptions ={}  
    for caption in captions[:-1]:  
        img, caption = caption.split('\t')  
        if img[:-2] not in descriptions:  
            descriptions[img[:-2]] =   
        else:  
            descriptions[img[:-2]].append(caption)  
    return descriptions  
#Data cleaning- lower casing, removing puntuations and words containing numbers  
def cleaning_text(captions):  
    table = str.maketrans('','',string.punctuation)  
    for img,caps in captions.items():  
        for i,img_caption in enumerate(caps):  
            img_caption.replace("-"," ")  
            desc = img_caption.split()  
            #converts to lowercase  
            desc = [word.lower() for word in desc]  
            #remove punctuation from each token  
            desc = [word.translate(table) for word in desc]  
            #remove hanging 's and a   
            desc = [word for word in desc if(len(word)>1)]  
            #remove tokens with numbers in them  
            desc = [word for word in desc if(word.isalpha())]  
            #convert back to string  
            img_caption = ' '.join(desc)  
            captions[img][i]= img_caption  
    return captions  
def text_vocabulary(descriptions):  
    # build vocabulary of all unique words  
    vocab = set()  
    for key in descriptions.keys():  
        [vocab.update(d.split()) for d in descriptions[key]]  
    return vocab  
#All descriptions in one file   
def save_descriptions(descriptions, filename):  
    lines = list()  
    for key, desc_list in descriptions.items():  
        for desc in desc_list:  
            lines.append(key + '\t' + desc )  
    data = "\n".join(lines)  
    file = open(filename,"w")  
    file.write(data)  
    file.close()  
# Set these path according to project folder in you system  
dataset_text = "D:\dataflair projects\Project - Image Caption Generator\Flickr_8k_text"  
dataset_images = "D:\dataflair projects\Project - Image Caption Generator\Flicker8k_Dataset"  
#we prepare our text data  
filename = dataset_text + "/" + "Flickr8k.token.txt"  
#loading the file that contains all data  
#mapping them into descriptions dictionary img to 5 captions  
descriptions = all_img_captions(filename)  
print("Length of descriptions =" ,len(descriptions))  
#cleaning the descriptions  
clean_descriptions = cleaning_text(descriptions)  
#building vocabulary   
vocabulary = text_vocabulary(clean_descriptions)  
print("Length of vocabulary = ", len(vocabulary))  
#saving each description to file   
save_descriptions(clean_descriptions, "descriptions.txt")

使用CNN和LSTM構建圖像字幕標題生成器

3、從所有圖像中提取特徵向量

這項技術也稱爲轉移學習,我們不必自己做任何事情,我們使用已經在大型數據集上進行訓練的預訓練模型,並從這些模型中提取特徵並將其用於我們的任務。我們正在使用Xception模型,該模型已經在imagenet數據集中進行了訓練,該數據集具有1000個不同的類別進行分類。我們可以直接從keras.applications導入此模型。由於Xception模型最初是爲imagenet構建的,因此與模型集成時,我們所做的改動很少。需要注意的一件事是,Xception模型採用299 299 3的圖像尺寸作爲輸入。我們將刪除最後一個分類層,並獲得2048個特徵向量。

模型= Xception(include_top = False,pooling ='avg')

函數extract_features()將提取所有圖像的特徵,然後將圖像名稱與它們各自的特徵數組映射。然後,我們將特徵字典轉儲到“ features.p”pickle文件中。

def extract_features(directory):  
        model = Xception( include_top=False, pooling='avg' )  
        features = {}  
        for img in tqdm(os.listdir(directory)):  
            filename = directory + "/" + img  
            image = Image.open(filename)  
            image = image.resize((299,299))  
            image = np.expand_dims(image, axis=0)  
            #image = preprocess_input(image)  
            image = image/127.5  
            image = image - 1.0  
            feature = model.predict(image)  
            features[img] = feature  
        return features  
#2048 feature vector  
features = extract_features(dataset_images)  
dump(features, open("features.p","wb"))

使用CNN和LSTM構建圖像字幕標題生成器

根據您的系統,此過程可能會花費很多時間。

features = load(open("features.p","rb"))

4、加載數據集以訓練模型

Flickr_8k_test文件夾中,我們有Flickr_8k.trainImages.txt文件,其中包含用於訓練的6000個圖像名稱的列表。

爲了加載訓練數據集,我們需要更多函數:

  • load_photos(filename)–這將以字符串形式加載文本文件,並返回圖像名稱列表。
  • load_clean_descriptions(文件名,照片)–此函數將創建一個字典,其中包含照片列表中每張照片的標題。我們還爲每個字幕附加了<start>和<end>標識符。我們需要這樣做,以便我們的LSTM模型可以識別字幕的開始和結束。
  • load_features(photos)–此函數將爲我們提供先前從Xception模型提取的圖像名稱及其特徵向量的字典。
#load the data   
def load_photos(filename):  
    file = load_doc(filename)  
    photos = file.split("\n")[:-1]  
    return photos  
def load_clean_descriptions(filename, photos):   
    #loading clean_descriptions  
    file = load_doc(filename)  
    descriptions = {}  
    for line in file.split("\n"):  
        words = line.split()  
        if len(words)<1 :  
            continue  
        image, image_caption = words[0], words[1:]  
        if image in photos:  
            if image not in descriptions:  
                descriptions[image] = []  
            desc = '<start> ' + " ".join(image_caption) + ' <end>'  
            descriptions[image].append(desc)  
    return descriptions  
def load_features(photos):  
    #loading all features  
    all_features = load(open("features.p","rb"))  
    #selecting only needed features  
    features = {k:all_features[k] for k in photos}  
    return features  
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"  
#train = loading_data(filename)  
train_imgs = load_photos(filename)  
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)  
train_features = load_features(train_imgs)

使用CNN和LSTM構建圖像字幕標題生成器

5、詞彙化

我們將用唯一的索引值映射詞彙表中的每個單詞。Keras庫爲我們提供了tokenizer函數,我們將使用該函數從詞彙表創建令牌並將其保存到“ tokenizer.p”pickle文件中。

#calculate maximum length of descriptions  
def max_length(descriptions):  
    desc_list = dict_to_list(descriptions)  
    return max(len(d.split()) for d in desc_list)  

max_length = max_length(descriptions)  
max_length

使用CNN和LSTM構建圖像字幕標題生成器

我們的詞彙表包含7577個單詞。

我們計算描述的最大長度。這對於確定模型結構參數很重要。說明的最大長度爲32。

#create input-output sequence pairs from the image description.  
#data generator, used by model.fit_generator()  
def data_generator(descriptions, features, tokenizer, max_length):  
    while 1:  
        for key, description_list in descriptions.items():  
            #retrieve photo features  
            feature = features[key][0]  
            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)  
            yield [[input_image, input_sequence], output_word]  
def create_sequences(tokenizer, max_length, desc_list, feature):  
    X1, X2, y = list(), list(), list()  
    # walk through each description for the image  
    for desc in desc_list:  
        # encode the sequence  
        seq = tokenizer.texts_to_sequences([desc])[0]  
        # split one sequence into multiple X,y pairs  
        for i in range(1, len(seq)):  
            # split into input and output pair  
            in_seq, out_seq = seq[:i], seq[i]  
            # pad input sequence  
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]  
            # encode output sequence  
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]  
            # store  
            X1.append(feature)  
            X2.append(in_seq)  
            y.append(out_seq)  
    return np.array(X1), np.array(X2), np.array(y)  
#You can check the shape of the input and output for your model  
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))  
a.shape, b.shape, c.shape  
#((47, 2048), (47, 32), (47, 7577))

使用CNN和LSTM構建圖像字幕標題生成器

6、創建數據生成器

首先讓我們看一下模型輸入和輸出的樣子。爲了使此任務成爲監督學習任務,我們必須爲模型提供輸入和輸出以進行訓練。我們必須在6000張圖像上訓練模型,每張圖像將包含2048個長度的特徵向量,並且標題也以數字表示。不能將這6000個圖像的數據量保存到內存中,因此我們將使用生成器方法來生成批處理。

生成器將產生輸入和輸出序列。

#create input-output sequence pairs from the image description.  
#data generator, used by model.fit_generator()  
def data_generator(descriptions, features, tokenizer, max_length):  
    while 1:  
        for key, description_list in descriptions.items():  
            #retrieve photo features  
            feature = features[key][0]  
            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)  
            yield [[input_image, input_sequence], output_word]  
def create_sequences(tokenizer, max_length, desc_list, feature):  
    X1, X2, y = list(), list(), list()  
    # walk through each description for the image  
    for desc in desc_list:  
        # encode the sequence  
        seq = tokenizer.texts_to_sequences([desc])[0]  
        # split one sequence into multiple X,y pairs  
        for i in range(1, len(seq)):  
            # split into input and output pair  
            in_seq, out_seq = seq[:i], seq[i]  
            # pad input sequence  
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]  
            # encode output sequence  
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]  
            # store  
            X1.append(feature)  
            X2.append(in_seq)  
            y.append(out_seq)  
    return np.array(X1), np.array(X2), np.array(y)  
#You can check the shape of the input and output for your model  
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))  
a.shape, b.shape, c.shape  
#((47, 2048), (47, 32), (47, 7577))

使用CNN和LSTM構建圖像字幕標題生成器

7.定義CNN-RNN模型

爲了定義模型的結構,我們將使用Functional API中的Keras模型。它將包括三個主要部分:

  • Feature Extractor–從圖像中提取的特徵大小爲2048,帶有密集層,我們會將尺寸減小到256個節點。
  • Sequence Processor–嵌入層將處理文本輸入,然後是LSTM層。
  • Decoder –通過合併以上兩層的輸出,我們將按密集層進行處理以做出最終預測。最後一層將包含等於我們詞彙量的節點數。

最終模型的視覺表示如下:

使用CNN和LSTM構建圖像字幕標題生成器

from keras.utils import plot_model  
# define the captioning model  
def define_model(vocab_size, max_length):  
    # features from the CNN model squeezed from 2048 to 256 nodes  
    inputs1 = Input(shape=(2048,))  
    fe1 = Dropout(0.5)(inputs1)  
    fe2 = Dense(256, activation='relu')(fe1)  
    # LSTM sequence model  
    inputs2 = Input(shape=(max_length,))  
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)  
    se2 = Dropout(0.5)(se1)  
    se3 = LSTM(256)(se2)  
    # Merging both models  
    decoder1 = add([fe2, se3])  
    decoder2 = Dense(256, activation='relu')(decoder1)  
    outputs = Dense(vocab_size, activation='softmax')(decoder2)  
    # tie it together [image, seq] [word]  
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)  
    model.compile(loss='categorical_crossentropy', optimizer='adam')  
    # summarize model  
    print(model.summary())  
    plot_model(model, to_file='model.png', show_shapes=True)  
    return model

使用CNN和LSTM構建圖像字幕標題生成器

8、訓練模型

爲了訓練模型,我們將使用6000個訓練圖像,方法是分批生成輸入和輸出序列,並使用model.fit_generator()方法將它們擬合到模型中。我們還將模型保存到我們的模型文件夾中。

# train our model  
print('Dataset: ', len(train_imgs))  
print('Descriptions: train=', len(train_descriptions))  
print('Photos: train=', len(train_features))  
print('Vocabulary Size:', vocab_size)  
print('Description Length: ', max_length)  
model = define_model(vocab_size, max_length)  
epochs = 10  
steps = len(train_descriptions)  
# making a directory models to save our models  
os.mkdir("models")  
for i in range(epochs):  
    generator = data_generator(train_descriptions, train_features, tokenizer, max_length)  
    model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)  
    model.save("models/model_" + str(i) + ".h5")

使用CNN和LSTM構建圖像字幕標題生成器

9、測試模型

該模型已經過訓練,現在,我們將製作一個單獨的文件testing_caption_generator.py,它將加載模型並生成預測。預測包含索引值的最大長度,因此我們將使用相同的tokenizer.p pickle文件從其索引值中獲取單詞。

import numpy as np  
from PIL import Image  
import matplotlib.pyplot as plt  
import argparse  
ap = argparse.ArgumentParser()  
ap.add_argument('-i', '--image', required=True, help="Image Path")  
args = vars(ap.parse_args())  
img_path = args['image']  
def extract_features(filename, model):  
        try:  
            image = Image.open(filename)  
        except:  
            print("ERROR: Couldn't open image! Make sure the image path and extension is correct")  
        image = image.resize((299,299))  
        image = np.array(image)  
        # for images that has 4 channels, we convert them into 3 channels  
        if image.shape[2] == 4:   
            image = image[..., :3]  
        image = np.expand_dims(image, axis=0)  
        image = image/127.5  
        image = image - 1.0  
        feature = model.predict(image)  
        return feature  
def word_for_id(integer, tokenizer):  
for word, index in tokenizer.word_index.items():  
     if index == integer:  
         return word  
return None  
def generate_desc(model, tokenizer, photo, max_length):  
    in_text = 'start'  
    for i in range(max_length):  
        sequence = tokenizer.texts_to_sequences([in_text])[0]  
        sequence = pad_sequences([sequence], maxlen=max_length)  
        pred = model.predict([photo,sequence], verbose=0)  
        pred = np.argmax(pred)  
        word = word_for_id(pred, tokenizer)  
        if word is None:  
            break  
        in_text += ' ' + word  
        if word == 'end':  
            break  
    return in_text  
#path = 'Flicker8k_Dataset/111537222_07e56d5a30.jpg'  
max_length = 32  
tokenizer = load(open("tokenizer.p","rb"))  
model = load_model('models/model_9.h5')  
xception_model = Xception(include_top=False, pooling="avg")  
photo = extract_features(img_path, xception_model)  
img = Image.open(img_path)  
description = generate_desc(model, tokenizer, photo, max_length)  
print("\n\n")  
print(description)  
plt.imshow(img)

使用CNN和LSTM構建圖像字幕標題生成器

使用CNN和LSTM構建圖像字幕標題生成器

two girls are playing in the grass(兩個女孩在草地上玩)

結論

在這個項目中,我們通過構建圖像標題生成器實現了CNN-RNN模型。需要注意的一些關鍵點是,我們的模型取決於數據,因此,它無法預測詞彙量之外的單詞。我們使用了一個包含8000張圖像的小型數據集。對於生產級別的模型,我們需要對大於100,000張圖像的數據集進行訓練,以產生更好的精度模型。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章