感謝參考原文-http://bjbsair.com/2020-04-01/tech-info/18508.html

當您看到一個圖像，您的大腦可以輕鬆分辨出圖像的含義，但是計算機可以分辨出圖像的含義嗎？計算機視覺研究人員爲此做了很多工作，他們認爲直到現在都不可能！隨着深度學習技術的進步，海量數據集的可用性和計算機功能的增強，我們可以構建可以爲圖像生成字幕的模型。

這就是我們將在這個項目中實現的目標，在該項目中，我們將一起使用卷積神經網絡和一種循環神經網絡（LSTM）的深度學習技術。

什麼是圖像字幕生成器？

圖像標題生成器是一項任務，涉及計算機視覺和自然語言處理概念，以識別圖像的上下文並以自然語言描述它們。

我們項目的目的是學習CNN和LSTM模型的概念，並通過使用LSTM實現CNN來構建圖像字幕生成器的工作模型。

在這個項目中我們將使用CNN（卷積神經網絡） 和LSTM（長短期記憶）實現字幕生成器。圖像特徵將從Xception中提取，Xception是在imagenet數據集上訓練的CNN模型，然後我們將特徵輸入到LSTM模型中，該模型將負責生成圖像標題。

整理數據集

對於圖像標題生成器，我們將使用Flickr_8K數據集。還有其他一些大數據集，例如Flickr_30K和MSCOCO數據集，但是訓練網絡可能需要數週的時間，因此我們將使用一個小的Flickr8k數據集。龐大的數據集的優勢在於我們可以構建更好的模型。

準備條件

我們將需要以下的幾種庫

tensorflow
keras
pillow
numpy
tqdm
jupyterlab

1.首先，我們導入所有必需的庫

import string  
import numpy as np  
from PIL import Image  
import os  
from pickle import dump, load  
import numpy as np  
from keras.applications.xception import Xception, preprocess_input  
from keras.preprocessing.image import load_img, img_to_array  
from keras.preprocessing.text import Tokenizer  
from keras.preprocessing.sequence import pad_sequences  
from keras.utils import to_categorical  
from keras.layers.merge import add  
from keras.models import Model, load_model  
from keras.layers import Input, Dense, LSTM, Embedding, Dropout  
# small library for seeing the progress of loops.  
from tqdm import tqdm_notebook as tqdm  
tqdm().pandas()

2、獲取並執行數據清理

我們文件的格式是圖像和標題，用新行（“ \ n”）分隔。

每個圖像有5個字幕，我們可以看到爲每個字幕分配了＃（0到5）數字。

我們將定義5個函數：

load_doc（filename）–用於加載文檔文件並將文件內部的內容讀取爲字符串。
all_img_captions（filename）–此函數將創建一個描述字典，該字典映射具有5個字幕列表的圖像。
cleaning_text（descriptions）–此函數獲取所有描述並執行數據清理。當使用文本數據時，這是重要的一步，根據目標，我們決定要對文本執行哪種類型的清理。在我們的例子中，我們將刪除標點符號，將所有文本轉換爲小寫並刪除包含數字的單詞。
text_vocabulary（descriptions）–這是一個簡單的函數，它將分隔所有唯一的單詞並從所有描述中創建詞彙表。
save_descriptions（descriptions，filename）–該函數將創建一個已被預處理的所有描述的列表，並將它們存儲到文件中。我們將創建一個descriptions.txt文件來存儲所有標題。

# Loading a text file into memory  
def load_doc(filename):  
    # Opening the file as read only  
    file = open(filename, 'r')  
    text = file.read()  
    file.close()  
    return text  
# get all imgs with their captions  
def all_img_captions(filename):  
    file = load_doc(filename)  
    captions = file.split('\n')  
    descriptions ={}  
    for caption in captions[:-1]:  
        img, caption = caption.split('\t')  
        if img[:-2] not in descriptions:  
            descriptions[img[:-2]] =   
        else:  
            descriptions[img[:-2]].append(caption)  
    return descriptions  
#Data cleaning- lower casing, removing puntuations and words containing numbers  
def cleaning_text(captions):  
    table = str.maketrans('','',string.punctuation)  
    for img,caps in captions.items():  
        for i,img_caption in enumerate(caps):  
            img_caption.replace("-"," ")  
            desc = img_caption.split()  
            #converts to lowercase  
            desc = [word.lower() for word in desc]  
            #remove punctuation from each token  
            desc = [word.translate(table) for word in desc]  
            #remove hanging 's and a   
            desc = [word for word in desc if(len(word)>1)]  
            #remove tokens with numbers in them  
            desc = [word for word in desc if(word.isalpha())]  
            #convert back to string  
            img_caption = ' '.join(desc)  
            captions[img][i]= img_caption  
    return captions  
def text_vocabulary(descriptions):  
    # build vocabulary of all unique words  
    vocab = set()  
    for key in descriptions.keys():  
        [vocab.update(d.split()) for d in descriptions[key]]  
    return vocab  
#All descriptions in one file   
def save_descriptions(descriptions, filename):  
    lines = list()  
    for key, desc_list in descriptions.items():  
        for desc in desc_list:  
            lines.append(key + '\t' + desc )  
    data = "\n".join(lines)  
    file = open(filename,"w")  
    file.write(data)  
    file.close()  
# Set these path according to project folder in you system  
dataset_text = "D:\dataflair projects\Project - Image Caption Generator\Flickr_8k_text"  
dataset_images = "D:\dataflair projects\Project - Image Caption Generator\Flicker8k_Dataset"  
#we prepare our text data  
filename = dataset_text + "/" + "Flickr8k.token.txt"  
#loading the file that contains all data  
#mapping them into descriptions dictionary img to 5 captions  
descriptions = all_img_captions(filename)  
print("Length of descriptions =" ,len(descriptions))  
#cleaning the descriptions  
clean_descriptions = cleaning_text(descriptions)  
#building vocabulary   
vocabulary = text_vocabulary(clean_descriptions)  
print("Length of vocabulary = ", len(vocabulary))  
#saving each description to file   
save_descriptions(clean_descriptions, "descriptions.txt")

3、從所有圖像中提取特徵向量

這項技術也稱爲轉移學習，我們不必自己做任何事情，我們使用已經在大型數據集上進行訓練的預訓練模型，並從這些模型中提取特徵並將其用於我們的任務。我們正在使用Xception模型，該模型已經在imagenet數據集中進行了訓練，該數據集具有1000個不同的類別進行分類。我們可以直接從keras.applications導入此模型。由於Xception模型最初是爲imagenet構建的，因此與模型集成時，我們所做的改動很少。需要注意的一件事是，Xception模型採用299 299 3的圖像尺寸作爲輸入。我們將刪除最後一個分類層，並獲得2048個特徵向量。

模型= Xception（include_top = False，pooling ='avg'）

函數extract_features（）將提取所有圖像的特徵，然後將圖像名稱與它們各自的特徵數組映射。然後，我們將特徵字典轉儲到“ features.p”pickle文件中。

def extract_features(directory):  
        model = Xception( include_top=False, pooling='avg' )  
        features = {}  
        for img in tqdm(os.listdir(directory)):  
            filename = directory + "/" + img  
            image = Image.open(filename)  
            image = image.resize((299,299))  
            image = np.expand_dims(image, axis=0)  
            #image = preprocess_input(image)  
            image = image/127.5  
            image = image - 1.0  
            feature = model.predict(image)  
            features[img] = feature  
        return features  
#2048 feature vector  
features = extract_features(dataset_images)  
dump(features, open("features.p","wb"))

根據您的系統，此過程可能會花費很多時間。

features = load(open("features.p","rb"))

4、加載數據集以訓練模型

在Flickr_8k_test文件夾中，我們有Flickr_8k.trainImages.txt文件，其中包含用於訓練的6000個圖像名稱的列表。

爲了加載訓練數據集，我們需要更多函數：

load_photos（filename）–這將以字符串形式加載文本文件，並返回圖像名稱列表。
load_clean_descriptions（文件名，照片）–此函數將創建一個字典，其中包含照片列表中每張照片的標題。我們還爲每個字幕附加了<start>和<end>標識符。我們需要這樣做，以便我們的LSTM模型可以識別字幕的開始和結束。
load_features（photos）–此函數將爲我們提供先前從Xception模型提取的圖像名稱及其特徵向量的字典。

#load the data   
def load_photos(filename):  
    file = load_doc(filename)  
    photos = file.split("\n")[:-1]  
    return photos  
def load_clean_descriptions(filename, photos):   
    #loading clean_descriptions  
    file = load_doc(filename)  
    descriptions = {}  
    for line in file.split("\n"):  
        words = line.split()  
        if len(words)<1 :  
            continue  
        image, image_caption = words[0], words[1:]  
        if image in photos:  
            if image not in descriptions:  
                descriptions[image] = []  
            desc = '<start> ' + " ".join(image_caption) + ' <end>'  
            descriptions[image].append(desc)  
    return descriptions  
def load_features(photos):  
    #loading all features  
    all_features = load(open("features.p","rb"))  
    #selecting only needed features  
    features = {k:all_features[k] for k in photos}  
    return features  
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"  
#train = loading_data(filename)  
train_imgs = load_photos(filename)  
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)  
train_features = load_features(train_imgs)

5、詞彙化

我們將用唯一的索引值映射詞彙表中的每個單詞。Keras庫爲我們提供了tokenizer函數，我們將使用該函數從詞彙表創建令牌並將其保存到“ tokenizer.p”pickle文件中。

#calculate maximum length of descriptions  
def max_length(descriptions):  
    desc_list = dict_to_list(descriptions)  
    return max(len(d.split()) for d in desc_list)  

max_length = max_length(descriptions)  
max_length

我們的詞彙表包含7577個單詞。

我們計算描述的最大長度。這對於確定模型結構參數很重要。說明的最大長度爲32。

#create input-output sequence pairs from the image description.  
#data generator, used by model.fit_generator()  
def data_generator(descriptions, features, tokenizer, max_length):  
    while 1:  
        for key, description_list in descriptions.items():  
            #retrieve photo features  
            feature = features[key][0]  
            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)  
            yield [[input_image, input_sequence], output_word]  
def create_sequences(tokenizer, max_length, desc_list, feature):  
    X1, X2, y = list(), list(), list()  
    # walk through each description for the image  
    for desc in desc_list:  
        # encode the sequence  
        seq = tokenizer.texts_to_sequences([desc])[0]  
        # split one sequence into multiple X,y pairs  
        for i in range(1, len(seq)):  
            # split into input and output pair  
            in_seq, out_seq = seq[:i], seq[i]  
            # pad input sequence  
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]  
            # encode output sequence  
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]  
            # store  
            X1.append(feature)  
            X2.append(in_seq)  
            y.append(out_seq)  
    return np.array(X1), np.array(X2), np.array(y)  
#You can check the shape of the input and output for your model  
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))  
a.shape, b.shape, c.shape  
#((47, 2048), (47, 32), (47, 7577))

6、創建數據生成器

首先讓我們看一下模型輸入和輸出的樣子。爲了使此任務成爲監督學習任務，我們必須爲模型提供輸入和輸出以進行訓練。我們必須在6000張圖像上訓練模型，每張圖像將包含2048個長度的特徵向量，並且標題也以數字表示。不能將這6000個圖像的數據量保存到內存中，因此我們將使用生成器方法來生成批處理。

生成器將產生輸入和輸出序列。

#create input-output sequence pairs from the image description.  
#data generator, used by model.fit_generator()  
def data_generator(descriptions, features, tokenizer, max_length):  
    while 1:  
        for key, description_list in descriptions.items():  
            #retrieve photo features  
            feature = features[key][0]  
            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)  
            yield [[input_image, input_sequence], output_word]  
def create_sequences(tokenizer, max_length, desc_list, feature):  
    X1, X2, y = list(), list(), list()  
    # walk through each description for the image  
    for desc in desc_list:  
        # encode the sequence  
        seq = tokenizer.texts_to_sequences([desc])[0]  
        # split one sequence into multiple X,y pairs  
        for i in range(1, len(seq)):  
            # split into input and output pair  
            in_seq, out_seq = seq[:i], seq[i]  
            # pad input sequence  
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]  
            # encode output sequence  
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]  
            # store  
            X1.append(feature)  
            X2.append(in_seq)  
            y.append(out_seq)  
    return np.array(X1), np.array(X2), np.array(y)  
#You can check the shape of the input and output for your model  
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))  
a.shape, b.shape, c.shape  
#((47, 2048), (47, 32), (47, 7577))

7.定義CNN-RNN模型

爲了定義模型的結構，我們將使用Functional API中的Keras模型。它將包括三個主要部分：

Feature Extractor–從圖像中提取的特徵大小爲2048，帶有密集層，我們會將尺寸減小到256個節點。
Sequence Processor–嵌入層將處理文本輸入，然後是LSTM層。
Decoder –通過合併以上兩層的輸出，我們將按密集層進行處理以做出最終預測。最後一層將包含等於我們詞彙量的節點數。

最終模型的視覺表示如下：

from keras.utils import plot_model  
# define the captioning model  
def define_model(vocab_size, max_length):  
    # features from the CNN model squeezed from 2048 to 256 nodes  
    inputs1 = Input(shape=(2048,))  
    fe1 = Dropout(0.5)(inputs1)  
    fe2 = Dense(256, activation='relu')(fe1)  
    # LSTM sequence model  
    inputs2 = Input(shape=(max_length,))  
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)  
    se2 = Dropout(0.5)(se1)  
    se3 = LSTM(256)(se2)  
    # Merging both models  
    decoder1 = add([fe2, se3])  
    decoder2 = Dense(256, activation='relu')(decoder1)  
    outputs = Dense(vocab_size, activation='softmax')(decoder2)  
    # tie it together [image, seq] [word]  
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)  
    model.compile(loss='categorical_crossentropy', optimizer='adam')  
    # summarize model  
    print(model.summary())  
    plot_model(model, to_file='model.png', show_shapes=True)  
    return model

8、訓練模型

爲了訓練模型，我們將使用6000個訓練圖像，方法是分批生成輸入和輸出序列，並使用model.fit_generator（）方法將它們擬合到模型中。我們還將模型保存到我們的模型文件夾中。

# train our model  
print('Dataset: ', len(train_imgs))  
print('Descriptions: train=', len(train_descriptions))  
print('Photos: train=', len(train_features))  
print('Vocabulary Size:', vocab_size)  
print('Description Length: ', max_length)  
model = define_model(vocab_size, max_length)  
epochs = 10  
steps = len(train_descriptions)  
# making a directory models to save our models  
os.mkdir("models")  
for i in range(epochs):  
    generator = data_generator(train_descriptions, train_features, tokenizer, max_length)  
    model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)  
    model.save("models/model_" + str(i) + ".h5")

9、測試模型

該模型已經過訓練，現在，我們將製作一個單獨的文件testing_caption_generator.py，它將加載模型並生成預測。預測包含索引值的最大長度，因此我們將使用相同的tokenizer.p pickle文件從其索引值中獲取單詞。

import numpy as np  
from PIL import Image  
import matplotlib.pyplot as plt  
import argparse  
ap = argparse.ArgumentParser()  
ap.add_argument('-i', '--image', required=True, help="Image Path")  
args = vars(ap.parse_args())  
img_path = args['image']  
def extract_features(filename, model):  
        try:  
            image = Image.open(filename)  
        except:  
            print("ERROR: Couldn't open image! Make sure the image path and extension is correct")  
        image = image.resize((299,299))  
        image = np.array(image)  
        # for images that has 4 channels, we convert them into 3 channels  
        if image.shape[2] == 4:   
            image = image[..., :3]  
        image = np.expand_dims(image, axis=0)  
        image = image/127.5  
        image = image - 1.0  
        feature = model.predict(image)  
        return feature  
def word_for_id(integer, tokenizer):  
for word, index in tokenizer.word_index.items():  
     if index == integer:  
         return word  
return None  
def generate_desc(model, tokenizer, photo, max_length):  
    in_text = 'start'  
    for i in range(max_length):  
        sequence = tokenizer.texts_to_sequences([in_text])[0]  
        sequence = pad_sequences([sequence], maxlen=max_length)  
        pred = model.predict([photo,sequence], verbose=0)  
        pred = np.argmax(pred)  
        word = word_for_id(pred, tokenizer)  
        if word is None:  
            break  
        in_text += ' ' + word  
        if word == 'end':  
            break  
    return in_text  
#path = 'Flicker8k_Dataset/111537222_07e56d5a30.jpg'  
max_length = 32  
tokenizer = load(open("tokenizer.p","rb"))  
model = load_model('models/model_9.h5')  
xception_model = Xception(include_top=False, pooling="avg")  
photo = extract_features(img_path, xception_model)  
img = Image.open(img_path)  
description = generate_desc(model, tokenizer, photo, max_length)  
print("\n\n")  
print(description)  
plt.imshow(img)

two girls are playing in the grass(兩個女孩在草地上玩)

結論

在這個項目中，我們通過構建圖像標題生成器實現了CNN-RNN模型。需要注意的一些關鍵點是，我們的模型取決於數據，因此，它無法預測詞彙量之外的單詞。我們使用了一個包含8000張圖像的小型數據集。對於生產級別的模型，我們需要對大於100,000張圖像的數據集進行訓練，以產生更好的精度模型。

使用CNN和LSTM構建圖像字幕標題生成器

什麼是圖像字幕生成器？

整理數據集

準備條件

結論

HBase模式案例Steroids上的日誌數據_時間序列上

gRPC怎樣節省您的開發時間

Html的css3法和python3 的matplotlib法實現波浪音節動畫特效解析

HeadFirst 解析Alibaba 成熟的微前端framework qiankun乾坤源碼圖文

Golang指導熱門Golangframework ,IDE和工具的列表

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結