面向遥图像数据的Image Caption研究附源码

面向遥感图像数据的Image Caption 相关理论知识请参见其他文章,这里只从工程角度进行描写,重点是源代码。

参考网址:

1.面向遥感图像的Image caption 数据集:

【干货】让遥感图像活起来:遥感图像描述生成的模型与数据集探索—遥感系统(RS)—地信网论坛 -
http://bbs.3s001.com/thread-264038-1-1.html

Exploring Models and Data for Remote Sensing Image Caption Generation
GitHub - 201528014227051/RSICD_optimal: Datasets for remote sensing images (Paper:Exploring Models and Data for Remote Sensing Image Caption Generation)
https://github.com/201528014227051/RSICD_optimal

2.Image Caption 某算法源码

GitHub - TalentBoy2333/remote-sensing-image-caption: remote sensing image classification and image caption by PyTorch
https://github.com/TalentBoy2333/remote-sensing-image-caption

在上述数据和算法源码基础上,结合我项目的实际需要,对某些py 文件进行了修改如下:

1.修改train.py如下,主要实现训练中断后从某CheckPoint点再次训练功能:

import numpy as np 
import torch 
from torch.autograd import Variable
from config import Config
from model import Encoder, DecoderWithAttention
from dataloader import DataLoader 
from augmentations import Augmentation
from eval import val_eval

#面向遥感图像的图像描述(Image Caption) 技术实现之一
#GitHub - TalentBoy2333/remote-sensing-image-caption: remote sensing image classification and image caption by PyTorch
#https://github.com/TalentBoy2333/remote-sensing-image-caption


cuda = True if torch.cuda.is_available() else False
cfg = Config()

def cal_loss(sentences, batch_label, alphas, alpha_c):
    loss_func = torch.nn.CrossEntropyLoss()
    for i in range(sentences.size(1)):
        label = batch_label[:, i]
        word = sentences[:,i,:]
        # print(label.size()[0])
        if i == 0:
            loss = loss_func(word, label)
        else:
            loss += loss_func(word, label)
    loss = loss / (i+1)
    # Add doubly stochastic attention regularization
    loss += alpha_c * ((1. - alphas.sum(dim=1)) ** 2).mean()
    return loss

def train(model_path=None):
    dataloader = DataLoader(Augmentation())
    encoder = Encoder()
    dict_len = len(dataloader.data.dictionary)
    decoder = DecoderWithAttention(dict_len)

    # Jerry:加上以下两句,就可以从某个检查点开始继续训练。
    encoder.load_state_dict(torch.load('./models/train/encoder_mobilenet_100.pkl', map_location='cpu'))
    decoder.load_state_dict(torch.load('./models/train/decoder_100.pkl', map_location='cpu'))

    if cuda:
        encoder = encoder.cuda()
        decoder = decoder.cuda()
    # if model_path:
    #   text_generator.load_state_dict(torch.load(model_path))
    # 记录训练多少轮次数的,每次重启train.py 都要结合实际次数进行修改,尤其是延续某个检查点(checkpoint)继续进行的训练。
    train_iter = 101
    encoder_optimizer = torch.optim.Adam(encoder.parameters(), lr=cfg.encoder_learning_rate)
    decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=cfg.decoder_learning_rate)

    val_bleu = list()
    losses = list()
    while True:
        batch_image, batch_label = dataloader.get_next_batch()
        batch_image = torch.from_numpy(batch_image).type(torch.FloatTensor)
        batch_label = torch.from_numpy(batch_label).type(torch.LongTensor)
        if cuda:
            batch_image = batch_image.cuda()
            batch_label = batch_label.cuda()
        # print(batch_image.size())
        # print(batch_label.size())

        print('Training')
        output = encoder(batch_image)
        # print('encoder output:', output.size())
        predictions, alphas = decoder(output, batch_label)

        loss = cal_loss(predictions, batch_label, alphas, 1)

        decoder_optimizer.zero_grad()
        encoder_optimizer.zero_grad()
        loss.backward()
        decoder_optimizer.step()
        encoder_optimizer.step()

        print(
            'Iter', train_iter, 
            '| loss:', loss.cpu().data.numpy(), 
            '| batch size:', cfg.batch_size,
            '| encoder learning rate:', cfg.encoder_learning_rate, 
            '| decoder learning rate:', cfg.decoder_learning_rate
        )
        losses.append(loss.cpu().data.numpy())
        if train_iter % cfg.save_model_iter == 0:
            val_bleu.append(val_eval(encoder, decoder, dataloader))
            # Jerry:.state_dict ,只存储神经网络的模型参数。
            torch.save(encoder.state_dict(), './models/train/encoder_'+cfg.pre_train_model+'_'+str(train_iter)+'.pkl')
            torch.save(decoder.state_dict(), './models/train/decoder_'+str(train_iter)+'.pkl')
            np.save('./result/train_bleu4.npy', val_bleu)
            np.save('./result/losses.npy', losses)

        if train_iter == cfg.train_iter:
            break
        train_iter += 1

if __name__ == '__main__':
    train()

2.model.py

调试过程中遇到某些bug,解决方法网址如下:

1)未找到文中提到的obilenet_v2.pth.tar 模型文件,以mobilenet_v2-b0353104.pth 代替。

pytorch从本地加载 .pth 格式模型_TomorrowAndTuture的博客-CSDN博客_人工智能
https://blog.csdn.net/TomorrowAndTuture/article/details/100219240
pre=torch.load(r'.\kaggle_dog_vs_cat\pretrain\vgg16-397923af.pth')
Jerry:模型加载格式参考。

 

pytorch学习笔记之加载预训练模型_spectre-CSDN博客_人工智能
https://blog.csdn.net/weixin_41278720/article/details/80759933
模型地址:https://github.com/pytorch/vision/tree/master/torchvision/models
官方文档:https://pytorch.org/docs/master/torchvision/models.html
vision/mobilenet.py at master · pytorch/vision · GitHub
https://github.com/pytorch/vision/blob/master/torchvision/models/mobilenet.py
obilenet_v2.pth.tar 模型文件未找到,用的是上面这个链接里的文档中提到的.pth 模型:
model_urls = {
    'mobilenet_v2': 'https://download.pytorch.org/models/mobilenet_v2-b0353104.pth',
}
下载完pth文件,放到工程的models文件夹下即可。

2)模型训练时提示内存不够,我的内存是8g,设置虚拟内存最大未24G,问题解决。

3)提示Missing key(s) in state_dict 错误,解决办法:

(7条消息)Missingkey(s)instate_dict_jacke121的专栏-CSDN博客
https://blog.csdn.net/jacke121/article/details/84184390

经过上述几步修改后的源码如下:

import numpy as np 
import torch 
from torch.autograd import Variable
from config import Config
from MobileNetV2 import MobileNetV2
from dataloader import DataLoader

cfg = Config()
cuda = True if torch.cuda.is_available() else False

class Encoder(torch.nn.Module):
    """
    Encoder.
    """
    def __init__(self, encoded_image_size=7):
        super(Encoder, self).__init__()
        self.enc_image_size = encoded_image_size

        # resnet = resnet101(pretrained=True)  # pretrained ImageNet ResNet-101
        # # Remove linear and pool layers (since we're not doing classification)
        # modules = list(resnet.children())[:-2]
        # self.resnet = torch.nn.Sequential(*modules)

        mobilenet = MobileNetV2(n_class=1000)
        # state_dict = torch.load('./models/mobilenet_v2.pth.tar', map_location='cpu')  # add map_location='cpu' if no gpu
        # state_dict = torch.load('./models/mobilenet_v2-b0353104.pth', map_location='cpu') # add map_location='cpu' if no gpu
        state_dict = torch.load('./models/train/encoder_mobilenet_100.pkl',map_location='cpu')  # add map_location='cpu' if no gpu
        # mobilenet.load_state_dict(state_dict)
        try:
            from collections import OrderedDict
            new_state_dict = OrderedDict()
            for k, v in state_dict.items():
                name = 'module.' + k  # add `module.`
                new_state_dict[name] = v
            # load params
            # model.load_state_dict(new_state_dict)
            self.net.load_state_dict(new_state_dict)
        except Exception as e:
            print(e)
        self.mobilenet = mobilenet.features
           
        # Resize image to fixed size to allow input images of variable size
        self.adaptive_pool = torch.nn.AdaptiveAvgPool2d((encoded_image_size, encoded_image_size))

        # self.fine_tune()

    def forward(self, images):
        """
        Forward propagation.
        :param images: images, a tensor of dimensions (batch_size, 3, image_size, image_size)
        :return: encoded images
        """
        out = self.mobilenet(images)  # (batch_size, 2048, image_size/32, image_size/32)
        out = self.adaptive_pool(out)  # (batch_size, 2048, encoded_image_size, encoded_image_size)
        out = out.permute(0, 2, 3, 1)  # (batch_size, encoded_image_size, encoded_image_size, 2048)
        return out

    def fine_tune(self, fine_tune=True):
        """
        Allow or prevent the computation of gradients for convolutional blocks 2 through 4 of the encoder.
        :param fine_tune: Allow?
        """
        for p in self.resnet.parameters():
            p.requires_grad = False
        # If fine-tuning, only fine-tune convolutional blocks 2 through 4
        for c in list(self.resnet.children())[5:]:
            for p in c.parameters():
                p.requires_grad = fine_tune

class Attention(torch.nn.Module):
    """
    Attention Network.
    """

    def __init__(self):
        super(Attention, self).__init__()
        self.encoder_att = torch.nn.Linear(cfg.feature_size, cfg.attention_size)  # linear layer to transform encoded image
        self.decoder_att = torch.nn.Linear(cfg.hidden_size, cfg.attention_size)  # linear layer to transform decoder's output
        self.full_att = torch.nn.Linear(cfg.attention_size, 1)  # linear layer to calculate values to be softmax-ed
        self.relu = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax(dim=1)  # softmax layer to calculate weights

    def forward(self, encoder_out, decoder_hidden):
        """
        Forward propagation.
        :param encoder_out: encoded images, a tensor of dimension (batch_size, num_pixels, encoder_dim)
        :param decoder_hidden: previous decoder output, a tensor of dimension (batch_size, decoder_dim)
        :return: attention weighted encoding, weights
        """
        att1 = self.encoder_att(encoder_out)  # (batch_size, num_pixels, attention_dim)
        att2 = self.decoder_att(decoder_hidden)  # (batch_size, attention_dim)
        att = self.full_att(self.relu(att1 + att2.unsqueeze(1))).squeeze(2)  # (batch_size, num_pixels)
        alpha = self.softmax(att)  # (batch_size, num_pixels)
        attention_weighted_encoding = (encoder_out * alpha.unsqueeze(2)).sum(dim=1)  # (batch_size, encoder_dim)

        return attention_weighted_encoding, alpha

class DecoderWithAttention(torch.nn.Module):
    """
    Decoder.
    """

    def __init__(self, dict_length):
        """
        :param dict_length: size of data's dictionary.
        """
        super(DecoderWithAttention, self).__init__()
        self.dict_length = dict_length
        self.attention = Attention()  # attention network

        # nn.Embedding: change a int number in [0, dict_length] to a vector size(cfg.embed_size).
        self.embedding = torch.nn.Embedding(dict_length, cfg.embed_size)  # embedding layer
        self.dropout = torch.nn.Dropout(p=0.5)
        self.decode_step = torch.nn.LSTMCell(cfg.input_size, cfg.hidden_size, bias=True)  # decoding LSTMCell
        # encoder's output feature vector size(2048), change it to hidden_size.
        self.init_h = torch.nn.Linear(cfg.feature_size, cfg.hidden_size)  # linear layer to find initial hidden state of LSTMCell
        self.init_c = torch.nn.Linear(cfg.feature_size, cfg.hidden_size)  # linear layer to find initial cell state of LSTMCell
        # create a gate vector to choose the more importent cell of the feature vector.
        self.f_beta = torch.nn.Linear(cfg.hidden_size, cfg.feature_size)  # linear layer to create a sigmoid-activated gate
        self.sigmoid = torch.nn.Sigmoid()
        self.fc = torch.nn.Linear(cfg.hidden_size, dict_length)  # linear layer to find scores over vocabulary
        self.init_weights()  # initialize some layers with the uniform distribution
        self.fine_tune_embeddings()

    def init_weights(self):
        """
        Initializes some parameters with values from the uniform distribution, for easier convergence.
        """
        self.embedding.weight.data.uniform_(-0.1, 0.1)
        self.fc.bias.data.fill_(0)
        self.fc.weight.data.uniform_(-0.1, 0.1)

    def fine_tune_embeddings(self, fine_tune=True):
        """
        Allow fine-tuning of embedding layer? (Only makes sense to not-allow if using pre-trained embeddings).
        :param fine_tune: Allow?
        """
        for p in self.embedding.parameters():
            p.requires_grad = fine_tune

    def init_hidden_state(self, encoder_out):
        """
        Creates the initial hidden and cell states for the decoder's LSTM based on the encoded images.
        :param encoder_out: encoded images, a tensor of dimension (batch_size, num_pixels, encoder_dim)
        :return: hidden state, cell state
        """
        mean_encoder_out = encoder_out.mean(dim=1)
        h = self.init_h(mean_encoder_out)  # (batch_size, decoder_dim)
        c = self.init_c(mean_encoder_out)
        return h, c

    def forward(self, encoder_out, encoded_captions, is_train=True):
        """
        Forward propagation.
        :param encoder_out: encoded images, a tensor of dimension (batch_size, enc_image_size, enc_image_size, encoder_dim)
        :param encoded_captions: encoded captions, a tensor of dimension (batch_size, max_caption_length)
        :return: scores for vocabulary, sorted encoded captions, decode lengths, weights, sort indices
        """

        # Flatten image
        if is_train:
            encoder_out = encoder_out.view(cfg.batch_size, -1, cfg.feature_size)  # (batch_size, num_pixels, encoder_dim)
        else:
            encoder_out = encoder_out.view(1, -1, cfg.feature_size)
        num_pixels = encoder_out.size(1)

        # Embedding
        sentence_length = encoded_captions.size(1)

        """
        # torch.ones(): create a number of [1].
        # label:
            [   9.  691.  241.   18.   41.    9.   66.   27.  262.   22.    9.   11.
            27.   32.   34.   35.    2.]
        # We want input is:
            [   1.    9.  691.  241.   18.   41.    9.   66.   27.  262.   22.    9. 
            11.   27.   32.   34.   35.]
        # So, we take label[:-1]:
            [   9.  691.  241.   18.   41.    9.   66.   27.  262.   22.    9.   11.
            27.   32.   34.   35.]
        # Then, concatenate torch.ones() and label[:-1].
        """
        if is_train:
            prewords_start = torch.ones(cfg.batch_size, 1).type(torch.LongTensor)
        else:
            prewords_start = torch.ones(1, 1).type(torch.LongTensor)
        prewords_start = prewords_start.cuda() if cuda else prewords_start
        prewords_behind = encoded_captions[:,:-1]
        prewords_label = torch.cat([prewords_start, prewords_behind], 1)
        embeddings = self.embedding(prewords_label)  # (batch_size, sentence_length, embed_dim)
        # print('embeddings:', embeddings.size())
        # print('sentence length:', sentence_length)

        # Initialize LSTM state
        h, c = self.init_hidden_state(encoder_out)  # (batch_size, decoder_dim)
        # print('h:', c.size())
        # print('c:', h.size())

        # Create tensors to hold word predicion scores and alphas
        if is_train:
            predictions = torch.zeros(cfg.batch_size, sentence_length, self.dict_length)
            alphas = torch.zeros(cfg.batch_size, sentence_length, num_pixels)
        else:
            predictions = torch.zeros(1, sentence_length, self.dict_length)
            alphas = torch.zeros(1, sentence_length, num_pixels)
        if cuda:
            predictions = predictions.cuda()
            alphas = alphas.cuda()

        # At each time-step, decode by
        # attention-weighing the encoder's output based on the decoder's previous hidden state output
        # then generate a new word in the decoder with the previous word and the attention weighted encoding
        for i in range(sentence_length):
            attention_weighted_encoding, alpha = self.attention(encoder_out, h)
            # print('attention output:', attention_weighted_encoding.size())
            # print('alpha:', alpha.size())
            gate = self.sigmoid(self.f_beta(h))  # gating scalar, (batch_size_t, encoder_dim)
            attention_weighted_encoding = gate * attention_weighted_encoding
            h, c = self.decode_step(torch.cat([embeddings[:, i, :], attention_weighted_encoding], 1), (h, c)) # (batch_size_t, decoder_dim)
            preds = self.fc(self.dropout(h))  # (batch_size_t, vocab_size)
            # print('predictions:', preds.size())
            predictions[:, i, :] = preds
            alphas[:, i, :] = alpha

        return predictions, alphas




if __name__ == '__main__':
    dataloader = DataLoader()
    batch_image, batch_label = dataloader.get_next_batch()
    print(batch_image.shape)
    print(batch_label.shape)
    batch_image = torch.from_numpy(batch_image).type(torch.FloatTensor)
    batch_label = torch.from_numpy(batch_label).type(torch.LongTensor)

    encoder = Encoder()
    dict_len = len(dataloader.data.dictionary)
    rnn = DecoderWithAttention(dict_len)

    output = encoder(batch_image)
    print('encoder output:', output.size())

    predictions, alphas = rnn(output, batch_label)
    print('prediction:', predictions.size())
    print('alphas:', alphas.size())

3.config.py 修改,主要是结合自己的软硬件条件及训练需要,进行的图像数据存放位置、batch_size、学习率、save_model_iter(检查点保存)等简单配置的修改。

class Config():
    images_folder = './data/RSICD/RSICD_images/'
    annotations_name = './data/RSICD/dataset_rsicd.json'

    # pretrain model config
    pre_train_model = 'mobilenet'
    fix_pretrain_model = False
    feature_size = 1280 # pretrain model's feature map number in final layer

    # Attention layer config
    attention_size = 1280

    # LSTM config 
    embed_size = 1280
    input_size = embed_size + feature_size # encoder output feature vector size: 1280
    hidden_size = 1280 # 4096
    num_layers = 1

    # training config
    batch_size = 100 # 64
    train_iter = 60001 # 100000
    # encoder_learning_rate = 1e-4 #Jerry:原来是1e-4
    # decoder_learning_rate = 1e-4 #Jerry:原来是1e-4
    encoder_learning_rate = 1e-3
    decoder_learning_rate = 1e-3
    # save_model_iter = 400 #Jerry:原来是400
    save_model_iter = 50

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章