面向遙圖像數據的Image Caption研究附源碼

面向遙感圖像數據的Image Caption 相關理論知識請參見其他文章,這裏只從工程角度進行描寫,重點是源代碼。

參考網址:

1.面向遙感圖像的Image caption 數據集:

【乾貨】讓遙感圖像活起來:遙感圖像描述生成的模型與數據集探索—遙感系統(RS)—地信網論壇 -
http://bbs.3s001.com/thread-264038-1-1.html

Exploring Models and Data for Remote Sensing Image Caption Generation
GitHub - 201528014227051/RSICD_optimal: Datasets for remote sensing images (Paper:Exploring Models and Data for Remote Sensing Image Caption Generation)
https://github.com/201528014227051/RSICD_optimal

2.Image Caption 某算法源碼

GitHub - TalentBoy2333/remote-sensing-image-caption: remote sensing image classification and image caption by PyTorch
https://github.com/TalentBoy2333/remote-sensing-image-caption

在上述數據和算法源碼基礎上,結合我項目的實際需要,對某些py 文件進行了修改如下:

1.修改train.py如下,主要實現訓練中斷後從某CheckPoint點再次訓練功能:

import numpy as np 
import torch 
from torch.autograd import Variable
from config import Config
from model import Encoder, DecoderWithAttention
from dataloader import DataLoader 
from augmentations import Augmentation
from eval import val_eval

#面向遙感圖像的圖像描述(Image Caption) 技術實現之一
#GitHub - TalentBoy2333/remote-sensing-image-caption: remote sensing image classification and image caption by PyTorch
#https://github.com/TalentBoy2333/remote-sensing-image-caption


cuda = True if torch.cuda.is_available() else False
cfg = Config()

def cal_loss(sentences, batch_label, alphas, alpha_c):
    loss_func = torch.nn.CrossEntropyLoss()
    for i in range(sentences.size(1)):
        label = batch_label[:, i]
        word = sentences[:,i,:]
        # print(label.size()[0])
        if i == 0:
            loss = loss_func(word, label)
        else:
            loss += loss_func(word, label)
    loss = loss / (i+1)
    # Add doubly stochastic attention regularization
    loss += alpha_c * ((1. - alphas.sum(dim=1)) ** 2).mean()
    return loss

def train(model_path=None):
    dataloader = DataLoader(Augmentation())
    encoder = Encoder()
    dict_len = len(dataloader.data.dictionary)
    decoder = DecoderWithAttention(dict_len)

    # Jerry:加上以下兩句,就可以從某個檢查點開始繼續訓練。
    encoder.load_state_dict(torch.load('./models/train/encoder_mobilenet_100.pkl', map_location='cpu'))
    decoder.load_state_dict(torch.load('./models/train/decoder_100.pkl', map_location='cpu'))

    if cuda:
        encoder = encoder.cuda()
        decoder = decoder.cuda()
    # if model_path:
    #   text_generator.load_state_dict(torch.load(model_path))
    # 記錄訓練多少輪次數的,每次重啓train.py 都要結合實際次數進行修改,尤其是延續某個檢查點(checkpoint)繼續進行的訓練。
    train_iter = 101
    encoder_optimizer = torch.optim.Adam(encoder.parameters(), lr=cfg.encoder_learning_rate)
    decoder_optimizer = torch.optim.Adam(decoder.parameters(), lr=cfg.decoder_learning_rate)

    val_bleu = list()
    losses = list()
    while True:
        batch_image, batch_label = dataloader.get_next_batch()
        batch_image = torch.from_numpy(batch_image).type(torch.FloatTensor)
        batch_label = torch.from_numpy(batch_label).type(torch.LongTensor)
        if cuda:
            batch_image = batch_image.cuda()
            batch_label = batch_label.cuda()
        # print(batch_image.size())
        # print(batch_label.size())

        print('Training')
        output = encoder(batch_image)
        # print('encoder output:', output.size())
        predictions, alphas = decoder(output, batch_label)

        loss = cal_loss(predictions, batch_label, alphas, 1)

        decoder_optimizer.zero_grad()
        encoder_optimizer.zero_grad()
        loss.backward()
        decoder_optimizer.step()
        encoder_optimizer.step()

        print(
            'Iter', train_iter, 
            '| loss:', loss.cpu().data.numpy(), 
            '| batch size:', cfg.batch_size,
            '| encoder learning rate:', cfg.encoder_learning_rate, 
            '| decoder learning rate:', cfg.decoder_learning_rate
        )
        losses.append(loss.cpu().data.numpy())
        if train_iter % cfg.save_model_iter == 0:
            val_bleu.append(val_eval(encoder, decoder, dataloader))
            # Jerry:.state_dict ,只存儲神經網絡的模型參數。
            torch.save(encoder.state_dict(), './models/train/encoder_'+cfg.pre_train_model+'_'+str(train_iter)+'.pkl')
            torch.save(decoder.state_dict(), './models/train/decoder_'+str(train_iter)+'.pkl')
            np.save('./result/train_bleu4.npy', val_bleu)
            np.save('./result/losses.npy', losses)

        if train_iter == cfg.train_iter:
            break
        train_iter += 1

if __name__ == '__main__':
    train()

2.model.py

調試過程中遇到某些bug,解決方法網址如下:

1)未找到文中提到的obilenet_v2.pth.tar 模型文件,以mobilenet_v2-b0353104.pth 代替。

pytorch從本地加載 .pth 格式模型_TomorrowAndTuture的博客-CSDN博客_人工智能
https://blog.csdn.net/TomorrowAndTuture/article/details/100219240
pre=torch.load(r'.\kaggle_dog_vs_cat\pretrain\vgg16-397923af.pth')
Jerry:模型加載格式參考。

 

pytorch學習筆記之加載預訓練模型_spectre-CSDN博客_人工智能
https://blog.csdn.net/weixin_41278720/article/details/80759933
模型地址:https://github.com/pytorch/vision/tree/master/torchvision/models
官方文檔:https://pytorch.org/docs/master/torchvision/models.html
vision/mobilenet.py at master · pytorch/vision · GitHub
https://github.com/pytorch/vision/blob/master/torchvision/models/mobilenet.py
obilenet_v2.pth.tar 模型文件未找到,用的是上面這個鏈接裏的文檔中提到的.pth 模型:
model_urls = {
    'mobilenet_v2': 'https://download.pytorch.org/models/mobilenet_v2-b0353104.pth',
}
下載完pth文件,放到工程的models文件夾下即可。

2)模型訓練時提示內存不夠,我的內存是8g,設置虛擬內存最大未24G,問題解決。

3)提示Missing key(s) in state_dict 錯誤,解決辦法:

(7條消息)Missingkey(s)instate_dict_jacke121的專欄-CSDN博客
https://blog.csdn.net/jacke121/article/details/84184390

經過上述幾步修改後的源碼如下:

import numpy as np 
import torch 
from torch.autograd import Variable
from config import Config
from MobileNetV2 import MobileNetV2
from dataloader import DataLoader

cfg = Config()
cuda = True if torch.cuda.is_available() else False

class Encoder(torch.nn.Module):
    """
    Encoder.
    """
    def __init__(self, encoded_image_size=7):
        super(Encoder, self).__init__()
        self.enc_image_size = encoded_image_size

        # resnet = resnet101(pretrained=True)  # pretrained ImageNet ResNet-101
        # # Remove linear and pool layers (since we're not doing classification)
        # modules = list(resnet.children())[:-2]
        # self.resnet = torch.nn.Sequential(*modules)

        mobilenet = MobileNetV2(n_class=1000)
        # state_dict = torch.load('./models/mobilenet_v2.pth.tar', map_location='cpu')  # add map_location='cpu' if no gpu
        # state_dict = torch.load('./models/mobilenet_v2-b0353104.pth', map_location='cpu') # add map_location='cpu' if no gpu
        state_dict = torch.load('./models/train/encoder_mobilenet_100.pkl',map_location='cpu')  # add map_location='cpu' if no gpu
        # mobilenet.load_state_dict(state_dict)
        try:
            from collections import OrderedDict
            new_state_dict = OrderedDict()
            for k, v in state_dict.items():
                name = 'module.' + k  # add `module.`
                new_state_dict[name] = v
            # load params
            # model.load_state_dict(new_state_dict)
            self.net.load_state_dict(new_state_dict)
        except Exception as e:
            print(e)
        self.mobilenet = mobilenet.features
           
        # Resize image to fixed size to allow input images of variable size
        self.adaptive_pool = torch.nn.AdaptiveAvgPool2d((encoded_image_size, encoded_image_size))

        # self.fine_tune()

    def forward(self, images):
        """
        Forward propagation.
        :param images: images, a tensor of dimensions (batch_size, 3, image_size, image_size)
        :return: encoded images
        """
        out = self.mobilenet(images)  # (batch_size, 2048, image_size/32, image_size/32)
        out = self.adaptive_pool(out)  # (batch_size, 2048, encoded_image_size, encoded_image_size)
        out = out.permute(0, 2, 3, 1)  # (batch_size, encoded_image_size, encoded_image_size, 2048)
        return out

    def fine_tune(self, fine_tune=True):
        """
        Allow or prevent the computation of gradients for convolutional blocks 2 through 4 of the encoder.
        :param fine_tune: Allow?
        """
        for p in self.resnet.parameters():
            p.requires_grad = False
        # If fine-tuning, only fine-tune convolutional blocks 2 through 4
        for c in list(self.resnet.children())[5:]:
            for p in c.parameters():
                p.requires_grad = fine_tune

class Attention(torch.nn.Module):
    """
    Attention Network.
    """

    def __init__(self):
        super(Attention, self).__init__()
        self.encoder_att = torch.nn.Linear(cfg.feature_size, cfg.attention_size)  # linear layer to transform encoded image
        self.decoder_att = torch.nn.Linear(cfg.hidden_size, cfg.attention_size)  # linear layer to transform decoder's output
        self.full_att = torch.nn.Linear(cfg.attention_size, 1)  # linear layer to calculate values to be softmax-ed
        self.relu = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax(dim=1)  # softmax layer to calculate weights

    def forward(self, encoder_out, decoder_hidden):
        """
        Forward propagation.
        :param encoder_out: encoded images, a tensor of dimension (batch_size, num_pixels, encoder_dim)
        :param decoder_hidden: previous decoder output, a tensor of dimension (batch_size, decoder_dim)
        :return: attention weighted encoding, weights
        """
        att1 = self.encoder_att(encoder_out)  # (batch_size, num_pixels, attention_dim)
        att2 = self.decoder_att(decoder_hidden)  # (batch_size, attention_dim)
        att = self.full_att(self.relu(att1 + att2.unsqueeze(1))).squeeze(2)  # (batch_size, num_pixels)
        alpha = self.softmax(att)  # (batch_size, num_pixels)
        attention_weighted_encoding = (encoder_out * alpha.unsqueeze(2)).sum(dim=1)  # (batch_size, encoder_dim)

        return attention_weighted_encoding, alpha

class DecoderWithAttention(torch.nn.Module):
    """
    Decoder.
    """

    def __init__(self, dict_length):
        """
        :param dict_length: size of data's dictionary.
        """
        super(DecoderWithAttention, self).__init__()
        self.dict_length = dict_length
        self.attention = Attention()  # attention network

        # nn.Embedding: change a int number in [0, dict_length] to a vector size(cfg.embed_size).
        self.embedding = torch.nn.Embedding(dict_length, cfg.embed_size)  # embedding layer
        self.dropout = torch.nn.Dropout(p=0.5)
        self.decode_step = torch.nn.LSTMCell(cfg.input_size, cfg.hidden_size, bias=True)  # decoding LSTMCell
        # encoder's output feature vector size(2048), change it to hidden_size.
        self.init_h = torch.nn.Linear(cfg.feature_size, cfg.hidden_size)  # linear layer to find initial hidden state of LSTMCell
        self.init_c = torch.nn.Linear(cfg.feature_size, cfg.hidden_size)  # linear layer to find initial cell state of LSTMCell
        # create a gate vector to choose the more importent cell of the feature vector.
        self.f_beta = torch.nn.Linear(cfg.hidden_size, cfg.feature_size)  # linear layer to create a sigmoid-activated gate
        self.sigmoid = torch.nn.Sigmoid()
        self.fc = torch.nn.Linear(cfg.hidden_size, dict_length)  # linear layer to find scores over vocabulary
        self.init_weights()  # initialize some layers with the uniform distribution
        self.fine_tune_embeddings()

    def init_weights(self):
        """
        Initializes some parameters with values from the uniform distribution, for easier convergence.
        """
        self.embedding.weight.data.uniform_(-0.1, 0.1)
        self.fc.bias.data.fill_(0)
        self.fc.weight.data.uniform_(-0.1, 0.1)

    def fine_tune_embeddings(self, fine_tune=True):
        """
        Allow fine-tuning of embedding layer? (Only makes sense to not-allow if using pre-trained embeddings).
        :param fine_tune: Allow?
        """
        for p in self.embedding.parameters():
            p.requires_grad = fine_tune

    def init_hidden_state(self, encoder_out):
        """
        Creates the initial hidden and cell states for the decoder's LSTM based on the encoded images.
        :param encoder_out: encoded images, a tensor of dimension (batch_size, num_pixels, encoder_dim)
        :return: hidden state, cell state
        """
        mean_encoder_out = encoder_out.mean(dim=1)
        h = self.init_h(mean_encoder_out)  # (batch_size, decoder_dim)
        c = self.init_c(mean_encoder_out)
        return h, c

    def forward(self, encoder_out, encoded_captions, is_train=True):
        """
        Forward propagation.
        :param encoder_out: encoded images, a tensor of dimension (batch_size, enc_image_size, enc_image_size, encoder_dim)
        :param encoded_captions: encoded captions, a tensor of dimension (batch_size, max_caption_length)
        :return: scores for vocabulary, sorted encoded captions, decode lengths, weights, sort indices
        """

        # Flatten image
        if is_train:
            encoder_out = encoder_out.view(cfg.batch_size, -1, cfg.feature_size)  # (batch_size, num_pixels, encoder_dim)
        else:
            encoder_out = encoder_out.view(1, -1, cfg.feature_size)
        num_pixels = encoder_out.size(1)

        # Embedding
        sentence_length = encoded_captions.size(1)

        """
        # torch.ones(): create a number of [1].
        # label:
            [   9.  691.  241.   18.   41.    9.   66.   27.  262.   22.    9.   11.
            27.   32.   34.   35.    2.]
        # We want input is:
            [   1.    9.  691.  241.   18.   41.    9.   66.   27.  262.   22.    9. 
            11.   27.   32.   34.   35.]
        # So, we take label[:-1]:
            [   9.  691.  241.   18.   41.    9.   66.   27.  262.   22.    9.   11.
            27.   32.   34.   35.]
        # Then, concatenate torch.ones() and label[:-1].
        """
        if is_train:
            prewords_start = torch.ones(cfg.batch_size, 1).type(torch.LongTensor)
        else:
            prewords_start = torch.ones(1, 1).type(torch.LongTensor)
        prewords_start = prewords_start.cuda() if cuda else prewords_start
        prewords_behind = encoded_captions[:,:-1]
        prewords_label = torch.cat([prewords_start, prewords_behind], 1)
        embeddings = self.embedding(prewords_label)  # (batch_size, sentence_length, embed_dim)
        # print('embeddings:', embeddings.size())
        # print('sentence length:', sentence_length)

        # Initialize LSTM state
        h, c = self.init_hidden_state(encoder_out)  # (batch_size, decoder_dim)
        # print('h:', c.size())
        # print('c:', h.size())

        # Create tensors to hold word predicion scores and alphas
        if is_train:
            predictions = torch.zeros(cfg.batch_size, sentence_length, self.dict_length)
            alphas = torch.zeros(cfg.batch_size, sentence_length, num_pixels)
        else:
            predictions = torch.zeros(1, sentence_length, self.dict_length)
            alphas = torch.zeros(1, sentence_length, num_pixels)
        if cuda:
            predictions = predictions.cuda()
            alphas = alphas.cuda()

        # At each time-step, decode by
        # attention-weighing the encoder's output based on the decoder's previous hidden state output
        # then generate a new word in the decoder with the previous word and the attention weighted encoding
        for i in range(sentence_length):
            attention_weighted_encoding, alpha = self.attention(encoder_out, h)
            # print('attention output:', attention_weighted_encoding.size())
            # print('alpha:', alpha.size())
            gate = self.sigmoid(self.f_beta(h))  # gating scalar, (batch_size_t, encoder_dim)
            attention_weighted_encoding = gate * attention_weighted_encoding
            h, c = self.decode_step(torch.cat([embeddings[:, i, :], attention_weighted_encoding], 1), (h, c)) # (batch_size_t, decoder_dim)
            preds = self.fc(self.dropout(h))  # (batch_size_t, vocab_size)
            # print('predictions:', preds.size())
            predictions[:, i, :] = preds
            alphas[:, i, :] = alpha

        return predictions, alphas




if __name__ == '__main__':
    dataloader = DataLoader()
    batch_image, batch_label = dataloader.get_next_batch()
    print(batch_image.shape)
    print(batch_label.shape)
    batch_image = torch.from_numpy(batch_image).type(torch.FloatTensor)
    batch_label = torch.from_numpy(batch_label).type(torch.LongTensor)

    encoder = Encoder()
    dict_len = len(dataloader.data.dictionary)
    rnn = DecoderWithAttention(dict_len)

    output = encoder(batch_image)
    print('encoder output:', output.size())

    predictions, alphas = rnn(output, batch_label)
    print('prediction:', predictions.size())
    print('alphas:', alphas.size())

3.config.py 修改,主要是結合自己的軟硬件條件及訓練需要,進行的圖像數據存放位置、batch_size、學習率、save_model_iter(檢查點保存)等簡單配置的修改。

class Config():
    images_folder = './data/RSICD/RSICD_images/'
    annotations_name = './data/RSICD/dataset_rsicd.json'

    # pretrain model config
    pre_train_model = 'mobilenet'
    fix_pretrain_model = False
    feature_size = 1280 # pretrain model's feature map number in final layer

    # Attention layer config
    attention_size = 1280

    # LSTM config 
    embed_size = 1280
    input_size = embed_size + feature_size # encoder output feature vector size: 1280
    hidden_size = 1280 # 4096
    num_layers = 1

    # training config
    batch_size = 100 # 64
    train_iter = 60001 # 100000
    # encoder_learning_rate = 1e-4 #Jerry:原來是1e-4
    # decoder_learning_rate = 1e-4 #Jerry:原來是1e-4
    encoder_learning_rate = 1e-3
    decoder_learning_rate = 1e-3
    # save_model_iter = 400 #Jerry:原來是400
    save_model_iter = 50

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章