用pytorch學習github寫了個picture caption的AI項目的經驗總結

 

目錄

1 整個項目的架構

1.1 文件名和文件作用

1.2 創建項目的整體思路

2 各部分文件的經驗總結

2.1 主函數 main.py

2.1.1 命令行參數 argparse.ArgumentParser

2.1.2 main.py文件的一般邏輯

2.1.3 可用的gpu環境部署

2.1.4 main函數 承載主要邏輯

2.1.5 train函數 訓練

2.1.6 validate函數 驗證

2.1.7 if __name__ == '__main__':  main函數功能邏輯之外的其餘背景部署

2.2 模型model.py

2.3 預處理prepro.py

2.4 數據加載data_loader.py

2.5 工具utils.py

2.6 驗證集數據集創建

2.7 其他一般性的經驗總結

2.7.1 註釋

2.7.2 nohup


1 整個項目的架構

1.1 文件名和文件作用

python文件名及作用
main.py 主函數
model.py 模型
prepro.py 預處理
data_loader.py 數據集加載輔助
flickr8k_dataloader.py 針對flickr8k的數據集加載輔助
compute_mean_val.py 計算數據集圖片的均值、標準差
utils.py 工具類
make_val_dataset.py 創建驗證數據集

 

1.2 創建項目的整體思路

  1. 首先書寫main.py文件,在主函數文件中理清思路和頭緒(遇到未寫的變量時,假裝已經定義,做好標記,跳過具體內容,繼續餘下書寫代碼,以梳理整體思路並通過標記讓main.py文件和接下來要寫輔助文件邏輯一致)。
  2. 在書寫main.py過程中,就會發現需要的模型文件、工具類預處理等輔助文件,並從整體上理解了所需要的功能接口
  3. 書寫預處理prepro.py文件,根據main.py中相關部分所需要的模型輸入接口,對數據進行預處理。
  4. 書寫模型model.py文件,根據預處理後的數據格式,和相關算法理論(比如閱讀到的paper或者自己構思的idea),用pytorch搭建model,遇到需要數據加載類時,同main.py文件,做好標記,跳過具體內容,繼續餘下書寫代碼。
  5. 書寫data_loader.py文件,根據基本書寫好的model.py文件中模型對輸入data的要求,基於pytorch的數據加載類torch.utils.data.DataLoader,構造自己的數據加載類。
  6. 書寫其他的工具類utils.py文件,根據已經書寫好的主函數文件,預處理文件,數據加載文件,模型文件中的所需要的具有普適性的一般功能(尤其是暫時跳過尚未書寫的),將其歸納進入工具類文件。
  7. 書寫創建驗證集數據集的make_val_dataset.py文件,一般就是基於訓練數據從裏面選出一些數據,最好讓選擇出來的數據不再參與訓練過程,以保證驗證過程的客觀公正。

 

2 各部分文件的經驗總結

2.1 主函數 main.py

2.1.1 命令行參數 argparse.ArgumentParser

首先就是命令行參數的構建,定義如下

import argparse
parser = argparse.ArgumentParser()  # 命令行參數解析器
parser.add_argument(
    '--model_path',  # 命令行參數名
    type=str,  # 類型
    default='./models/',  # 默認值
    help='path for saving trained models')  # 提示
# 創建其他命令行參數...
args = parser.parse_args()  # 獲取命令行參數
print(args)  # 打印查看命令行參數

建議

  1. 將其寫在全局,這樣方便全局引用。
  2. 一定要用argparse.ArgumentParser()構造命令行參數,規範簡介而且功能一目瞭然

調用時

model_path=args.model_path

再調用model_path即可。

當然,如果後面不再使用這一變量,可以直接使用args.model_path

 

2.1.2 main.py文件的一般邏輯

  • 調用包
  • 可用的gpu環境部署
  • main函數 承載主要邏輯
  • train函數 訓練
  • validate函數 驗證
  • if __name__ == '__main__':  main函數功能邏輯之外的其餘背景部署

 

2.1.3 可用的gpu環境部署

當gpu可用時,一般而言只有一塊gpu,由多塊時指定某一塊x就寫作cuda:x即可;gpu不可用就爲cpu模式。

device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')

使用時,將變量、模型或計算轉移至gpu上:

imgs = imgs.to(device)  # 圖片部署至gpu
decoder = decoder.to(device)  # 解碼器部署至gpu
criterion = nn.CrossEntropyLoss().to(device)  # 計算部署至gpu

 

2.1.4 main函數 承載主要邏輯

一 預加載

一般需要加載些東西,比如字典、模型什麼的

# 加載字典包裝
with open(args.vocab_path, 'rb') as f:
    vocab = pickle.load(f)

使用到pickle模塊

import pickle

 

二 預定義變量

在訓練之前,需要預先創建一些變量。當然這時就得分之前是否訓練過了。

a 如果之前訓練過,則需要加載之前保存的checkpoint(一般是個用torch保存的字典)

保存的例子(一般放在工具utils.py中,作爲一個單獨的函數),例如

def save_checkpoint(data_name, epoch, epochs_since_improvement, encoder,
                    decoder, encoder_optimizer, decoder_optimizer, bleu4,
                    is_best):
    """
    Saves model checkpoint.
    :param data_name: base name of processed dataset
    :param epoch: epoch number
    :param epochs_since_improvement: number of epochs since last improvement in BLEU-4 score
    :param encoder: encoder model
    :param decoder: decoder model
    :param encoder_optimizer: optimizer to update encoder's weights, if fine-tuning
    :param decoder_optimizer: optimizer to update decoder's weights
    :param bleu4: validation BLEU-4 score for this epoch
    :param is_best: is this checkpoint the best so far?
    """
    state = {
        'epoch': epoch,
        'epochs_since_improvement': epochs_since_improvement,
        'bleu-4': bleu4,
        'encoder': encoder,
        'decoder': decoder,
        'encoder_optimizer': encoder_optimizer,
        'decoder_optimizer': decoder_optimizer
    }
    filename = 'checkpoint_' + data_name + '.pth.tar'
    torch.save(state, filename)
    # 如果這個 checkpoint 是目前爲止最好的,存儲一個副本,這樣它就不會被更差的 checkpoint 覆蓋
    if is_best:
        torch.save(state, 'BEST_' + filename)

加載時,用torch.load,之後就得到一個字典類型的變量,用訪問字典鍵值對的方式讀取即可

checkpoint = torch.load(args.checkpoint)
start_epoch = checkpoint['epoch'] + 1
epochs_since_improvement = checkpoint['epochs_since_improvement']
best_bleu4 = checkpoint['bleu-4']
decoder = checkpoint['decoder']
decoder_optimizer = checkpoint['decoder_optimizer']
encoder = checkpoint['encoder']
encoder_optimizer = checkpoint['encoder_optimizer']
if fine_tune_encoder is True and encoder_optimizer is None:
    encoder.fine_tune(fine_tune_encoder)  # 微調器微調
    encoder_optimizer = torch.optim.Adam(
        params=filter(lambda p: p.requires_grad, encoder.parameters()),
        lr=args.encoder_lr)  # 編碼器優化器

b 如果之前沒有訓練過,則需要預定義新變量

decoder = AttnDecoderRNN(
    attention_dim=args.attention_dim,
    embed_dim=args.embed_dim,
    decoder_dim=args.decoder_dim,
    vocab_size=len(vocab),
    dropout=args.dropout)  # 解碼器
decoder_optimizer = torch.optim.Adam(
    params=filter(lambda p: p.requires_grad, decoder.parameters()),
    lr=args.decoder_lr)  # 解碼器優化器
encoder = EncoderCNN()  # 編碼器
encoder.fine_tune(args.fine_tune_encoder)  # 編碼器微調
encoder_optimizer = torch.optim.Adam(
    params=filter(lambda p: p.requires_grad, encoder.parameters()),
    lr=args.encoder_lr) if args.fine_tune_encoder else None  # 編碼器優化器
best_bleu4 = args.best_bleu4

可以看到,這裏普遍使用了lambda表達式filter函數,優化器選用的是常用而魯棒的Adam

 

三 損失函數

然後定義損失函數,例如使用交叉熵

criterion = nn.CrossEntropyLoss().to(device)

這裏用到了包

import torch.nn as nn

 

四 數據集加載器

如前文所述,一般就是利用torch.utils.data.DataLoader,構造自己的dataloader。如

flickr = DataLoader(
    root=root, json=json, vocab=vocab, rank=rank, transform=transform)

data_loader = torch.utils.data.DataLoader(
    dataset=flickr,
    batch_size=batch_size,
    shuffle=shuffle,  # 打亂
    num_workers=num_workers,  # 用於數據加載的子進程數
    collate_fn=collate_fn)

其中,參數dataset是繼承torch.utils.data.Dataset類的數據集子類

繼承torch.utils.data.Dataset類,需要實現兩個方法

  1. __getitem__(self, index)(支持範圍從0到len(self)獨佔的整數索引,即給出索引數字下標返回數據對象)
  2. __len__(self) 返回總數據量的長度

具體實現如下:

class DataLoader(data.Dataset):
    def __init__(self, root, json, vocab, rank, transform=None):

        self.root = root
        self.flickr = flickr8k(
            ann_text_location=json, imgs_location=root, ann_rank=rank)
        self.vocab = vocab
        self.rank = rank
        self.transform = transform

    # 支持範圍從0到len(self)獨佔的整數索引
    def __getitem__(self, index):
        flickr = self.flickr
        vocab = self.vocab
        # ann:annotation
        caption = flickr.anns[index]['caption']
        img_id = flickr.anns[index]['image_id']
        path = flickr.loadImg(img_id)

        image = Image.open(path).convert('RGB')
        if self.transform is not None:
            image = self.transform(image)

        tokens = nltk.tokenize.word_tokenize(str(caption).lower())  # 分詞
        caption = []
        caption.append(vocab('<start>'))
        caption.extend([vocab(token) for token in tokens])
        caption.append(vocab('<end>'))
        target = torch.Tensor(caption)
        return image, target

    def __len__(self):
        return len(self.flickr.anns)

 

參數collate_fn自定義的數據批量獲取的方法,即每次訓練返回的batch

def collate_fn(data):
    data.sort(key=lambda x: len(x[1]), reverse=True)
    images, captions = zip(*data)

    images = torch.stack(images, 0)  # 將張量序列沿新維度串聯起來

    lengths = [len(cap) for cap in captions]
    targets = torch.zeros(len(captions), max(lengths)).long()
    for i, cap in enumerate(captions):
        end = lengths[i]
        targets[i, :end] = cap[:end]
    return images, targets, lengths

這裏每次就返回一些圖片、對應的captions和captions的長度

 

有了這些,封裝成我們自己的數據加載器get_loader,返回一個DataLoader對象用於數據加載

def get_loader(root, json, vocab, transform, batch_size, rank, shuffle,
               num_workers):
    flickr = DataLoader(
        root=root, json=json, vocab=vocab, rank=rank, transform=transform)

    # 數據加載 flickr 數據集
    # 每次迭代返回 (images, captions, lengths)
    # images: tensor of shape (batch_size, 3, 224, 224).
    # captions: tensor of shape (batch_size, padded_length).
    # lengths: 表示每個標題有效長度的列表. length is (batch_size).
    data_loader = torch.utils.data.DataLoader(
        dataset=flickr,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=num_workers,
        collate_fn=collate_fn)  # 合併一個示例列表以形成一個 mini-batch
    return data_loader

然後就可以順理成章的創建我們的DataLoader了

train_loader = get_loader(
    args.image_dir,
    args.caption_path,
    vocab,
    transform,
    args.batch_size,
    args.rank,
    shuffle=True,
    num_workers=args.num_workers)  # 訓練數據集加載器

val_loader = get_loader(
    args.image_dir_val,
    args.caption_path_val,
    vocab,
    transform,
    args.batch_size,
    args.rank,
    shuffle=True,
    num_workers=args.num_workers)  # 驗證數據集加載器

 

五 訓練及驗證的迭代過程

一般就使用for循環定義最大訓練上限(當然也可以在train和validate函數中分別定義訓練次數),然後每輪訓練再驗證,並打印中間信息,最後保存最終模型即可。

但考慮到訓練會發生過擬合多次訓練未見效果提升的情況,所以可以考慮

1 設置自上次訓練以來,未提升歷史最佳效果的訓練次數上限,達到後自動退出循環,以免浪費時間。

if args.epochs_since_improvement == 20:  # 自上次優化以來 20次迭代仍不見優化則退出
    break

# 訓練

# 驗證

is_best = recent_bleu4 > best_bleu4  # 判斷當前是否表現得最好
best_bleu4 = max(recent_bleu4, best_bleu4)  # 記錄最優bleu4值
if not is_best:  # 仍未實現優化
    args.epochs_since_improvement += 1
    print("\nEpoch since last improvement: %d\n" %
            (args.epochs_since_improvement, ))  # 打印自上次優化以來的目前的epoch數目
else:  # 當前迭代實現了優化
    args.epochs_since_improvement = 0  # epochs_since_improvement 計數清零

2 在訓練過程中,當訓練次數達到一定數量仍未見效果提升,但未達到1提到的退出上限,可以考慮降低學習率

if args.epochs_since_improvement > 0 and args.epochs_since_improvement % 8 == 0:
    adjust_learning_rate(decoder_optimizer, 0.8)  # 將解碼器學習率降低一個特定的因子
    if args.fine_tune_encoder:
        adjust_learning_rate(encoder_optimizer,
                            0.8)  # 將編碼器學習率降低一個特定的因子

 

六 保存模型

最後,保存中間模型,一般最後就剩兩個模型,最終的模型和歷史最佳模型。

save_checkpoint(args.data_name, epoch, args.epochs_since_improvement,
                encoder, decoder, encoder_optimizer, decoder_optimizer,
                recent_bleu4, is_best)  # 保存模型檢查點

save_checkpoint函數自定義如下

def save_checkpoint(data_name, epoch, epochs_since_improvement, encoder,
                    decoder, encoder_optimizer, decoder_optimizer, bleu4,
                    is_best):
    """
    Saves model checkpoint.
    :param data_name: base name of processed dataset
    :param epoch: epoch number
    :param epochs_since_improvement: number of epochs since last improvement in BLEU-4 score
    :param encoder: encoder model
    :param decoder: decoder model
    :param encoder_optimizer: optimizer to update encoder's weights, if fine-tuning
    :param decoder_optimizer: optimizer to update decoder's weights
    :param bleu4: validation BLEU-4 score for this epoch
    :param is_best: is this checkpoint the best so far?
    """
    state = {
        'epoch': epoch,
        'epochs_since_improvement': epochs_since_improvement,
        'bleu-4': bleu4,
        'encoder': encoder,
        'decoder': decoder,
        'encoder_optimizer': encoder_optimizer,
        'decoder_optimizer': decoder_optimizer
    }
    filename = 'checkpoint_' + data_name + '.pth.tar'
    torch.save(state, filename)
    # 如果這個 checkpoint 是目前爲止最好的,存儲一個副本,這樣它就不會被更差的 checkpoint 覆蓋
    if is_best:
        torch.save(state, 'BEST_' + filename)

 

2.1.5 train函數 訓練

def train(train_loader, encoder, decoder, criterion, encoder_optimizer,
          decoder_optimizer, epoch)

先把編碼器、解碼器設置爲訓練模式

decoder.train()  # 將解碼器模塊設置爲訓練模式
encoder.train()  # 將編碼器模塊設置爲訓練模式

下面的幾個變量用到了utils工具類文件的AverageMeter類,這是用來跟蹤度量的最新值val、平均值avg、和sum和計數count的輔助類

# AverageMeter 跟蹤度量的最新值val、平均值avg、和sum和計數count
batch_time = AverageMeter()
data_time = AverageMeter()
losses = AverageMeter()
top5accs = AverageMeter()

utils中AverageMeter類定義如下

class AverageMeter(object):
    """
    跟蹤度量的最新值、平均值、和與計數
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

 

然後就是從之前定義得DataLoader中獲取數據

for i, (imgs, caps, caplens) in enumerate(train_loader):

之後的邏輯大體上就是

  1. 將數據轉移到gpu上
  2. 預測結果
  3. 計算loss
  4. 添加正則化到loss
  5. 優化器清除梯度
  6. 反向傳播
  7. 優化器推進一步(step)
  8. 返回給定輸入張量沿給定維度的5個最大元素
  9. 到一定迭代次數後打印當前信息

train函數完整參考如下

# 訓練
def train(train_loader, encoder, decoder, criterion, encoder_optimizer,
          decoder_optimizer, epoch):
    decoder.train()  # 將解碼器模塊設置爲訓練模式
    encoder.train()  # 將編碼器模塊設置爲訓練模式

    # AverageMeter 跟蹤度量的最新值val、平均值avg、和sum和計數count
    batch_time = AverageMeter()
    data_time = AverageMeter()
    losses = AverageMeter()
    top5accs = AverageMeter()

    start = time.time()  # 開始時間計時

    for i, (imgs, caps, caplens) in enumerate(train_loader):
        data_time.update(time.time() - start)

        # 部署圖片和標題至gpu
        imgs = imgs.to(device)
        caps = caps.to(device)
        imgs = encoder(imgs)  # 編碼器訓練

        scores, caps_sorted, decode_lengths, alphas = decoder(
            imgs, caps, caplens)  # 解碼器
        scores = pack_padded_sequence(
            scores, decode_lengths, batch_first=True)  # 包一個包含可變長度的填充序列的張量

        targets = caps_sorted[:, 1:]
        targets = pack_padded_sequence(
            targets, decode_lengths, batch_first=True)

        scores = scores.data
        targets = targets.data

        loss = criterion(scores, targets)  # 根據自定義標準計算損失值
        loss += args.alpha_c * ((1. - alphas.sum(dim=1))**2).mean()  # 加上正則化項

        decoder_optimizer.zero_grad()  # 清除解碼器所有梯度
        if encoder_optimizer is not None:
            encoder_optimizer.zero_grad()  # 清除編碼器所有梯度
        loss.backward()  # 損失值反向傳播

        if args.grad_clip is not None:
            clip_gradient(decoder_optimizer,
                          args.grad_clip)  # 在反向傳播過程中計算剪輯梯度,以避免梯度爆炸
            if encoder_optimizer is not None:
                clip_gradient(encoder_optimizer, args.grad_clip)

        decoder_optimizer.step()  # 解碼器優化器前進一步
        if encoder_optimizer is not None:
            encoder_optimizer.step()  # 編碼器優化器前進一步

        top5 = accuracy(scores, targets, 5)  # 返回給定輸入張量沿給定維度的5個最大元素
        losses.update(loss.item(), sum(decode_lengths))
        top5accs.update(top5, sum(decode_lengths))
        batch_time.update(time.time() - start)

        start = time.time()

        # 到了打印一波日誌的時候
        if i % args.log_step == 0:
            print('Epoch: [{0}][{1}/{2}]\t'
                  'Batch Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
                  'Data Load Time {data_time.val:.3f} ({data_time.avg:.3f})\t'
                  'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
                  'Top-5 Accuracy {top5.val:.3f} ({top5.avg:.3f})'.format(
                      epoch,
                      i,
                      len(train_loader),
                      batch_time=batch_time,
                      data_time=data_time,
                      loss=losses,
                      top5=top5accs))

 

2.1.6 validate函數 驗證

驗證函數與之類似,就是多了計算BLEU-4分數以評估模型

關鍵代碼

from nltk.translate.bleu_score import corpus_bleu
# 計算 BLEU-4 得分
bleu4 = corpus_bleu(references, hypotheses)

validate函數完整代碼

# 驗證集上效果計算
def validate(val_loader, encoder, decoder, criterion):
    """
    Performs one epoch's validation.
    :param val_loader: DataLoader for validation data.
    :param encoder: encoder model
    :param decoder: decoder model
    :param criterion: loss layer
    :return: BLEU-4 score
    """
    decoder.eval()  # 將模塊設置爲評估模式 (no dropout or batchnorm)
    if encoder is not None:
        encoder.eval()

    batch_time = AverageMeter()
    losses = AverageMeter()
    top5accs = AverageMeter()

    start = time.time()

    references = list()  # 計算BLEU-4分數的參考(真實標題)
    hypotheses = list()  # 假設(預測)

    # 每輪batch迭代
    for i, (imgs, caps, caplens) in enumerate(val_loader):

        # 遷移至gpu
        imgs = imgs.to(device)
        caps = caps.to(device)

        # 前向傳播
        if encoder is not None:
            imgs = encoder(imgs)
        scores, caps_sorted, decode_lengths, alphas = decoder(
            imgs, caps, caplens)

        # 因爲我們是從<start>開始解碼的,所以目標都是<start>之後的單詞,一直到<end>
        targets = caps_sorted[:, 1:]

        # 刪除我們沒有解碼的時間步長,或者是pad
        # pack_padded_sequence 是完成這個目的的一個簡單的技巧
        scores_copy = scores.clone()
        scores = pack_padded_sequence(scores, decode_lengths, batch_first=True)
        targets = pack_padded_sequence(
            targets, decode_lengths, batch_first=True)

        scores = scores.data
        targets = targets.data

        loss = criterion(scores, targets)  # 計算損失

        # 加入 doubly stochastic attention 正則化
        loss += args.alpha_c * ((1. - alphas.sum(dim=1))**2).mean()

        # 跟蹤指標
        losses.update(loss.item(), sum(decode_lengths))
        top5 = accuracy(scores, targets, 5)
        top5accs.update(top5, sum(decode_lengths))
        batch_time.update(time.time() - start)

        start = time.time()

        if i % args.log_step == 0:
            print('Validation: [{0}/{1}]\t'
                  'Batch Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
                  'Loss {loss.val:.4f} ({loss.avg:.4f})\t'
                  'Top-5 Accuracy {top5.val:.3f} ({top5.avg:.3f})\t'.format(
                      i,
                      len(val_loader),
                      batch_time=batch_time,
                      loss=losses,
                      top5=top5accs))

        # 存儲每個圖像的引用(真實標題)和假設(預測)
        # 如果對於n幅圖像,我們有n個假設,參考文獻a, b, c…
        # 對於每個圖像,我們需要
        # references= [[ref1a, ref1b, ref1c], [ref2a, ref2b, ref2c],…
        # hypotheses= [hyp1, hyp2, …]

        # References
        # caps = caps[sort_ind]  # 因爲圖像是在解碼器中排序的
        for j in range(caps.shape[0]):
            img_caps = caps[j].tolist()
            img_captions = list(
                map(
                    lambda c: [
                        w for w in img_caps if w not in
                        {vocab.__call__('<start>'),
                         vocab.__call__('<end>')}
                    ], img_caps))  # 去除 <start> and 填充
            references.append(img_captions)

        # Hypotheses
        _, preds = torch.max(scores_copy, dim=2)
        preds = preds.tolist()
        temp_preds = list()
        for j, p in enumerate(preds):
            temp_preds.append(preds[j][:decode_lengths[j]])  # 移除結尾的填充
        preds = temp_preds
        hypotheses.extend(preds)

        assert len(references) == len(hypotheses)

    # 計算 BLEU-4 得分
    bleu4 = corpus_bleu(references, hypotheses)

    print(
        '\n * LOSS - {loss.avg:.3f}, TOP-5 ACCURACY - {top5.avg:.3f}, BLEU-4 - {bleu}\n'
        .format(loss=losses, top5=top5accs, bleu=bleu4))

    return bleu4

 

2.1.7 if __name__ == '__main__':  main函數功能邏輯之外的其餘背景部署

可以在這裏修改一下進程名字,這樣在多人共用服務器是可以互相看見,以免誤傷2333

if __name__ == '__main__':
    setproctitle.setproctitle("張晉豪的python caption flickr8k")
    main(args)

 

2.2 模型model.py

這裏就是pytorch定義神經網絡的地方了。一般來說,最簡單的,就直接繼承nn.Module父類,重寫forward方法即可。forward方法用於每次數據獲取(輸入參數)預測輸出(return)

當然,還可以定義其他的輔助方法,如fine_tune微調等。

具體例子如下:

CNN編碼器定義如下:

class EncoderCNN(nn.Module):
    def __init__(self, encoded_image_size=14):
        super(EncoderCNN, self).__init__()
        resnet = models.resnet101(pretrained=True)
        # children 返回直接子模塊上的迭代器
        modules = list(resnet.children())[:-2]
        self.resnet = nn.Sequential(*modules)

        self.adaptive_pool = nn.AdaptiveAvgPool2d((encoded_image_size,
                                                   encoded_image_size))
        self.fine_tune()

    def forward(self, images):

        out = self.resnet(images)
        out = self.adaptive_pool(out)
        out = out.permute(0, 2, 3, 1)  # 轉換數組軸
        return out

    def fine_tune(self, fine_tune=True):
        for p in self.resnet.parameters():
            p.requires_grad = False
        for c in list(self.resnet.children())[5:]:
            for p in c.parameters():
                p.requires_grad = fine_tune

attention解碼器定義如下:

class AttnDecoderRNN(nn.Module):
    def __init__(self,
                 attention_dim,
                 embed_dim,
                 decoder_dim,
                 vocab_size,
                 encoder_dim=2048,
                 dropout=0.5):
        super(AttnDecoderRNN, self).__init__()
        self.encoder_dim = encoder_dim
        self.attention_dim = attention_dim
        self.embed_dim = embed_dim
        self.decoder_dim = decoder_dim
        self.vocab_size = vocab_size
        self.dropout = dropout

        self.attention = Attention(encoder_dim, decoder_dim, attention_dim)

        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.dropout = nn.Dropout(p=self.dropout)
        self.decode_step = nn.LSTMCell(
            embed_dim + encoder_dim, decoder_dim, bias=True)
        self.init_h = nn.Linear(encoder_dim, decoder_dim)
        self.init_c = nn.Linear(encoder_dim, decoder_dim)
        self.f_beta = nn.Linear(
            decoder_dim,
            encoder_dim)  # linear layer to create a sigmoid-activated gate
        self.sigmoid = nn.Sigmoid()
        self.fc = nn.Linear(decoder_dim, vocab_size)
        self.init_weights()

    def init_weights(self):
        self.embedding.weight.data.uniform_(-0.1, 0.1)
        self.fc.bias.data.fill_(0)
        self.fc.weight.data.uniform_(-0.1, 0.1)

    def load_pretrained_embeddings(self, embeddings):
        # Parameter
        # 在參數優化的時候可以進行優化 所以經過類型轉換這個self.v變成了模型的一部分
        # 成爲了模型中根據訓練可以改動的參數了
        # 使用這個函數的目的也是想讓某些變量在學習的過程中不斷的修改其值以達到最優化
        self.embedding.weight = nn.Parameter(embeddings)

    def fine_tune_embeddings(self, fine_tune=True):
        for p in self.embedding.parameters():
            p.requires_grad = fine_tune

    def init_hidden_state(self, encoder_out):
        mean_encoder_out = encoder_out.mean(dim=1)
        h = self.init_h(mean_encoder_out)
        c = self.init_c(mean_encoder_out)
        return h, c

    def forward(self, encoder_out, encoded_captions, caption_lengths):
        """
        :return: scores for vocabulary, sorted encoded captions, decode lengths, weights
        """
        batch_size = encoder_out.size(0)
        encoder_dim = encoder_out.size(-1)
        vocab_size = self.vocab_size

        encoder_out = encoder_out.view(batch_size, -1,
                                       encoder_dim)  # view pytorch的reshape
        num_pixels = encoder_out.size(1)

        embeddings = self.embedding(encoded_captions)

        h, c = self.init_hidden_state(encoder_out)

        decode_lengths = [c - 1 for c in caption_lengths]

        predictions = torch.zeros(batch_size, max(decode_lengths),
                                  vocab_size).to(device)
        alphas = torch.zeros(batch_size, max(decode_lengths),
                             num_pixels).to(device)

        # 一個batch爲一個整體預測集合
        # 每個caption一個單詞一個單詞的預測
        # 當短的預測完成時,就開始預測剩下的長的
        # 在dataloader處已經排序了, 從頭到尾caption長度逐漸減少
        for t in range(max(decode_lengths)):
            batch_size_t = sum([l > t for l in decode_lengths])
            attention_weighted_encoding, alpha = self.attention(
                encoder_out[:batch_size_t], h[:batch_size_t])
            gate = self.sigmoid(self.f_beta(h[:batch_size_t]))
            attention_weighted_encoding = gate * attention_weighted_encoding
            h, c = self.decode_step(
                torch.cat([
                    embeddings[:batch_size_t, t, :],
                    attention_weighted_encoding
                ],
                          dim=1), (h[:batch_size_t], c[:batch_size_t]))
            preds = self.fc(self.dropout(h))
            predictions[:batch_size_t, t, :] = preds
            alphas[:batch_size_t, t, :] = alpha

        return predictions, encoded_captions, decode_lengths, alphas

attention輔助類定義如下:

class Attention(nn.Module):
    def __init__(self, encoder_dim, decoder_dim, attention_dim):
        super(Attention, self).__init__()
        self.encoder_att = nn.Linear(encoder_dim, attention_dim)
        self.decoder_att = nn.Linear(decoder_dim, attention_dim)
        self.full_att = nn.Linear(attention_dim, 1)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, encoder_out, decoder_hidden):
        att1 = self.encoder_att(encoder_out)
        att2 = self.decoder_att(decoder_hidden)
        # unsqueeze(arg) 在第arg維增加一個維度值爲1的維度
        # squeeze(arg) 第arg維的維度值爲1,則去掉該維度
        att = self.full_att(self.relu(att1 + att2.unsqueeze(1))).squeeze(2)
        alpha = self.softmax(att)
        attention_weighted_encoding = (encoder_out * alpha.unsqueeze(2)).sum(
            dim=1)
        return attention_weighted_encoding, alpha

 

2.3 預處理prepro.py

預處理部分一般依據任務類型而定,例如nlp的話主要是搭建字典,而cv主要是將圖片進行resize降噪標準化等等。

而這個picture_caption的項目就決定了要同時做nlp和cv的預處理工作。

一 nlp 搭建字典的部分

from flickr8k_dataloader import flickr8k
class Vocabulary(object):
    """Simple vocabulary wrapper."""

    def __init__(self):
        self.word2idx = {}
        self.idx2word = {}
        self.idx = 0

    def add_word(self, word):
        if not word in self.word2idx:
            self.word2idx[word] = self.idx
            self.idx2word[self.idx] = word
            self.idx += 1

    def __call__(self, word):
        if not word in self.word2idx:
            return self.word2idx['<unk>']
        return self.word2idx[word]

    def __len__(self):
        return len(self.word2idx)


def build_vocab(json, threshold):
    """Build a simple vocabulary wrapper."""
    flickr = flickr8k(ann_text_location=json)
    counter = Counter()
    anns_length = len(flickr.anns)
    for id in range(anns_length):
        caption = str(flickr.anns[id]['caption'])
        tokens = nltk.tokenize.word_tokenize(caption.lower())
        counter.update(tokens)

        if id % 1000 == 0:
            print("[%d/%d] Tokenized the captions." % (id, anns_length))

    # 如果當詞頻低於 'threshold', 就會被拋棄
    words = [word for word, cnt in counter.items() if cnt >= threshold]

    # 創建一個並添加一些特殊的 token
    vocab = Vocabulary()
    vocab.add_word('<pad>')
    vocab.add_word('<start>')
    vocab.add_word('<end>')
    vocab.add_word('<unk>')

    # 將單詞添加到字典中
    for i, word in enumerate(words):
        vocab.add_word(word)
    
    return vocab

這裏用到了我的 flickr8k_dataloader.py 中的輔助類 flickr8k

flickr8k_dataloader.py 完整文件如下

# coding=utf-8
'''
讀取flickr8k數據集
'''
import re
import os


class flickr8k():
    def __init__(
            self,
            ann_text_location='/mnt/disk2/flickr8k/Flickr8k_text/Flickr8k.lemma.token.txt',
            imgs_location='/mnt/disk2/flickr8k/Flickr8k_Dataset/Flickr8k_Dataset/',
            ann_rank=4):
        '''
        讀取flickr8k數據集的輔助類
        :param ann_text_location: annotation文件所在的位置
        :param imgs_location: 圖片文件夾所在位置
        :param ann_rank: 選取第幾個等級的annotation
        '''
        self.ann_text_location = ann_text_location
        self.ann_rank = ann_rank
        self.imgs_location = imgs_location

        self.anns = self.read_anns()

    def read_anns(self):
        '''
        讀取圖片id(不含.jpg)和annotation
        :returns: anns 一個list 每個元素爲一個dict: {'image_id': image_id, 'annotation': image_annotation}
        '''
        anns = []
        with open(self.ann_text_location, 'r') as raw_ann_text:
            ann_text_lines = raw_ann_text.readlines()
        match_re = r'(.*).jpg#' + str(self.ann_rank) + '\s+(.*)'
        for line in ann_text_lines:
            matchObj = re.match(match_re, line)
            if matchObj:
                image_id = matchObj.group(1)
                image_annotation = matchObj.group(2)
                image = {'image_id': image_id, 'caption': image_annotation}
                anns.append(image)
        return anns

    def loadImg(self, img_id):
        '''
        返回一張圖片的完整路徑
        :param imgid: 圖片的id(不含.jpg)
        :param return: img_path 圖片的完整路徑
        :returns: img_path 圖片完整路徑
        '''
        img_path = os.path.join(self.imgs_location, img_id + '.jpg')
        return img_path


# 測試
# if __name__ == "__main__":
#     f = flickr8k()
#     print('f.anns[0] ', f.anns[0])
#     print('len(f.anns)', len(f.anns))
#     id = f.anns[0]['image_id']
#     path = f.loadImg(id)
#     print('path', path)

 

二 cv 調整圖片的部分

from PIL import Image
def resize_image(image):
    width, height = image.size
    # 圖片 resize 後以長和寬兩者中較短的長度爲基準
    # 長的邊取基準長度的中心部分進行截取 最後形成方形
    if width > height:
        left = (width - height) / 2
        right = width - left
        top = 0
        bottom = height
    else:
        top = (height - width) / 2
        bottom = height - top
        left = 0
        right = width
    image = image.crop((left, top, right, bottom))
    image = image.resize([224, 224], Image.ANTIALIAS)  # ANTIALIAS 高質量
    return image

 

三 兩個配套的主函數(構造字典、resize圖片並保存)

def main(args):
    vocab = build_vocab(json=args.caption_path, threshold=args.threshold)
    vocab_path = args.vocab_path
    with open(vocab_path, 'wb') as f:
        pickle.dump(vocab, f)
    print("Total vocabulary size: %d" % len(vocab))
    print("Saved the vocabulary wrapper to '%s'" % vocab_path)

    folder = '/mnt/disk2/flickr8k/Flickr8k_Dataset/Flickr8k_Dataset/'
    resized_folder = '/mnt/disk2/flickr8k/Flickr8k_Dataset/Flickr8k_Dataset_resized/'
    if not os.path.exists(resized_folder):
        os.makedirs(resized_folder)

    print('Start resizing images.')
    image_files = os.listdir(folder)
    num_images = len(image_files)
    for i, image_file in enumerate(image_files):
        with open(os.path.join(folder, image_file), 'rb') as f:
            with Image.open(f) as image:
                image = resize_image(image)  # resize 圖片
                image.save(
                    os.path.join(resized_folder, image_file),
                    image.format)  # 保存resize之後的圖片
        if i % 100 == 0:
            print('Resized images: %d/%d' % (i, num_images))


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--caption_path',
        type=str,
        default='/mnt/disk2/flickr8k/Flickr8k_text/Flickr8k.lemma.token.txt',
        help='path for train annotation file')
    parser.add_argument(
        '--vocab_path',
        type=str,
        default='/mnt/disk2/flickr8k/Flickr8k_Dataset/vocab.pkl',
        help='path for saving vocabulary wrapper')
    parser.add_argument(
        '--threshold',
        type=int,
        default=1,
        help='minimum word count threshold')
    args = parser.parse_args()
    main(args)

 

 

2.4 數據加載data_loader.py

2.1.4 main函數 承載主要邏輯的第四部分講數據集搭建時已經完整介紹,故不再贅述,貼完整代碼如下

# coding=utf-8
import os

import nltk
import torch
import torch.utils.data as data
from PIL import Image
from flickr8k_dataloader import flickr8k


class DataLoader(data.Dataset):
    def __init__(self, root, json, vocab, rank, transform=None):

        self.root = root
        self.flickr = flickr8k(
            ann_text_location=json, imgs_location=root, ann_rank=rank)
        self.vocab = vocab
        self.rank = rank
        self.transform = transform

    # 支持範圍從0到len(self)獨佔的整數索引
    def __getitem__(self, index):
        flickr = self.flickr
        vocab = self.vocab
        # ann:annotation
        caption = flickr.anns[index]['caption']
        img_id = flickr.anns[index]['image_id']
        path = flickr.loadImg(img_id)

        image = Image.open(path).convert('RGB')
        if self.transform is not None:
            image = self.transform(image)

        tokens = nltk.tokenize.word_tokenize(str(caption).lower())  # 分詞
        caption = []
        caption.append(vocab('<start>'))
        caption.extend([vocab(token) for token in tokens])
        caption.append(vocab('<end>'))
        target = torch.Tensor(caption)
        return image, target

    def __len__(self):
        return len(self.flickr.anns)


def collate_fn(data):
    data.sort(key=lambda x: len(x[1]), reverse=True)
    images, captions = zip(*data)

    images = torch.stack(images, 0)  # 將張量序列沿新維度串聯起來

    lengths = [len(cap) for cap in captions]
    targets = torch.zeros(len(captions), max(lengths)).long()
    for i, cap in enumerate(captions):
        end = lengths[i]
        targets[i, :end] = cap[:end]
    return images, targets, lengths


def get_loader(root, json, vocab, transform, batch_size, rank, shuffle,
               num_workers):
    flickr = DataLoader(
        root=root, json=json, vocab=vocab, rank=rank, transform=transform)

    # 數據加載 flickr 數據集
    # 每次迭代返回 (images, captions, lengths)
    # images: tensor of shape (batch_size, 3, 224, 224).
    # captions: tensor of shape (batch_size, padded_length).
    # lengths: 表示每個標題有效長度的列表. length is (batch_size).
    data_loader = torch.utils.data.DataLoader(
        dataset=flickr,
        batch_size=batch_size,
        shuffle=shuffle,
        num_workers=num_workers,
        collate_fn=collate_fn)  # 合併一個示例列表以形成一個 mini-batch
    return data_loader

 

2.5 工具utils.py

這裏主要是一些小工具,之前文字和代碼已經提到 clip_gradient(在反向傳播過程中計算剪輯梯度, 以避免梯度爆炸) 、save_checkpoint(保存中間模型)、AverageMeter(輔助類,跟蹤度量的最新值、平均值、和與計數)、adjust_learning_rate(將學習率降低一個特定的因子)和accuracy(從預測和真實的標籤, 計算top-k精度)。註釋格式寫得挺好的,直接看吧。

# coding=utf-8
import numpy as np
import torch


def init_embedding(embeddings):
    """
    用均勻分佈填補embedding tensor
    :param embeddings: embedding tensor
    """
    bias = np.sqrt(3.0 / embeddings.size(1))
    torch.nn.init.uniform_(embeddings, -bias, bias)


def load_embeddings(emb_file, word_map):
    """
    爲指定的 word map 創建一個 embedding tensor, 用於加載到模型中
    :param emb_file: file containing embeddings (stored in GloVe format)
    :param word_map: word map
    :return: embeddings(順序與 word map 中的單詞相同, 即 embeddings 的維度) emb_dim(embedding 維度)
    """

    # 找到 embedding 維數
    with open(emb_file, 'r') as f:
        emb_dim = len(f.readline().split(' ')) - 1

    vocab = set(word_map.keys())

    # 創建 tensor 來保存 embeddings, initialize
    embeddings = torch.FloatTensor(len(vocab), emb_dim)
    init_embedding(embeddings)

    # 讀取 embedding 文件
    print("\nLoading embeddings...")
    for line in open(emb_file, 'r'):
        line = line.split(' ')

        emb_word = line[0]
        # 處理詞向量
        # 去掉空格 再把字符串轉換爲 float 類型
        embedding = list(
            map(lambda t: float(t),
                filter(lambda n: n and not n.isspace(), line[1:])))

        # 忽略不在 train_vocab 中的單詞
        if emb_word not in vocab:
            continue

        # 將 embedding 中的單詞和詞向量記錄在 embeddings 中
        embeddings[word_map[emb_word]] = torch.FloatTensor(embedding)

    return embeddings, emb_dim


def clip_gradient(optimizer, grad_clip):
    """
    在反向傳播過程中計算剪輯梯度, 以避免梯度爆炸
    :param optimizer: optimizer with the gradients to be clipped
    :param grad_clip: clip value
    """
    for group in optimizer.param_groups:
        for param in group['params']:
            if param.grad is not None:
                # 將輸入的所有元素鉗入範圍[min, max]並返回一個結果張量
                # 本身在其中的就不變 超出的分別用 min 和 max 代替
                param.grad.data.clamp_(-grad_clip, grad_clip)


def save_checkpoint(data_name, epoch, epochs_since_improvement, encoder,
                    decoder, encoder_optimizer, decoder_optimizer, bleu4,
                    is_best):
    """
    Saves model checkpoint.
    :param data_name: base name of processed dataset
    :param epoch: epoch number
    :param epochs_since_improvement: number of epochs since last improvement in BLEU-4 score
    :param encoder: encoder model
    :param decoder: decoder model
    :param encoder_optimizer: optimizer to update encoder's weights, if fine-tuning
    :param decoder_optimizer: optimizer to update decoder's weights
    :param bleu4: validation BLEU-4 score for this epoch
    :param is_best: is this checkpoint the best so far?
    """
    state = {
        'epoch': epoch,
        'epochs_since_improvement': epochs_since_improvement,
        'bleu-4': bleu4,
        'encoder': encoder,
        'decoder': decoder,
        'encoder_optimizer': encoder_optimizer,
        'decoder_optimizer': decoder_optimizer
    }
    filename = 'checkpoint_' + data_name + '.pth.tar'
    torch.save(state, filename)
    # 如果這個 checkpoint 是目前爲止最好的,存儲一個副本,這樣它就不會被更差的 checkpoint 覆蓋
    if is_best:
        torch.save(state, 'BEST_' + filename)


class AverageMeter(object):
    """
    跟蹤度量的最新值、平均值、和與計數
    """

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def adjust_learning_rate(optimizer, shrink_factor):
    """
    將學習率降低一個特定的因子
    :param optimizer: optimizer whose learning rate must be shrunk.
    :param shrink_factor: factor in interval (0, 1) to multiply learning rate with.
    """

    print("\nDECAYING learning rate.")
    for param_group in optimizer.param_groups:
        param_group['lr'] = param_group['lr'] * shrink_factor
    print(
        "The new learning rate is %f\n" % (optimizer.param_groups[0]['lr'], ))


def accuracy(scores, targets, k):
    """
    從預測和真實的標籤, 計算top-k精度
    :param scores: scores from the model
    :param targets: true labels
    :param k: k in top-k accuracy
    :return: top-k accuracy
    """

    batch_size = targets.size(0)
    _, ind = scores.topk(k, 1, True, True)
    correct = ind.eq(targets.view(-1, 1).expand_as(ind))
    correct_total = correct.view(-1).float().sum()  # 0D tensor
    return correct_total.item() * (100.0 / batch_size)

 

2.6 驗證集數據集創建

主要是將原始的flickr8k中找些圖片出來當驗證集,這裏我有個小邏輯錯誤,那就是訓練集包含了這些2333以後做項目的時候一定要將驗證數據集和訓練數據集分開啊,哈哈哈。

順便要處理下annotation文件,用了個小正則,把圖片id和caption抓了出來。

import re
import numpy as np
import os
import random

ann_text_location = '/mnt/disk2/flickr8k/Flickr8k_text/Flickr8k.lemma.token.txt'
val_text_folder = '/mnt/disk1/zjhao/show_attend_and_tell_pytorch_flickr8/val/'
ann_text_val_location = os.path.join(val_text_folder,
                                     'Flickr8k_val.lemma.token.txt')
pictures_img_resized_folder = '/mnt/disk2/flickr8k/Flickr8k_Dataset/Flickr8k_Dataset_resized/'
pictures_img_resized_val_folder = '/mnt/disk2/flickr8k/Flickr8k_Dataset/Flickr8k_Dataset_val_resized'
picture_rank = 5
item = 0
val_rate = 0.1
pictures_captions = []
picture_captions = []
with open(ann_text_location, 'r') as raw_ann_text:
    ann_text_lines = raw_ann_text.readlines()
    total = int(len(ann_text_lines) / picture_rank)
    val_total = int(total * val_rate)
    val_random = random.sample(range(total), val_total)
    if not os.path.exists(val_text_folder):
        os.makedirs(val_text_folder)
    if not os.path.exists(pictures_img_resized_val_folder):
        os.makedirs(pictures_img_resized_val_folder)
    for doc_path in os.listdir(pictures_img_resized_val_folder):
        if not os.path.isdir(doc_path):
            os.remove(os.path.join(pictures_img_resized_val_folder, doc_path))
    with open(ann_text_val_location, 'w+') as val_ann_text:
        count = 0
        for val_id in val_random:
            for line in ann_text_lines[val_id * 5:val_id * 5 + 5]:
                val_ann_text.write(line)
            match_line = ann_text_lines[val_id * 5]
            match_re = r'(.*).jpg#[0-9]\s+.*'
            matchObj = re.match(match_re, match_line)
            img_full_path = os.path.join(pictures_img_resized_folder,
                                         matchObj.group(1) + '.jpg')
            img_full_path_copy = os.path.join(pictures_img_resized_val_folder,
                                              matchObj.group(1) + '.jpg')
            os.system('cp %s %s' % (img_full_path, img_full_path_copy))
            count += 1
            print('%d/%d' % (count, val_total))

2.7 其他一般性的經驗總結

2.7.1 註釋

註釋要寫好真的是太關鍵了,當然不只是用#這樣的事情。主要是函數的註釋,要在函數名下用'''xxx'''(單引號雙引號均可)這樣的長字符串寫明函數功能,輸入,輸出等具體信息,這樣既可以理清思路,也可以在vscode提示裏一目瞭然,舉例如下:

註釋格式即爲

def 函數名(參數):
    '''
    函數功能
    :param 參數名1: 參數意義、格式
    :param 參數名2: 參數意義、格式
    :returns: 返回對象、格式
    '''
    # 開始寫函數

這樣,在vscode裏的提示就會看到這樣(將鼠標放在其他引用這個函數的地方)

PS 參數就別管了,我只是爲了展示一個效果2333

就非常的清楚明晰。

 

2.7.2 nohup

這個主要是可以退出xshell連接後讓服務器的python繼續執行,我就不用一直開着電腦等待了。

具體操作就是

nohup python python文件名.py > 要保存輸出信息的文件 2>&1 &

例如

nohup python main.py > nohup.log 2>&1 &

命令解析

nohup 不掛斷地運行命令

第一個> 標準輸出重定向至nohup.log

2>&1 標準出錯(2)重定向至標準輸出(1),之前重定向了標準輸出,所以出錯信息和輸出信息都可以在nohup.log裏看到了。

& 在後臺運行

然後運行命令,就可以退出xshell,回頭看看nohup.log文件可以了。

正常結果(開頭部分)如下(如果看到有報錯就說明已經出錯停掉了2333):

PS. pytorch讓我快樂!!!!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章