Softmax & 分類模型

Softmax

與候選採樣相對

Softmax function, a wonderful activation function that turns numbers aka logits into probabilities that sum to one. Softmax function outputs a vector that represents the probability distributions of a list of potential outcomes.

一種函數,可提供多類別分類模型中每個可能類別的概率。這些概率的總和正好爲 1.0。

Example: softmax 可能會得出某個圖像是狗、貓和馬的概率分別是 0.9、0.08 和 0.02。(也稱爲完整 softmax。)在這裏插入圖片描述

softmax的基本概念

  • 分類問題
    一個簡單的圖像分類問題,輸入圖像的高和寬均爲2像素,色彩爲灰度。
    圖像中的4像素分別記爲x1,x2,x3,x4x_1, x_2, x_3, x_4
    假設真實標籤爲狗、貓或者雞,這些標籤對應的離散值爲y1,y2,y3y_1, y_2, y_3
    我們通常使用離散的數值來表示類別,例如y1=1,y2=2,y3=3y_1=1, y_2=2, y_3=3

  • 權重矢量

o1=x1w11+x2w21+x3w31+x4w41+b1 \begin{aligned} o_1 &= x_1 w_{11} + x_2 w_{21} + x_3 w_{31} + x_4 w_{41} + b_1 \end{aligned}

o2=x1w12+x2w22+x3w32+x4w42+b2 \begin{aligned} o_2 &= x_1 w_{12} + x_2 w_{22} + x_3 w_{32} + x_4 w_{42} + b_2 \end{aligned}

o3=x1w13+x2w23+x3w33+x4w43+b3 \begin{aligned} o_3 &= x_1 w_{13} + x_2 w_{23} + x_3 w_{33} + x_4 w_{43} + b_3 \end{aligned}

  • 神經網絡圖
    下圖用神經網絡圖描繪了上面的計算。softmax迴歸同線性迴歸一樣,也是一個單層神經網絡。由於每個輸出o1,o2,o3o_1, o_2, o_3的計算都要依賴於所有的輸入x1,x2,x3,x4x_1, x_2, x_3, x_4,softmax迴歸的輸出層也是一個全連接層。

softmax \begin{aligned}softmax迴歸是一個單層神經網絡\end{aligned}

既然分類問題需要得到離散的預測輸出,一個簡單的辦法是將輸出值oio_i當作預測類別是ii的置信度,並將值最大的輸出所對應的類作爲預測輸出,即輸出 argmaxioi\underset{i}{\arg\max} o_i。例如,如果o1,o2,o3o_1,o_2,o_3分別爲0.1,10,0.10.1,10,0.1,由於o2o_2最大,那麼預測類別爲2,其代表貓。

  • 輸出問題
    直接使用輸出層的輸出有兩個問題:
    1. 一方面,由於輸出層的輸出值的範圍不確定,我們難以直觀上判斷這些值的意義。例如,剛纔舉的例子中的輸出值10表示“很置信”圖像類別爲貓,因爲該輸出值是其他兩類的輸出值的100倍。但如果o1=o3=103o_1=o_3=10^3,那麼輸出值10卻又表示圖像類別爲貓的概率很低。
    2. 另一方面,由於真實標籤是離散值,這些離散值與不確定範圍的輸出值之間的誤差難以衡量。

softmax運算符(softmax operator)解決了以上兩個問題。它通過下式將輸出值變換成值爲正且和爲1的概率分佈:

y^1,y^2,y^3=softmax(o1,o2,o3) \hat{y}_1, \hat{y}_2, \hat{y}_3 = \text{softmax}(o_1, o_2, o_3)

其中

y^1=exp(o1)i=13exp(oi),y^2=exp(o2)i=13exp(oi),y^3=exp(o3)i=13exp(oi). \hat{y}1 = \frac{ \exp(o_1)}{\sum_{i=1}^3 \exp(o_i)},\quad \hat{y}2 = \frac{ \exp(o_2)}{\sum_{i=1}^3 \exp(o_i)},\quad \hat{y}3 = \frac{ \exp(o_3)}{\sum_{i=1}^3 \exp(o_i)}.

容易看出y^1+y^2+y^3=1\hat{y}_1 + \hat{y}_2 + \hat{y}_3 = 10y^1,y^2,y^310 \leq \hat{y}_1, \hat{y}_2, \hat{y}_3 \leq 1,因此y^1,y^2,y^3\hat{y}_1, \hat{y}_2, \hat{y}_3是一個合法的概率分佈。這時候,如果y^2=0.8\hat{y}_2=0.8,不管y^1\hat{y}_1y^3\hat{y}_3的值是多少,我們都知道圖像類別爲貓的概率是80%。此外,我們注意到

argmaxioi=argmaxiy^i \underset{i}{\arg\max} o_i = \underset{i}{\arg\max} \hat{y}_i

因此softmax運算不改變預測類別輸出。

  • 計算效率
    • 單樣本矢量計算表達式
      爲了提高計算效率,我們可以將單樣本分類通過矢量計算來表達。在上面的圖像分類問題中,假設softmax迴歸的權重和偏差參數分別爲

W=[w11w12w13w21w22w23w31w32w33w41w42w43],b=[b1b2b3], \boldsymbol{W} = \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ w_{41} & w_{42} & w_{43} \end{bmatrix},\quad \boldsymbol{b} = \begin{bmatrix} b_1 & b_2 & b_3 \end{bmatrix},

設高和寬分別爲2個像素的圖像樣本ii的特徵爲

x(i)=[x1(i)x2(i)x3(i)x4(i)], \boldsymbol{x}^{(i)} = \begin{bmatrix}x_1^{(i)} & x_2^{(i)} & x_3^{(i)} & x_4^{(i)}\end{bmatrix},

輸出層的輸出爲

o(i)=[o1(i)o2(i)o3(i)], \boldsymbol{o}^{(i)} = \begin{bmatrix}o_1^{(i)} & o_2^{(i)} & o_3^{(i)}\end{bmatrix},

預測爲狗、貓或雞的概率分佈爲

y^(i)=[y^1(i)y^2(i)y^3(i)]. \boldsymbol{\hat{y}}^{(i)} = \begin{bmatrix}\hat{y}_1^{(i)} & \hat{y}_2^{(i)} & \hat{y}_3^{(i)}\end{bmatrix}.

softmax迴歸對樣本ii分類的矢量計算表達式爲

o(i)=x(i)W+b,y^(i)=softmax(o(i)). \begin{aligned} \boldsymbol{o}^{(i)} &= \boldsymbol{x}^{(i)} \boldsymbol{W} + \boldsymbol{b},\\ \boldsymbol{\hat{y}}^{(i)} &= \text{softmax}(\boldsymbol{o}^{(i)}). \end{aligned}

  • 小批量矢量計算表達式
    爲了進一步提升計算效率,我們通常對小批量數據做矢量計算。廣義上講,給定一個小批量樣本,其批量大小爲nn,輸入個數(特徵數)爲dd,輸出個數(類別數)爲qq。設批量特徵爲XRn×d\boldsymbol{X} \in \mathbb{R}^{n \times d}。假設softmax迴歸的權重和偏差參數分別爲WRd×q\boldsymbol{W} \in \mathbb{R}^{d \times q}bR1×q\boldsymbol{b} \in \mathbb{R}^{1 \times q}。softmax迴歸的矢量計算表達式爲

O=XW+b,Y^=softmax(O), \begin{aligned} \boldsymbol{O} &= \boldsymbol{X} \boldsymbol{W} + \boldsymbol{b},\\ \boldsymbol{\hat{Y}} &= \text{softmax}(\boldsymbol{O}), \end{aligned}

其中的加法運算使用了廣播機制,O,Y^Rn×q\boldsymbol{O}, \boldsymbol{\hat{Y}} \in \mathbb{R}^{n \times q}且這兩個矩陣的第ii行分別爲樣本ii的輸出o(i)\boldsymbol{o}^{(i)}和概率分佈y^(i)\boldsymbol{\hat{y}}^{(i)}


兩種操作對比

numpy 操作:np.exp(x) / np.sum(np.exp(x), axis=0)
pytorch 操作:torch.exp(x)/torch.sum(torch.exp(x), dim=1).view(-1,1)

引入Fashion-MNIST

爲方便介紹 Softmax, 爲了更加直觀的觀察到算法之間的差異
引入較爲複雜的多分類圖像分類數據集

導入:torchvision 包【構建計算機視覺模型】

# import needed package
%matplotlib inline
from IPython import display
import matplotlib.pyplot as plt

import torch
import torchvision
import torchvision.transforms as transforms
import time

import sys
sys.path.append("path to file storge d2lzh1981")
import d2lzh1981 as d2l

# print(torch.__version__)
# print(torchvision.__version__)
# get dataset
mnist_train = torchvision.datasets.FashionMNIST(root='path to file storge FashionMNIST.zip', train=True, download=True, transform=transforms.ToTensor())
mnist_test = torchvision.datasets.FashionMNIST(root='path to file storge FashionMNIST.zip', train=False, download=True, transform=transforms.ToTensor())

def get_fashion_mnist_labels(labels):
    text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',
                   'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
    return [text_labels[int(i)] for i in labels]
def show_fashion_mnist(images, labels):
    d2l.use_svg_display()
    # 這裏的_表示我們忽略(不使用)的變量
    _, figs = plt.subplots(1, len(images), figsize=(12, 12))
    for f, img, lbl in zip(figs, images, labels):
        f.imshow(img.view((28, 28)).numpy())
        f.set_title(lbl)
        f.axes.get_xaxis().set_visible(False)
        f.axes.get_yaxis().set_visible(False)
    plt.show()
X, y = [], []
for i in range(10):
    X.append(mnist_train[i][0]) # 將第i個feature加到X中
    y.append(mnist_train[i][1]) # 將第i個label加到y中
show_fashion_mnist(X, get_fashion_mnist_labels(y))
# read data
batch_size = 256
num_workers = 4
train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=num_workers)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=num_workers)

start = time.time()
for X, y in train_iter:
    continue
print('%.2f sec' % (time.time() - start))

Softmax 手動實現

import package and module

import torch
import torchvision
import numpy as np
import sys
sys.path.append("path to file storge d2lzh1981")
import d2lzh1981 as d2l

print(torch.__version__)
print(torchvision.__version__)

獲取數據

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

模型參數初始化

# init module param 
num_inputs = 784
print(28*28)
num_outputs = 10

W = torch.tensor(np.random.normal(0, 0.01, (num_inputs, num_outputs)), dtype=torch.float)
b = torch.zeros(num_outputs, dtype=torch.float)
W.requires_grad_(requires_grad=True)
b.requires_grad_(requires_grad=True)

Sofmax 定義

# define softmax function
def softmax(X):
    X_exp = X.exp()
    partition = X_exp.sum(dim=1, keepdim=True)
    # print("X size is ", X_exp.size())
    # print("partition size is ", partition, partition.size())
    return X_exp / partition  # 這裏應用了廣播機制

softmax 迴歸模型

# define regression model
def net(X):
    return softmax(torch.mm(X.view((-1, num_inputs)), W) + b)

損失函數

# define loss function
def cross_entropy(y_hat, y):
    return - torch.log(y_hat.gather(1, y.view(-1, 1)))

準確率

def accuracy(y_hat, y):
    return (y_hat.argmax(dim=1) == y).float().mean().item()

訓練模型

num_epochs, lr = 5, 0.1

def train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size,
              params=None, lr=None, optimizer=None):
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
        for X, y in train_iter:
            y_hat = net(X)
            l = loss(y_hat, y).sum()
            
            # 梯度清零
            if optimizer is not None:
                optimizer.zero_grad()
            elif params is not None and params[0].grad is not None:
                for param in params:
                    param.grad.data.zero_()
            
            l.backward()
            if optimizer is None:
                d2l.sgd(params, lr, batch_size)
            else:
                optimizer.step() 
            
            
            train_l_sum += l.item()
            train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
            n += y.shape[0]
        test_acc = evaluate_accuracy(test_iter, net)
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f'
              % (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc))

train_ch3(net, train_iter, test_iter, cross_entropy, num_epochs, batch_size, [W, b], lr)

模型預測

X, y = iter(test_iter).next()

true_labels = d2l.get_fashion_mnist_labels(y.numpy())
pred_labels = d2l.get_fashion_mnist_labels(net(X).argmax(dim=1).numpy())
titles = [true + '\n' + pred for true, pred in zip(true_labels, pred_labels)]

d2l.show_fashion_mnist(X[0:9], titles[0:9])

Pytorch 改進

# import package and module
import torch
from torch import nn
from torch.nn import init
import numpy as np
import sys
sys.path.append("path to file storge d2lzh1981")
import d2lzh1981 as d2l

初始化參數和獲取數據

# init param and get data
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)

定義網絡模型

num_inputs = 784
num_outputs = 10

class LinearNet(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super(LinearNet, self).__init__()
        self.linear = nn.Linear(num_inputs, num_outputs)
    def forward(self, x): # x 的形狀: (batch, 1, 28, 28)
        y = self.linear(x.view(x.shape[0], -1))
        return y
    
# net = LinearNet(num_inputs, num_outputs)

class FlattenLayer(nn.Module):
    def __init__(self):
        super(FlattenLayer, self).__init__()
    def forward(self, x): # x 的形狀: (batch, *, *, ...)
        return x.view(x.shape[0], -1)

from collections import OrderedDict
net = nn.Sequential(
        # FlattenLayer(),
        # LinearNet(num_inputs, num_outputs) 
        OrderedDict([
           ('flatten', FlattenLayer()),
           ('linear', nn.Linear(num_inputs, num_outputs))]) # 或者寫成我們自己定義的 LinearNet(num_inputs, num_outputs) 也可以
        )

初始化模型參數

# init module param
init.normal_(net.linear.weight, mean=0, std=0.01)
init.constant_(net.linear.bias, val=0)

損失函數

loss = nn.CrossEntropyLoss() # 下面是他的函數原型
# class torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')

優化函數

optimizer = torch.optim.SGD(net.parameters(), lr=0.1) # 下面是函數原型
# class torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)

訓練

num_epochs = 5
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, optimizer)

訓練結果分析

在這裏插入圖片描述

最開始:訓練數據集上的準確率低於測試數據集上的準確率

原因

訓練集上的準確率是在一個epoch的過程中計算得到的
測試集上的準確率是在一個epoch結束後計算得到的
Result: 後者的模型參數更優

發佈了46 篇原創文章 · 獲贊 21 · 訪問量 3443
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章