邏輯迴歸模型 Logistic Regression 詳細推導 (含 Numpy 與PyTorch 實現)

文章目錄

邏輯迴歸模型 Logistic Regression 詳細推導 (含 Numpy 與PyTorch 實現)

LR 模型的優化目標

內容概括

邏輯迴歸模型 (Logistic Regression, LR) 是一個二分類模型, 它假設數據服從 Bernoulli 分佈(也稱 0 - 1 分佈), 採用 Sigmoid 函數 $\sigma(x)$ 將線性迴歸 $Y = X\theta$ 的結果約束在 $(0, 1)$ 區間內, 以表示樣本屬於某一類的概率. 之後通過最大似然函數的方法, 對目標函數採用梯度下降法實現對模型參數 $\theta$ 的更新. 本質上, LR 模型仍然是一個線性模型.

下面的內容主要是對 LR 進行推導, 按照兩種思路:

使用代數法進行推導
採用矩陣法進行推導

前者較爲繁瑣, 而後者非常簡潔! 完成推導之後, 再分別使用 Numpy 或者 PyTorch 實現 LR 模型.

LR 模型介紹

符號說明

在介紹 LR 模型之前, 先對本文用到的符號進行說明;

$x_i\in \mathbb{R}^{n\times 1}$ 表示第 $i$ 個樣本
$y_i\in \{0, 1\}$ 表示第 $i$ 個樣本對應的標籤
$X\in \mathbb{R}^{m\times n}$ 爲由 $m$ 個樣本組成的矩陣
$Y\in \mathbb{R}^{m\times 1}$ 爲 $X$ 對應的標籤組成的矩陣
$\theta\in\mathbb{R}^{n\times 1}$ 爲 LR 模型的權重參數
$E\in\mathbb{R}^{m\times 1}$ 爲全 $1$ 向量, 即 $[1, 1, \ldots, 1]^T$

Sigmoid 函數

Sigmoid 函數定義爲:

$\sigma(x) = \frac{1}{1 + \exp{(-x)}}$

其函數圖像如下:

當 $x\rightarrow +\infty$ 時, $\sigma(x)\rightarrow 1$ ; 而當 $x\rightarrow -\infty$ 時, $\sigma(x)\rightarrow 0$ . 由於 $\sigma(x)$ 的取值範圍在 $(0, 1)$ 區間內, 因此可以用來表示概率的大小.

另外一個關於 Sigmoid 函數的有用性質是, 對其求導的結果可以用它的輸出值來表示, 即:

$\sigma^\prime(x) = \sigma(x)\left(1 - \sigma(x)\right)$

具體推導過程如下:

$\begin{aligned} \sigma^\prime(x) &= \left(\frac{1}{1 + \exp{(-x)}}\right)^\prime \\ &= -\frac{1}{\left(1 + \exp{(-x)}\right)^2}\cdot\exp{(-x)}\cdot(-x)^\prime \\ &= \frac{1}{1 + \exp{(-x)}}\cdot\frac{\exp{(-x)}}{1 + \exp{(-x)}} \\ &= \sigma(x)\left(1 - \sigma(x)\right) \end{aligned}$

LR 模型

對樣本 $x\in\mathbb{R}^{n\times 1}$ , 設其類別爲 $y$ , 線性迴歸模型的參數設爲 $\theta\in\mathbb{R}^{n\times 1}$ , 使用 Sigmoid 函數將線性模型的結果 $x^T\theta$ 進行轉換, 便得到了二元邏輯迴歸的一般形式:

$h_\theta(x) = \frac{1}{1 + \exp{(-x^T\theta)}}$

可以用 $h_\theta(x)$ 表示分類的概率, 如果 $h_\theta(x) > 0.5$ , 那麼可以認爲樣本 $x$ 的類別爲 $y=1$ ; 若 $h_\theta(x) < 0.5$ , 則認爲樣本 $x$ 的類別爲 $y=0$ . 如果 $h_\theta(x)$ 剛好等於 $0.5$ , 即此時 $x$ 剛好等於 $0$ , 模型無法判斷樣本的具體類別, 但具體實現時, 一般將等於 $0.5$ 的情況加入到前面兩種情況之一中.

將二元邏輯迴歸寫成矩陣的形式:

$h_\theta(X) = \frac{1}{1 + \exp{(-X\theta)}}$

其中 $X = [x_1, x_2, \ldots, x_m]^T\in\mathbb{R}^{m\times n}$ 爲樣本的輸入特徵矩陣, 參數 $\theta\in\mathbb{R}^{n\times 1}$ , 那麼輸出結果 $h_\theta(X)\in\mathbb{R}^{m\times 1}$ .

LR 模型的優化目標

似然函數與損失函數

對於輸入樣本 $(x, y)$ , 利用 LR 模型可以得到它屬於某一類的概率分別爲:

$\begin{aligned} P(y = 1 | x, \theta) &= h_{\theta}(x) \\ P(y = 0 | x, \theta) &= 1 - h_{\theta}(x) \end{aligned}$

將兩個式子合併爲一個式子, 可以表示如下:

$P(y | x, \theta) = h_{\theta}(x)^y\left(1 - h_{\theta}(x) \right)^{(1 - y)}$

對於樣本集 $\mathcal{T}=\{(x_i, y_i)\}_{i=1}^{m}$ , 其似然函數可以表示爲:

$J(\theta) = \prod_{i=1}^{m}h_{\theta}(x_i)^{y_i}\left(1 - h_{\theta}(x_i) \right)^{(1 - y_i)}$

如果採用似然函數最大化進行優化, 求的是最大值, 如果取負, 相當於優化方向是進行最小化目標, 之所以取負, 是因爲一般機器學習中我們的優化目標是最小化損失函數, 通常採用梯度下降法來求解, 原因是梯度的負方向函數值下降最快. 不取負也是可以的, 但是在更新模型參數的時候, 需要做簡單的修改. 這裏按照慣例來.

其對數似然函數(代數形式)取負爲:

$\begin{aligned} L(\theta) &=-\sum_{i=1}^{m}\left[y_{i} \log h_{\theta}\left(x_{i}\right)+\left(1-y_{i}\right) \log \left(1-h_{\theta}\left(x_{i}\right)\right)\right] \\ &=-\sum_{i=1}^{m}\left[y_{i} \log \frac{h_{\theta}\left(x_{i}\right)}{1-h_{\theta}\left(x_{i}\right)}+\log \left(1-h_{\theta}\left(x_{i}\right)\right)\right] \\ &= -\sum_{i=1}^{m}\left[y_{i} \log \frac{1}{\exp{(-x_i^T\theta)}}+\log \left(1-h_{\theta}\left(x_{i}\right)\right)\right] \\ &= -\sum_{i=1}^{m}\left[y_{i} \log \frac{1}{\exp{(-x_i^T\theta)}}+\log \left(1-\frac{1}{1 + \exp{(-x_i^T\theta)}}\right)\right] \\ &= -\sum_{i=1}^{m}\left[y_{i} \log \frac{1}{\exp{(-x_i^T\theta)}}+\log \left(\frac{\exp{(-x_i^T\theta)}}{1 + \exp{(-x_i^T\theta)}}\right)\right] \\ &= -\sum_{i=1}^{m}\left[y_{i} \log \frac{1}{\exp{(-x_i^T\theta)}}+\log \left(\frac{1}{1 + \exp{(x_i^T\theta)}}\right)\right] \\ &=-\sum_{i=1}^{m}\left[y_{i}\left(x_i^T\theta\right)-\log \left(1+\exp \left(x_i^T\theta\right)\right)\right] \end{aligned}$

如果用矩陣來表示 $L(\theta)$ , 那麼結果爲:

$L(\theta) = -\left(Y^TX\theta - E^T\log\left(E + \exp{(X\theta)}\right)\right)$

其中 $E\in\mathbb{R}^{m\times 1}$ 爲全 1 向量, 即 $[1, 1, \ldots, 1]^T$ .

模型參數更新 – 代數法求梯度

這一小節使用代數法求二元邏輯迴歸的梯度, 相對繁瑣; 而使用矩陣法求解在形式上更爲簡潔, 但是理解上有一定的門檻.

前面得到了損失函數爲:

$\begin{aligned} L(\theta) &=-\sum_{i=1}^{m}\left[y_{i}\left(x_i^T\theta\right)-\log \left(1+\exp \left(x_i^T\theta\right)\right)\right] \end{aligned}$

那麼 $\frac{\partial L(\theta)}{\partial \theta_j}$ 的結果爲:

$\begin{aligned} \frac{\partial L(\theta)}{\partial \theta_j} &=-\frac{\partial}{\partial \theta_j}\sum_{i=1}^{m}\left[y_{i}\left(x_i^T\theta\right)-\log \left(1+\exp \left(x_i^T\theta\right)\right)\right] \\ &= -\sum_{i=1}^{m}\left[y_{i}x_{ij} - \frac{1}{1+\exp \left(x_i^T\theta\right)}\cdot \exp \left(x_i^T\theta\right)\cdot x_{ij} \right] \\ &= -\sum_{i=1}^{m}\left[y_{i} - \frac{1}{1+\exp \left(-x_i^T\theta\right)}\right] x_{ij} \\ &= -\sum_{i=1}^{m}\left[y_{i} - h_{\theta}(x_i)\right] x_{ij} \\ &= \sum_{i=1}^{m}\left[h_{\theta}(x_i) - y_{i}\right] x_{ij} \end{aligned}$

那麼 $\frac{\partial L(\theta)}{\partial \theta} = \left[\frac{\partial L(\theta)}{\partial \theta_1}, \frac{\partial L(\theta)}{\partial \theta_2}, \ldots, \frac{\partial L(\theta)}{\partial \theta_n}\right]^T = X^T\left(h_\theta{(X)} - Y\right)$

其中 $\frac{\partial L(\theta)}{\partial \theta} \in\mathbb{R}^{n\times 1}$ , $X^T\in{n\times m}$ , $\left(h_\theta{(X)} - Y\right)\in\mathbb{R}^{m\times 1}$

模型參數更新 – 矩陣法求梯度

這裏採用矩陣法來求梯度, 根據前面得到的損失函數的矩陣形式爲:

$L(\theta) = -\left(Y^TX\theta - E^T\log\left(E + \exp{(X\theta)}\right)\right)$

其中 $E\in\mathbb{R}^{m\times 1}$ 爲全 $1$ 向量.

在求解之前, 需要了解關於矩陣導數與微分以及跡的關係, 詳情可以參考: 矩陣求導術（上）

其中需要用到的是:

$df = \sum\limits_{i=1}^{m}\sum\limits_{j=1}^{n}\frac{\partial f}{\partial X_{ij}}dX_{ij} = \text{tr}\left({\frac{df}{dX}}^TdX\right)$
若 $x$ 爲向量, 那麼 $df = {\frac{df}{dx}}^Tdx$
逐元素函數: $\text{d}\sigma(X) = \sigma^\prime(X)\odot\text{d}X$ , 其中 $\sigma(X) = [\sigma(X_{ij})]$ 是逐元素標量函數運算, $\sigma^\prime(X) = [\sigma^\prime(X_{ij})]$ 是逐元素求導數.
矩陣乘法/逐元素乘法交換: $\text{tr}(A^T(B\odot C)) = \text{tr}((A\odot B)^TC)$ , 其中 $A, B, C$ 尺寸相同, 兩邊都等於 $\sum\limits_{ij}A_{ij}B_{ij}C_{ij}$

因此推導如下:

$\begin{aligned} \text{d}L(\theta) &= -\left(Y^TX\theta - E^T\log\left(E + \exp{(X\theta)}\right)\right) \\ &= -\left(\text{d}Y^TX\theta + Y^T\text{d}X\theta + Y^TX\text{d}\theta - \text{d}E^T\log\left(E + \exp{(X\theta)}\right) - E^T\text{d}\log\left(E + \exp{(X\theta)}\right)\right) \\ &= -\left(Y^TX\text{d}\theta - E^T\text{d}\log\left(E + \exp{(X\theta)}\right)\right) \\ &= -\left(Y^TX\text{d}\theta - E^T\left(\frac{1}{E + \exp{(X\theta)}}\odot\text{d}\left(E + \exp{(X\theta)}\right)\right)\right) \\ &= -\left(Y^TX\text{d}\theta - \left(E\odot\frac{1}{E + \exp{(X\theta)}}\right)^T\text{d}\left(E + \exp{(X\theta)}\right)\right) \\ &= -\left(Y^TX\text{d}\theta - \left(E\odot\frac{1}{E + \exp{(X\theta)}}\right)^T\left(\exp{(X\theta)}\odot\text{d}\left(X\theta\right)\right)\right) \\ &= -\left(Y^TX\text{d}\theta - \left(E\odot\frac{1}{E + \exp{(X\theta)}}\odot\exp{(X\theta)}\right)^T\text{d}\left(X\theta\right)\right) \\ &= -\left(Y^TX\text{d}\theta - h_{\theta}(X)^T\text{d}\left(X\theta\right)\right) \\ &= -\left(Y^TX\text{d}\theta - h_{\theta}(X)^T\left(\text{d}X\theta + X\text{d}\theta\right)\right) \\ &= -\left(Y^TX\text{d}\theta - h_{\theta}(X)^TX\text{d}\theta\right) \\ &= -\left([Y^T- h_{\theta}(X)^T]X\text{d}\theta\right) \\ &= [h_{\theta}(X) - Y]^TX\text{d}\theta \\ \end{aligned}$

由矩陣求導公式 2, 即若 $x$ 爲向量, 那麼 $df = {\frac{df}{dx}}^Tdx$ , 那麼可以得到:

$\frac{\partial L(\theta)}{\partial \theta} = X^T(h_{\theta}(X) - Y)$

(吐槽: 打完這些公式也太累了吧… 另外我前面說矩陣法求更簡潔, 是在沒有打這些公式的情況下說的, 現在弄完這些公式, 感覺也不簡潔 … 🤣🤣🤣)

參數更新

使用梯度下降法對 $\theta$ 的更新公式爲:

$\begin{aligned} \theta &= \theta - \alpha\frac{\partial L(\theta)}{\partial \theta} \\ &= \theta - \alpha X^T(h_{\theta}(X) - Y) \end{aligned}$

LR Numpy 代碼實現

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from collections import Counter

def sigmoid(x):
    x = np.array(x)
    return 1. / (1. + np.exp(-x))

## 目標函數, 極大似然
## 注意這裏求取了平均值而不是直接 sum
def L(w, b, X, y):
    dot = np.dot(X, w) + b
    return np.mean(y * dot - np.log(1 + np.exp(dot)), axis=0)

## w, b 的導數
def dL(w, b, X, y):
    dot = np.dot(X, w) + b
    distance = y - sigmoid(dot)
    distance = distance.reshape(-1, 1)
    return np.mean(distance * X, axis=0), np.mean(distance, axis=0)

## 隨機梯度下降? (上升)
def sgd(w, b, X, y, epoch, lr):
    for i in range(epoch):
        dw, db = dL(w, b, X, y)
        w += lr * dw
        b += lr * db
    return w, b

## 測試代碼, 對於預測值, 當概率大於 0.5 時, label 屬於 True
def predict(w, b, X_test):
    return sigmoid(np.dot(X_test, w) + b) >= 0.5

## 畫出分類面
def plot_surface(X, y, w, b):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
    X_test = np.c_[xx.ravel(), yy.ravel()]
    Z = predict(w, b, X_test)
    Z = Z.reshape(xx.shape)

    fig, ax = plt.subplots()
    counter = Counter(y)
    ax.set_xlim(x_min, x_max)
    ax.set_ylim(y_min, y_max)
    ax.set_xlabel(feature_names[0])
    ax.set_ylabel(feature_names[1])
    ax.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

    ## 畫出分割線
    #     i = np.linspace(x_min, x_max, 100)
    #     o = (w[0] * i + b) / -w[1]
    #     ax.plot(i, o)
    
    for label in counter.keys():
        ax.scatter(X[y==label, 0], X[y==label, 1])
    plt.show()

## 訓練代碼
iris = load_iris()
X = iris.data[:100, :2]
y = iris.target[:100] # y \in {0, 1}
feature_names = iris.feature_names[2:]
np.random.seed(123)
n = X.shape[1]
w = np.random.randn(n)
b = np.random.randn(1)
print('initial: w: {}, b: {}, L: {}'.format(w, b, L(w, b, X, y)))
w, b = sgd(w, b, X, y, 10000, 0.001)
print('final: w: {}, b: {}, L: {}'.format(w, b, L(w, b, X, y)))

plot_surface(X, y, w, b)

效果:

LR 的 PyTorch 實現

import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

iris = load_iris()
X = iris.data[:100, :2]
y = iris.target[:100] # y \in {0, 1}
X = torch.tensor(X).float()
y = torch.tensor(y).float()
feature_names = iris.feature_names[2:]

class LR(nn.Module):
    def __init__(self, in_features):
        super(LR, self).__init__()
        self.linear = nn.Linear(in_features, 1, bias=True)
    
    def sigmoid(self, x):
        return 1. / (1 + torch.exp(-x))
    
    def predict(self, x):
        return (self(x) > 0.5).int()
    
    def forward(self, x):
        x = self.linear(x)
        x = self.sigmoid(x)
        return x

model = LR(X.size(1))
criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
epoch = 100
batch_size = 10
N = X.size(0)

for i in range(epoch):
    order = torch.randperm(N)
    X = X[order]
    y = y[order]
    for n in range(N // batch_size):
        input = X[n * batch_size : (n + 1) * batch_size]
        label = y[n * batch_size : (n + 1) * batch_size]
        optimizer.zero_grad()
        output = model(input)
        loss = criterion(output, label)
        loss.backward()
        optimizer.step()
        if n % 100 == 0:
            print(loss) 


def plot_surface(X, y, w, b):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
    X_test = np.c_[xx.ravel(), yy.ravel()]
    Z = predict(w, b, X_test)
    Z = Z.reshape(xx.shape)

    fig, ax = plt.subplots()
    counter = Counter(y)
    ax.set_xlim(x_min, x_max)
    ax.set_ylim(y_min, y_max)
    ax.set_xlabel(feature_names[0])
    ax.set_ylabel(feature_names[1])
    ax.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

    ## 畫出分割線
#     i = np.linspace(x_min, x_max, 100)
#     o = (w[0] * i + b) / -w[1]
#     ax.plot(i, o)
    
    for label in counter.keys():
        ax.scatter(X[y==label, 0], X[y==label, 1])
    plt.show()

def sigmoid(x):
    x = np.array(x)
    return 1. / (1. + np.exp(-x))
        
def predict(w, b, X_test):
    return sigmoid(np.dot(X_test, w) + b) >= 0.5

X = torch.tensor(X).float()
y = torch.tensor(y).float()

x = X.numpy()
t = y.numpy()
w = model.linear.weight.data.numpy().transpose()
b = model.linear.bias.data.numpy()[0]

plot_surface(x, t, w, b)

效果:

參考文獻

邏輯迴歸原理小結: 劉建平Pinard 的博客, 大佬, 博文讓人受益匪淺
矩陣求導術（上）: 感覺這是機器學習裏的內功祕籍啊, 分爲上下兩卷, 我今天研讀了上卷, 那種心情怎麼來描述呢, 大徹大悟 ? No, 還沒到出家的地步 😂😂😂. 總之, 彷彿腦袋中有個燈泡💡突然亮起來了的感覺.

邏輯迴歸模型 Logistic Regression 詳細推導 (含 Numpy 與PyTorch 實現)

邏輯迴歸模型 Logistic Regression 詳細推導 (含 Numpy 與PyTorch 實現)

文章目錄

內容概括

LR 模型介紹

符號說明

Sigmoid 函數

LR 模型

LR 模型的優化目標

似然函數與損失函數

模型參數更新 – 代數法求梯度

模型參數更新 – 矩陣法求梯度

參數更新

LR Numpy 代碼實現

LR 的 PyTorch 實現

參考文獻

861. Score After Flipping Matrix**

300. Longest Increasing Subsequence**

278. First Bad Version*

986. Interval List Intersections**

623. Add One Row to Tree**

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結