手推邏輯斯蒂迴歸——以向量形式

LR的決策函數爲

h(x)=σ(θTx)=11+eθTx(1)h(\boldsymbol x)=\sigma(\boldsymbol \theta^T \boldsymbol x)=\frac{1}{1+e^{-\boldsymbol \theta^T \boldsymbol x}} \tag1

其中σ(z)=11+ez\sigma(z)=\frac 1{1+e^{-z}},稱爲sigmoid函數

h(x)h(\boldsymbol x)表示該樣本爲正例的概率,將其視爲類後驗概率估計p(y=1x;θ)p(y=1|\boldsymbol x;\boldsymbol \theta),則:

p(y=1x;θ)=h(x)(2)p(y=1|\boldsymbol x;\boldsymbol \theta)=h (\boldsymbol x) \tag2

p(y=0x;θ)=1h(x)(3)p(y=0|\boldsymbol x;\boldsymbol \theta)=1-h (\boldsymbol x) \tag3

合併式(2)(3)(2)(3)得到

p(yx;θ)=h(x)y(1h(x))1y(4)p(y|\boldsymbol x;\boldsymbol \theta)=h (\boldsymbol x)^y(1-h(\boldsymbol x))^{1-y} \tag4

我們可以使用極大似然估計來得到參數θ\theta,似然函數爲

L(θ)=i=1mp(y(i)x(i);θ)=i=1mh(x(i))y(i)(1h(x(i)))1y(i)(5)L(\boldsymbol \theta)=\prod_{i=1}^mp(y^{(i)}|\boldsymbol x^{(i)};\boldsymbol \theta)=\prod_{i=1}^m h(\boldsymbol x^{(i)})^{y^{(i)}} (1-h(\boldsymbol x^{(i)}))^{1-y^{(i)}} \tag5

其中mm爲數據集的樣本個數.

由於取對數不影響單調性且可以避免一些數值問題,取對數可得

logL(θ)=i=1my(i)log(h(x(i)))+(1y(i))log(1h(x(i)))(6)\log L(\boldsymbol \theta)= \sum_{i=1}^m y^{(i)}\log(h(\boldsymbol x^{(i)})) + (1-y^{(i)})\log(1-h(\boldsymbol x^{(i)})) \tag6

最大化式(6)(6)等價於最小化下列損失函數,剛好就是交叉熵損失函數:

J(θ)=1mi=1my(i)log(h(x(i)))+(1y(i))log(1h(x(i)))(7)J(\boldsymbol \theta)= -\frac1m\sum_{i=1}^m y^{(i)}\log(h(\boldsymbol x^{(i)})) + (1-y^{(i)})\log(1-h(\boldsymbol x^{(i)})) \tag7

爲推導簡便,令JiJ_i表示J(θ)J(\theta)的第ii項,對應了第ii個樣本,即

J(θ)=1mi=1mJi(θ)(8)J(\boldsymbol \theta)= -\frac1m\sum_{i=1}^m J_i(\boldsymbol \theta) \tag8

Ji(θ)=y(i)log(h(x(i)))+(1y(i))log(1h(x(i)))(9)J_i(\boldsymbol \theta)=y^{(i)}\log(h(\boldsymbol x^{(i)})) + (1-y^{(i)})\log(1-h(\boldsymbol x^{(i)})) \tag{9}

下面先推導出Jiθ\frac{\partial J_i}{\partial \boldsymbol \theta},省略JiJ_i表達式中x(i)\boldsymbol x^{(i)}y(i)y^{(i)}h(i)h^{(i)}的上標(i)(i),有:

Ji(θ)θ=yloghθ+(1y)log(1h)θ=yhhθ+(1y)(1h)(1h)θ=yhh(1h)hθ=yhh(1h)σ(z)θ=yhh(1h)σ(z)zzθ=yhh(1h)h(1h)θTxθ=(yh)x \begin{aligned} \frac{\partial J_i(\boldsymbol \theta)}{\partial \boldsymbol \theta} &=y\frac{\partial \log h}{\partial \boldsymbol \theta} + (1-y)\frac{\partial \log (1-h)}{\partial \boldsymbol \theta} \\ &=\frac yh \frac{\partial h}{\partial \boldsymbol \theta} +\frac{ (1-y)}{(1-h)}\frac{\partial(1-h)}{\partial \boldsymbol \theta} \\ &=\frac{y-h}{h(1-h)} \frac{\partial h}{\partial \boldsymbol \theta} \\ &=\frac{y-h}{h(1-h)} \frac{\partial \sigma(z)}{\partial \boldsymbol \theta}\\ &=\frac{y-h}{h(1-h)} \frac{\partial \sigma(z)}{\partial z} \frac{\partial z}{\partial \boldsymbol \theta}\\ &=\frac{y-h}{h(1-h)} h(1-h) \frac{\partial \boldsymbol \theta^T \boldsymbol x}{\partial \boldsymbol \theta}\\ &=(y-h)\boldsymbol x\\ \end{aligned}

補好上標(i)(i)則是:

Jiθ=(y(i)h(i))x(i)(10)\frac{\partial J_i}{\partial \boldsymbol \theta}=(y^{(i)}-h^{(i)})\boldsymbol x^{(i)}\tag{10}

由式(8)(8)和式(10)(10)

Jθ=1mi=1mJiθ=1mi=1m(h(i)y(i))x(i)(11)\frac{\partial J}{\partial \boldsymbol \theta}=-\frac1m\sum_{i=1}^m \frac{\partial J_i}{\partial \boldsymbol \theta}=\frac1m\sum_{i=1}^m (h^{(i)}-y^{(i)})\boldsymbol x^{(i)} \tag{11}

故梯度更新式爲θθα1mi=1m(h(i)y(i))x(i)(12)\boldsymbol \theta \leftarrow \boldsymbol \theta-\alpha \frac 1 m\sum_{i=1}^m (h^{(i)}-y^{(i)})\boldsymbol x^{(i)} \tag{12}

References:
[1] 機器學習 3.3節. 周志華

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章