矩陣求導法

1、前言

$y=f(x)$$x=[x_1 \ x_2 \ ... x_n]^{T}$，由多元函數微積分可知
$dy=\sum_{i=1}^{n} \frac{dy}{dx_i}dx_i=(\frac{dy}{dx})^{T}dx$

$dy=\sum_{i = 1}^{n}\sum_{j = 1}^{m}\frac{\partial y}{\partial X_{ij}}dX_{ij}=tr((\frac{\partial y}{\partial X})^{T}dX)$

3、基本公式

3.1 微分基本性質

• 微分加減法：$d(X \pm Y)=dX \pm dY$
• 微分乘法：$d(XY)=XdY+YdX$
• 微分轉置：$d(X^{T})=(dX)^{T}$
• 逆矩陣微分：$d(X^{-1})=-X^{-1}d(X)X^{-1}$
• 哈達馬積（Hadamard Product）微分：$d(X \odot Y) = X \odot dY + dX \odot Y$
• 逐項元素求導：$d\sigma(X) = \sigma ^{'}(X) \odot dX$

3.2 跡的基本性質（針對標量對向量或者矩陣求導情況）

• $tr(x)=x(x爲標量)$
• $tr(A^{T})=tr(A)$
• $tr(AB)=tr(BA)$
• $tr(A \pm B) = tr(A) \pm tr(B)$
• $d[tr(X)]=tr(dX)$
• $tr[(A \odot B)^{T}C] = tr[A^{T}(B \odot C)]$

3.3 性質證明

（1）$d(X^{-1})=-X^{-1}d(X)X^{-1}$

\begin{aligned} d(X^{-1}) &= d(X^{-1}XX^{-1})=d(X^{-1})XX^{-1}+X^{-1}d(X)X^{-1}+X^{-1}Xd(X^{-1}) \\ &= 2d(X^{-1})+X^{-1}d(X)X^{-1} \\ \end{aligned}

3.4 例題

(1)已知標量$y=a^{T}Xb$，求$\frac{\partial y}{\partial x}$

\begin{aligned} dy&=d[tr(a^{T}Xb)]=tr[d(a^{T}Xb)]=tr[d(a^{T})Xb+a^{T}d(X)b+a^{T}Xd(b)] \\&=tr[a^{T}d(X)b]=tr(ba^{T}dX) \end{aligned}
$\because dy = tr((\frac{\partial y}{\partial X})^{T})dX$$\therefore \frac{\partial y}{\partial X} = ab^{T}$

（2）已知標量$y=X^{T}AX$，求$\frac{\partial y}{\partial X}$

\begin{aligned} dy&=d[tr(X^{T}AX)]=tr[d(X^{T}AX)]=tr[d(X^{T})AX+X^{T}AdX] \\ &=tr[d(X^{T})AX]+tr(X^{T}AdX) \\ &=tr[d(X)^{T}AX]+tr(X^{T}AdX) \\ &=tr(X^{T}A^{T}dX)+tr(X^{T}AdX) \\ &=tr[X^{T}(A^{T}+A)dX] \end{aligned}
$\because dy = tr((\frac{\partial y}{\partial X})^{T})dX$$\therefore \frac{\partial y}{\partial X} = (A+A^{T})X$

（3）已知標量$y=a^{T}e^{Xb}$，求$\frac{\partial y}{\partial X}$

\begin{aligned} dy &= tr(dy)=tr(a^{T}de^{Xb})=tr[a^{T}(e^{Xb}\odot d(Xb))] \\ &= tr[(a \odot e^{Xb})^{T}d(Xb)]=tr[b(a \odot e^{Xb})^{T}dX] \end{aligned}
$\because dy = tr((\frac{\partial y}{\partial X})^{T})dX$$\therefore \frac{\partial y}{\partial X} = (a \odot e^{Xb})b^{T}$

4、鏈式法則

4.1 向量對向量的鏈式法則

$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}$

4.2 標量對向量的鏈式法則

$\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}$

$\frac{\partial z}{\partial x} = (\frac{\partial y}{\partial x})^{T} \frac{\partial z}{\partial y}$

$\frac{\partial z}{\partial x} = (\frac{\partial y_{n}}{\partial y_{n-1}}\frac{\partial y_{n-1}}{\partial y_{n-2}}...\frac{\partial y_{1}}{\partial x})^{T} \frac{\partial z}{\partial y_{n}}$

$z=X\theta-y$，則存在鏈式關係：$\theta \rightarrow z \rightarrow loss$，則由鏈式求導法則有：
$\frac{\partial loss}{\partial \theta}=(\frac{\partial z}{\partial \theta})^{T} \frac{\partial loss}{\partial z}=X^{T}(2z)=2X^{T}(X \theta - y)$

4.3 標量對矩陣的鏈式法則

\begin{aligned} \frac{\partial z}{\partial x_{ij}}&=\sum_{k,l}\frac{\partial z}{\partial Y_{kl}}\frac{\partial Y_{kl}}{\partial X_{ij}} = \sum_{k,l}\frac{\partial z}{\partial Y_{kl}}\frac{\partial \sum_{s}(A_{ks}X_{sl})}{\partial X_{ij}}\\ &=\sum_{k,l}\frac{\partial z}{\partial Y_{kl}}\frac{\partial A_{ki}X_{il}}{\partial X_{ij}}=\sum_{k}\frac{\partial z}{\partial Y_{kj}}A_{ki} \end{aligned}

$\frac{\partial z}{\partial X} = A^{T}\frac{\partial z}{\partial Y}$

參考文獻

[1]SL_World.機器學習常用矩陣求導方法

[2] 劉建平.機器學習中的矩陣向量求導[一]

[3] 劉建平.機器學習中的矩陣向量求導[二]

[4] 劉建平.機器學習中的矩陣向量求導[三]

[5] 劉建平.機器學習中的矩陣向量求導[四]