矩阵求导法

矩阵求导法

1、前言

y=f(x)y=f(x)x=[x1 x2 ...xn]Tx=[x_1 \ x_2 \ ... x_n]^{T},由多元函数微积分可知
dy=i=1ndydxidxi=(dydx)Tdx dy=\sum_{i=1}^{n} \frac{dy}{dx_i}dx_i=(\frac{dy}{dx})^{T}dx
将向量xx推广到矩阵的形式,可以得到
dy=i=1nj=1myXijdXij=tr((yX)TdX) dy=\sum_{i = 1}^{n}\sum_{j = 1}^{m}\frac{\partial y}{\partial X_{ij}}dX_{ij}=tr((\frac{\partial y}{\partial X})^{T}dX)

值得注意的是,本文所有向量都默认为列向量的形式,用小写字母表示标量向量,用大写字母表示矩阵

2、布局方式

在这里插入图片描述在这里插入图片描述一般来说,我们会使用一种叫混合布局的思路,即如果是向量或者矩阵对标量求导,则使用分子布局为准,如果是标量对向量或者矩阵求导,则以分母布局为准。

3、基本公式

3.1 微分基本性质

  • 微分加减法:d(X±Y)=dX±dYd(X \pm Y)=dX \pm dY
  • 微分乘法:d(XY)=XdY+YdXd(XY)=XdY+YdX
  • 微分转置:d(XT)=(dX)Td(X^{T})=(dX)^{T}
  • 逆矩阵微分:d(X1)=X1d(X)X1d(X^{-1})=-X^{-1}d(X)X^{-1}
  • 哈达马积(Hadamard Product)微分:d(XY)=XdY+dXYd(X \odot Y) = X \odot dY + dX \odot Y
  • 逐项元素求导:dσ(X)=σ(X)dXd\sigma(X) = \sigma ^{'}(X) \odot dX

3.2 迹的基本性质(针对标量对向量或者矩阵求导情况)

  • tr(x)=x(x)tr(x)=x(x为标量)
  • tr(AT)=tr(A)tr(A^{T})=tr(A)
  • tr(AB)=tr(BA)tr(AB)=tr(BA)
  • tr(A±B)=tr(A)±tr(B)tr(A \pm B) = tr(A) \pm tr(B)
  • d[tr(X)]=tr(dX)d[tr(X)]=tr(dX)
  • tr[(AB)TC]=tr[AT(BC)]tr[(A \odot B)^{T}C] = tr[A^{T}(B \odot C)]

3.3 性质证明

(1)d(X1)=X1d(X)X1d(X^{-1})=-X^{-1}d(X)X^{-1}

证:
d(X1)=d(X1XX1)=d(X1)XX1+X1d(X)X1+X1Xd(X1)=2d(X1)+X1d(X)X1 \begin{aligned} d(X^{-1}) &= d(X^{-1}XX^{-1})=d(X^{-1})XX^{-1}+X^{-1}d(X)X^{-1}+X^{-1}Xd(X^{-1}) \\ &= 2d(X^{-1})+X^{-1}d(X)X^{-1} \\ \end{aligned}

3.4 例题

(1)已知标量y=aTXby=a^{T}Xb,求yx\frac{\partial y}{\partial x}

解:
dy=d[tr(aTXb)]=tr[d(aTXb)]=tr[d(aT)Xb+aTd(X)b+aTXd(b)]=tr[aTd(X)b]=tr(baTdX) \begin{aligned} dy&=d[tr(a^{T}Xb)]=tr[d(a^{T}Xb)]=tr[d(a^{T})Xb+a^{T}d(X)b+a^{T}Xd(b)] \\&=tr[a^{T}d(X)b]=tr(ba^{T}dX) \end{aligned}
dy=tr((yX)T)dX\because dy = tr((\frac{\partial y}{\partial X})^{T})dXyX=abT\therefore \frac{\partial y}{\partial X} = ab^{T}

(2)已知标量y=XTAXy=X^{T}AX,求yX\frac{\partial y}{\partial X}

解:
dy=d[tr(XTAX)]=tr[d(XTAX)]=tr[d(XT)AX+XTAdX]=tr[d(XT)AX]+tr(XTAdX)=tr[d(X)TAX]+tr(XTAdX)=tr(XTATdX)+tr(XTAdX)=tr[XT(AT+A)dX] \begin{aligned} dy&=d[tr(X^{T}AX)]=tr[d(X^{T}AX)]=tr[d(X^{T})AX+X^{T}AdX] \\ &=tr[d(X^{T})AX]+tr(X^{T}AdX) \\ &=tr[d(X)^{T}AX]+tr(X^{T}AdX) \\ &=tr(X^{T}A^{T}dX)+tr(X^{T}AdX) \\ &=tr[X^{T}(A^{T}+A)dX] \end{aligned}
dy=tr((yX)T)dX\because dy = tr((\frac{\partial y}{\partial X})^{T})dXyX=(A+AT)X\therefore \frac{\partial y}{\partial X} = (A+A^{T})X

(3)已知标量y=aTeXby=a^{T}e^{Xb},求yX\frac{\partial y}{\partial X}

解:
dy=tr(dy)=tr(aTdeXb)=tr[aT(eXbd(Xb))]=tr[(aeXb)Td(Xb)]=tr[b(aeXb)TdX] \begin{aligned} dy &= tr(dy)=tr(a^{T}de^{Xb})=tr[a^{T}(e^{Xb}\odot d(Xb))] \\ &= tr[(a \odot e^{Xb})^{T}d(Xb)]=tr[b(a \odot e^{Xb})^{T}dX] \end{aligned}
dy=tr((yX)T)dX\because dy = tr((\frac{\partial y}{\partial X})^{T})dXyX=(aeXb)bT\therefore \frac{\partial y}{\partial X} = (a \odot e^{Xb})b^{T}

4、链式法则

4.1 向量对向量的链式法则

假设存在链式关系:x(m×1)y(n×1)z(p×1)x(m \times 1) \rightarrow y(n \times 1) \rightarrow z(p \times 1),则有如下链式求导法则:
zx=zyyx \frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}
矩阵维度角度来看,zx\frac{\partial z}{\partial x}p×mp \times m的矩阵,zy\frac{\partial z}{\partial y}p×np \times n的矩阵,yx\frac{\partial y}{\partial x}n×mn \times m的矩阵,满足矩阵相乘的法则。

4.2 标量对向量的链式法则

假设存在链式关系:x(m×1)y(n×1)z(1×1)x(m \times 1) \rightarrow y(n \times 1) \rightarrow z(1 \times 1),从矩阵维度角度来看,zx\frac{\partial z}{\partial x}m×1m \times 1的矩阵,zy\frac{\partial z}{\partial y}n×1n\times 1的矩阵,yx\frac{\partial y}{\partial x}n×mn \times m的矩阵,显然无法写成如下形式:
zx=zyyx \frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}
为了使维度相容,式子应该写成:
zx=(yx)Tzy \frac{\partial z}{\partial x} = (\frac{\partial y}{\partial x})^{T} \frac{\partial z}{\partial y}
事实上,出现上述转置的原因是:我们使用了混合布局,对于标量对向量求导使用的是分母布局,而向量对向量求导使用的是分子布局。

对于更深的链式关系:xy1y2...ynzx \rightarrow y_{1} \rightarrow y_{2} \rightarrow ... \rightarrow y_{n} \rightarrow z,有链式求导法则:
zx=(ynyn1yn1yn2...y1x)Tzyn \frac{\partial z}{\partial x} = (\frac{\partial y_{n}}{\partial y_{n-1}}\frac{\partial y_{n-1}}{\partial y_{n-2}}...\frac{\partial y_{1}}{\partial x})^{T} \frac{\partial z}{\partial y_{n}}
例子:已知loss=(Xθy)T(Xθy)loss = (X \theta - y)^{T}(X \theta - y),求lossθ\frac{\partial loss}{\partial \theta}

解:

z=Xθyz=X\theta-y,则存在链式关系:θzloss\theta \rightarrow z \rightarrow loss,则由链式求导法则有:
lossθ=(zθ)Tlossz=XT(2z)=2XT(Xθy) \frac{\partial loss}{\partial \theta}=(\frac{\partial z}{\partial \theta})^{T} \frac{\partial loss}{\partial z}=X^{T}(2z)=2X^{T}(X \theta - y)

4.3 标量对矩阵的链式法则

由于矩阵对矩阵的求导是比较复杂的定义,现在只对一些简单的线性关系求导继续分析。假设存在链式关系:XYzX \rightarrow Y \rightarrow z,即z=f(Y)z=f(Y)Y=AX+BY=AX+B,现在要求解zX\frac{\partial z}{\partial X},分析过程如下:
zxij=k,lzYklYklXij=k,lzYkls(AksXsl)Xij=k,lzYklAkiXilXij=kzYkjAki \begin{aligned} \frac{\partial z}{\partial x_{ij}}&=\sum_{k,l}\frac{\partial z}{\partial Y_{kl}}\frac{\partial Y_{kl}}{\partial X_{ij}} = \sum_{k,l}\frac{\partial z}{\partial Y_{kl}}\frac{\partial \sum_{s}(A_{ks}X_{sl})}{\partial X_{ij}}\\ &=\sum_{k,l}\frac{\partial z}{\partial Y_{kl}}\frac{\partial A_{ki}X_{il}}{\partial X_{ij}}=\sum_{k}\frac{\partial z}{\partial Y_{kj}}A_{ki} \end{aligned}
可以看出zxij\frac{\partial z}{\partial x_{ij}}的值为矩阵ATA^{T}的第ii行和zY\frac{\partial z}{\partial Y}的第jj列的内积,所以可得:
zX=ATzY \frac{\partial z}{\partial X} = A^{T}\frac{\partial z}{\partial Y}

参考文献

[1]SL_World.机器学习常用矩阵求导方法

[2] 刘建平.机器学习中的矩阵向量求导[一]

[3] 刘建平.机器学习中的矩阵向量求导[二]

[4] 刘建平.机器学习中的矩阵向量求导[三]

[5] 刘建平.机器学习中的矩阵向量求导[四]

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章