矩陣求導法

矩陣求導法

1、前言

y=f(x)y=f(x)x=[x1 x2 ...xn]Tx=[x_1 \ x_2 \ ... x_n]^{T},由多元函數微積分可知
dy=i=1ndydxidxi=(dydx)Tdx dy=\sum_{i=1}^{n} \frac{dy}{dx_i}dx_i=(\frac{dy}{dx})^{T}dx
將向量xx推廣到矩陣的形式,可以得到
dy=i=1nj=1myXijdXij=tr((yX)TdX) dy=\sum_{i = 1}^{n}\sum_{j = 1}^{m}\frac{\partial y}{\partial X_{ij}}dX_{ij}=tr((\frac{\partial y}{\partial X})^{T}dX)

值得注意的是,本文所有向量都默認爲列向量的形式,用小寫字母表示標量向量,用大寫字母表示矩陣

2、佈局方式

在這裏插入圖片描述在這裏插入圖片描述一般來說,我們會使用一種叫混合佈局的思路,即如果是向量或者矩陣對標量求導,則使用分子佈局爲準,如果是標量對向量或者矩陣求導,則以分母佈局爲準。

3、基本公式

3.1 微分基本性質

  • 微分加減法:d(X±Y)=dX±dYd(X \pm Y)=dX \pm dY
  • 微分乘法:d(XY)=XdY+YdXd(XY)=XdY+YdX
  • 微分轉置:d(XT)=(dX)Td(X^{T})=(dX)^{T}
  • 逆矩陣微分:d(X1)=X1d(X)X1d(X^{-1})=-X^{-1}d(X)X^{-1}
  • 哈達馬積(Hadamard Product)微分:d(XY)=XdY+dXYd(X \odot Y) = X \odot dY + dX \odot Y
  • 逐項元素求導:dσ(X)=σ(X)dXd\sigma(X) = \sigma ^{'}(X) \odot dX

3.2 跡的基本性質(針對標量對向量或者矩陣求導情況)

  • tr(x)=x(x)tr(x)=x(x爲標量)
  • tr(AT)=tr(A)tr(A^{T})=tr(A)
  • tr(AB)=tr(BA)tr(AB)=tr(BA)
  • tr(A±B)=tr(A)±tr(B)tr(A \pm B) = tr(A) \pm tr(B)
  • d[tr(X)]=tr(dX)d[tr(X)]=tr(dX)
  • tr[(AB)TC]=tr[AT(BC)]tr[(A \odot B)^{T}C] = tr[A^{T}(B \odot C)]

3.3 性質證明

(1)d(X1)=X1d(X)X1d(X^{-1})=-X^{-1}d(X)X^{-1}

證:
d(X1)=d(X1XX1)=d(X1)XX1+X1d(X)X1+X1Xd(X1)=2d(X1)+X1d(X)X1 \begin{aligned} d(X^{-1}) &= d(X^{-1}XX^{-1})=d(X^{-1})XX^{-1}+X^{-1}d(X)X^{-1}+X^{-1}Xd(X^{-1}) \\ &= 2d(X^{-1})+X^{-1}d(X)X^{-1} \\ \end{aligned}

3.4 例題

(1)已知標量y=aTXby=a^{T}Xb,求yx\frac{\partial y}{\partial x}

解:
dy=d[tr(aTXb)]=tr[d(aTXb)]=tr[d(aT)Xb+aTd(X)b+aTXd(b)]=tr[aTd(X)b]=tr(baTdX) \begin{aligned} dy&=d[tr(a^{T}Xb)]=tr[d(a^{T}Xb)]=tr[d(a^{T})Xb+a^{T}d(X)b+a^{T}Xd(b)] \\&=tr[a^{T}d(X)b]=tr(ba^{T}dX) \end{aligned}
dy=tr((yX)T)dX\because dy = tr((\frac{\partial y}{\partial X})^{T})dXyX=abT\therefore \frac{\partial y}{\partial X} = ab^{T}

(2)已知標量y=XTAXy=X^{T}AX,求yX\frac{\partial y}{\partial X}

解:
dy=d[tr(XTAX)]=tr[d(XTAX)]=tr[d(XT)AX+XTAdX]=tr[d(XT)AX]+tr(XTAdX)=tr[d(X)TAX]+tr(XTAdX)=tr(XTATdX)+tr(XTAdX)=tr[XT(AT+A)dX] \begin{aligned} dy&=d[tr(X^{T}AX)]=tr[d(X^{T}AX)]=tr[d(X^{T})AX+X^{T}AdX] \\ &=tr[d(X^{T})AX]+tr(X^{T}AdX) \\ &=tr[d(X)^{T}AX]+tr(X^{T}AdX) \\ &=tr(X^{T}A^{T}dX)+tr(X^{T}AdX) \\ &=tr[X^{T}(A^{T}+A)dX] \end{aligned}
dy=tr((yX)T)dX\because dy = tr((\frac{\partial y}{\partial X})^{T})dXyX=(A+AT)X\therefore \frac{\partial y}{\partial X} = (A+A^{T})X

(3)已知標量y=aTeXby=a^{T}e^{Xb},求yX\frac{\partial y}{\partial X}

解:
dy=tr(dy)=tr(aTdeXb)=tr[aT(eXbd(Xb))]=tr[(aeXb)Td(Xb)]=tr[b(aeXb)TdX] \begin{aligned} dy &= tr(dy)=tr(a^{T}de^{Xb})=tr[a^{T}(e^{Xb}\odot d(Xb))] \\ &= tr[(a \odot e^{Xb})^{T}d(Xb)]=tr[b(a \odot e^{Xb})^{T}dX] \end{aligned}
dy=tr((yX)T)dX\because dy = tr((\frac{\partial y}{\partial X})^{T})dXyX=(aeXb)bT\therefore \frac{\partial y}{\partial X} = (a \odot e^{Xb})b^{T}

4、鏈式法則

4.1 向量對向量的鏈式法則

假設存在鏈式關係:x(m×1)y(n×1)z(p×1)x(m \times 1) \rightarrow y(n \times 1) \rightarrow z(p \times 1),則有如下鏈式求導法則:
zx=zyyx \frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}
矩陣維度角度來看,zx\frac{\partial z}{\partial x}p×mp \times m的矩陣,zy\frac{\partial z}{\partial y}p×np \times n的矩陣,yx\frac{\partial y}{\partial x}n×mn \times m的矩陣,滿足矩陣相乘的法則。

4.2 標量對向量的鏈式法則

假設存在鏈式關係:x(m×1)y(n×1)z(1×1)x(m \times 1) \rightarrow y(n \times 1) \rightarrow z(1 \times 1),從矩陣維度角度來看,zx\frac{\partial z}{\partial x}m×1m \times 1的矩陣,zy\frac{\partial z}{\partial y}n×1n\times 1的矩陣,yx\frac{\partial y}{\partial x}n×mn \times m的矩陣,顯然無法寫成如下形式:
zx=zyyx \frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}
爲了使維度相容,式子應該寫成:
zx=(yx)Tzy \frac{\partial z}{\partial x} = (\frac{\partial y}{\partial x})^{T} \frac{\partial z}{\partial y}
事實上,出現上述轉置的原因是:我們使用了混合佈局,對於標量對向量求導使用的是分母佈局,而向量對向量求導使用的是分子佈局。

對於更深的鏈式關係:xy1y2...ynzx \rightarrow y_{1} \rightarrow y_{2} \rightarrow ... \rightarrow y_{n} \rightarrow z,有鏈式求導法則:
zx=(ynyn1yn1yn2...y1x)Tzyn \frac{\partial z}{\partial x} = (\frac{\partial y_{n}}{\partial y_{n-1}}\frac{\partial y_{n-1}}{\partial y_{n-2}}...\frac{\partial y_{1}}{\partial x})^{T} \frac{\partial z}{\partial y_{n}}
例子:已知loss=(Xθy)T(Xθy)loss = (X \theta - y)^{T}(X \theta - y),求lossθ\frac{\partial loss}{\partial \theta}

解:

z=Xθyz=X\theta-y,則存在鏈式關係:θzloss\theta \rightarrow z \rightarrow loss,則由鏈式求導法則有:
lossθ=(zθ)Tlossz=XT(2z)=2XT(Xθy) \frac{\partial loss}{\partial \theta}=(\frac{\partial z}{\partial \theta})^{T} \frac{\partial loss}{\partial z}=X^{T}(2z)=2X^{T}(X \theta - y)

4.3 標量對矩陣的鏈式法則

由於矩陣對矩陣的求導是比較複雜的定義,現在只對一些簡單的線性關係求導繼續分析。假設存在鏈式關係:XYzX \rightarrow Y \rightarrow z,即z=f(Y)z=f(Y)Y=AX+BY=AX+B,現在要求解zX\frac{\partial z}{\partial X},分析過程如下:
zxij=k,lzYklYklXij=k,lzYkls(AksXsl)Xij=k,lzYklAkiXilXij=kzYkjAki \begin{aligned} \frac{\partial z}{\partial x_{ij}}&=\sum_{k,l}\frac{\partial z}{\partial Y_{kl}}\frac{\partial Y_{kl}}{\partial X_{ij}} = \sum_{k,l}\frac{\partial z}{\partial Y_{kl}}\frac{\partial \sum_{s}(A_{ks}X_{sl})}{\partial X_{ij}}\\ &=\sum_{k,l}\frac{\partial z}{\partial Y_{kl}}\frac{\partial A_{ki}X_{il}}{\partial X_{ij}}=\sum_{k}\frac{\partial z}{\partial Y_{kj}}A_{ki} \end{aligned}
可以看出zxij\frac{\partial z}{\partial x_{ij}}的值爲矩陣ATA^{T}的第ii行和zY\frac{\partial z}{\partial Y}的第jj列的內積,所以可得:
zX=ATzY \frac{\partial z}{\partial X} = A^{T}\frac{\partial z}{\partial Y}

參考文獻

[1]SL_World.機器學習常用矩陣求導方法

[2] 劉建平.機器學習中的矩陣向量求導[一]

[3] 劉建平.機器學習中的矩陣向量求導[二]

[4] 劉建平.機器學習中的矩陣向量求導[三]

[5] 劉建平.機器學習中的矩陣向量求導[四]

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章