本文是前幾篇文章中相關公式的詳細推導部分,主要對論文中一些被省略的推導進行補充說明,對“擴散模型”感興趣請查看前幾篇文章。
高斯分佈
概率密度函數
若\(x \sim \mathcal{N}(\mu, \sigma^2)\),則:
\[f(x ; \mu, \sigma)=\frac{1}{\sigma \sqrt{2 \pi}} \exp \left(-\frac{(x-\mu)^2}{2 \sigma^2}\right)
\]
兩個高斯的KL散度
\[D_{\mathrm{KL}}\left(\mathcal{N(\mu_1, \sigma_1^2) \mid\mid N(\mu_2, \sigma_2^2)}\right) = \ln \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}
\]
性質1
如果存在一個隨機變量\(x \sim \mathcal{N}(\mu, \sigma^2)\)服從高斯分佈,那麼存在實數\(a, b\),滿足:
\[ax + b \sim \mathcal{N}(a\mu + b, (a\sigma)^2)
\]
因此,對於任意高斯分佈\(\mathbf{x} \sim \mathcal{N}(\mu, \sigma^2)\),可以將其表示爲服從標準正態分佈的隨機變量\(\epsilon\)的變換,即:
\[\mathbf{x} = \epsilon * \sigma + \mu, \epsilon \sim \mathcal{N}(0, \mathbf{I})
\]
性質2
假定兩個隨機變量都服從高斯分佈且相互獨立,記作\(x \sim \mathcal{N}(\mu_x, \sigma_x^2),\ \ y \sim \mathcal{N}(\mu_y, \sigma_y^2)\),則兩個隨機變量的和或差仍服從高斯分佈,即:
\[\begin{aligned}
& U=x+y \sim N\left(\mu_x+\mu_y, \sigma_x^2+\sigma_y^2\right) \\
& V=x-y \sim N\left(\mu_x-\mu_y, \sigma_x^2+\sigma_y^2\right)
\end{aligned}\]
推導一
在\(\text{Diffusion Forward process}\)中,任意時刻\(t\)的狀態\(\mathbf{x}_t\)如何基於\(\mathbf{x}_0\)表示?
解:
已知前向過程中,狀態間的轉換服從高斯分佈,有:
\[q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right) = \mathcal{N}\left(\sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}\right)\tag{1}
\]
對\(\beta_{t}\)進行變換,定義:
\[\begin{aligned}
\alpha_t & =1-\beta_t \\
\bar{\alpha}_t & =\prod_{i=1}^t \alpha_i
\end{aligned}\]
對\((1)\)式展開如下:
\[\begin{aligned}
q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right) & =\mathcal{N}\left(\sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}\right) \\
\mathbf{x}_t & =\sqrt{1-\beta_t} \mathbf{x}_{t-1}+\sqrt{\beta_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I}) \\
& =\sqrt{\alpha_t} \mathbf{x}_{t-1}+\sqrt{1-\alpha_t} \epsilon
\end{aligned}
\tag{2}
\]
已知\(\mathbf{x}_t = \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t} \epsilon\),同理可得\(\mathbf{x}_{t-1} = \sqrt{\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}} \bar{\epsilon}\),對\((2)\)改寫,有:
\[\begin{aligned}
& \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t} \epsilon \\
=& \sqrt{\alpha_t}\left(\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}} \bar{\epsilon} \right) + \sqrt{1 - \alpha_t} \epsilon \\
=& \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{\alpha_t \left(1 - \alpha_{t-1}\right)} \bar{\epsilon} + \sqrt{1 - \alpha_t} \epsilon
\end{aligned}
\tag{3}
\]
爲了與\(\epsilon\)進行區分,使用\(\bar{\epsilon}\)表示另外一個服從標準高斯分佈\(\mathcal{N}(0, \mathbf{I})\)的變量。
根據高斯分佈的性質1,任意的高斯分佈可由標準高斯分佈轉換得到,故:
\[\begin{aligned}
\epsilon \sim \mathcal{N}(0, \mathbf{I})
\quad &\Rightarrow \quad
\sqrt{1 - \alpha_t} \epsilon \sim \mathcal{N}(0, \left(1-\alpha_t\right)\mathbf{I}) \
\\
\bar{\epsilon} \sim \mathcal{N}(0, \mathbf{I})
\quad &\Rightarrow \quad
\sqrt{\alpha_t \left(1 - \alpha_{t-1}\right)} \epsilon \sim \mathcal{N}(0, \alpha_t\left(1-\alpha_{t-1}\right)\mathbf{I})
\end{aligned} \tag{a}\]
由於\(\sqrt{1 - \alpha_t} \epsilon\)與\(\sqrt{\alpha_t \left(1 - \alpha_{t-1}\right)} \bar{\epsilon}\)獨立且都服從高斯分佈,記\(U = \sqrt{1 - \alpha_t} \epsilon + \sqrt{\alpha_t \left(1 - \alpha_{t-1}\right)} \bar{\epsilon}\),由性質2可知\(U\)也服從高斯分佈,有:
\[\begin{aligned}
\sqrt{1 - \alpha_t} \epsilon + \sqrt{\alpha_t \left(1 - \alpha_{t-1}\right)} \bar{\epsilon} &\sim \mathcal{N}(0, \left(1-\alpha_t\right)\mathbf{I} +\alpha_t\left(1-\alpha_{t-1}\right)\mathbf{I}) \\ \Rightarrow U & \sim \mathcal{N}(0, \left(1-\alpha_t\alpha_{t-1}\right)\mathbf{I})
\end{aligned} \tag{b}\]
基於高斯分佈的性質1,將\(U\)使用標準高斯分佈表示:
\[\begin{aligned}
U &\sim \mathcal{N}(0, \left(1-\alpha_t\alpha_{t-1}\right)\mathbf{I})
\Rightarrow
U = \sqrt{1 - \alpha_t\alpha_{t-1}} \epsilon
\end{aligned} \tag{c}\]
將\((c)\)代入\((3)\),可得:
\[\begin{aligned}
& \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t} \epsilon \\
=& \sqrt{\alpha_t}\left(\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1 - \alpha_{t-1}} \bar{\epsilon} \right) + \sqrt{1 - \alpha_t} \epsilon \\
=& \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{\alpha_t \left(1 - \alpha_{t-1}\right)} \bar{\epsilon} + \sqrt{1 - \alpha_t} \epsilon \\
=& \sqrt{\alpha_t \alpha_{t-1}}\mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon
\end{aligned}\]
由數學歸納法,易知:
\[\begin{aligned}
q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right) & =\mathcal{N}\left(\sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}\right) \\
\mathbf{x}_t & =\sqrt{1-\beta_t} \mathbf{x}_{t-1}+\sqrt{\beta_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I}) \\
& =\sqrt{\alpha_t} \mathbf{x}_{t-1}+\sqrt{1-\alpha_t} \epsilon \\
& =\sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2}+\sqrt{1-\alpha_t \alpha_{t-1}} \epsilon \\
& =\ldots \\
& =\sqrt{\bar{\alpha}_t} \mathbf{x}_0+\sqrt{1-\bar{\alpha}_t} \epsilon
\end{aligned}
\]
因此,\(q\left(\mathbf{x}_t \mid \mathbf{x}_{0}\right) = \mathcal{N}\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_{0}, \sqrt{1 - \bar{\alpha}_t} \mathbf{I}\right)\)
推導二
在\(diffusion\)中,定義\(q\)服從高斯分佈,故對\(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)\)定義如下:
\[\begin{aligned}
q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) & =\mathcal{N}\left(\mathbf{x}_{t-1} ;
\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right), \tilde{\beta}_t \mathbf{I}\right)
\end{aligned}
\]
那其中\(\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right)\)與\(\tilde{\beta_t}\)如何得到?
此處先給出結論,下方是更詳細的推導。
\[\begin{aligned}
\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right) &:= \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} \mathbf{x}_0+\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} \mathbf{x}_t, \\ \tilde{\beta}_t &:= \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t
\end{aligned}
\]
解:
回顧貝葉斯公式,對\(q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)\)改寫,有:
\[q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)
=q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_0\right) \frac{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)}\tag{1}\]
由於Diffusion基於馬爾可夫鏈建模,由馬爾可夫性易知每個狀態只依賴於前一個狀態,故
\[q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_0\right) = q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right)
\]
\((1)\)式寫作\((2)\)式:
\[\begin{aligned}
q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right)
&=q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}, \mathbf{x}_0\right) \frac{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)} \\
&=q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right) \frac{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)}
\end{aligned}
\tag{2}\]
基於推導一的結論,易知:
\[\begin{aligned}
q\left(\mathbf{x}_t \mid \mathbf{x}_{0}\right) &= \mathcal{N}\left(\sqrt{\bar{\alpha}_t} \mathbf{x}_{0}, \sqrt{1 - \bar{\alpha}_t} \mathbf{I}\right) \\
q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_{0}\right) &= \mathcal{N}\left(\sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_{0}, \sqrt{1 - \bar{\alpha}_{t-1}} \mathbf{I}\right)
\end{aligned}\]
由高斯分佈的概率密度函數,對\((2)\)展開,有:
\[\begin{aligned}
q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) & = q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right) \frac{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)} \\ & \propto \exp \left(-\frac{1}{2}\left(\frac{\left(\mathbf{x}_t-\sqrt{\alpha_t} \mathbf{x}_{t-1}\right)^2}{\beta_t}+\frac{\left(\mathbf{x}_{t-1}-\sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0\right)^2}{1-\bar{\alpha}_{t-1}}-\frac{\left(\mathbf{x}_t-\sqrt{\bar{\alpha}_t} \mathbf{x}_0\right)^2}{1-\bar{\alpha}_t}\right)\right) \end{aligned}\tag{3}\]
不論是\(\beta_t\)或是\(\bar{\alpha}_t\)皆非隨機變量,故可省略。最終目標是使用隨機變量\(\mathbf{x}_0\)與\(\mathbf{x}_{t}\)表示\(\mathbf{x}_{t-1}\)。對\((3)\)式繼續展開,有\((4)\):
\[\begin{aligned} &q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0\right) = q\left(\mathbf{x}_t \mid \mathbf{x}_{t-1}\right) \frac{q\left(\mathbf{x}_{t-1} \mid \mathbf{x}_0\right)}{q\left(\mathbf{x}_t \mid \mathbf{x}_0\right)} \\ & \propto \exp \left(-\frac{1}{2}\left(\frac{\left(\mathbf{x}_t-\sqrt{\alpha_t} \mathbf{x}_{t-1}\right)^2}{\beta_t}+\frac{\left(\mathbf{x}_{t-1}-\sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0\right)^2}{1-\bar{\alpha}_{t-1}}-\frac{\left(\mathbf{x}_t-\sqrt{\bar{\alpha}_t} \mathbf{x}_0\right)^2}{1-\bar{\alpha}_t}\right)\right) \\
&=\exp \left(-\frac{1}{2}\left(\frac{\mathbf{x}_t^2-2 \sqrt{\alpha_t} \mathbf{x}_t \mathbf{x}_{t-1}+\alpha_t \mathbf{x}_{t-1}^2}{\beta_t}+\frac{\mathbf{x}_{t-1}^2-2 \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0 \mathbf{x}_{t-1}+\bar{\alpha}_{t-1} \mathbf{x}_0^2}{1-\bar{\alpha}_{t-1}}-\frac{\left(\mathbf{x}_t-\sqrt{\bar{\alpha}_t} \mathbf{x}_0\right)^2}{1-\bar{\alpha}_t}\right)\right) \\ &=\exp \left(-\frac{1}{2}\left(\left(\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha}_{t-1}}\right) \mathbf{x}_{t-1}^2-\left(\frac{2 \sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t+\frac{2 \sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}} \mathbf{x}_0\right) \mathbf{x}_{t-1}+C\left(\mathbf{x}_t, \mathbf{x}_0\right)\right)\right) \end{aligned} \tag{4}\]
其中,倒數第二個等號右邊是對上一步的平方展開;最後一個等號右邊是以\(\mathbf{x}_{t-1}\)爲變量,\(\mathbf{x}_0\)與\(\mathbf{x}_{t}\)爲參數,構造完全平方公式,以形成高斯分佈概率密度函數中的指數部分,形如\(-\frac{(\mathbf{x}_{t-1}-\tilde{\mu_t})^2}{2 \tilde{\beta_t}}\)。因此,不難得出:
\[\begin{aligned}
\tilde{\boldsymbol{\mu}}_t &= \frac{1}{\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha}_{t-1}}} * \left(\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}} \mathbf{x}_0\right) \\
&= \frac{\left(1 - \bar{\alpha}_{t-1}\right) \beta_{t}}{\alpha_t\left(1 - \bar{\alpha}_{t-1}\right) + \beta_t} * \left(\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t+\frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}} \mathbf{x}_0\right) \\
& = \frac{\left(1 - \bar{\alpha}_{t-1}\right)\sqrt{\alpha_t}}{\alpha_t\left(1 - \bar{\alpha}_{t-1}\right) + \beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_{t}}{\alpha_t\left(1 - \bar{\alpha}_{t-1}\right) + \beta_t} \mathbf{x}_0 \\
\end{aligned}\tag{5}\]
\(\alpha_t = 1 - \beta_t\),故:
\[\begin{aligned}
\alpha_t\left(1 - \bar{\alpha}_{t-1}\right) + \beta_t &= \alpha_t - \alpha_t \bar{\alpha}_{t-1} + \beta_t \\
&= 1 - \beta_t - \alpha_t \bar{\alpha}_{t-1} + \beta_t \\
&= 1 - \alpha_t \bar{\alpha}_{t-1} \\
&= 1 - \bar{\alpha}_{t}
\end{aligned}\tag{6}\]
將\((6)\)式代入\((5)\),有:
\[\tilde{\boldsymbol{\mu}}_t\left(\mathbf{x}_t, \mathbf{x}_0\right) :=\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1-\bar{\alpha}_t} \mathbf{x}_0+\frac{\sqrt{\alpha_t}\left(1-\bar{\alpha}_{t-1}\right)}{1-\bar{\alpha}_t} \mathbf{x}_t
\]
對於\(\tilde{\beta}_t\),有:
\[\begin{aligned}
\tilde{\beta}_t &= \frac{1}{\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha}_{t-1}}} \\
&= \frac{\left(1 - \bar{\alpha}_{t-1}\right) \beta_{t}}{\alpha_t\left(1 - \bar{\alpha}_{t-1}\right) + \beta_t} \\
&= \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t
\end{aligned}\]
以上內容即\(\text{DDPM}\)中一些被省略的數學推導。
Papers
- Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
- Denoising diffusion probabilistic models, 2020.
- Improved denoising diffusion probabilistic models, 2021.