【Autograd】深入理解BP與自動求導

“所有數值計算歸根結底是一系列有限的可微算子的組合”
——《An introduction to automatic differentiation》

BTW：也可以點擊傳送門去我的個人主頁看這篇文章哦~

符號語言的導數

《Deep Learning》 Chap 6.5.5

代數表達式和計算圖都對符號(symbol) 或不具有特定值的變量進行操作。這些代數或者基於圖的表達式被稱爲符號表示(symbolic representation)。
當我們實際使用或者訓練神經網絡時，我們必須給這些符號賦值。我們用一個特定的數值(numeric value) 來替代網絡的符號輸入x，例如 [1.2,3,765,−1.8]T 。

符號到數值的微分
一些反向傳播的方法採用計算圖和一組用於圖的輸入的數值，然後返回在這些輸入值處梯度的一組數值。我們將這種方法稱爲‘‘符號到數值’’ 的微分。這種方法用在諸如Torch(Collobert et al., 2011b)和Caffe(Jia, 2013)之類的庫中。

符號到符號的微分
另一種方法是採用計算圖以及添加一些額外的節點到計算圖中，這些額外的節點提供了我們所需導數的符號描述。這是Theano(Bergstra et al., 2010b; Bastien et al., 2012b) 和TensorFlow(Abadi et al., 2015) 採用的方法。圖6.10 中給出了該方法如何工作的一個例子。這種方法的主要優點是導數可以使用與原始表達式相同的語言來描述。

TensorFlow中實現的自動求導（automatic gradient / Automatic Differentiation）：
實現的方式是利用反向傳遞與鏈式法則建立一張對應原計算圖的梯度圖。因爲導數只是另外一張計算圖，可以再次運行反向傳播，對導數再進行求導以得到更高階的導數。（這裏我們重點講這一種，所以下面幾個小節會對反向傳遞算法與鏈式法則作簡要概述）

Backpropagation algorithm

http://neuralnetworksanddeeplearning.com/chap2.html

Input x : Set the corresponding activation a1 for the input layer.
Feedforward : For each l=2,3,...,L compute zl=wlal−1+bl and al=σ(zl)
Output error δL : Compute the vector δL=∇aC⨀σ′(zL)
Backpropagate the error : For each l=L−1,L−2,...,2 compute δl=((wl+1)⊤δl+1)⨀σ′(zl)
Output : The gradient of the cost function is given by ∂C∂wljk=al−1kδlj and ∂C∂blj=δlj

Automatic Differentiation

CSE599G1: Deep Learning System (陳天奇)

微分求解大致可以分爲4種方式：

手動求解法(Manual Differentiation)
- 求解出梯度公式，然後編寫代碼，代入實際數值，得出真實的梯度。在這樣的方式下，每一次我們修改算法模型，都要修改對應的梯度求解算法，因此沒有很好的辦法解脫用戶手動編寫梯度求解的代碼。
數值微分法(Numerical Differentiation)
- 不能完全消除truncation error，只是將誤差減小。但是由於它實在是太簡單實現了，於是很多時候，我們利用它來檢驗其他算法的正確性，比如在實現backprop的時候，我們用的”gradient check”就是利用數值微分法。
符號微分法(Symbolic Differentiation)
- “表達式膨脹”（expression swell）問題，如果不加小心就會使得問題符號微分求解的表達式急速“膨脹”，導致最終求解速度變慢，如本小節末的圖表Table 1所示。
自動微分法(Automatic Differentiation)
- 自動微分法是一種介於符號微分和數值微分的方法：數值微分強調一開始直接代入數值近似求解；符號微分強調直接對代數進行求解，最後才代入問題數值；自動微分將符號微分法應用於最基本的算子，比如常數，冪函數，指數函數，對數函數，三角函數等，然後代入數值，保留中間結果，最後再應用於整個函數。因此它應用相當靈活，可以做到完全向用戶隱藏微分求解過程，由於它只對基本函數或常數運用符號微分法則，所以它可以靈活結合編程語言的循環結構，條件結構等，使用自動微分和不使用自動微分對代碼總體改動非常小，並且由於它的計算實際是一種圖計算，可以對其做很多優化，這也是爲什麼該方法在現代深度學習系統中得以廣泛應用。

Backpropagation vs AutoDiff (reverse)

CSE599G1 DeepLearning System Lecture4 —— [Slides View](https://okcd00.github.io/assets/CSE599G1 DeepLearning System Lecture4.pdf)

We can take derivative of derivative nodes in autodiff, while it’s much harder to do so in backprop.
In autodiff, there’s only a forward pass (vs. forward-backward in backprop). So it’s easier to apply graph and schedule optimization to a single graph.
In backprop, all intermediate results might be used in the future, so we need to keep these values in the memory. On the other hand, in autodiff, we already know the dependencies of the backward graph, so we can have better memory optimization.

Jacobi與鏈式法則

《Deep Learning》 Chap 6.5.2
該段引用了較多開源社區中對 Deep Learning 一書的中文翻譯
https://github.com/exacity/deeplearningbook-chinese

微積分中的鏈式法則（爲了不與概率中的鏈式法則相混淆）用於計算複合函數的導數。反向傳播是一種計算鏈式法則的算法，使用高效的特定運算順序。
設 x 是實數， f 和 g 是從實數映射到實數的函數。假設 y=g(x) 並且 z=f(g(x))=f(y) 。那麼鏈式法則是說

d z d x = d z d y d y d x

我們可以將這種標量情況進行擴展。假設x∈Rm,y∈Rn ，g 是從Rm 到Rn 的映射，f 是從Rn 到R 的映射。如果y=g(x) 並且z=f(y) ，那麼

\partial z \partial x i = \sum j \partial z \partial y j \partial y j \partial x i .

使用向量記法，可以等價地寫成

\nabla x z = (\partial y \partial x) ⊤ \nabla y z,

通常我們不將反向傳播算法僅用於向量，而是應用於任意維度的張量。從概念上講，這與使用向量的反向傳播完全相同。唯一的區別是如何將數字排列成網格以形成張量。
我們可以想象，在我們運行反向傳播之前，將每個張量變平爲一個向量，計算一個向量值梯度，然後將該梯度重新構造成一個張量。從這種重新排列的觀點上看，反向傳播仍然只是將 Jacobi 矩陣乘以梯度。

如果 Y=g(X) 並且 z=f(Y) ，那麼

\nabla X z = \sum j (\nabla X Y j) \partial z \partial Y j .

於是，反向傳播算法就變得非常簡單：
爲了計算某個標量 z 關於圖中它的一個祖先 x 的梯度，我們首先觀察到它關於 z 的梯度由 dzdz=1 給出。然後，我們可以計算對圖中 z 的每個父節點的梯度，通過現有的梯度乘以產生z 的操作的 Jacobian 。我們繼續乘以 Jacobian ，以這種方式向後穿過圖，直到我們到達 x 。對於從 z 出發可以經過兩個或更多路徑向後行進而到達的任意節點，我們簡單地對該節點來自不同路徑上的梯度進行求和。

Tensorflow的自動求導實現

Tensorflow 中的符號求導見項目下的 tensorflow/python/ops/gradients_impl.py
“Constructs symbolic derivatives of sum of ys w.r.t. x in xs”
[db, dW, dx] = tf.gradients(C, [b,W,x])

《Deep Learning》一書中，表示Theano與Tensorflow採用如下圖算法的子程序來建立 grad_table ，而在Tensorflow白皮書的第五節中，介紹了在grad_table 中，存儲了通常會被重複計算多次的 ∂u(n)/∂u(i) ，用以減少程序的冗餘計算從而增加效率：

If a tensor C in a TensorFlow graph depends, perhaps through a complex subgraph of operations, on some set of tensors Xk , then there is a built-in function that will return the tensors dC/dXk .

每個操作 op 也與 bprop 操作相關聯。該 bprop 操作可以計算如上述公式所描述的 Jacobi 向量積。這是反向傳播算法能夠實現很大通用性的原因。每個操作負責瞭解如何通過它參與的圖中的邊來反向傳播。反向傳播算法本身並不需要知道任何微分法則。它只需要使用正確的參數調用每個操作的 bprop 方法即可。正式地，op.bprop(inputs,X,G) 必須返回

\sum i (\nabla X op.f(inputs) i) G i,

這裏，inputs 是提供給操作的一組輸入，op.f 是操作實現的數學函數，X 是輸入，我們想要計算關於它的梯度，G 是操作對於輸出的梯度。

op.bprop 方法應該總是假裝它的所有輸入彼此不同，即使它們不是。例如，如果 mul 操作傳遞兩個 x 來計算 x2 ，op.bprop 方法應該仍然返回 x 作爲對於兩個輸入的導數。反向傳播算法後面會將這些變量加起來獲得 2x ，這是 x 上總的正確的導數。

反向傳播算法的軟件實現通常提供操作和其 bprop 兩種方法，所以深度學習軟件庫的用戶能夠對使用諸如矩陣乘法、指數運算、對數運算等等常用操作構建的圖進行反向傳播。構建反向傳播新實現的軟件工程師或者需要向現有庫添加自己的操作的高級用戶通常必須手動爲新操作推導 op.bprop 方法。

我們以 Tensorflow 的一次 commit： * Register log1p in math_ops. 爲例：

// 該文件爲 tensorflow/core/ops/math_ops.cc
// 作用爲註冊操作log1p，定義爲單元操作，以及提供說明文本

REGISTER_OP("Log1p")
    .UNARY_COMPLEX()
    .Doc(R"doc(
Computes natural logarithm of (1 + x) element-wise.
I.e., \\(y = \log_e (1 + x)\\).
)doc");

# 該文件爲 tensorflow/python/ops/math_grad.py
# log1p的作用是求加一之後的自然對數

@ops.RegisterGradient("Log1p")
def _Log1pGrad(op, grad):
  """Returns grad * (1/(1 + x))."""
  x = op.inputs[0]
  with ops.control_dependencies([grad.op]):
    x = math_ops.conj(x)
    return grad * math_ops.inv(1 + x)

由於重複子表達式的存在，簡單的算法可能具有指數運行時間。現在我們已經詳細說明了反向傳播算法，我們可以去理解它的計算成本：
對於與 Theano 與 Tensorflow 類似的平臺，反向傳播算法在原始圖的每條邊添加一個 Jacobi 向量積，可以用 O(1) 個節點來表達。因爲計算圖是有向無環圖，它至多有 O(n2) 條邊。
而對於實踐中常用的圖的類型，情況會更好：大多數神經網絡的代價函數大致是鏈式結構的，使得反向傳播只有 O(n) 的成本。這遠遠勝過簡單的方法，簡單方法可能需要執行指數級的節點。這種潛在的指數級代價可以通過非遞歸地擴展和重寫遞歸鏈式法則看出：

\partial u ( n ) \partial u ( j ) = \sum path (u (π 1), u (π 2), \dots, u (π t)), from π 1 = j to π t = n \prod k = 2 t \partial u ( π k ) \partial u ( π k - 1 ) .

由於節點 j 到節點 n 的路徑數目可以關於這些路徑的長度上指數地增長，所以上述求和符號中的項數（這些路徑的數目），可能以前向傳播圖的深度的指數級增長。會產生如此大的成本是因爲對於 ∂u(i)∂u(j) ，相同的計算會重複進行很多次。爲了避免這種重新計算，我們可以將反向傳播看作一種表填充算法，利用存儲的中間結果 ∂u(n)∂u(i) 來對錶進行填充。圖中的每個節點對應着表中的一個位置，這個位置存儲對該節點的梯度。通過順序填充這些表的條目，反向傳播算法避免了重複計算許多公共子表達式——這種表填充策略有時被稱爲動態規劃。

上述AutoDiff的圖片來自於：http://dlsys.cs.washington.edu/pdf/lecture4.pdf

高階導數

一些軟件框架支持使用高階導數。在深度學習軟件框架中，這至少包括Theano和TensorFlow。這些庫使用一種數據結構來描述要被微分的原始函數，它們使用相同類型的數據結構來描述這個函數的導數表達式。這意味着符號微分機制可以應用於導數（從而產生高階導數）。

黑塞矩陣（Hessian Matrix），又譯作海森矩陣、海瑟矩陣、海塞矩陣等，是一個多元函數的二階偏導數構成的方陣，描述了函數的局部曲率。黑塞矩陣最早於19世紀由德國數學家Ludwig Otto Hesse提出，並以其名字命名。黑塞矩陣常用於牛頓法解決優化問題，利用黑塞矩陣可判定多元函數的極值問題。 —— 百度百科

在深度學習的相關領域，很少會計算標量函數的單個二階導數。相反，我們通常對Hessian矩陣的性質比較感興趣。如果我們有函數 f:Rn→R ，那麼Hessian矩陣的大小是 n×n 。在典型的深度學習應用中，n 將是模型的參數數量，可能很容易達到數十億。因此，完整的Hessian矩陣甚至不能表示。

典型的深度學習方法是使用Krylov方法，而不是顯式地計算Hessian矩陣。 Krylov方法是用於執行各種操作的一組迭代技術，這些操作包括像近似求解矩陣的逆、或者近似矩陣的特徵值或特徵向量等，而不使用矩陣-向量乘法以外的任何操作。

爲了在Hesssian矩陣上使用Krylov方法，我們只需要能夠計算Hessian矩陣 H 和一個任意向量 v 間的乘積即可（該表達式中兩個梯度的計算都可以由適當的軟件庫自動完成）：

H v = \nabla x [(\nabla x f (x)) ⊤ v]

雖然計算Hessian通常是不可取的，但是可以使用Hessian向量積。可以對所有的 i=1,…,n 簡單地計算 He(i) ，其中 e(i) 是 e(i)i=1 並且其他元素都爲 0 的 one-hot 向量（通過閱讀源碼，我們發現Hessian向量積 Hv 尚未實現成avaliable的狀態，Tensorflow當前版本當前僅對Hesssian矩陣完成了實現）。

其它：PyTorch的自動求導

PyTorch提供了包torch.autograd用於自動求導。在前向過程中，PyTorch會構建計算圖，每個節點用Variable表示，邊表示由輸入節點到輸出節點的函數（torch.autograd.Function對象）。Function對象不僅負責執行前向計算，在反向過程中，每個Function對象會調用.backward()函數計算輸出對輸入的梯度，然後將梯度傳遞給下一個Function對象。

How autograd encodes the history (PyTorch)

http://pytorch.org/docs/master/notes/autograd.html#how-autograd-encodes-the-history

Autograd is reverse automatic differentiation system. Conceptually, autograd records a graph recording all of the operations that created the data as you execute operations, giving you a directed acyclic graph whose leaves are the input variables and roots are the output variables. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

Internally, autograd represents this graph as a graph of Function objects (really expressions), which can be apply() ed to compute the result of evaluating the graph. When computing the forwards pass, autograd simultaneously performs the requested computations and builds up a graph representing the function that computes the gradient (the .grad_fn attribute of each Variable is an entry point into this graph). When the forwards pass is completed, we evaluate this graph in the backwards pass to compute the gradients.

An important thing to note is that the graph is recreated from scratch at every iteration, and this is exactly what allows for using arbitrary Python control flow statements, that can change the overall shape and size of the graph at every iteration. You don’t have to encode all possible paths before you launch the training - what you run is what you differentiate.

PyTorch中定義一個新操作

定義新的操作，意味着定義Function的子類，並且這些子類必須重寫以下函數：::forward()和::backward()。初始化函數::__init__()根據實際需求判斷是否需要重寫。

forward()
forward()可以有任意多個輸入、任意多個輸出，但是輸入和輸出必須是Variable。

backward()
backward()的輸入和輸出的個數就是forward()函數的輸出和輸入的個數。其中，backward()輸入表示關於forward()輸出的梯度，backward()的輸出表示關於forward()的輸入的梯度。在輸入不需要梯度時（通過查看needs_input_grad參數）或者不可導時，可以返回None。

Reference: http://blog.csdn.net/victoriaw/article/details/72566249

# Inherit from Function
class Linear(Function):

    # bias is an optional argument
    def forward(self, input, weight, bias=None):
        self.save_for_backward(input, weight, bias)
        output = input.mm(weight.t())
        if bias is not None:
            output += bias.unsqueeze(0).expand_as(output)
        return output

    # This function has only a single output, so it gets only one gradient
    def backward(self, grad_output):
        # This is a pattern that is very convenient - at the top of backward
        # unpack saved_tensors and initialize all gradients w.r.t. inputs to
        # None. Thanks to the fact that additional trailing Nones are
        # ignored, the return statement is simple even when the function has
        # optional inputs.
        input, weight, bias = self.saved_tensors
        grad_input = grad_weight = grad_bias = None

        # These needs_input_grad checks are optional and there only to
        # improve efficiency. If you want to make your code simpler, you can
        # skip them. Returning gradients for inputs that don't require it is
        # not an error.
        if self.needs_input_grad[0]:
            grad_input = grad_output.mm(weight)
        if self.needs_input_grad[1]:
            grad_weight = grad_output.t().mm(input)
        if bias is not None and self.needs_input_grad[2]:
            grad_bias = grad_output.sum(0).squeeze(0)

        return grad_input, grad_weight, grad_bias

#建議把新操作封裝在一個函數中
def linear(input, weight, bias=None):
    # First braces create a Function object. Any arguments given here
    # will be passed to __init__. Second braces will invoke the __call__
    # operator, that will then use forward() to compute the result and
    # return it.
    return Linear()(input, weight, bias)#調用forward()

#檢查實現的backward()是否正確
from torch.autograd import gradcheck
# gradchek takes a tuple of tensor as input, check if your gradient
# evaluated with these tensors are close enough to numerical
# approximations and returns True if they all verify this condition.
input = (Variable(torch.randn(20,20).double(), requires_grad=True),)
test = gradcheck(Linear(), input, eps=1e-6, atol=1e-4)
print(test)

【Autograd】深入理解BP與自動求導

符號語言的導數

Backpropagation algorithm

Automatic Differentiation

Backpropagation vs AutoDiff (reverse)

Jacobi與鏈式法則

Tensorflow的自動求導實現

高階導數

其它：PyTorch的自動求導

How autograd encodes the history (PyTorch)

PyTorch中定義一個新操作

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

nodejs學習06——小案例

【Tensorflow】用於處理checkpoint中參數名稱與矩陣數值的工具類

Advanced Algorithm 聽課筆記（Useful Inequalities & Balls and Bins）

【GraphLite】同步圖運算初試-數三角形

【Pytorch】Windows10下配置Pytorch環境

【selenium】Windows平臺下使用python自動登陸網關 (更新至 v1.1.0)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結