Theano 0.6 文檔[6] - Theano的微分[待完成]

本文爲Lancelod_Liu編譯, 轉載請保留此行.

This Article is translated and edited by Lancelod_Liu, keep this line if you share it.

原文: .

Derivatives in Theano

Computing Gradients(梯度計算)

現在使用Theano做一個更復雜的任務: 創建一個函數來計算某些表達式y對於x參數的導數. 我們使用宏T.grad. 舉例來說, 我們可以計算 x^2 對於 x的導數. 注意到:d(x^2)/dx = 2 \cdot x.

下面就是梯度的代碼:

>>> from theano import pp
>>> x = T.dscalar('x')
>>> y = x ** 2
>>> gy = T.grad(y, x)
>>> pp(gy)  # print out the gradient prior to optimization
'((fill((x ** 2), 1.0) * 2) * (x ** (2 - 1)))'
>>> f = function([x], gy)
>>> f(4)
array(8.0)
>>> f(94.2)
array(188.40000000000001)

在本例中, 我們可以從pp(gy) 中看到我們在計算符號梯度. fill((x ** 2), 1.0) 意思是構造一個和x ** 2形狀一樣的矩陣, 將其值設爲 1.0.

注意

優化器簡化了符號梯度表達式. 你可以通過下面的操作來檢驗這個結論.

pp(f.maker.fgraph.outputs[0])
'(2.0 * x)'

優化之後圖中只剩下一個 Apply node, 它把輸入乘以2.

我們同樣可以計算複雜表達式的梯度, 比如對數函數. 它的公式是這樣的:

ds(x)/dx = s(x) \cdot (1 - s(x)).

../_images/dlogistic.png

對數函數的梯度圖

>>> x = T.dmatrix('x')
>>> s = T.sum(1 / (1 + T.exp(-x)))
>>> gs = T.grad(s, x)
>>> dlogistic = function([x], gs)
>>> dlogistic([[0, 1], [-1, -2]])
array([[ 0.25      ,  0.19661193],
       [ 0.19661193,  0.10499359]])

一般來說, 任何標量(scalar) 表達式 s, T.grad(s, w) 計算了Theano形式的梯度 \frac{\partial s}{\partial w}. Theano 可以以這種方式被用來做一些高效的符號差分(T.grad的結果會在編譯時被優化), 即使對於有多輸入的函數來說也是如此. (參考 automatic differentiation 查看符號差分的細節).

注意

T.grad 的第二個參數可以是一個list, 輸出也會變成一個list. 兩個list的順序非常重要: 第i個輸出是T.grad的第一個參數對應於第二個list中第i個參數的梯度.(好拗口, 其實就是一一對應). T.grad 的第一個參數應該是一個標量(scalar) (size爲1的張量(tensor)). 更多信息參考this Extending Theano.

Computing the Jacobian(雅可非行列式)

在Theano的句法中,  Jacobian 形式意味着一個函數的輸出對應於輸入的首個空間導數的張量(tensor)(數學中的雅克非行列式). Theano 實現了宏 theano.gradient.jacobian() 來計算Jacobian. 下面介紹如何手動計算.

爲了手動計算某些函數y相對於某些參數x的 Jacobian, 我們需要使用scan. 我們需要做的就是遍歷y的所有成員, 計算y[i] 對應於 x的梯度.

注意

scan 是一個普通的op, 它允許以符號形式書寫所有形式的遞歸等式. 儘管創建符號循環(爲性能優化)不是很容易的事, scan的性能已經被優化了. 我們之後還會討論scan .

>>> x = T.dvector('x')
>>> y = x ** 2
>>> J, updates = theano.scan(lambda i, y,x : T.grad(y[i], x), sequences=T.arange(y.shape[0]), non_sequences=[y,x])
>>> f = function([x], J, updates=updates)
>>> f([4, 4])
array([[ 8.,  0.],
       [ 0.,  8.]])

What we do in this code is to generate a sequence of ints from 0 toy.shape[0] using T.arange. Then we loop through this sequence, andat each step, we compute the gradient of element y[i] with respect tox. scan automatically concatenates all these rows, generating amatrix which corresponds to the Jacobian.

Note

There are some pitfalls to be aware of regarding T.grad. One of them is that youcannot re-write the above expression of the Jacobian astheano.scan(lambda y_i,x: T.grad(y_i,x), sequences=y,non_sequences=x), even though from the documentation of scan thisseems possible. The reason is that y_i will not be a function ofx anymore, while y[i] still is.

Computing the Hessian

In Theano, the term Hessian has the usual mathematical acception: It is thematrix comprising the second order partial derivative of a function with scalaroutput and vector input. Theano implements theano.gradient.hessian() macro that does allthat is needed to compute the Hessian. The following text explains howto do it manually.

You can compute the Hessian manually similarly to the Jacobian. The onlydifference is that now, instead of computing the Jacobian of some expressiony, we compute the Jacobian of T.grad(cost,x), where cost is somescalar.

>>> x = T.dvector('x')
>>> y = x ** 2
>>> cost = y.sum()
>>> gy = T.grad(cost, x)
>>> H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences=T.arange(gy.shape[0]), non_sequences=[gy, x])
>>> f = function([x], H, updates=updates)
>>> f([4, 4])
array([[ 2.,  0.],
       [ 0.,  2.]])

Jacobian times a Vector

Sometimes we can express the algorithm in terms of Jacobians times vectors,or vectors times Jacobians. Compared to evaluating the Jacobian and thendoing the product, there are methods that compute the desired results whileavoiding actual evaluation of the Jacobian. This can bring about significantperformance gains. A description of one such algorithm can be found here:

  • Barak A. Pearlmutter, “Fast Exact Multiplication by the Hessian”, NeuralComputation, 1994

While in principle we would want Theano to identify these patterns automatically for us,in practice, implementing such optimizations in a generic manner is extremelydifficult. Therefore, we provide special functions dedicated to these tasks.

R-operator

The R operator is built to evaluate the product between a Jacobian and avector, namely \frac{\partial f(x)}{\partial x} v. The formulationcan be extended even for x being a matrix, or a tensor in general, case inwhich also the Jacobian becomes a tensor and the product becomes some kindof tensor product. Because in practice we end up needing to compute suchexpressions in terms of weight matrices, Theano supports this more genericform of the operation. In order to evaluate the R-operation ofexpression y, with respect to x, multiplying the Jacobian with vyou need to do something similar to this:

>>> W = T.dmatrix('W')
>>> V = T.dmatrix('V')
>>> x = T.dvector('x')
>>> y = T.dot(x, W)
>>> JV = T.Rop(y, W, V)
>>> f = theano.function([W, V, x], JV)
>>> f([[1, 1], [1, 1]], [[2, 2], [2, 2]], [0,1])
array([ 2.,  2.])

List of Op that implement Rop.

L-operator

In similitude to the R-operator, the L-operator would compute a row vector timesthe Jacobian. The mathematical formula would be v \frac{\partialf(x)}{\partial x}. The L-operator is also supported for generic tensors(not only for vectors). Similarly, it can be implemented as follows:

>>> W = T.dmatrix('W')
>>> v = T.dvector('v')
>>> x = T.dvector('x')
>>> y = T.dot(x, W)
>>> VJ = T.Lop(y, W, v)
>>> f = theano.function([v,x], VJ)
>>> f([2, 2], [0, 1])
array([[ 0.,  0.],
       [ 2.,  2.]])

Note

v, the point of evaluation, differs between the L-operator and the R-operator.For the L-operator, the point of evaluation needs to have the same shapeas the output, whereas for the R-operator this point shouldhave the same shape as the input parameter. Furthermore, the results of these twooperations differ. The result of the L-operator is of the same shapeas the input parameter, while the result of the R-operator has a shape similarto that of the output.

Hessian times a Vector

If you need to compute the Hessian times a vector, you can make use of theabove-defined operators to do it more efficiently than actually computingthe exact Hessian and then performing the product. Due to the symmetry of theHessian matrix, you have two options that willgive you the same result, though these options might exhibit differing performances.Hence, we suggest profiling the methods before using either one of the two:

>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x ** 2)
>>> gy = T.grad(y, x)
>>> vH = T.grad(T.sum(gy * v), x)
>>> f = theano.function([x, v], vH)
>>> f([4, 4], [2, 2])
array([ 4.,  4.])

or, making use of the R-operator:

>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x ** 2)
>>> gy = T.grad(y, x)
>>> Hv = T.Rop(gy, x, v)
>>> f = theano.function([x, v], Hv)
>>> f([4, 4], [2, 2])
array([ 4.,  4.])

Final Pointers

  • The grad function works symbolically: it receives and returns Theano variables.
  • grad can be compared to a macro since it can be applied repeatedly.
  • Scalar costs only can be directly handled by grad. Arrays are handled through repeated applications.
  • Built-in functions allow to compute efficiently vector times Jacobian and vector times Hessian.
  • Work is in progress on the optimizations required to compute efficiently the fullJacobian and the Hessian matrix as well as the Jacobian times vector.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章