Theano 0.6 文档[6] - Theano的微分[待完成]

本文为Lancelod_Liu编译, 转载请保留此行.

This Article is translated and edited by Lancelod_Liu, keep this line if you share it.

原文: .

Derivatives in Theano

Computing Gradients(梯度计算)

现在使用Theano做一个更复杂的任务: 创建一个函数来计算某些表达式y对于x参数的导数. 我们使用宏T.grad. 举例来说, 我们可以计算 x^2 对于 x的导数. 注意到:d(x^2)/dx = 2 \cdot x.

下面就是梯度的代码:

>>> from theano import pp
>>> x = T.dscalar('x')
>>> y = x ** 2
>>> gy = T.grad(y, x)
>>> pp(gy)  # print out the gradient prior to optimization
'((fill((x ** 2), 1.0) * 2) * (x ** (2 - 1)))'
>>> f = function([x], gy)
>>> f(4)
array(8.0)
>>> f(94.2)
array(188.40000000000001)

在本例中, 我们可以从pp(gy) 中看到我们在计算符号梯度. fill((x ** 2), 1.0) 意思是构造一个和x ** 2形状一样的矩阵, 将其值设为 1.0.

注意

优化器简化了符号梯度表达式. 你可以通过下面的操作来检验这个结论.

pp(f.maker.fgraph.outputs[0])
'(2.0 * x)'

优化之后图中只剩下一个 Apply node, 它把输入乘以2.

我们同样可以计算复杂表达式的梯度, 比如对数函数. 它的公式是这样的:

ds(x)/dx = s(x) \cdot (1 - s(x)).

../_images/dlogistic.png

对数函数的梯度图

>>> x = T.dmatrix('x')
>>> s = T.sum(1 / (1 + T.exp(-x)))
>>> gs = T.grad(s, x)
>>> dlogistic = function([x], gs)
>>> dlogistic([[0, 1], [-1, -2]])
array([[ 0.25      ,  0.19661193],
       [ 0.19661193,  0.10499359]])

一般来说, 任何标量(scalar) 表达式 s, T.grad(s, w) 计算了Theano形式的梯度 \frac{\partial s}{\partial w}. Theano 可以以这种方式被用来做一些高效的符号差分(T.grad的结果会在编译时被优化), 即使对于有多输入的函数来说也是如此. (参考 automatic differentiation 查看符号差分的细节).

注意

T.grad 的第二个参数可以是一个list, 输出也会变成一个list. 两个list的顺序非常重要: 第i个输出是T.grad的第一个参数对应于第二个list中第i个参数的梯度.(好拗口, 其实就是一一对应). T.grad 的第一个参数应该是一个标量(scalar) (size为1的张量(tensor)). 更多信息参考this Extending Theano.

Computing the Jacobian(雅可非行列式)

在Theano的句法中,  Jacobian 形式意味着一个函数的输出对应于输入的首个空间导数的张量(tensor)(数学中的雅克非行列式). Theano 实现了宏 theano.gradient.jacobian() 来计算Jacobian. 下面介绍如何手动计算.

为了手动计算某些函数y相对于某些参数x的 Jacobian, 我们需要使用scan. 我们需要做的就是遍历y的所有成员, 计算y[i] 对应于 x的梯度.

注意

scan 是一个普通的op, 它允许以符号形式书写所有形式的递归等式. 尽管创建符号循环(为性能优化)不是很容易的事, scan的性能已经被优化了. 我们之后还会讨论scan .

>>> x = T.dvector('x')
>>> y = x ** 2
>>> J, updates = theano.scan(lambda i, y,x : T.grad(y[i], x), sequences=T.arange(y.shape[0]), non_sequences=[y,x])
>>> f = function([x], J, updates=updates)
>>> f([4, 4])
array([[ 8.,  0.],
       [ 0.,  8.]])

What we do in this code is to generate a sequence of ints from 0 toy.shape[0] using T.arange. Then we loop through this sequence, andat each step, we compute the gradient of element y[i] with respect tox. scan automatically concatenates all these rows, generating amatrix which corresponds to the Jacobian.

Note

There are some pitfalls to be aware of regarding T.grad. One of them is that youcannot re-write the above expression of the Jacobian astheano.scan(lambda y_i,x: T.grad(y_i,x), sequences=y,non_sequences=x), even though from the documentation of scan thisseems possible. The reason is that y_i will not be a function ofx anymore, while y[i] still is.

Computing the Hessian

In Theano, the term Hessian has the usual mathematical acception: It is thematrix comprising the second order partial derivative of a function with scalaroutput and vector input. Theano implements theano.gradient.hessian() macro that does allthat is needed to compute the Hessian. The following text explains howto do it manually.

You can compute the Hessian manually similarly to the Jacobian. The onlydifference is that now, instead of computing the Jacobian of some expressiony, we compute the Jacobian of T.grad(cost,x), where cost is somescalar.

>>> x = T.dvector('x')
>>> y = x ** 2
>>> cost = y.sum()
>>> gy = T.grad(cost, x)
>>> H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences=T.arange(gy.shape[0]), non_sequences=[gy, x])
>>> f = function([x], H, updates=updates)
>>> f([4, 4])
array([[ 2.,  0.],
       [ 0.,  2.]])

Jacobian times a Vector

Sometimes we can express the algorithm in terms of Jacobians times vectors,or vectors times Jacobians. Compared to evaluating the Jacobian and thendoing the product, there are methods that compute the desired results whileavoiding actual evaluation of the Jacobian. This can bring about significantperformance gains. A description of one such algorithm can be found here:

  • Barak A. Pearlmutter, “Fast Exact Multiplication by the Hessian”, NeuralComputation, 1994

While in principle we would want Theano to identify these patterns automatically for us,in practice, implementing such optimizations in a generic manner is extremelydifficult. Therefore, we provide special functions dedicated to these tasks.

R-operator

The R operator is built to evaluate the product between a Jacobian and avector, namely \frac{\partial f(x)}{\partial x} v. The formulationcan be extended even for x being a matrix, or a tensor in general, case inwhich also the Jacobian becomes a tensor and the product becomes some kindof tensor product. Because in practice we end up needing to compute suchexpressions in terms of weight matrices, Theano supports this more genericform of the operation. In order to evaluate the R-operation ofexpression y, with respect to x, multiplying the Jacobian with vyou need to do something similar to this:

>>> W = T.dmatrix('W')
>>> V = T.dmatrix('V')
>>> x = T.dvector('x')
>>> y = T.dot(x, W)
>>> JV = T.Rop(y, W, V)
>>> f = theano.function([W, V, x], JV)
>>> f([[1, 1], [1, 1]], [[2, 2], [2, 2]], [0,1])
array([ 2.,  2.])

List of Op that implement Rop.

L-operator

In similitude to the R-operator, the L-operator would compute a row vector timesthe Jacobian. The mathematical formula would be v \frac{\partialf(x)}{\partial x}. The L-operator is also supported for generic tensors(not only for vectors). Similarly, it can be implemented as follows:

>>> W = T.dmatrix('W')
>>> v = T.dvector('v')
>>> x = T.dvector('x')
>>> y = T.dot(x, W)
>>> VJ = T.Lop(y, W, v)
>>> f = theano.function([v,x], VJ)
>>> f([2, 2], [0, 1])
array([[ 0.,  0.],
       [ 2.,  2.]])

Note

v, the point of evaluation, differs between the L-operator and the R-operator.For the L-operator, the point of evaluation needs to have the same shapeas the output, whereas for the R-operator this point shouldhave the same shape as the input parameter. Furthermore, the results of these twooperations differ. The result of the L-operator is of the same shapeas the input parameter, while the result of the R-operator has a shape similarto that of the output.

Hessian times a Vector

If you need to compute the Hessian times a vector, you can make use of theabove-defined operators to do it more efficiently than actually computingthe exact Hessian and then performing the product. Due to the symmetry of theHessian matrix, you have two options that willgive you the same result, though these options might exhibit differing performances.Hence, we suggest profiling the methods before using either one of the two:

>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x ** 2)
>>> gy = T.grad(y, x)
>>> vH = T.grad(T.sum(gy * v), x)
>>> f = theano.function([x, v], vH)
>>> f([4, 4], [2, 2])
array([ 4.,  4.])

or, making use of the R-operator:

>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x ** 2)
>>> gy = T.grad(y, x)
>>> Hv = T.Rop(gy, x, v)
>>> f = theano.function([x, v], Hv)
>>> f([4, 4], [2, 2])
array([ 4.,  4.])

Final Pointers

  • The grad function works symbolically: it receives and returns Theano variables.
  • grad can be compared to a macro since it can be applied repeatedly.
  • Scalar costs only can be directly handled by grad. Arrays are handled through repeated applications.
  • Built-in functions allow to compute efficiently vector times Jacobian and vector times Hessian.
  • Work is in progress on the optimizations required to compute efficiently the fullJacobian and the Hessian matrix as well as the Jacobian times vector.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章