本文的主要目的是对基于gradient的一些approximation知识点以及优化方法做一个简单的review。详细内容参考引用链接，这里只列出key points，主要是在遗忘的时候能够快速catch up…

Jacobian矩阵和Hessian矩阵

引用：

雅可比矩阵(描述 $f:R^n\rightarrow R^m$ 的一阶导数矩阵). $J_{ij}=\partial f_i/\partial x_j$ . 把它理解为一阶gradient就好了。例如在Automatic Differenciation中,利用Chain Rule就可以将求导过程作为一系列Jacobian矩阵的乘积。
海森矩阵(描述 $f:R^n\rightarrow R$ 的二阶导数矩阵). $H_{ij}=\frac{\partial^2 f}{\partial x_i\partial x_j}$ . 另, 则此函数的雅克比矩阵为 $[\partial f/\partial x_1, \partial f/\partial x_2,...]$ .求此雅克比矩阵（转置后作为向量函数）的雅克比矩阵，即得到海森矩阵。

Automatic Differenciation

引用：

Symbolic differentiation can lead to inefficient code and faces the difficulty of converting a computer program into a single expression. 符号求导是先推导出 $\partial y/\partial x$ 的符号公式，再代入x求解，比较复杂。
Numerical differentiation can introduce round-off errors in the discretization process and cancellation. 数值积分是在目标点邻域两端取点，并通过 $\frac{\Delta y}{\Delta x}$ 的方式对gradient进行数值近似。容易遇到数值问题。
Auto Differenciation exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program. 自动求导主要是对Chain Rule的应用。
考虑 $f:R^n\rightarrow R^m$ ，我们最终想要的是一个m*n的雅克比矩阵。
Forward accumulation: $\frac{\partial y}{\partial x_i} =\frac{\partial y}{\partial w_0}\frac{\partial w_0}{\partial x_i}=\frac{\partial y}{\partial w_1}(\frac{\partial w_1}{\partial w_0}\frac{\partial w_0}{\partial x_i})=...$ 也就是沿着f箭头方向，从x向y展开chain. 注意括号的位置，对括号都对应了在computational graph中的一个节点。每次计算都是针对某个xi进行的，所以要得到最终的雅克比矩阵需要进行n次计算。当n<<m时，使用Forward方式理论上计算次数更少
Reverse accumulation：与forward相反 $\frac{\partial y}{\partial x_i} =\frac{\partial y}{\partial w_5}\frac{\partial w_5}{\partial x_i}=(\frac{\partial y}{\partial w_5}\frac{\partial w_5}{\partial w_4})\frac{\partial w_4}{\partial x_i}...$ 逆着f箭头方向，从y向x展开chain。注意括号的位置，对括号都对应了在computational graph中的一个节点。每次计算都是针对某个yi进行的，所以要得到最终的雅克比矩阵需要进行m次计算。当n>>m时，使用Reverse方式理论上计算次数更少.但是要注意，由于是top-down递归求解，中间过程需要存储，导致memory使用会比较大
Forward/Reverse 只是两中极端方式，如何使用最少的步骤求得雅克比矩阵，是一个np难问题。
Dual Numbers: 这是一种计算Ad的方式。把标量全部用类似complex number的形式表现出来： $x\rightarrow(a+b\epsilon )$ 其中的 $\epsilon$ 理解为无限趋近于零的无穷小(infinidestimal),并且 $\epsilon ^2=0$ . 现在 $f(x)=f(x+\epsilon )=f(x)+f'(x)\epsilon$ 也就是说，将x用dual number表示后，只要按往常一样计算y，那么最终y的dual number表示中的第二个component就是gradient了。这里有点意思，举个栗子吧： $f(x)=x^2=f(x+\epsilon )^2=x^2+2x\epsilon +\epsilon ^2=x^2+(2x)\epsilon$ 所以2x就是f'(x)了！这里我觉得其实是完全符合gradient的定义的，即当x改变一丢丢时，y的变化量。可以把此处的 $\epsilon$ 理解为。