Conjugate Gradient

Before diving in to Haskell, let’s go over exactly what the conjugate gradient method is and why it works. The “normal” conjugate gradient method is a method for solving systems of linear equations. However, this extends to a method for minimizing quadratic functions, which we can subsequently generalize to minimizing arbitrary functions f:Rn→R . We will start by going over the conjugate gradient method of minimizing quadratic functions, and later generalize.

Suppose we have some quadratic function

f (x) = 1 2 x T A x + b T x + c

for

x∈Rn with

A∈Rn×n and

b,c∈Rn .

We can write any quadratic function in this form, as this generates all the coefficients xixj as well as linear and constant terms. In addition, we can assume that A=AT (A is symmetric). (If it were not, we could just rewrite this with a symmetric A , since we could take the term for xixj and the term for xjxi , sum them, and then have Aij=Aji both be half of this sum.)

Taking the gradient of f , we obtain

\nabla f (x) = A x + b,

which you can verify by writing out the terms in summation notation.

If we evaluate −∇f at any given location, it will give us a vector pointing towards the direction of steepest descent. This gives us a natural way to start our algorithm - pick some initial guess x0 , compute the gradient −∇f(x0) , and move in that direction by some step size α . Unlike normal gradient descent, however, we do not have a fixed step size α - instead, we perform a line search in order to find the best α . This α is the value of α which brings us to the minimum of f if we are constrainted to move in the direction given by d0=−∇f(x0) .

Note that computing α is equivalent to minimizing the function

g (α) = f (x 0 + α d 0) = 1 2 (x 0 + α d 0) T A (x 0 + α d 0) + b T (x 0 + α d 0) + c = 1 2 α 2 d 0 T A d 0 + d 0 T (A x 0 + b) α + (1 2 x 0 T A x 0 + x 0 T d 0 + c)

Since this is a quadratic function in

α , it has a unique global minimum or maximum. Since we assume we are not at the minimum and not at a saddle point of

f , we assume that it has a minimum.

The minimum of this function occurs when g′(α)=0 , that is, when

g' (α) = (d i T A d i) α + d i T (A x i + b) = 0.

Solving this for

α , we find that the minimum is at

α = - d i T ( A x i + b ) d i T A d i .

Note that since the directon is the negative of the gradient, a.k.a. the direction of steepest descent,

α will be non-negative. These first steps give us our second point in our iterative algorithm:

x 1 = x 0 - α \nabla f (x 0)

If this were simple gradient descent, we would iterate this procedure, computing the gradient at each next point and moving in that direction. However, this has a problem - by moving

α0 in direction

d0 (to find the minimum in direction

d0 ) and then moving

α1 in direction

d1 , we may ruin our work from the previous iteration, so that we are no longer at a minimum in direction

d0 . In order to rectify this, we require that our directions be conjugate to one another.

We define two vectors x and y to be conjugate with respect to some semi-definite matrix A if xTAy=0 . (Semi-definite matrices are ones where xTAx≥0 for all x , and are what we require for conjugate gradient.)

Since we have already moved in the d0=−∇f(x0) direction, we must find a new direction d1 to move in that is conjugate to d0 . How do we do this? Well, let’s compute d1 by starting with the gradient at x1 and then subtracting off anything that would counter-act the previous direction:

d 1 = - \nabla f (x 1) + β 0 d 0 .

This leaves us with the obvious question - what is

β0 ? We can derive that from our definition of conjugacy. Since

d0 and

d1 must be conjugate, we know that

d1TAd0=0 . Expanding

d1 by using its definition, we get that

d1TAd0=−∇f(x1)TAd0+β0d0TAd0=0 . Therefore, we must choose

β0 such that

β 0 = \nabla f ( x 1 ) T A d 0 d 0 T A d 0 .

Choosing this

β gives us a direction conjugate to all previous directions. Interestingly enough, iterating this will keep giving us conjugate directions. After generating each direction, we find the best

α for that direction and update the current estimate of position.

Thus, the full Conjugate Gradient algorithm for quadratic functions:

Let f be a quadratic function f(x)=12xTAx+bTx+c which we wish to minimize.
1. Initialize: Let i=0 and xi=x0 be our initial guess, and compute di=d0=−∇f(x0) .
2. Find best step size: Compute α to minimize the function f(xi+αdi) via the equation

$α = - d i T ( A x i + b ) d i T A d i .$
3.Update the current guess: Let xi+1=xi+αdi .
4.Update the direction: Let di+1=−∇f(xi+1)+βidi where βi is given by
$β i = \nabla f ( x i + 1 ) T A d i d i T A d i .$
5.Iterate: Repeat steps 2-4 until we have looked in n directions, where nn is the size of your vector space (the dimension of x ).

Nonlinear Conjugate Gradient

So, now that we’ve derived this for quadratic functions, how are we going to use this for general nonlinear optimization of differentiable functions? To do this, we’re going to reformulate the above algorithm in slightly more general terms.

First of all, we will revise step two. Instead of

Find best step size: Compute α to minimize the function f(xi+αdi) via the equation

$α = - d i T ( A x i + b ) d i T A d i .$

we will simply use a line search:

Find best step size: Compute α to minimize the function f(xi+αdi) via a line search in the direction di .

In addition, we must reformulate the computation of βi . There are several ways to do this, all of which are the same in the quadratic case but are different in the general nonlinear case. We reformulate this computation by generalizing. Note that the difference between xk+1 and xk is entirely in the direction dk , so that for some constant c , xk+1−xk=cdk . Since ∇f(x)=Ax+b ,

\nabla f (x k + 1) - \nabla f (x k) = (A x k + 1 + b) - (A x k + b) = A (x k + 1 - x k) = c A d k .

Therefore,

Adk=c−1(∇f(xk+1)−∇f(xk)) . We can now plug this in to the equation for

βi and obtain

β k = \nabla f ( x k + 1 ) T ( \nabla f ( x k + 1 ) - \nabla f ( x k ) ) d k T ( \nabla f ( x k + 1 ) - \nabla f ( x k ) ) .

Conveniently enough, the value of

c cancels, as it is both in the numerator and denominator. This gives us the new update rule:

Update the direction: Let dk+1=−∇f(xk+1)+βkdk where βk is given by

$β k = \nabla f ( x k + 1 ) T ( \nabla f ( x k + 1 ) - \nabla f ( x k ) ) d k T ( \nabla f ( x k + 1 ) - \nabla f ( x k ) ) .$
We can now apply this algorithm to any nonlinear and differentiable function! This reformulation of β is known as the Polak-Ribiere method; know that there are others, similar in form and also in use.

Line Search

The one remaining bit of this process that we haven’t covered is step two: the line search. As you can see above, we are given a point x , some vector v , and a multivariate function f:Rn→R , and we wish to find the α which minimizes f(x+αv) . Note that a line search can be viewed simply as root finding, since we know that v⋅∇f(x+αv) should be zero at the minimum. (Since if it were non-zero, we could move from that minimum to a better location.)

There are many ways to do this line search, and they can range from relatively simple linear methods (like the secant method) to more complex (using quadratic or cubic polynomial approximations).

One simple method for a line search is known as the bisection method. The bisection method is simply a binary search. To minimize a univariate function g(x) , it begins with two points, aa and bb, such that g(a) and g(b) have opposite signs. By the intermediate value theorem, g(x) must have a root in [a,b] . (Note that in our case, g(α)=v⋅∇f(x+αv) .) It then computes their midpoint, c=a+b2 , and evaluates the function g to compute g(c) . If g(a) and g(c) have opposite signs, the root must be in [a,c] ; if g(c) and g(b) have opposite signs, then [c,b] must have the root. At this point, the method recurses, continuing its search until it has gotten close enough to the true α .

Another simple method is known as the secant method. Like the bisection method, the secant method requires two initial points a and b such that g(a) and g(b) have opposite signs. However, instead of doing a simple binary search, it does linear interpolation. It finds the line between (a,g(a)) and (b,g(b)) :

g (x) \approx g ( b ) - g ( a ) b - a (x - a) + g (a)

It then finds the root of this linear approximation, setting

g(x)=0 and finding that the root is at

g ( b ) - g ( a ) b - a (x - a) + g (a) = 0 ⟹ x = a - b - a g ( b ) - g ( a ) g (a) .

It then evaluates

g at this location

x . As with the bisection method, if

g(x) and

g(a) have opposite signs, then the root is in

[a,x] , and if

g(x) and

g(b) have opposite signs, the root must be in

[x,b] . As before, root finding continues via iteration, until some stopping condition is reached.

There are more line search methods, but the last one we will examine is one known as Dekker’s method. Dekker’s method is a combination of the secand method and the bisection method. Unlike the previous two methods, Dekker’s method keeps track of three points:

ak : the current “contrapoint”
bk : the current guess for the root
bk−1 : the previous guess for the root

Brent’s method then computes the two possible next values: m (by using the bisection method) and s (by using the secant method with bk and bk−1 ). (On the very first iteration, bk−1=ak and it uses the bisection method.) If the secant method result ss lies between bk and m , then let bk+1=s ; otherwise, let bk+1=m .

After bk+1 is chosen, it is checked to for convergence. If the method has converged, iteration is stopped. If not, the method continues. A new contrapoint ak+1 is chosen such that bk+1 and ak+1 have opposite signs. The two choices for ak+1 are either for it to remain unchanged (stay ak ) or for it to become bk - the choice depends on the signs of the function values involved. Before repeating, the values of f(ak+1) and f(bk+1) are examined, and bk+1 is swapped with ak+1 if it has a higher function value. Finally, the method repeats with the new values of ak,bk , and bk−1 .

Dekker’s method is effectively a heuristic method, but is nice in practice; it has the reliability of the bisection method and gains a boost of speed from its use of the secant method.

Conjugate Gradient