SHPC: Optimisation

The fourth of six posts covering content from SHPC4001 and SHPC4002

This post covers several numerical methods mostly focused on optimisation. These are (1) Newton’s method (a.k.a the Newton-Raphson method); (2) Conjugate gradient method; (3) Lagrange multipliers and (4) Sequential quadratic programming. The first two methods are used to find extrema (a.k.a extreme points): points that minimise or maximise the value of a function. The last two are used for constraint optimisation: given some objective function, find the point which minimises or maximises its value subject to one or more constraint functions. These methods can be readily implemented in most programming languages; that said, languages with symbolic algebra manipulation or dedicated numerical libraries (such as Maple, Mathematica, MATLAB, Python + SymP) are advantageous. I will discuss the basics of each approach along with some examples; all that is required is some familiarity with calculus and matrices.

Newton’s Method

Suppose we have some function \( f(x) \) and want to find its roots (i.e where \( f(x) = 0 \) ). Now, my previous post on Numerical Methods discussed the bisection method, but that method has a myriad of limitations and can only ever be used to find roots. Newton’s method, also known as the Newton-Raphson method, can be used to find both roots and (local) extrema (i.e minima or maxima). It is an iterative method based on the derivatives of the function. Let’s consider the equation of the tangent line of the function \( f(x) \) at the point \( x_k \): \[ l(x) = f(x_k) + f^\prime (x_k)(x- x_k) \] Since we want to find a root (ideally near our first guess of \( x_k \) ), we want our next guess to be such that \( l(x_{k+1}) = 0 \). Thus \[ 0 = f(x_k) + f^\prime (x_k)(x_{k+1}-x_k) \] which rearranges to yield the iterative formula \begin{align*} \boxed{x_{k+1} = x_k-\frac{f(x_k)}{f^\prime(x_k)}} \end{align*}

Assuming \( f \) has a second derivative, we can consider the Taylor series approximation \[ q(x) = f(x_k) + f\prime(x_k)(x-x_k) + \frac{1}{2}f^{\prime\prime}(x_k)(x-x_k)^2 \] Now, since we want to find an extrema (ideally near our first guess of \( x_k \) ), we want our next guess to be such that \( q^{\prime}_{k+1} = 0 \) . It is important to keep in mind that all extrema are critical points but not all critical points are extrema (such as stationary points of inflection), so this method still relies on an appropriate initial guess \( x_k \). Thus we get \begin{align*} \boxed{x_{k+1} = x_k-\frac{f\prime(x_k)}{f^{\prime\prime}(x_k)}} \end{align*} Let’s generalise this approach to multivariable functions. Suppose we have a function \( f(x) \) where \( x = (x_1, x_2, \ldots ) \) is a vector. Then we can find the extrema using the iterative formula \begin{align*} \boxed{x_{k+1} = x_k-H^{-1}(x_k)\nabla f(x_k) }\end{align*} where \( H \) is the Hessian matrix (the matrix of second partial derivatives) and \( \nabla f(x_k \) is the gradient vector (i.e vector of first derivatives). This method relies on a very good first guess for the initial vector \( x_k \), especially for vectors with many components.

Aside: Local vs. global extrema

Local extrema simply refer to extreme points located within some arbitrary interval (usually close to the starting value of \( x \)). These points may not necessarily be the highest/lowest value of the function over the entire domain. Extrema which are the highest/lowest possible value of the function are referred to as global extrema. Care must be taken to distinguish between these two, as algorithms like Newton’s method (and, as we’ll see, gradient descent) tend to converge to the nearest extrema.

Some examples

Let’s find the local minimum of the function \( f(x) = x^5-3x^4-5x^3+15x^2+4x-12 \) with the initial guess \( x_0 = 10 \). We first need to calculate the first and second derivatives: \[ f^\prime(x) = 5x^4-12x^3-15x^2+30x+4 \quad f^{\prime\prime}(x) = 20x^3- 36x^2-30x+30 \] Successive iterations of the formula yield the values \( 7.72, 6.03, 4.79, 3.90, 3.29, 2.90, 2.70, 2.63 \). So the desired result is reached after 9 iterations.

Let’s now find the local minimum of the multivariable function \( f(x,y) = x^4−4x^3+8x^2 +4xy+2y^2+4y+6 \). We first need to calculate the Hessian and gradient vectors: \[ \nabla f = \begin{bmatrix} 4x^3-12x^2+16x+4y \\ 4y + 4x + 4 \end{bmatrix} \quad \mathbf{H} = \begin{bmatrix} 12x^2 – 24x + 16 & 4 \\ 4 & 4 \end{bmatrix} \] Successive iterations of the formula finds the solution \( (x,y) = (1,-2) \) after around 12 iterations.

Gradient Descent

Suppose you are walking along some hill overlooking a valley, and you spot a particularly picturesque glade at the valley’s lowest point. Suppose also that you want to reach the bottom of the valley (the minima of the function) as quickly as possible (you don’t care if there are rocks or obstacles ahead – you simply want to take the fastest route now). A good approach is to assume that the quickest route to the bottom of the valley is reached by descending down the steepest slope of the hill (i.e to move in the direction of greatest change) at every step. Gradient descent (also known as hill climbing in contexts where the function is something you wish to maximise) is the iterative method \[ x_{k+1} = x_k – \alpha_k \nabla f(x_k) \] for some nonnegative scalar \( \alpha_k \). These constants \( \alpha_k \) can be arbitrary, but a good first choice is to choose \( \alpha_k \) such that \( f (x_k – \alpha \nabla f (x_k) \) is minimised, which is essentially a line search in the direction of the negative gradient.

Conjugate Gradient Method

Now let’s improve on our choices of \( \alpha_k \). Consider the quadratic function given by \[ f(\mathbf{x}) = \frac{1}{2}\mathbf{x}^T Q\mathbf{x} – \mathbf{x}^T\mathbf{b} \] for some symmetric, positive definite matrix Q. The gradient is thus given by \[ \nabla f = Q\mathbf{x} – \mathbf{b} \] and so it follows that the minimum is \( \mathbf{x} = Q^{-1}\mathbf{b} \). Let’s use the shorthand \( \mathbf{d}_k = -\nabla f(\mathbf{x}_k) \). Now, substituting \( \mathbf{x}_k + \alpha\mathbf{d}_k \) into our quadratic equation, and solving for the derivative equal to 0, we get \[ \alpha^\star = \frac{\mathbf{d}_k^T\mathbf{d}_k}{\mathbf{d}_k^TQ\mathbf{d}_k} \] So now, instead of merely iterating values of \( \mathbf{x} \), we are also computing new direction vectors. Let’s formally define our new symbol \( \mathbf{d} \) as \( \mathbf{d} =-\mathbf{g}=\mathbf{b}-Q\mathbf{x} \). Thus, starting with some initial guess \( \mathbf{x} \), our new iterative method is \[ \mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k\mathbf{d}_k \quad \quad \alpha_k = -\frac{\mathbf{g}_k^T\mathbf{d}_k}{\mathbf{d}_k^TQ\mathbf{d}_k} \] \[ \mathbf{d}_{k+1} = -\mathbf{g}_{k+1} + \beta_k \mathbf{d}_k \quad \quad \beta_k = \frac{\mathbf{g}_{k+1}^TQ\mathbf{d}_k}{\mathbf{d}_k^TQ\mathbf{d}_k} \]

An example

Suppose we have the function \[ f(x,y,z) = 4x^2 +2y^2 +6z^2 +3xy−6xz−3yz+4x−3y+2z+2 \] and wish to determine its minimum starting from the point \( (1, 2, 0) \). We really only need to explicitly calculate the vector of first derivatives \[ \mathbf{g} = \nabla f = \begin{bmatrix} 8x+3y-6z+4 \\ 3x+4y-3z-3 \\ -6x-3y+12z+2 \end{bmatrix} \] It follows then that \[ Q = \begin{bmatrix} 8 & 3 & -6 \\ 3 & 4 & -3 \\ -6 & -3 & 12 \end{bmatrix} \quad \quad \mathbf{b} = \begin{bmatrix} -4 \\ 3 \\ -2 \end{bmatrix} \]
Plugging all of these into the iterative formulae eventually yields the exact answer \[ (x,y,z) = \left(-\frac{10}{7},\frac{10}{7},-\frac{11}{21}\right) \]

Lagrange Multipliers

Let’s now turn our attention to constraint problems. In general, a constraint optimisation problem attempts to find the minimum (or maximum) value of some objective function subject to one or more constraint functions. For example, we may wish to find the minimum and maximum values of the function \( f(x) = x^2- y^2 \) subject to the constraint \( g(x,y) = x^2 + y^2 = 1 \). To do so, let’s look at what it means to be a minimum or maximum. Let’s suppose we have a curve \( C \) that is parametrised as follows: \[ C : \quad \mathbf{r}(t) = (x(t),y(t) ) \] and we have a function \( f(x(t),y(t)) \) on this curve. To find the min/max points of the function, we must solve for \( f^\prime = 0 \). Thus \[ \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt} = \left(\frac{\partial f}{\partial x},\frac{\partial f}{\partial y}\right)\cdot\left(\frac{dx}{dt},\frac{dy}{dt}\right) = \nabla f \cdot \mathbf{r}^\prime(t) \] In other words, \( \mathbf{r}^\prime(t) \) is tangent to \( C \) while \( \nabla f \) is perpendicular to the contour on \( f \). For any function \( f \) subject to a constraint \( g \), \( f \) and \( g \) are parallel; thus their normals (a fancy term for lines perpendicular to the tangent line at a given point) must also be parallel. Hence we have \[ \nabla f = \lambda \nabla g \] where \( \lambda \) is the Lagrange multiplier.

For our example, we have \( (2x,-2y) = \lambda(2x,2y) \) which leads the following system of equations \[ 2x = \lambda2x \quad\quad -2y = \lambda 2y \quad\quad x^2+y^2=1 \] from which we can deduce the extreme points \[ f(0,\pm 1)=-1 \quad\quad f(\pm 1,0) = 1 \]

What about multiple constraints?

We can simply daisy-chain (it is, in general, much more complicated than that) the first derivative requirement, e.g for the conditions \( g_1, g_2, \ldots \) we get \[ \nabla f = \lambda_1\nabla g_1 + \lambda_2\nabla g_2 + \ldots \] assuming that there are more variables than conditions (or else we’d have conflicting solutions). For example, consider the function \( f(x,y,z) = x + 2y – 3z \) subject to the conditions \( x – y + z = 1 \) and \( x^2 – z^2 = 2 \). We thus need to solve \[ \nabla f = \lambda\nabla g + \mu\nabla h \] (it is common to use separate symbols for each Lagrange multiplier). This yields a system of five equations (try it!).

Sequential Quadratic Programming

The previous discussion on Lagrange multipliers is first the tip of the iceberg of the (huge) field of convex optimisation. Among the many other techniques for constrained optimisation is a method known as sequential quadratic programming (or SQP). But first, we must introduce the Lagrangian, which in its simplest form is \[ \mathcal{L}(x,\lambda) = f(x) + \lambda g(x) \] here we wish to approximate the Lagrangian by using a quadratic approximation for both the function \( f(x) \) and any constraints \( g(x) \). This is done using our good friend, the Taylor series \[ f(x) \approx f(x_0) + \nabla f(x_0)^T\delta x + \frac{1}{2} \delta x^T\nabla^2 \mathcal{L}(x_0,\lambda) \delta x \\ g(x) \approx g(x_0) + \nabla g(x_0)^T\delta x = a \] where we have used the \( \nabla^2 \mathcal{L}(x_0,\lambda) \) (the Hessian of the Lagrangian) instead of \( \nabla^2 f(x_0) \), and where \( \delta x \equiv x – x_0 \). The Lagrangian can thus be constructed using these approximations. Once the system of equations is solved, we can obtain the next values using \( x_1 = x_0 + \delta x_0 \). When it comes to approximating the gradient of the Lagrangian \( \delta g \), we use both the new and old values of \( x \): \[ \delta g \approx \nabla \mathcal{L} (x_1,\lambda)-\nabla\mathcal{L}(x_0,\lambda) \] We can therefore update the Hessian matrix \( H = \nabla^2 \mathcal{L}(x,\lambda) \) using the Broyden-Fletcher-Goldfarb-Shanno estimation: \[ H_{k+1} = H_k + \frac{\delta g_k \delta g_k^T}{\delta g_k^T \delta x_k} – \frac{H_k\delta x_k \delta x_k^T H_k}{\delta x_k^T H_k \delta x_k} \]