1.2.2.1 First-order necessary condition (Lagrange multipliers)

Next: 1.2.2.2 Second-order conditions Up: 1.2.2 Constrained optimization Previous: 1.2.2 Constrained optimization Contents Index

1.2.2.1 First-order necessary condition (Lagrange multipliers)

Let $x^*\in D$ be a local minimum of over . We assume that is a regular point of in the sense that the gradients $\nabla h_i$ , $i=1,\dots,m$ are linearly independent at . This is a technical assumption needed to rule out degenerate situations; see Exercise 1.2 below.

Instead of line segments containing which we used in the unconstrained case, we now consider curves in passing through . Such a curve is a family of points $x(\alpha)\in D$ parameterized by $\alpha\in\mathbb{R}$ , with . We require the function $x(\cdot)$ to be $\mathcal C^1$ , at least for $\alpha$ near 0. Given an arbitrary curve of this kind, we can consider the function

$\displaystyle g(\alpha):=f(x(\alpha)).$

Note that when there are no equality constraints, functions of the form (1.4) considered previously can be viewed as special cases of this more general construction. From the fact that 0 is a minimum of

, we derive exactly as before that (1.7) holds, i.e.,

. To interpret this result in terms of

, note that

$\displaystyle g'(\alpha)=\nabla f(x(\alpha))\cdot x'(\alpha)$

which for $\alpha=0$ gives

$\displaystyle g'(0)=\nabla f(x^*)\cdot x'(0)= 0.$

(1.19)

The vector $x'(0)\in\mathbb{R}^n$ is an important object for us here. From the first-order Taylor expansion $x(\alpha)=x^*+x'(0)\alpha+o(\alpha)$ we see that

defines a linear approximation of $x(\cdot)$ at

. Geometrically, it specifies the infinitesimal direction of the curve (see Figure 1.4). The vector

is a tangent vector to

. It lives in the tangent space to

, which is denoted by $T_{x^*}D$ . (We can think of this space as having its origin at

**Figure:** A tangent vector
$\includegraphics[height=1.75in]{figures/tangent-vector.eps}$

We want to have a more explicit characterization of the tangent space $T_{x^*}D$ , which will help us understand it better. Since was defined as the set of points satisfying the equalities (1.18), and since the points $x(\alpha)$ lie in by construction, we must have $h_i(x(\alpha))=0$ for all $\alpha$ and all $i\in\{1,\dots,m\}$ . Differentiating this formula gives

$\displaystyle 0={\frac d{d\alpha}}h_i(x(\alpha))=\nabla h_i(x(\alpha))\cdot x'(\alpha), \qquad i=1,\dots, m$

for all $\alpha$ (close enough to 0). Setting $\alpha=0$ and remembering that

, we obtain

$\displaystyle 0=\left.\frac d{d\alpha}\right\vert _{\alpha=0}h_i(x(\alpha))=\nabla h_i(x^*)\cdot x'(0), \qquad i=1,\dots, m.$

We have shown that for an arbitrary $\mathcal C^1$ curve $x(\cdot)$ in

with

, its tangent vector

must satisfy $\nabla h_i(x^*)\cdot x'(0)=0$ for each

. Actually, one can show that the converse is also true, namely, every vector $d\in\mathbb{R}^n$ satisfying

$\displaystyle \nabla h_i(x^*)\cdot d=0, \qquad i=1,\dots, m$

(1.20)

is a tangent vector to

corresponding to some curve. (We do not give a proof of this fact but note that it relies on

being a regular point of

.) In other words, the tangent vectors to

are exactly the vectors

for which (1.20) holds. This is the characterization of the tangent space $T_{x^*}D$ that we were looking for. It is clear from (1.20) that $T_{x^*}D$ is a subspace of $\mathbb{R}^n$ ; in particular, if

is a tangent vector, then so is

(going from

corresponds to reversing the direction of the curve).

Now let us go back to (1.19), which says that $\nabla f(x^*)\cdot d=0$ for all $d\in T_{x^*}D$ (since the curve $x(\cdot)$ and thus the tangent vector were arbitrary). In view of the characterization of $T_{x^*}D$ given by (1.20), we can rewrite this condition as follows:

$\displaystyle \nabla f(x^*)\cdot d=0\qquad \forall\,d \$ such that $\displaystyle \ \nabla h_i(x^*)\cdot d=0,\ i=1,\dots,m.$

(1.21)

The relation between $\nabla f(x^*)$ and $\nabla h_i(x^*)$ expressed by (1.21) looks somewhat clumsy, since checking it involves a search over

. Can we eliminate

from this relation and make it more explicit? A careful look at (1.21) should quickly lead the reader to the following statement.

Claim: The gradient of at is a linear combination of the gradients of the constraint functions $h_1,\dots,h_m$ at :

$\displaystyle \nabla f(x^*)\in$ span $\displaystyle \{\nabla h_i(x^*),\,i=1,\dots,m\}.$

(1.22)

Indeed, if the claim is not true, then $\nabla f(x^*)$ has a component orthogonal to span $\{\nabla h_i(x^*)\}$ , i.e, there exists a $d \ne 0$ such that

$\displaystyle \nabla h_i(x^*)\cdot d=0, \qquad i=1,\dots, m$

(1.23)

and $\nabla f(x^*)$ can be written in the form

$\displaystyle \nabla f(x^*)=d-\sum_{i=1}^m \lambda_i^* \nabla h_i(x^*)$

(1.24)

for some $\lambda_1^*,\dots,\lambda_m^*\in\mathbb{R}$ . Taking the inner product with

on both sides of (1.24) and using (1.23) gives

$\displaystyle \nabla f(x^*)\cdot d=d\cdot d\ne 0$

and we reach a contradiction with (1.21).

Geometrically, the claim says that $\nabla f(x^*)$ is normal to at . This situation is illustrated in Figure 1.5 for the case of two constraints in $\mathbb{R}^3$ . Note that if there is only one constraint, say , then is a two-dimensional surface and $\nabla f(x^*)$ must be proportional to $\nabla h_1(x^*)$ , the normal direction to at . When the second constraint is added, becomes a curve (the thick curve in the figure) and $\nabla f(x^*)$ is allowed to live in the plane spanned by $\nabla h_1(x^*)$ and $\nabla h_2(x^*)$ , i.e., the normal plane to at . In general, the intuition behind the claim is that unless $\nabla f(x^*)$ is normal to , there are curves in passing through whose tangent vectors at make both positive and negative inner products with $\nabla f(x^*)$ , hence in particular can be decreased by moving away from while staying in .

**Figure:** Gradient vectors and constrained optimality
$\includegraphics{figures/gradients_v2.eps}$

The condition (1.22) means that there exist real numbers $\lambda_1^*,\dots,\lambda_m^*$ such that

$\displaystyle \fbox{$\nabla f(x^*)+\lambda_1^*\nabla h_1(x^*)+\dots+ \lambda_m^*\nabla h_m(x^*)=0$}$

(1.25)

This is the first-order necessary condition for constrained optimality. The coefficients $\lambda_i^*$ , $i=1,\dots,m$ are called Lagrange multipliers.

$\begin{Exercise} Give an example where a local minimum $x^*$\ is not a regular p... ...bove necessary condition is false (justify both of these claims). \end{Exercise}$

The above proof of the first-order necessary condition for constrained optimality involves geometric concepts. We also left a gap in it because we did not prove the converse implication in the equivalent characterization of the tangent space given by (1.20). We now give a shorter alternative proof which is purely analytic, and which will be useful when we study problems with constraints in calculus of variations. However, the geometric intuition behind the previous proof will be helpful for us later as well. We invite the reader to study both proofs as a way of testing the mathematical background knowledge that will be required in the subsequent chapters.

Let us start again by assuming that is a local minimum of over , where is a surface in $\mathbb{R}^n$ defined by the equality constraints (1.18) and is a regular point of . Our goal is to rederive the necessary condition expressed by (1.25). For simplicity, we only give the argument for the case of a single constraint , i.e., ; the extension to is straightforward (see Exercise 1.3 below). Given two arbitrary vectors $d_1,d_2\in\mathbb{R}^n$ , we can consider the following map from $\mathbb{R}\times\mathbb{R}$ to itself:

$\displaystyle F: (\alpha_1,\alpha_2)\mapsto \big(f(x^*+\alpha_1d_1+\alpha_2d_2), {h}(x^*+\alpha_1d_1+\alpha_2d_2)\big).$

The Jacobian matrix of

$\displaystyle \begin{pmatrix}\nabla f(x^*)\cdot d_1 & \nabla f(x^*)\cdot d_2 \\ \nabla h(x^*)\cdot d_1 & \nabla h(x^*)\cdot d_2 \end{pmatrix}.$

(1.26)

If this Jacobian matrix were nonsingular, then we could apply the Inverse Function Theorem (see, e.g., [Rud76, Theorem 9.24]) and conclude that there are neighborhoods of

and

on which the map

is a bijection (has an inverse). This would imply, in particular, that there are points

arbitrarily close to

such that

and

; such points would be obtained by taking preimages of points on the ray directed to the left from

in Figure 1.6. But this cannot be true, since

means that $x\in D$ and we know that

is a local minimum of

over

. Therefore, the matrix (1.26) is singular.

**Figure:** Illustrating the alternative proof
$\includegraphics{figures/maccluer.eps}$

Regularity of in the present case just means that the gradient $\nabla h(x^*)$ is nonzero. Choose a such that $\nabla h(x^*)\cdot d_1\ne 0$ . With this fixed, let $\lambda^*:=-(\nabla f(x^*)\cdot d_1)/(\nabla h(x^*)\cdot d_1)$ , so that $\nabla f(x^*)\cdot d_1=-\lambda^*\nabla h(x^*)\cdot d_1$ . Since the matrix (1.26) must be singular for all choices of , its first row must be a constant multiple of its second row (the second row being nonzero by our choice of ). Thus we have $\nabla f(x^*)\cdot d_2=-\lambda^*\nabla h(x^*)\cdot d_2$ , or $(\nabla f(x^*)+\lambda^*\nabla h(x^*))\cdot d_2=0$ , and this must be true for all $d_2\in\mathbb{R}^n.$ It follows that $\nabla f(x^*)+\lambda^*\nabla h(x^*)=0$ , which proves (1.25) for the case when .

$\begin{Exercise} Generalize the previous argument to an arbitrary number $m\ge 1... ...ty constrains, assuming as before that $x^*$\ is a regular point. \end{Exercise}$

The first-order necessary condition for constrained optimality generalizes the corresponding result we derived earlier for the unconstrained case. The condition (1.25) together with the constraints (1.18) is a system of equations in unknowns: components of plus components of the Lagrange multiplier vector $\lambda^*=(\lambda_1^*, \dots,\lambda_m^*)^T$ . For , we recover the condition (1.11) which consists of equations in unknowns. To make this relation even more explicit, consider the function $\ell:\mathbb{R}^n\times \mathbb{R}^m\to \mathbb{R}$ defined by

$\displaystyle \ell(x,\lambda):=f(x)+\sum_{i=1}^m \lambda_i h_i(x)$

(1.27)

which we call the augmented cost function. If

is a local constrained minimum of

and $\lambda^*$ is the corresponding vector of Lagrange multipliers for which (1.25) holds, then the gradient of $\ell$ at $(x^*,\lambda^*)$ satisfies

$\displaystyle \nabla \ell(x^*,\lambda^*)= \begin{pmatrix}{\ell}_{x}(x^*,\lambda... ...}\nabla f(x^*)+\sum_{i=1}^m \lambda_i^* \nabla h_i(x^*)\\ h(x^*)\end{pmatrix}=0$

(1.28)

where ${\ell}_{x}$ , ${\ell}_{\lambda}$ are the vectors of partial derivatives of $\ell$ with respect to the components of

and $\lambda$ , respectively, and $h=(h_1,\dots,h_m)^T$ is the vector of constraint functions. We conclude that $(x^*,\lambda^*)$ is a usual (unconstrained) stationary point of the augmented cost $\ell$ . In other words, adding Lagrange multipliers basically converts a constrained problem into an unconstrained one, and the first-order necessary condition (1.25) for constrained optimality is recovered from the first-order necessary condition for unconstrained optimality applied to $\ell$ .

The idea of passing from constrained minimization of the original cost function to unconstrained minimization of the augmented cost function is due to Lagrange. If $(x^*,\lambda^*)$ is a minimum of $\ell$ , then we must have (because otherwise we could decrease $\ell$ by changing $\lambda^*$ ), and subject to these constraints should minimize (because otherwise it would not minimize $\ell$ ). Also, it is clear that (1.28) must hold. However, it does not follow that (1.28) is a necessary condition for to be a constrained minimum of . Unfortunately, there is no quick way to derive the first-order necessary condition for constrained optimality by working with the augmented cost--something that Lagrange originally attempted to do. Nevertheless, the basic form of the augmented cost function (1.27) is fundamental in constrained optimization theory, and will reappear in various forms several times in this book.

Even though the condition in terms of Lagrange multipliers is only necessary and not sufficient for constrained optimality, it is very useful for narrowing down candidates for local extrema. The next exercise illustrates this point for a well-known optimization problem arising in optics.

$\begin{Exercise} Consider a curve $D$\ in the plane described by the equation $h... ...cessary condition for constrained optimality~\eqref{e-lagr-mult}. \end{Exercise}$

Next: 1.2.2.2 Second-order conditions Up: 1.2.2 Constrained optimization Previous: 1.2.2 Constrained optimization Contents Index

Daniel 2010-12-20