next up previous contents index
Next: 1.2.2.2 Second-order conditions Up: 1.2.2 Constrained optimization Previous: 1.2.2 Constrained optimization   Contents   Index

1.2.2.1 First-order necessary condition (Lagrange multipliers)

Let $ x^*\in D$ be a local minimum of $ f$ over $ D$ . We assume that $ x^*$ is a regular point of $ D$ in the sense that the gradients $ \nabla h_i$ , $ i=1,\dots,m$ are linearly independent at $ x^*$ . This is a technical assumption needed to rule out degenerate situations; see Exercise 1.2 below.

Instead of line segments containing $ x^*$ which we used in the unconstrained case, we now consider curves in $ D$ passing through $ x^*$ . Such a curve is a family of points $ x(\alpha)\in D$ parameterized by $ \alpha\in\mathbb{R}$ , with $ x(0)=x^*$ . We require the function $ x(\cdot)$ to be $ \mathcal C^1$ , at least for $ \alpha$ near 0. Given an arbitrary curve of this kind, we can consider the function

$\displaystyle g(\alpha):=f(x(\alpha)).
$

Note that when there are no equality constraints, functions of the form (1.4) considered previously can be viewed as special cases of this more general construction. From the fact that 0 is a minimum of $ g$ , we derive exactly as before that (1.7) holds, i.e., $ g'(0)=0$ . To interpret this result in terms of $ f$ , note that

$\displaystyle g'(\alpha)=\nabla f(x(\alpha))\cdot x'(\alpha)
$

which for $ \alpha=0$ gives

$\displaystyle g'(0)=\nabla f(x^*)\cdot x'(0)= 0.$ (1.19)

The vector $ x'(0)\in\mathbb{R}^n$ is an important object for us here. From the first-order Taylor expansion $ x(\alpha)=x^*+x'(0)\alpha+o(\alpha)
$ we see that $ x'(0)$ defines a linear approximation of $ x(\cdot)$ at $ x^*$ . Geometrically, it specifies the infinitesimal direction of the curve (see Figure 1.4). The vector $ x'(0)$ is a tangent vector to $ D$ at $ x^*$ . It lives in the tangent space to $ D$ at $ x^*$ , which is denoted by $ T_{x^*}D$ . (We can think of this space as having its origin at $ x^*$ .)

Figure: A tangent vector
\includegraphics[height=1.75in]{figures/tangent-vector.eps}

We want to have a more explicit characterization of the tangent space $ T_{x^*}D$ , which will help us understand it better. Since $ D$ was defined as the set of points satisfying the equalities (1.18), and since the points $ x(\alpha)$ lie in $ D$ by construction, we must have $ h_i(x(\alpha))=0$ for all $ \alpha$ and all $ i\in\{1,\dots,m\}$ . Differentiating this formula gives

$\displaystyle 0={\frac d{d\alpha}}h_i(x(\alpha))=\nabla
h_i(x(\alpha))\cdot x'(\alpha), \qquad i=1,\dots, m
$

for all $ \alpha$ (close enough to 0). Setting $ \alpha=0$ and remembering that $ x(0)=x^*$ , we obtain

$\displaystyle 0=\left.\frac d{d\alpha}\right\vert _{\alpha=0}h_i(x(\alpha))=\nabla
h_i(x^*)\cdot x'(0), \qquad i=1,\dots, m.
$

We have shown that for an arbitrary $ \mathcal C^1$ curve $ x(\cdot)$ in $ D$ with $ x(0)=x^*$ , its tangent vector $ x'(0)$ must satisfy $ \nabla
h_i(x^*)\cdot x'(0)=0$ for each $ i$ . Actually, one can show that the converse is also true, namely, every vector $ d\in\mathbb{R}^n$ satisfying

$\displaystyle \nabla h_i(x^*)\cdot d=0, \qquad i=1,\dots, m$ (1.20)

is a tangent vector to $ D$ at $ x^*$ corresponding to some curve. (We do not give a proof of this fact but note that it relies on $ x^*$ being a regular point of $ D$ .) In other words, the tangent vectors to $ D$ at $ x^*$ are exactly the vectors $ d$ for which (1.20) holds. This is the characterization of the tangent space $ T_{x^*}D$ that we were looking for. It is clear from (1.20) that $ T_{x^*}D$ is a subspace of $ \mathbb{R}^n$ ; in particular, if $ d$ is a tangent vector, then so is $ -d$ (going from $ x'(0)$ to $ -x'(0)$ corresponds to reversing the direction of the curve).

Now let us go back to (1.19), which says that $ \nabla f(x^*)\cdot d=0$ for all $ d\in T_{x^*}D$ (since the curve $ x(\cdot)$ and thus the tangent vector $ x'(0)$ were arbitrary). In view of the characterization of $ T_{x^*}D$ given by (1.20), we can rewrite this condition as follows:

$\displaystyle \nabla f(x^*)\cdot d=0\qquad \forall\,d \ $    such that $\displaystyle \ \nabla h_i(x^*)\cdot d=0,\ i=1,\dots,m.$ (1.21)

The relation between $ \nabla f(x^*)$ and $ \nabla h_i(x^*)$ expressed by (1.21) looks somewhat clumsy, since checking it involves a search over $ d$ . Can we eliminate $ d$ from this relation and make it more explicit? A careful look at (1.21) should quickly lead the reader to the following statement.

Claim: The gradient of $ f$ at $ x^*$ is a linear combination of the gradients of the constraint functions $ h_1,\dots,h_m$ at $ x^*$ :

$\displaystyle \nabla f(x^*)\in$span$\displaystyle \{\nabla h_i(x^*),\,i=1,\dots,m\}.$ (1.22)

Indeed, if the claim is not true, then $ \nabla f(x^*)$ has a component orthogonal to span$ \{\nabla h_i(x^*)\}$ , i.e, there exists a $ d \ne 0$ such that

$\displaystyle \nabla h_i(x^*)\cdot d=0, \qquad i=1,\dots, m$ (1.23)

and $ \nabla f(x^*)$ can be written in the form

$\displaystyle \nabla f(x^*)=d-\sum_{i=1}^m \lambda_i^* \nabla h_i(x^*)$ (1.24)

for some $ \lambda_1^*,\dots,\lambda_m^*\in\mathbb{R}$ . Taking the inner product with $ d$ on both sides of (1.24) and using (1.23) gives

$\displaystyle \nabla f(x^*)\cdot d=d\cdot d\ne 0
$

and we reach a contradiction with (1.21).

Geometrically, the claim says that $ \nabla f(x^*)$ is normal to $ D$ at $ x^*$ . This situation is illustrated in Figure 1.5 for the case of two constraints in $ \mathbb{R}^3$ . Note that if there is only one constraint, say $ h_1(x)=0$ , then $ D$ is a two-dimensional surface and $ \nabla f(x^*)$ must be proportional to $ \nabla h_1(x^*)$ , the normal direction to $ D$ at $ x^*$ . When the second constraint $ h_2(x)=0$ is added, $ D$ becomes a curve (the thick curve in the figure) and $ \nabla f(x^*)$ is allowed to live in the plane spanned by $ \nabla h_1(x^*)$ and $ \nabla h_2(x^*)$ , i.e., the normal plane to $ D$ at $ x^*$ . In general, the intuition behind the claim is that unless $ \nabla f(x^*)$ is normal to $ D$ , there are curves in $ D$ passing through $ x^*$ whose tangent vectors at $ x^*$ make both positive and negative inner products with $ \nabla f(x^*)$ , hence in particular $ f$ can be decreased by moving away from $ x^*$ while staying in $ D$ .

Figure: Gradient vectors and constrained optimality
\includegraphics{figures/gradients_v2.eps}

The condition (1.22) means that there exist real numbers $ \lambda_1^*,\dots,\lambda_m^*$ such that

$\displaystyle \fbox{$\nabla f(x^*)+\lambda_1^*\nabla h_1(x^*)+\dots+ \lambda_m^*\nabla h_m(x^*)=0$}$ (1.25)

This is the first-order necessary condition for constrained optimality. The coefficients $ \lambda_i^*$ , $ i=1,\dots,m$ are called Lagrange multipliers.


\begin{Exercise}
Give an example where a local
minimum $x^*$\ is not a regular p...
...bove necessary
condition is false (justify both of these claims).
\end{Exercise}

The above proof of the first-order necessary condition for constrained optimality involves geometric concepts. We also left a gap in it because we did not prove the converse implication in the equivalent characterization of the tangent space given by (1.20). We now give a shorter alternative proof which is purely analytic, and which will be useful when we study problems with constraints in calculus of variations. However, the geometric intuition behind the previous proof will be helpful for us later as well. We invite the reader to study both proofs as a way of testing the mathematical background knowledge that will be required in the subsequent chapters.

Let us start again by assuming that $ x^*$ is a local minimum of $ f$ over $ D$ , where $ D$ is a surface in $ \mathbb{R}^n$ defined by the equality constraints (1.18) and $ x^*$ is a regular point of $ D$ . Our goal is to rederive the necessary condition expressed by (1.25). For simplicity, we only give the argument for the case of a single constraint $ h(x)=0$ , i.e., $ m=1$ ; the extension to $ m>1$ is straightforward (see Exercise 1.3 below). Given two arbitrary vectors $ d_1,d_2\in\mathbb{R}^n$ , we can consider the following map from $ \mathbb{R}\times\mathbb{R}$ to itself:

$\displaystyle F: (\alpha_1,\alpha_2)\mapsto
\big(f(x^*+\alpha_1d_1+\alpha_2d_2),
{h}(x^*+\alpha_1d_1+\alpha_2d_2)\big).
$

The Jacobian matrix of $ F$ at $ (0,0)$ is

$\displaystyle \begin{pmatrix}\nabla f(x^*)\cdot d_1 & \nabla f(x^*)\cdot d_2 \\ \nabla h(x^*)\cdot d_1 & \nabla h(x^*)\cdot d_2 \end{pmatrix}.$ (1.26)

If this Jacobian matrix were nonsingular, then we could apply the Inverse Function Theorem (see, e.g., [Rud76, Theorem 9.24]) and conclude that there are neighborhoods of $ (0,0)$ and $ F(0,0)=(f(x^*),0)$ on which the map $ F$ is a bijection (has an inverse). This would imply, in particular, that there are points $ x$ arbitrarily close to $ x^*$ such that $ h(x)=0$ and $ f(x)< f(x^*)$ ; such points would be obtained by taking preimages of points on the ray directed to the left from $ F(0,0)$ in Figure 1.6. But this cannot be true, since $ h(x)=0$ means that $ x\in D$ and we know that $ x^*$ is a local minimum of $ f$ over $ D$ . Therefore, the matrix (1.26) is singular.

Figure: Illustrating the alternative proof
\includegraphics{figures/maccluer.eps}

Regularity of $ x^*$ in the present case just means that the gradient $ \nabla h(x^*)$ is nonzero. Choose a $ d_1$ such that $ \nabla h(x^*)\cdot d_1\ne 0$ . With this $ d_1$ fixed, let $ \lambda^*:=-(\nabla f(x^*)\cdot d_1)/(\nabla h(x^*)\cdot d_1)
$ , so that $ \nabla f(x^*)\cdot d_1=-\lambda^*\nabla h(x^*)\cdot d_1$ . Since the matrix (1.26) must be singular for all choices of $ d_2$ , its first row must be a constant multiple of its second row (the second row being nonzero by our choice of $ d_1$ ). Thus we have $ \nabla f(x^*)\cdot d_2=-\lambda^*\nabla h(x^*)\cdot d_2$ , or $ (\nabla f(x^*)+\lambda^*\nabla h(x^*))\cdot d_2=0$ , and this must be true for all $ d_2\in\mathbb{R}^n.
$ It follows that $ \nabla f(x^*)+\lambda^*\nabla h(x^*)=0$ , which proves (1.25) for the case when $ m=1$ .


\begin{Exercise}
Generalize the previous argument to an arbitrary number $m\ge 1...
...ty constrains, assuming as before that $x^*$\ is a regular point.
\end{Exercise}

The first-order necessary condition for constrained optimality generalizes the corresponding result we derived earlier for the unconstrained case. The condition (1.25) together with the constraints (1.18) is a system of $ n+m$ equations in $ n+m$ unknowns: $ n$ components of $ x^*$ plus $ m$ components of the Lagrange multiplier vector $ \lambda^*=(\lambda_1^*,
\dots,\lambda_m^*)^T$ . For $ m=0$ , we recover the condition (1.11) which consists of $ n$ equations in $ n$ unknowns. To make this relation even more explicit, consider the function $ \ell:\mathbb{R}^n\times \mathbb{R}^m\to \mathbb{R}$ defined by

$\displaystyle \ell(x,\lambda):=f(x)+\sum_{i=1}^m \lambda_i h_i(x)$ (1.27)

which we call the augmented cost function. If $ x^*$ is a local constrained minimum of $ f$ and $ \lambda^*$ is the corresponding vector of Lagrange multipliers for which (1.25) holds, then the gradient of $ \ell$ at $ (x^*,\lambda^*)$ satisfies

$\displaystyle \nabla \ell(x^*,\lambda^*)= \begin{pmatrix}{\ell}_{x}(x^*,\lambda...
...}\nabla f(x^*)+\sum_{i=1}^m \lambda_i^* \nabla h_i(x^*)\\ h(x^*)\end{pmatrix}=0$ (1.28)

where $ {\ell}_{x}$ , $ {\ell}_{\lambda}$ are the vectors of partial derivatives of $ \ell$ with respect to the components of $ x$ and $ \lambda$ , respectively, and $ h=(h_1,\dots,h_m)^T$ is the vector of constraint functions. We conclude that $ (x^*,\lambda^*)$ is a usual (unconstrained) stationary point of the augmented cost $ \ell$ . In other words, adding Lagrange multipliers basically converts a constrained problem into an unconstrained one, and the first-order necessary condition (1.25) for constrained optimality is recovered from the first-order necessary condition for unconstrained optimality applied to $ \ell$ .

The idea of passing from constrained minimization of the original cost function to unconstrained minimization of the augmented cost function is due to Lagrange. If $ (x^*,\lambda^*)$ is a minimum of $ \ell$ , then we must have $ h(x^*)=0$ (because otherwise we could decrease $ \ell$ by changing $ \lambda^*$ ), and subject to these constraints $ x^*$ should minimize $ f$ (because otherwise it would not minimize $ \ell$ ). Also, it is clear that (1.28) must hold. However, it does not follow that (1.28) is a necessary condition for $ x^*$ to be a constrained minimum of $ f$ . Unfortunately, there is no quick way to derive the first-order necessary condition for constrained optimality by working with the augmented cost--something that Lagrange originally attempted to do. Nevertheless, the basic form of the augmented cost function (1.27) is fundamental in constrained optimization theory, and will reappear in various forms several times in this book.

Even though the condition in terms of Lagrange multipliers is only necessary and not sufficient for constrained optimality, it is very useful for narrowing down candidates for local extrema. The next exercise illustrates this point for a well-known optimization problem arising in optics.


\begin{Exercise}
Consider a curve $D$\ in the plane described by
the equation $h...
...cessary condition for constrained optimality~\eqref{e-lagr-mult}.
\end{Exercise}


next up previous contents index
Next: 1.2.2.2 Second-order conditions Up: 1.2.2 Constrained optimization Previous: 1.2.2 Constrained optimization   Contents   Index
Daniel 2010-12-20