Almost Sure Uniqueness of a Global Minimum
Without Convexity111A previous version of this paper was circulated under the title “Generic Uniqueness of a Global Minimum.” The author gratefully acknowledges help from conversations with Donald Andrews, Xiaohong Chen, Yuichi Kitamura, Ivana Komunjer, Simon Lee, Adam McCloskey, José Luis Montiel Olea, Serena Ng, Bernard Salanié, Tobias Salz, and Ming Yuan.
This paper provides a theorem for the set of global minimizers, the argmin, of a random objective function to be unique almost surely. The usual way to get uniqueness is to assume the function is strictly quasiconvex and the domain is convex. Outside of a few special cases, verifying uniqueness without convexity has not been done and is often just assumed. The main result of this paper establishes uniqueness without assuming convexity by relying on an easy-to-verify nondegeneracy condition. The main result of this paper has widespread application beyond econometrics. Six applications are discussed: uniqueness of M-estimators, utility maximization with a nonconvex budget set, uniqueness of the policy function in dynamic programming, envelope theorems, limit theory in weakly identified models, and functionals of Brownian motion.
Keywords: global optimization, nonconvex optimization
This paper establishes the argmin of a random objective function to be unique almost surely.333Existence of a minimizer is not considered in this paper. The arguments used to prove existence are different from those used to prove uniqueness. Thus, the phrase “the argmin is unique” should be interpreted to mean the argmin does not contain two or more points. The task of finding the argmin of a random function is a very general problem, and evaluating whether the argmin is unique is important in many applications. The main result of this paper, Theorem 1, holds under very weak conditions. In particular, it allows for nonconvexity, both of the objective function and of the domain.
The usual argument for uniqueness of the argmin relies on convexity assumptions. Without convexity, it is difficult to guarantee uniqueness of the argmin. At the same time, there is a popular intuition that multiple global minimizers occurring with positive probability requires a degenerate random function, in some sense. By considering almost sure uniqueness and relying on a type of nondegeneracy condition,444See Assumption Generic in Section 2 and the remarks that follow it. Theorem 1 provides a systematic way to relax convexity conditions.
Intuitively, the proof of Theorem 1 eliminates potential global minimizers occurring at distinct points. If the first order condition is nonzero, then we can eliminate that point as a global minimizer. The novel idea is recognizing that we can do the same thing for derivatives with respect to , a random vector. If the difference of the derivatives at two distinct points with respect to is nonzero, then the probability of two global minimizers occurring in neighborhoods of those two points is zero.555See Lemma 4 in Section 3. This novel idea is very useful because the derivative with respect to is often more tractable than the derivative with respect to the domain of optimization.
At this level of generality, there are not many papers that seek to verify uniqueness of the argmin of a function without convexity. The closest is an approach based on a “Mountain Pass Lemma.” This applies if the Hessian of the objective function is positive definite whenever the gradient is zero. The intuition is that between any two minimizers there must exist a local maximum or saddle point. While this condition is sufficient in one dimension, Tarone and Gruenhage (1975) give a counterexample in multiple dimensions. A variety of papers, including Mäkeläinen, Schmidt, and Styan (1981), Demidenko (2008), and Mascarenhas (2010) supplement the Hessian condition with additional regularity conditions to prove uniqueness of the minimizer.
This approach has two disadvantages. First, it has narrow scope. The conclusion of the Mountain Pass Lemma is that the local minimizer is unique. This implies that this approach does not work for any function with multiple local minimizers, but a unique global minimizer. Second, the Hessian condition can be difficult to verify if the derivatives of the objective function are intractable. In contrast, Theorem 1 applies to functions with multiple local minimizers, and the assumptions of Theorem 1 are easy to verify.
Theorem 1 has applications to a variety of fields in economics and, more broadly, optimization. In this paper, six applications are discussed.666See Section 4 for the literature related to each application.
(1) M-estimators minimize a random objective function. The objective function is usually nonconvex and the estimator may be difficult to calculate numerically. The literature has proved uniqueness of an M-estimator for a nonconvex objective function only in isolated cases. More generally, Theorem 1 guarantees the estimator is unique with probability 1, including a new result for uniqueness of the maximum likelihood estimator in a finite mixture model.
(2) A utility maximizing consumer with a nonconvex budget set may not choose a single-valued demand correspondence, even with strictly concave preferences. In addition, nonconvexity may lead to a value function that is not differentiable and violate Roy’s identity. Theorem 1 can be used to verify single-valued demand almost surely with respect to a distribution of individual heterogeneity in preference parameters.
(3) Many dynamic programming problems in economics contain nonconcave aspects, such as fixed adjustment costs. The policy function is an object of interest in many settings, and uniqueness of the policy function is an important condition for policy function iteration to converge, a common way to solve dynamic programming problems. Theorem 1 provides the first general way to verify uniqueness of the policy function that does not require concavity.
(4) Envelope Theorems establish differentiability of the value function, together with a formula for the derivative. Under weak conditions, the value function is directionally differentiable. In the proof of the envelope theorem, uniqueness of the argmin is used to get the directional derivatives to coincide. Thus, Theorem 1 can be used in a key step in the proof of the envelope theorem.
(5) In weakly identified models, limit theory for an estimator relies on uniqueness of the argmin of a random function. This is stated as an assumption that can be difficult to verify. Theorem 1 reduces this assumption to two easily verified conditions: a generic identification condition and a rank condition. Examples are given that demonstrate the importance of these two conditions.
(6) Many random objective functions can be written as a functional of Brownian motion. Theorem 1 accommodates both the infinite dimensionality and nondifferentiability aspects of Brownian motion. For example, Theorem 1 provides a new proof that Chernoff’s distribution is well-defined.
Section 2 states Theorem 1. Section 3 provides intuition and states lemmas for the proof of Theorem 1. Section 4 discusses the applications. Section 5 concludes. An appendix contains the proofs.
2 Statement of Theorem 1
This paper studies minimizers of an objective function, , where is random. The following assumption eliminates mass points in the distribution of .
Assumption Absolute Continuity.
Let be an absolutely continuous random -vector with distribution . Let be a measurable set such that .
The finite dimensionality of is not restrictive. Infinite dimensional sources of randomness can be accommodated by focusing on a finite dimensional marginal distribution, and conditioning on the remainder. Section 4.6 illustrates this in an application in which the randomness is Brownian motion.
Let be a disjoint union of finitely or countably many second-countable Hausdorff manifolds, possibly with boundary or corner.
is the domain over which is minimized.
Using manifolds, possibly with boundaries or corners, is a flexible way to allow for a variety of shapes to be minimized over. This is important because many applications require irregularly shaped . In the utility maximization application in Section 4.2, is the budget set, which may have nonlinearities or kink points. In the weak identification application in Section 4.5, is the identified set, which may have an irregular shape due to bounds.
is a manifold with boundary or corner if each point, , is locally diffeomorphic777A diffeomorphism is a continuously differentiable function with a continuously differentiable inverse. to a neighborhood in , where denotes the nonnegative reals. For every , the number is the same and denotes the dimension of the manifold, .
Each is locally Euclidean, so, to simplify notation, we identify with the Euclidean space, understanding that this holds with respect to the local coordinate system. For example, if is the coordinate map, we write to indicate , and the derivative of taken with respect to indicates a derivative taken with respect to the coordinates, , of the composition, .
To accommodate the possibility of a nondifferentiable objective function in some directions, we write the local neighborhoods as subsets of . Then, for each , write , where are the differentiable directions and are the nondifferentiable directions. This rewriting assumes that is locally a product space between the differentiable directions and the nondifferentiable directions. Let denote the tangent cone to at the point .888The tangent cone of to a set is the closure of the union of all rays that start at and intersect .
Assumption Continuous Differentiability.
is a continuous function of and .
For every and , is differentiable with respect to , and the derivative is continuous with respect to and .
For every , , and , 999To simplify notation, we often write instead of . is differentiable with respect to in the direction , and the derivative is continuous with respect to .
Part (c) allows for nondifferentiability of the objective function in the direction. See Section 4.6 for an application with nondifferentiability.
Let . Let be defined on .
Assume is a generic function over . That is, for every , at least one of the following is true:
there exists such that ,
there exists such that , or
The key to Assumption Generic is condition (d). Often, derivatives with respect to are more tractable than derivatives with respect to . In the applications, the general strategy for verifying Assumption Generic is to show that implies .
For interior points, or , conditions (b) and (c) are related to first order conditions for optimality. If the derivative with respect to is nonzero, then not a global minimizer, and condition (b) is satisfied for some . These conditions can be augmented with conditions on second derivatives to allow saddle points and local maximizers.
Assumption Generic is a standard condition in differential topology. It is also called transversality or regularity. Intuitively, a function is generic if, whenever its value hits zero, it crosses zero (or is transverse to zero) with a nonzero derivative. In this context, Assumption Generic says that, whenever , then one of the derivatives must not be equal, either with respect to , , or . Furthermore, Assumption Generic is satisfied generically, in the sense of the transversality theorem.101010See Guillemin and Pollack (1974), Section 2.3.
Assumption Generic makes precise the type of degeneracy needed for a random function to have multiple global minimizers with positive probability. Specifically, for interior and and no nondifferentiable directions, Assumption Generic is a system of equations in unknowns. We can expect an arbitrarily chosen function, , to have zero solutions, satisfying Assumption Generic.
Assumption Generic is useful in a wide variety of fields. A version of Assumption Generic is used to prove uniqueness of equilibria in a generic economy.111111See Section 17.D in Mas-Colell, Whinston, and Green (1995).
Under Assumptions Absolute Continuity, Manifold, Continuous Differentiability, and Generic, the argmin of over is unique almost surely-.
Intuitively, the proof of Theorem 1 eliminates potential global minimizers occurring at distinct points. Each condition in Assumption Generic eliminates as simultaneous global minimizers of with positive probability. If, with probability 1, all simultaneous global minimizers are eliminated, we can conclude that the argmin contains two or more points with probability zero. Section 3 discusses this intuition in more detail.
Theorem 1 is related to the transversality theorem.121212See Guillemin and Pollack (1974), Section 2.3. One difference is which derivatives on the boundary are permitted. The transversality theorem assumes that both the interior manifold and the boundary manifold are transversal, which implies that the derivatives at boundary points are only taken along the boundary. In contrast, Assumption Generic allows derivatives at boundary points to be taken into the interior. This generates a new boundary complication, discussed and solved in Section 3.3.
3 Intuition for the Proof of Theorem 1
This section gives intuition for the proof of Theorem 1. The first subsection reduces the global problem to a local problem, the second subsection states lemmas for the local problem, and the third subsection discusses and solves a boundary complication. All proofs are in the appendix.
3.1 Global to Local
The uniqueness of a global minimizer is a global condition. In order to verify it, we reduce the problem to a local problem. Lemma 1 reduces from possibly noncompact and to compact subsets. Lemma 2 provides a condition to cover these compact sets with neighborhoods.
The following lemma allows a reduction from minimization over all of to minimization over compact subsets, . We impose the following assumption, stating the properties of , and, in the proof of Theorem 1, construct that satisfy it. Since is second-countable, we can let denote a countable atlas for .
For every , satisfies the following conditions.
is a countable collection of compact subsets of .
For every , there exist such that .
For every , , where , is a closed ball in with center and radius , is a compact subset of , and , a closed convex cone with vertex .
For every , where , let , where denotes the closed ball with center and radius . Assume for every , there exist , disjoint, such that and .
Condition (c) imposes special structure on . If does not have a boundary or corner, then . If includes a boundary or corner of , then, as constructed in the proof of Theorem 1, is an orthant of .
The notation, , denotes the fact that has been shrunk in the differentiable direction by a factor of . Condition (d) is similar to a Hausdorff condition on with respect to neighborhoods defined by . It is used when dealing with the boundary complication in Section 3.3.
Assumption K is assumed throughout the rest of this section and the proofs of the lemmas. In the proof of Theorem 1, we construct satisfying Assumption K.
For the first lemma, we consider , which may be the same. Let and be disjoint. We are interested in two things: (1) whether the minimum value of over is equal to the minimum value over , and (2) whether the minimum is achieved on and . To deal with these, define the value function,
Notice that achieves its minimum over in if and only if . With this notation, Lemma 1 shows that the value of the minimum occurring in and of disjoint compact sets, and , being different is sufficient for the argmin to be unique almost surely.
Suppose that is a sequence of compact subsets of such that as , and suppose that for every , for every , for every and such that ,
Then, the argmin of over is unique almost surely.
The condition in Lemma 1 is still not local. Lemma 2 reduces this condition to a local condition by finding neighborhoods of , , and that can be used to cover these compact sets such that an appropriate probability is zero.
Fix , compact, fix , and fix and such that . Suppose, for every and for every , there exist neighborhoods, , and of , and , respectively, such that
3.2 The Local Problem
For the rest of this section, fix , , such that , , , and . We state lemmas that are useful for showing the existence of neighborhoods, , and , that satisfy (3.1), using properties of that follow from Assumption Generic. Assumption Generic implies one of three conditions.
, which occurs if ,
, which occurs if , and
there exists a such that , which occurs if , and symmetrically for .
Lemmas 3, 4, and 5, below, show the existence of neighborhoods that satisfy (3.1) for each of these cases, respectively.
If , then there exist neighborhoods, , , and , so that
Lemma 3 follows from the intuitive notion that if the value of is far from the value of , then the value of is far from the value of , for small enough neighborhoods of , , and .
then there exist neighborhoods, , , and , so that
The fact that Lemma 4 holds for neighborhoods of and is surprising. It relies on a type of mean value bound for secants of the value function.131313Despite the fact that the value function may not be differentiable with respect to . The intuition is: if, in some -direction, the derivative of is always less than the derivative of , then any secants of and share that property, for sufficiently small neighborhoods, and . Thus, is always increasing or decreasing at a rate which is less than the rate at which is increasing or decreasing. Thus, they cannot cross more than once, and the set of crossing points must have probability zero.
The next lemma uses the first order conditions for optimality of or . Without loss of generality, consider just , where and . Let denote the tangent cone to at the point .
If there exists a such that
then there exist neighborhoods of and , and , such that
Lemma 5 uses the intuitive notion that if is not a relative minimum of , then it is not a minimum of over , and it can be bounded away from the minimum in a neighborhood of .
3.3 Dealing with the Boundary
Lemmas 3 - 5 are useful for most cases. However, there is a gap in the conditions that the lemmas cover and the conditions in Assumption Generic. Specifically, if is a strict subset of , then the satisfying may not belong to . This complication occurs when is on the boundary of .
If Lemmas 3 - 5 do not apply to , then is called a problem point. Formally, problem points are defined by the following conditions. Let .
is a problem point for over if the following hold.
For all , .
It follows from Assumption K, part (c), that always belongs to . belongs to if is on the interior of . Thus, in order for conditions (b) and (c) in the definition of a problem point to be compatible, must belong to the boundary of . A problem point is illustrated in Figure 1.
Neighborhoods of a problem point satisfying (3.1) may not be available. The following two lemmas provide a solution. Lemma 6 proves the existence of neighborhoods if there does not exist a minimizer of over on the interior of . To satisfy equation (3.1) using Lemma 6, we can take and to be all of or .
Fix , and suppose is such that contains no minimizer over with . Then, there exists a neighborhood of , , such that
The fact that a problem point and a minimizer on the interior of take the exact same value seems unlikely. If one could change the radius of by a little bit, this would not happen because the value at the boundary would be different. This gives the intuition for Lemma 7. For every and for every , let denote . This notation denotes the fact that has been shrunk in the differentiable direction by a factor of . Also, for every , let
For every and , there exists an such that .
Lemma 7 is very different from Lemmas 3 - 6. Instead of proving the existence of neighborhoods that satisfy (3.1), Lemma 7 proves that, if we can adjust the radius of by a small amount, we can always choose in such a way that the probability of a problem point occurring simultaneously with an interior minimizer is zero. This avoids the boundary complication, confining it to a negligible set.
Together, Lemmas 1 - 7 provide the intuition for how Assumption Generic is used, as well as the basic structure for the proof of Theorem 1.
4.1 Nonconvex M-estimation
Theorem 1 can be applied to estimation methods that minimize a random objective function, also known as M-estimation. In this case, is the negative of the likelihood or some other objective function, is the parameter space, and is the sample. These optimization problems are known to be nonconvex in general. Whenever satisfies Assumption Generic, Theorem 1 guarantees the estimator is well-defined and unique with probability 1.
Uniqueness of the argmin is an important property in M-estimation, for a variety of reasons.141414Uniqueness of an M-estimator is not necessary for asymptotic results, such as consistency. Rather, it is the finite sample property that the estimator is a point, a desirable property in itself. Finding the global minimum is a very hard problem numerically, and there are many algorithms, such as multi-start or branch-and-bound, that are designed to find the global minimum. For all of these, uniqueness of the argmin is important for a well-defined convergence criterion. At the same time, uniqueness is a property that is often difficult to verify numerically because the objective function can be very flat or contain many local minimizers in a neighborhood of the global minimizer. Also, uniqueness is important for replication and communication in research. If a replication study calculates a different value of an M-estimator, the study may come to a different conclusion than the original. In addition, Hillier and Armstrong (1999) provide a formula for the exact density of the maximum likelihood estimator, under the assumption that it is unique, among other regularity conditions. For these reasons, it is useful to have a theorem that provides an analytic guarantee that the argmin is unique almost surely.
A lot of effort has been put into verifying uniqueness of the argmin in isolated cases of nonconvex objective functions. These examples include the truncated normal likelihood,151515See Orme (1989) and Orme and Ruud (2002). the Cauchy likelihood,161616See Copas (1975). the Weibull likelihood,171717See Cheng and Chen (1988). the Tobit model,181818See Olsen (1978) and Wang and Bice (1997). random coefficient regression models,191919See Mallet (1986). mixed proportional hazard models,202020See Huh, Postert, and Sickles (1998). k-monotone densities,212121See Seregin (2010). estimating a covariance matrix with a Kronecker product structure,222222See Roś, Bijma, Munck, and Gunst (2016) and Soloveychik and Trushin (2016). and a variety of nonparametric mixture models.232323See Simar (1976), Hill, Saunders, and Laud (1980), Lindsay (1981, 1983a, 1983b, 1995), Jewell (1982), Lindsay, Clogg, and Grego (1991), Lindsay and Roeder (1993), and Wood (1999). All of these examples require specific knowledge about the structure of the objective function.
The mixture model is an important example because of its widespread use and the presence of many local minimizers. The cases where uniqueness has been verified, such as in Lindsay (1983a), are for nonparametric mixture models, where the number of mixture components, , is allowed to be as large as necessary to maximize the likelihood. In some cases, this can be as large as , or half of the sample size. This contrasts with finite mixture models, where the number of mixture components is fixed and assumed known. To the author’s knowledge, uniqueness of the maximum likelihood estimator has not been verified in finite mixture models. Corollary 1, below, uses Theorem 1 to verify uniqueness of the maximum likelihood estimator for a finite mixture of normals.
The mixture of normals assumes the sample, , is drawn iid from a continuous distribution with a density, , that is approximated by a mixture density, , where is a -vector of weights and is a -vector of means. The mixture density satisfies:
where is the standard normal density. The weights, , are assumed to be positive and sum to 1. The means, , are assumed to be strictly increasing: . These assumptions are necessary to ensure that the components can be separately identified. In this case, the parameter to be optimized is , while the random vector is the full sample, . We can write the negative of the log-likelihood as:
If , then the argmin of over is unique almost surely-.
The assumption that is very weak. Practical uses of finite mixture models require very few components relative to the sample size.
The proof of Corollary 1 verifies condition (d) in Assumption Generic by taking derivatives with respect to and arguing that for all implies and .
Corollary 1 does not require the model to be correctly specified. The proof only requires that is continuously distributed.
Corollary 1 illustrates how Theorem 1 can be used to verify uniqueness of M-estimators. It is stated for a mixture of normals, but the proof also covers any mixture of a 1-parameter exponential family. In addition, Corollary 1 can be extended to any mixture of a -parameter exponential family, including a mixture of normals with unknown variance.
4.2 Utility Maximization with a Nonconvex Budget Set
A nonconvex budget set introduces many problems for the standard utility maximization framework. Even with strictly concave preferences, the demand correspondence may not be single-valued. Single-valued demand is an important condition in utility maximization. For example, without convexity, Hausman and Newey (2016) assume single-valued demand in order to derive welfare formulas with unrestricted heterogeneity. In addition, Szabó (2015) truncates the support of heterogeneity so that demand is single-valued in order to consider well-defined counterfactual experiments.
The most common type of nonconvex budget set is a piecewise linear budget set, which can occur due to decreasing block-rate pricing.242424See Moffitt (1986) for an exposition of piecewise linear budget sets. A closely related literature is that of discrete/continuous models, where nonconvex budget sets can also occur. See Hanemann (1984) and Hausman (1985) for an exposition of discrete/continuous models as well as a survey of examples. More recently, Dalton (2014) and Kowalski (2015) allow for nonconvex piecewise linear budget set when analyzing the market for health insurance. Decreasing block-rate pricing implies nonconvexity in the budget set, and therefore the demand correspondence may not be single-valued. This is especially relevant when the pricing schedule is set by a monopolist such that the optimal schedule makes the consumers indifferent between blocks. If there are only finitely many types of consumers, this occurs whenever an incentive compatibility constraint is binding for the problem of the monopolist.252525For a discussion of the problem of the monopolist in setting price schedules, see Varian (1989) and Wilson (1993).
Burtless and Hausman (1978) analyze a government subsidy program that induces a nonconvex kink in the budget set.262626Blundell, MaCurdy, and Meghir (2007) survey the literature on labor supply models, including approaches that account for nonconvex budget sets. Their approach specifies a constant income elasticity functional form for indirect utility. Burtless and Hausman (1978) use the specified functional form together with the assumption that the income elasticity is heterogeneous across the population with a continuous distribution to verify that the supply correspondence272727In the market for labor, the consumer chooses supply rather than demand. is single-valued.282828Another approach to verifying single-valued demand used in discrete/continuous models is to assume each consumer has a full set of continuously distributed choice-specific tastes, which is additively separable with the characteristics of that choice. This is done in nested logit models, for example. In contrast, Corollary 2 allows non-additively separable specifications of heterogeneity. Single-valued supply implies differentiability of the indirect utility function, and consequently, Roy’s identity.292929See Section 4.4, on envelope theorems. Roy’s identity is the key equation for estimating preference parameters from demand functions. See Chapter 3 of Mas-Colell, Whinston, and Green (1995). They use Roy’s identity to write labor supply as a function of preference parameters, which they estimate by maximum likelihood.
We generalize Burtless and Hausman (1978), allowing more flexible functional forms for the indirect utility function by using Theorem 1 to verify that demand is single-valued. Suppose the budget set is defined by decreasing block-rate pricing with blocks. Each block is associated with a price, , and a “virtual income,” .303030Virtual income, , is the value of the bundle if the price, , is the same for all quantities, as defined in Burtless and Hausman (1978). The budget set can be written as the union of linear budget sets.
Suppose the indirect utility function, conditional on the linear budget set takes the form , where is a vector of preference parameters and is a vector of individual specific tastes, assumed to be finite dimensional and continuously distributed.313131We allow to be infinite dimensional, incorporating nonparametric random utility models. Denote the conditional demand correspondence by . The overall indirect utility function can be written as:
Denote the overall demand correspondence by . The following corollary gives sufficient conditions for the overall demand correspondence to be single-valued almost surely.
Assume the following conditions.
For every , the conditional demand correspondence, , is almost surely single-valued.
For every , is continuously differentiable with respect to .
For every such that ,
Then, the overall demand correspondence, , is almost surely single-valued.
Condition (a) follows from strict quasiconcavity of preferences because the conditional budget set is linear, or, if preferences are not strictly quasiconcave, from a prior application of Theorem 1.
Condition (c) has the interpretation that for any two budget sets, there exists some dimension of heterogeneity that varies the indirect utility at those two budget sets by different amounts. Notice that is a vector and condition (c) can be satisfied by any component of .
Corollary 2 is more general than the approach in Burtless and Hausman (1978) because it: (i) allows for an arbitrary number of blocks, (ii) applies to other functional forms for indirect utility, and (iii) can use any kind of continuously distributed heterogeneity in taste preferences to verify single-valued demand almost surely.
Infinite dimensional heterogeneity, as considered in Hausman and Newey (2016) and Blomquist, Kumar, Liang, and Newey (2015), can be accommodated by verifying condition (c) with respect to an absolutely continuous finite dimensional functional of the heterogeneity. In fact, the condition is easier to verify the higher the dimension of the heterogeneity. In this sense, Corollary 2 covers nonparametric random utility models with totally unrestricted, possibly infinite dimensional, heterogeneity.
4.3 Unique Policy Functions in Dynamic Programming
Many dynamic programming problems are not concave. The utility functions (or cost functions) may be nonconcave (nonconvex). For example, fixed adjustment costs imply a nonconvex cost function that appears in many settings, including investment choice, choice over consumer durables, and price setting with menu costs, among others. In addition, the choice set may be nonconvex, as in discrete choice models.
Consider the general dynamic programming problem,
where denotes a vector of state variables, denotes a vector of shocks, and denotes a vector of choice variables. The expectation is taken with respect to some distribution, , which defines the state transition probabilities. In many settings, the object of interest is the policy function, , which maps state variables to choice variables. Uniqueness is an important condition for a well-defined convergence criterion for policy function iteration, a common way to solve dynamic programming problems.
The typical argument for uniqueness of the policy function follows Theorems 4.8 and 9.8 in Stokey, Lucas, and Prescott (1989), which require the problem to be strictly concave. The author is unaware of attempts to verify uniqueness of the policy function without concavity in a general setting.323232Some papers analyze uniqueness of the value function without concavity, such as Rincón-Zapatero and Rodríguez-Palmero (2003) and Martins-da-Rocha and Vailakis (2010), but they do not analyze uniqueness of the policy function. There are very special cases where the policy function has been fully characterized. For example, Majumdar and Mitra (1983) characterize optimal policy for a linear objective function and a particular type of nonconvex constraint. This can be very difficult. For example, Caballero and Engel (1999) study investment with adjustment costs and characterize many aspects of the optimal policy, but cannot prove uniqueness.
Assume the following hold.
The choice set, , is a finite or countable union of manifolds, possibly with boundary or corner.
The shocks, , are continuously distributed.
and are differentiable with respect to , and the derivative is continuous with respect to and .
For every , and , where ,
Then, the policy function, , is unique almost surely-.
Condition (a) incorporates adjustment costs by allowing one manifold for adjustment and another manifold for no adjustment.
Condition (c) is a regularity condition that is easy to check. If the shocks are independent over time, so that does not depend on , condition (c) simplifies to differentiability of with respect to .
If the shocks are independent over time, condition (d) has the interpretation that for any two choices, there exists a shock that affects the payoffs differently for those two choices. For dependent shocks, the interpretation is the same except the contribution to future payoffs must be considered.
Condition (d) is satisfied in a wide variety of dynamic programming problems.
For example, if one of the ’s is a linearly additive transitory shock, which means that it is independent over time and satisfies , then condition (d) is satisfied. Adding this type of shock to any nonconcave dynamic programming problem implies uniqueness of the policy function almost surely.
Condition (d) is also satisfied in discrete choice models that incorporate a full set of additively separable choice-specific shocks, as in Rust (1987), among many others. In this context, Corollary 3 also covers shocks that are not additively separable and may be dependent over time.
Random constraints can be incorporated in a limited capacity, either by solving the constraints and substituting into the objective function, or by conditioning, verifying Assumption Generic using that only enter the objective function.
4.4 Envelope Theorems
Envelope theorems establish differentiability of the value function, together with a formula for the derivative. Envelope theorems were originally developed for concave problems,333333For example, see Benveniste and Scheinkman (1979). and later extended to nonconcave problems. Milgrom and Segal (2002) prove an envelope theorem for general choice sets and nonconcave objective functions using equi-differentiability of the objective function. The equi-differentiability assumption is relaxed in subsequent work by Morand, Reffett, and Tarafdar (2015), hereafter MRT, which provides the most general envelope theorem available.
Uniqueness of the optimum is a key step in proving differentiability of the value function. In the setup of MRT, the value function is directionally differentiable under very weak conditions.343434Theorem 7 in MRT gives directional differentiability of the value function assuming differentiability of the objective function and a constraint qualification. However, to get differentiability of the value function, MRT need uniqueness of the optimum so that directional derivatives coincide. Correspondingly, Corollary 11 in MRT proves differentiability of the value function assuming the objective function is strictly quasi-concave and the domain is convex. These are overly strong assumptions considering Theorem 1 provides uniqueness under much weaker conditions.
An alternative approach to envelope theorems for nonconcave problems is Clausen and Strub (2017), which assumes the existence of a differentiable lower support function to the policy function, together with a “Sandwich Lemma,” to prove differentiability. In this context, the existence of a differentiable lower support function requires, as a necessary condition, uniqueness of the policy function.
Applied to utility maximization with a nonconvex budget set, the envelope theorem implies differentiability of the indirect utility function. Further, this implies that Roy’s identity holds, the key equation relating demand to preference parameters and simplifying estimation of the preference parameters.
Applied to dynamic programming with a nonconcave problem, the envelope theorem implies differentiability of the value function. If the choice variable is continuous, this can be used to derive an Euler equation, which is useful for solving the model as well as estimating parameters in the model.
Uniqueness of the policy function is a useful condition for proving envelope theorems, and Theorem 1 provides almost sure uniqueness in nondegenerate cases.
4.5 Limit Theory in Weakly Identified Models
In increasingly many models, there are concerns about parameters being weakly identified. Limit theory for estimators of weakly identified parameters often relies on an assumption that a random function has a unique minimizer almost surely. In this case, the random function is the limit of the profiled objective function. Examples include Stock and Wright (2000), Andrews and Cheng (2012), Cheng (2015), Han and McCloskey (2017), and Cox (2017). Andrews and Cheng (2012) provide sufficient conditions for this condition in the special case that the key parameter, , which determines the strength of identification, is scalar. However, examples that require vector , including Cheng (2015) and Cox (2017), can benefit from the low-level sufficient conditions stated in this paper.
Cox (2017) defines two types of parameters, and . is identified, but the identification of depends on a function, , that maps structural parameters to reduced form parameters. In particular, for some value, , the function is not injective as a function of .353535This structure is closely connected to the definition of identification, incorporating many models including the linear IV model, models estimated by minimum distance, and two examples given below. Let parameters and have dimensions and , respectively. Cox (2017) considers a sequence of true values, converging to , and characterizes the limiting distribution of an estimator for . A typical sequence satisfies the following assumption, which says that influences the value of at the rate. The parameter measures the strength of identification. The case is an important special case corresponding to total unidentification. It derives from for all .
Assumption Weak Id.
is twice continuously differentiable, and
for some , , uniformly on compact sets over , where denotes the derivative of with respect to .
In this application, the parameter serves the same purpose as , indexing the domain of the random function. The domain is the identified set for under identification failure. Allowing for a variety of identified sets is useful because it often has an unusual shape that may be difficult to characterize exactly. Let be a be a continuous random vector with dimension . If is a symmetric and positive definite matrix, then the limit of the profiled objective function is
where is a continuous function of ,363636The formula for is , but the following corollary only uses the continuity of .
We show that the argmin of over is almost surely unique by verifying the conditions of Theorem 1. Assumption Manifold is satisfied as long as the identified set can be written as a finite or countable union of second-countable Hausdorff manifolds, possibly with boundary or corner. The fact that is a continuous random vector implies Assumption Absolute Continuity. Assumption Continuous Differentiability is satisfied by Assumption Weak Id (a). The only assumption we need to verify is Assumption Generic. We place low-level conditions on .
Let and the sequence satisfy Assumption Weak Id. Let be defined in equation (4.1). If
for all , the rank of is , and
there exists an open set containing , such that for almost every , is an injective function of .
Then, satisfies Assumption Generic. Therefore, by Theorem 1, the argmin of over is unique almost surely.
Conditions (c) and (d) eliminate degeneracy in as a function of so that Assumption Generic can be verified by taking derivatives with respect to . Condition (c) is a rank condition guaranteeing that varies enough as a function of . A necessary condition is that . Condition (d) says that is generically identified locally around . Below, two examples are given that demonstrate the importance of these two conditions.
This example demonstrates the importance of condition (d), that is generically identified in a neighborhood of . Consider the model,
where . In this case, identification of is determined by injectivity of in a neighborhood of . Condition (d) is not satisfied because, for any and , there exists an such that the quadratic equation, , has multiple solutions in . We can calculate , which satisfies condition (c). If is estimated by nonlinear least squares, the profiled objective function, , has multiple minimizers with positive probability. Figure 2 gives some simulations of this function.
This example demonstrates the importance of condition (c), which states that has full rank, , for all . Consider the model,
where . In this case, identification of is determined by injectivity of
in a neighborhood of . Condition (d) is satisfied because for any , is injective as a function of . We can calculate
which does not satisfy condition (c) because the rank is zero whenever and are both roots of .
If is estimated by nonlinear least squares, the profiled objective function, , has multiple minimizers with positive probability. Figure 3 gives some simulations of this function. The key components of this example are the two functions in that depend nonlinearly on , and contain different amounts of information about . In the case , the more informative function is weaker, and therefore cannot satisfy condition (c). This example is concerning because it seems likely that these key components are present in more complicated weakly identified models.
4.6 Functionals of Brownian Motion
Many objective functions can be written as functionals of Brownian motion. For example, Chernoff’s distribution is defined to be the distribution of
where is a two-sided Wiener Process satisfying . Kim and Pollard (1990) show that Chernoff’s distribution is well-defined in the sense that the argmin is almost surely unique, and that Chernoff’s distribution characterizes the asymptotic distribution of many estimators at the cube-root rate. Theorem 1 can be used to provide a new proof that the argmin in Chernoff’s distribution is almost surely unique.
The argmin of over is unique almost surely.
The proof of Corollary 5 illustrates the key aspect of accommodating an infinite dimensional source of randomness or nondifferentiability in : only derivatives that are used to verify Assumption Generic are needed. In this case, Assumption Generic is verified by taking derivatives with respect to , for fixed values of .
Corollary 5 is stated only for one example of a functional of a Wiener process, but the proof can be applied to more general functionals of Brownian motion.
This paper states and proves Theorem 1, establishing the argmin of a random objective function to be unique almost surely. The conditions of Theorem 1 are very weak and easy to verify. In particular, Theorem 1 does not rely on convexity of the objective function nor of the domain. Instead, Theorem 1 relies on a nondegeneracy condition based on taking derivatives with respect to , a random vector. Theorem 1 has widespread usefulness, including six applications that are discussed.
-  Andrews, D., and X. Cheng (2012): “Estimation and Inference with Weak, Semi-Strong, and Strong Identification,” Econometrica, 80, 2153-2211.
-  Benveniste, L., and J. Scheinkman (1979): “On the Differentiability of the Value Function in Dynamic Models of Economics,” Econometrica, 47, 727-732.
-  Blomquist, S., A. Kumar, C. Liang, and W. Newey (2015): “Individual Heterogeneity, Nonlinear Budget Sets, and Taxable Income,” Unpublished Manuscript.
-  Blundell, R., T. MaCurdy, and C. Meghir (2007): “Labor Supply Models: Unobserved Heterogeneity, Nonparticipation and Dynamics,” Handbook of Labor Economics, 6, 4667-4775.
-  Burtless, G., and J. Hausman (1978): “The Effect of Taxation on Labor Supply: Evaluating the Gary Negative Income Tax Experiment,” Journal of Political Economy, 86, 1103-1130.
-  Caballero, R., and E. Engel (1999): “Explaining Investment Dynamics in U.S. Manufacturing: A Generalized (S,s) Approach,” Econometrica, 67, 783-826.
-  Cheng, K., and C. Chen (1988): “Estimation of the Weibull Parameters with Grouped Data,” Communications in Statistics - Theory and Methods, 17, 325-341.
-  Cheng, X. (2015): “Robust Inference in Nonlinear Models with Mixed Identification Strength,” Journal of Econometrics, 189, 207-228.
-  Clausen, A., and C. Strub (2017): “A General and Intuitive Envelope Theorem,” Unpublished Manuscript.
-  Copas, J. (1975): “On the Unimodality of the Likelihood for the Cauchy Distribution,” Biometrika, 62, 701-704.
-  Cox, G. (2017): “Weak Identification in a Class of Generically Identified Models with an Application to Factor Models,” Unpublished Manuscript.
-  Dalton, C. (2014): “Estimating Demand Elasticities Using Nonlinear Pricing,” International Journal of Industrial Organization, 37, 178-191.
-  Demidenko, E. (2008): “Criteria for Unconstrained Global Optimization,” Journal of Optimization Theory and Applications, 136, 375-395.
-  Guillemin, V., and A. Pollack (1974): Differential Topology, Prentice-Hall, Inc., Englewood Cliffs, New Jersey.
-  Han, S., and A. McCloskey (2017): “Estimation and Inference with a (Nearly) Singular Jacobian,” Unpublished Manuscript.
-  Hanemann, M. (1984): “Discrete/Continuous Models of Consumer Demand,” Econometrica, 52, 541-561.
-  Hausman, J. (1985): “The Econometrics of Nonlinear Budget Sets,” Econometrica, 53, 1255-1282.
-  Hausman, J., and W. Newey (2016): “Individual Heterogeneity and Average Welfare,” Econometrica, 84, 1225-1248.
-  Hill, D., R. Saunders, and P. Laud (1980): “Maximum Likelihood Estimation for Mixtures,” The Canadian Journal of Statistics, 8, 87-93.
-  Hillier, G., and M. Armstrong (1999): “The Density of the Maximum Likelihood Estimator,” Econometrica, 67, 1459-1470.
-  Huh, K., A. Postert, and R. Sickles (1998): “Maximum Penalized Likelihood Estimation of Mixed Proportional Hazard Models,” Communications in Statistics - Theory and Methods, 27, 2143-2164.
-  Jewell, N. (1982): “Mixtures of Exponential Distributions,” The Annals of Statistics, 10, 479-484.
-  Karlin, S. (1968): Total Positivity, Vol. 1, Stanford University Press, Stanford.
-  Kim, J., and D. Pollard (1990): “Cube Root Asymptotics,” The Annals of Statistics, 18, 191-219.
-  Kowalski, A. (2015): “Estimating the Tradeoff Between Risk Protection and Moral Hazard with a Nonlinear Budget Set Model of Health Insurance,” International Journal of Industrial Organization, 43, 122-135.
-  Lindsay, B. (1981): “Properties of the Maximum Likelihood Estimator of a Mixing Distribution,” Statistical Distributions in Scientific Work, 5, 95-110.
-  Lindsay, B. (1983a): “The Geometry of Mixture Likelihoods: A General Theory,” The Annals of Statistics, 11, 86-94.
-  Lindsay, B. (1983b): “The Geometry of Mixture Likelihoods, Part II: The Exponential Family,” The Annals of Statistics, 11, 783-792.
-  Lindsay, B. (1995): Mixture Models: Theory, Geometry and Applications, Institute of Mathematical Statistics, United States of America.
-  Lindsay, B., C. Clogg, and J. Grego (1991): “Semiparametric Estimation in the Rasch Model and Related Exponential Response Models, Including a Simple Latent Class Model for Item Analysis,” Journal of the American Statistical Association, 86, 96-107.
-  Lindsay, B., and K. Roeder (1993): “Uniqueness of Estimation and Identifiability in Mixture Models,” The Canadian Journal of Statistics, 21, 139-147.
-  Majumdar, M., and T. Mitra (1983): “Dynamic Optimization with a Non-Convex Technology: The Case of a Linear Objective Function,” The Review of Economic Studies, 50, 143-151.
-  Mäkeläinen, T., K. Schmidt, and G. Styan (1981): “On the Existence and Uniqueness of the Maximum Likelihood Estimate of a Vector-Valued Parameter in Fixed-Size Samples,” The Annals of Statistics, 9, 758-767.
-  Mas-Colell A., M. Whinston, and J. Green (1995): Microeconomic Theory, Oxford University Press, New York.
-  Mascarenhas, W. (2010): “A Mountain Pass Lemma and its Implications Regarding the Uniqueness of Constrained Minimizers,” Optimization, 60, 1121-1159.
-  Mallet, A. (1986): “A Maximum Likelihood Estimation Method for Random Coefficient Regression Models,” Biometrika, 73, 645-656.
-  Martins-da-Rocha, V., and Y. Vailakis (2010): “Existence and Uniqueness of a Fixed Point for Local Contractions,” Econometrica, 78, 1127-1141.
-  Milgrom, P., and I. Segal (2002): “Envelope Theorems for Arbitrary Choice Sets,” Econometrica, 70, 583-601.
-  Moffitt, R. (1986): “The Econometrics of Piecewise-Linear Budget Constraints,” Journal of Business and Economic Statistics, 4, 317-328.
-  Morand, O., K. Reffett, and S. Tarafdar (2015): “A Nonsmooth Approach to Envelope Theorems,” Journal of Mathematical Economics, 61, 157-165.
-  Olsen, R. (1978): “Note on the Uniqueness of the Maximum Likelihood Estimator for the Tobit Model,” Econometrica, 46, 1211-1215.
-  Orme, C. (1989): “On the Uniqueness of the Maximum Likelihood Estimator in Truncated Regression Models,” Econometric Reviews, 8, 217-222.
-  Orme, C., and P. Ruud (2002): “On the Uniqueness of the Maximum Likelihood Estimator,” Economics Letters, 75, 209-217.
-  Rincón-Zapatero, J., and C. Rodríguez-Palmero (2003): “Existence and Uniqueness of Solutions to the Bellman Equation in the Unbounded Case,” Econometrica, 71, 1519-1555.
-  Roś, B., F. Bijma, J. de Munck, and M. de Gunst (2016): “Existence and Uniqueness of the Maximum Likelihood Estimator for Models with a Kronecker Product Covariance Structure,” Journal of Multivariate Analysis, 143, 345-361.
-  Rust, J. (1987): “Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher,” Econometrica, 55, 999-1033.
-  Seregin, A. (2010): “Uniqueness of the Maximum Likelihood Estimator for k-Monotone Densities,” Proceedings of the American Mathematical Society, 138, 4511-4515.
-  Simar, L. (1976): “Maximum Likelihood Estimation of a Compound Poisson Process,” The Annals of Statistics, 4, 1200-1209.
-  Soloveychik, I., and D. Trushin (2016): “Gaussian and Robust Kronecker Product Covariance Estimation: Existence and Uniqueness,” Journal of Multivariate Analysis, 149, 92-113.
-  Stock, J., and J. Wright (2000): “GMM with Weak Identification,” Econometrica, 68, 1055-1096.
-  Stokey, N., R. Lucas, and E. Prescott (1989): Recursive Methods in Economic Dynamics, Harvard University Press, Cambridge.
-  Szabó, A. (2015): “The Value of Free Water: Analyzing South Africa’s Free Basic Water Policy,” Econometrica, 83, 1913-1961.
-  Tarone, R., and G. Gruenhage (1975): “A Note on the Uniqueness of Roots of the Likelihood Equations for Vector-Valued Parameters,” Journal of the American Statistical Association, 70, 903-904.
-  Varian, H. (1989): “Price Discrimination,” Handbook of Industrial Organization, 597-654.
-  Wang, W., and D. Bice (1997): “A Model with Mixed Binary Responses and Censored Observations,” Communications in Statistics - Theory and Methods, 26, 921-941.
-  Wilson, R. (1993): Nonlinear Pricing, Oxford University Press, New York.
-  Wood, G. (1999): “Binomial Mixtures: Geometric Estimation of the Mixing Distribution,” The Annals of Statistics, 27, 1706-1721.
Appendix A Proofs
a.1 Global to Local
Proof of Lemma 1.
Let for some . Suppose is not uniquely minimized over . Then, there exist , , and , , so that
Furthermore, by Assumption K(d), there exist sets, and , such that , and . It follows that
This implies that for every ,
This implies, by countability of and , that
where denotes the complement of in , the equality follows by assumption, and the convergence follows as by the assumption on . ∎
Proof of Lemma 2.
is an open cover of
Thus, there is a finite subcover that we index by Then,