Computational Complexity in algebraic regression

Computational Complexity in algebraic regression

Oliver Gäfvert Department of mathematics, KTH, 10044, Stockholm, Sweden oliverg@math.kth.se https://people.kth.se/ oliverg/
Abstract.

We analyze the complexity of fitting a variety, coming from a class of varieties, to a configuration of points in . The complexity measure, called the algebraic complexity, computes the Euclidean Distance Degree (EDdegree) of a certain variety called the hypothesis variety as the number of points in the configuration increases.

For the problem of fitting an -sphere to a configuration of points in , we give a closed formula of the algebraic complexity of the hypothesis variety as grows for the case of . For the case we conjecture a generalization of this formula supported by numerical experiments.

1. Introduction

A fundamental problem in data analysis is recovering model parameters from noisy measurements. This is a basic problem in manifold learning/dimensionality reduction [Lee:2007:NDR:1557216], statistical regression and learning theory [hastie01statisticallearning, mlbook]. We phrase this problem in the setting of algebraic geometry and use tools coming from this field to study it. The model is in our case an algebraic variety , the parameters are coefficients of polynomials defining an ideal cutting out and the measurements are points sampled from with noise from a specified distribution.

To recover the unknown variety we assume that the points are sampled from with Gaussian noise and that comes from a class of varieties . We then look for the varieties lying closest to the points in the sense that small perturbations lie on a variety in . If is the class of all hypersurfaces of degree , this is called polynomial regression, which is a special case of linear regression [mlbook]. In this paper we consider more general classes of varieties and therefore use the broader name algebraic regression. The goal of the paper is to analyze the computational complexity of finding the variety in a class that best fit a given set of samples. To this end, we develop a complexity measure, called the algebraic complexity of , based on the Euclidean Distance Degree (EDdegree) [edd].

The class of varieties is in the theory of Probably Approximately Correct (PAC) learning [mlbook] called a hypothesis class. The two fundamental invariants of a hypothesis class are the sample complexity and the computational complexity. The sample complexity tells you how many points you need to sample in order to recover the variety with some probability. It measures the richness, or expressibility of a hypothesis class, while the computational complexity measures the complexity of implementing the learning rule, which in our case means solving an optimization problem. It tells you the amount of work you need to perform in order to obtain an optimal hypothesis. To analyze the sample complexity of , one may use tools such as the Vapnik–Chervonenkis (VC) dimension and the Rademacher complexity [mlbook]. Analyzing the computational complexity is much harder as many learning problems are NP-hard to compute [mlbook]. It would be desirable to characterize their computational complexity relative to each other. We propose using the Euclidean Distance Degree (EDdegree) to analyze the computational complexity. The EDdegree measures the algebraic complexity of a polynomial optimization problem and we will show in this paper how we can use it to say something about the computational complexity of regression problems.

In our setting, the hypothesis class is defined by polynomial equations and is thus a variety. We therefore refer to it as the hypothesis variety. Rather than measuring expressibility of , like the VC-dimension, the EDdegree instead characterizes the complexity of finding an optimal hypothesis in for a learning rule, given a set of samples. The EDdegree is the degree of the polynomial describing the optimal solutions to the regression problem of fitting a function from to the given samples. By optimal, we mean local minimum/maximum or saddle points of an optimization problem (see Section 3), these are called the critical points. The number of critical points is dependent on the number of samples and therefore we consider the growth of the EDdegree of as the number of samples grow. We call this function the algebraic complexity of . The following example illustrates the meaning of the critical configurations of a specific point configuration. The varieties passing through the critical configurations are the varieties that best fit the configuration.

Example 1.1.

Consider a configuration of four points in . The following show four circles passing through real critical configurations of Problem (2), out of a total of 26 critical configurations:

Figure 1. Four of the critical circles of the point configuration.

The procedure in the above example provides a non-linear generalization of principal component analysis (PCA) (see Section 2.2). For the hyperplane class we get the usual (linear) principal components while for other choices of we get something else. For the non-linear hypothesis varieties we consider in this paper, the number of principal components we get is dependent on the number of points in the configuration.

In this paper we investigate the algebraic complexity of a series of classes of varieties. Starting from the simplest classes, the class of hyperplanes in and the class of affine subspaces of codimension , both of which have closed form expressions for their algebraic complexity. We then continue with analyzing the class of spheres, for which we prove the following closed formula for the algebraic complexity in the case :

Theorem 1.

and if is a real configuration, then the critical points of with respect to are all real.

For the general case, , we conjecture the following formula, based on the numerical numerical experiments in Table 1:

Conjecture 1.

The number of critical spheres of a generic configuration of points in , where , is given by:

Finally we end by considering paraboloids and ellipsoids for which we provide numerical results on their algebraic complexity. The main contributions of the paper are the following:

  • defining the algebraic complexity (Definition 3.3) as a complexity measure for algebraic regression problems.

  • developing tools, such as a weighted EDdegree called , for computing and approximating the algebraic complexity (Theorem 3.6, Corollary 3.10, Corollary 3.13 and Proposition 3.11).

  • proving and conjecturing closed formulas of the algebraic complexity for prescribed classes of varieties (Theorem 1 and Conjecture 1) and relating the algebraic complexity to the generalized Eckart-Young theorem (Theorem 4.4).

The computation of the algebraic complexity relies on computing the EDdegree. We compute the EDdegree directly from the defining equations of the critical ideal (see Section 2.2) using numerical methods. Other methods are possible, such as computing it using Chern classes, polar classes and their extensions to singular varieties [edd, ZHANG201855] or the Euler characteristic and Euler obstruction function [aluffi2018, 2018arXiv181205648M, 2019arXiv190105550M]. For small examples we compute the EDdegree using Macaulay2 [M2] and for all other examples we use Bertini [BHSW06] and the method of regeneration [10.2307/41104703]. Another alternative is to use the method by Martín del Campo and Rodriguez [MARTINDELCAMPO2017559] using monodromy loops to compute the EDdegree, but this we leave for future work.

1.1. Future work

For future work we would like to investigate more complicated classes of varieties such as polynomial neural networks, which are neural networks [mlbook] with polynomial activation functions. A special type of polynomial neural network was recently studied by Kileel et. al. in the paper [2019arXiv190512207K] where they compute the dimension of (see Section 3) for these types of networks. They also note that the EDdegree, and by extension the algebraic complexity, could be useful to characterize the complexity of different architectures of these networks and polynomial neural networks in general. The algebraic complexity measures the number of critical points of the Euclidean distance function, which in the setting of polynomial neural networks is the same as the mean-squared-error (MSE) objective function, which is one of standard objective functions used in practice.

1.2. Organization

The paper is organized as follows. In Section 2 we review background, such as multivariate polynomial interpolation, the Euclidean distance degree and define the relevant terms we will use. In Section 3 we define the hypothesis variety, which is our main object of study, and the algebraic complexity. We show a strategy of computing the EDdegree of the hypothesis variety and determine its dimension. In Section 4 we give results on prescribed classes of varieties and numerical computation of the algebraic complexity of their hypothesis varieties. For the case of spheres, we provide a closed formula for the algebraic complexity in the case and conjecture a generalization of the formula for supported by numerical results.

1.3. Acknowledgments

This work was partly made during the authors visit to ICERM during the fall of 2018 as part of the semester program on Non-linear algebra. The author would like to thank Sandra Di Rocco and David Eklund who have been of enormous help during the course of this project, Kathlén Kohn for suggesting to look at the EDdegree, and Bernd Sturmfels and Paul Breiding for useful discussions in the beginning of the project.

2. Preliminaries

2.1.

A point configuration is a configuration of points in . When appropriate, we will interpret as a matrix where each point, , is a row in the matrix.

2.2.

The Veronese map of degree is the map sending to its th symmetric power. The affine Veronese map of degree is the dehomogenization of in the sense that it sends to all possible monomials of degree , which is a vector in .

2.3.

The Euclidean distance function is defined as where and are points in either or . Let be a variety and be a point in . A critical point of the Euclidean distance function with respect to and is a point such that the line from to lies in the normal space of at .

2.1. Multivariate Polynomial Interpolation

Let be a point configuration of points in and assume that each point is sampled from a variety without noise. It is then straightforward to proceed as in [Breiding2018] to numerically estimate the coefficients of a set of polynomials defining . The main tool for doing this is the Vandermonde matrix, defined as follows:

Definition 2.4.

The Vandermonde matrix of degree of a point configuration is defined as:

where is the affine Veronese map of degree .

Example 2.5.

For and we have that

Note that the coefficients of any polynomial up to degree 2 that vanishes on both and lie in the kernel of .

In fact, it is true in general that the coefficients of any polynomial, , of degree , that vanishes on , lie in . A generating set for an ideal cutting out can thus be obtained by a choice of basis for . The problem one now faces is dealing with the numerical errors associated with computing the kernel of the Vandermonde matrix. For this we refer to [Breiding2018], where the authors explore ways of numerically estimating a generating set for the ideal cutting out .

2.2. Euclidean Distance Degree

The Euclidean Distance Degree (EDdegree) [edd] counts the number of critical points of the squared Euclidean distance function from a generic point to a variety . The EDdegree thus counts the number of local minima, maxima and saddle points of this distance function. In this sense, the EDdegree is an algebraic complexity measure of polynomial optimization problems. It measures the degree of the ideal describing the critical points of the objective function subject to polynomial contraints.

The most straighforward way of computing the EDdegree is by describing it as the degree of a certain ideal. Let

where is the codimension of and the Jacobian matrix of the defining equations of . We assume that is the radical ideal of an irreducible variety for which we want to compute the EDdegree. The EDdegree is then equal to the degree of the following critical ideal, which is defined as the saturation:

(1)

By [edd, Theorem 2.7], the critical ideal is always zero-dimensional if is a generic point in , and the points of the critical ideal are exactly the critical points of the squared Euclidean distance function from to .

The following result gives a general upper bound of the EDdegree:

Proposition 2.6 ([edd]).

Let be a variety of codimension in that is cut out by polynomials of degrees . Then

Equality holds when is a general complete intersection of codimension .

Example 2.7.

(Eckart-Young Theorem) Let be a configuration of points in and suppose we want to approximate with a hyperplane in . The objective function that we optimize is the sum of the squared Euclidean distance function from each point in the configuration to its closest point on the hyperplane. Note that this is the Fréchet-norm of the matrix representation of where is a configuration lying on the hyperplane. It is a consequence of the Eckart-Young Theorem that the critical hyperplanes can be computed analytically using the singular value decomposition (SVD). They are computed by first centering the configuration around the origin and then computing the SVD of represented as a matrix. Suppose and

is the SVD of when represented as an matrix. Then the critical hyperplanes are given by kernel elements of the matrices

where the th singular value has been set to zero. Each has a one-dimensional kernel and thus the number of critical hyperplanes equals . This means that the EDdegree of the set of configurations lying on a hyperplane in equals if . The kernel elements of each are the principal components of in the sense of principal component analysis (PCA) [mlbook].

Remark 2.8.

It is possible to replace the Euclidean distance function with a generic positive definite quadratic form, it is then called the generic EDdegree. In the next section we will see what happens when we replace the Euclidean distance function with a pull-back of along a projection map, which results in a semi-definite quadratic form on the domain.

3. The algebraic complexity of the Hypothesis variety

In this section we define the optimization problem we want to analyze and the complexity measure used to study it, namely the algebraic complexity. The optimization problem is associated to a variety, called the hypothesis variety. We develop the tools we use to compute the algebraic complexity of this variety.

Let be a system of polynomials whose coefficients are polynomials in , and let be a point configuration in . Let denote the zero locus of the polynomial system . Then define a class of varieties in , in fact it’s a variety in . Our goal is to analyse the complexity of fitting a variety coming from the class to the point configuration , in the sense of the following optimization problem:

(2)

The above objective function is the squared Euclidean distance function in the configuration space . The optimization problem finds the closest configuration to such that there is a variety passing through each . One may view the problem as being handed noisy samples , sampled from a variety coming from the class , and the goal is to recover the true values of , which is done by finding the smallest perturbation of that lies on such a variety. Consider for instance the case of Example 2.7, where is the class of hyperplanes in . Finding the minimal perturbation of is resolved by the Eckart-Young theorem, and computed using the SVD. The singular values correspond to critical values of the objective function in Problem (2), under the assumption that  is centered around the origin.

Example 3.1.

Fix a configuration of four points in the plane centered around the origin, and let where are chosen randomly and . We choose the ’s in order to dehomogenize the equation. The following figures illustrate the two lines passing through the two critical configurations of Problem (2), which are both real and computed using the SVD:

Figure 2. The two critical lines to the point configuration.

As noted in Example 2.7 the two critical lines correspond to the principal components of the point configuration .

Example 3.2.

In Example 1.1 we have the same configuration as in the previous example but we let and . We then get critical circles as shown in the example.

The optimization problem (2) is in general non-linear and has many local minima/maxima and saddle points. Our goal is to develop a complexity measure to study the complexity of Problem (2) as and grows. We will study the complexity by considering the Euclidean Distance Degree of a certain variety, with respect to the point configuration . The variety we will consider is the zero locus of the optimization problem (2), by which we mean the set of configurations such that the global minimum of Problem (2) is zero. This is the image of a variety under a projection whose closure the hypothesis variety, .

Definition 3.3.

Consider the incidence variety:

Define the hypothesis variety, denoted , to be the algebraic closure of the image of under the projection onto the first coordinates.

Note that for a generic there exists a such that for all in . It is clear that for any in the zero locus of Problem (2) it holds that . It follows from the above definition that is the algebraic closure of the zero locus of Problem (2).

The algebraic complexity of finding the optimal solution of Problem (2) may be characterized by the algebraic complexity of writing down the polynomial defining its solutions. The degree of this polynomial is the EDdegree of . We refer to the function

as the algebraic complexity of .

Computing the EDdegree of is however not straightforward since it is defined as the closure of the image of under the projection . The defining equations of can be computed elimination ideals of . This computation is very costly since the result is a Gröbner basis for , which is known to have doubly exponential computational complexity in the number of variables in the worst case. To remedy this we make the following definition.

Definition 3.4.

Let denote the EDdegree of using the Euclidean distance in pulled back to along . This means that the distance between is given by . Note that may be zero even if and thus the pulled back distance is a pseudometric. Let , then the critical ideal is in this case defined as follows:

(3)

where is the codimension of in and is the ideal of generated by its defining equations as given by Definition 3.3. Finally, denotes the Jacobian matrix of .

We might expect the EDdegree of to equal but this is not the case, as shown by the following example.

Example 3.5.

Consider configurations of three points in the plane and let and . Then since any (distinct) three points in the plane have a unique circle passing through them, so . However, . Its critical points consists of the unique circle passing through the points and 3 additional circles illustrated by the following figures:

Note that in each critical configuration above, exactly two points are mapped to the same point on the circle. This would normally yield a singular point on , but not in this case.

We are interested in the generic behavior of in the sense of generic choices of and generic fibers of . What is interesting is thus the dominating components of under the projection that takes and the dominating components of under . We will throughout this section assume that there is a component dominating both and . For convenience we assume that only has one component and is irreducible. Note that if is irreducible for a generic , then is irreducible, and thus it is reasonable to assume that is irreducible.

The following result shows that is bounded from above by and that the critical points of are a subset of the critical points of .

Theorem 3.6.

and suppose that is critical to for some , then any element in the fibre is critical to with respect to .

Proof.

First note that by [edd, Lemma 2.1] it follows that is finite and thus if   is infinite the inequality still holds. We may compute the image of the projection by computing a Gröbner basis of with an appropriate monomial ordering. A subset of this basis describes the image and is a Gröbner basis for . When we compute EDdegree of we consider the following Jacobian matrix:

Note that the upper right block is the Jacobian matrix in the critical ideal of . Consequently, any critical point of is also a critical point of . It remains to prove two things: the first is that a critical point of does not yield a singular point of . Generically, this would only happen if has a singular component but this cannot be since is assumed to be irreducible. The second thing is if a critical point of is not in the image , but generically this does not happen either since the image is constructible and thus contains a Zariski open subset of its closure . ∎

Remark 3.7.

Note that if is homogeneous in , so that for some . Then is not finite since for any critical point we have that is also a critical point. Thus we have to assume that the map is generically finite.

Consider the variety of pairs , denoted by , where is such that is a critical point of with respect to . This is called the ED-correspondence in [edd]. Let denote the projection onto the first component. Then , where is generic. If is irreducible, then is an irreducible variety of dimension . This means that the projection has finite fibers, which is the same as saying that the critical ideal of with respect to is zero-dimensional (see the proof of [edd, Lemma 2.1]).

We can construct an analogous object for . Consider the variety

where denotes the normal space of at . Assume that is generically finite, then . Let denote the projection onto the first component. From the proof of Theorem 3.6 we note that . This then means that if is generically finite and , then . Thus is an irreducible variety of dimension , which means that the projection onto the second component has finite fibers and thus that is finite.

Note that if for a generic . Then this implies that , which for dimensionality reasons forces . This means that the condition is equivalent to . We will now show that this condition is also equivalent to being generically finite.

Lemma 3.8.

is generically finite if and only if for a generic .

Proof.

Suppose we fix a generic and consider the system of equations in variables. The condition that is equivalent to saying that the associated variety is zero-dimensional, since we assumed that is irreducible. Conversely, if the system does not have full rank which means that the variety describing the solutions is not zero-dimensional, which implies that is not generically finite. If is generically finite then the variety describing the solutions to the system is zero-dimensional, which implies that the system has full rank and thus that . ∎

Proposition 3.9.

The critical ideal corresponding to is zero-dimensional for a generic if and only if is generically finite.

Proof.

Suppose that is zero-dimensional for a generic . By Theorem 3.6 this means that is generically finite and that . Now assume that . This forces and thus . Now consider the projection onto the second component. Since is an irreducible variety of dimension it follows that the fibers of are not generically finite, which implies that the critical ideal of with respect to is not zero-dimensional. Thus it has to hold that which by Lemma 3.8 implies that is generically finite.

Conversely, assume that is generically finite, which by Lemma 3.8 implies that rank for a generic . Then the differential is surjective at . To see this, note that if and , then the vector has only non-zero values for coordinates corresponding to . The fact that is surjective implies that it is an isomorphsim, since dim, and thus any in the normal space of at is in the normal space of at under the projection . The result then follows from the fact the the critical ideal of with respect to is zero-dimensional. ∎

Corollary 3.10.

Let be a critical configuration of with respect to some . Then is a critical configuration of with respect to if has full rank.

Proof.

As noted in the proof of Theorem 3.9; if has full rank, then for any in the normal space of it holds that is in the normal space of at , which means that it is critical. ∎

Note that if , then the normal space of at any is a point, since . This means that the only critical point of with respect to is itself. For however we have a different situation. Note that in this case is a square matrix. Thus any critical point of with respect to , which is not itself, would have . In Example 1.1 these critical points all come from the subvariety of consisting of points where is such that for some . This means that all critical points for which lie on this subvariety of and are degenerate in this way. This also means that any lie in the normal space of this subvariety on .

For we have observed through numerical experiments and proved for two special cases (see Section 4.2 and 4.1) that this does in general not happen. Therefore we state the following conjecture:

Conjecture 2.

If and is generically finite of degree , then

3.1. Complete intersection

In this section we assume that is a complete intersection. From the first statement below we will see that it suffices to assume that the zero locus of is a complete intersection for a generic choice of . This allows us to read out the dimension of when the projection is generically finite.

The fact that is a complete intersection is computationally advantageous since it allows using Lagrange multipliers instead of computing the minors in the critical ideal (see Section 2). We end the section by showing that is monotone in for a special case.

Proposition 3.11.

Suppose that the zero locus of is a complete intersection for a generic . Then is a complete intersection.

Proof.

By definition, the ideal of is generated by equations . To show that is a complete intersection we will show that its codimension in is . To show that has codimension we consider the rank of its Jacobian matrix:

Note that the right part of the above matrix is block-diagonal and is of rank if rank for all for any smooth point on . Note that since is irreducible, this condition is satisfied by the fact that is a complete intersection for a generic and that if is a singular point of , then is a singular point of . Consequently, has codimension and is thus a complete intersection. ∎

Corollary 3.12.

If the conditions of Proposition 3.11 hold, then codim.

When is a complete intersection we can utilize a trick for computing the more efficiently. In this case, the Jacobian has full rank and we can replace the minors in the critical ideal (3) with Lagrange multipliers. This yields the following critical ideal:

Corollary 3.13.

If is a complete intersection, then the critical ideal for computing is given by:

Lemma 3.14.

Suppose that is a complete intersection and has non-singular critical points for some specific point configuration . Then .

Proof.

To show this we will construct a square system of equations describing the critical ideal, parametrized by a start configuration . The claim then follows from [bertini, Theorem 7.1.1(2)]. First note that is cut out by equations in variables. Since is assumed to be a complete intersection we may use the critical ideal of Corollary 3.13. This critical ideal uses Lagrange multipliers and thus introduces new variables. There are equations coming from the Jacobian . In total this yields equations in variables, which thus yields a square system of polynomial equations. This system is parametrized by the start configuration and can thus be described by a polynomial map . Let denote the number of non-singular solutions of the system for a specific choice of . Then by [bertini, Theorem 7.1.1(2)] it follows that for a generic choice of , by which the statement follows. ∎

Theorem 3.15.

Suppose is such that . Then,

Proof.

Let be the map that takes

and let be the map that takes

Let denote the corresponding polynomial for . Note that if , then since:

Suppose that is a critical point of with respect to a point configuration . We will show that is a critical point of with respect to . We do this by showing that is in the row space of . is a matrix which means that it has more columns than . We will show that the added columns are just duplicates of columns from and that they do not add any equation to the critical ideal.

Note that the column of corresponding to is duplicated and the column corresponding to is duplicated and scaled in . The scaling is fine since its supposed to sum up to zero when multiplied by in the critical ideal. The other columns which differ are the columns corresponding to for . These columns are also duplicated and scaled by . Note that the derivative of with respect to is linear in and . Also note that the -coordinate of is also scaled by . This means that the whole equation in the critical ideal corresponding to this column is scaled by , which means that its solutions are the same. We can view it as if the column of corresponding to is mapped to to the following columns in :

Thus we may conclude that is a critical point of with respect to . This means that any critical point of with respect to is a critical point of with respect to under . The statement then follows from Lemma 3.14. ∎

Corollary 3.16.

Suppose is a system of equations of the form

such that for a generic choice of , the zero locus of is a complete intersection. Then,

Proof.

We use the same map as in the proof of Theorem 3.15. Let be the map that takes

where denotes the tuple and takes

Since is a complete intersection for a generic it follows from Proposition 3.11 that is a complete intersection. The argument is then analogous to the proof of Theorem 3.15 and the statement follows from Lemma 3.14. ∎

4. Results for prescribed classes

In this section we investigate the algebraic complexity of a number of classes of varieties.

4.1. Linear

Suppose that each polynomial in the system has coefficients which are linear in . Without loss of generality we may assume that is the system:

where