Spurious Vanishing Problem in Approximate Vanishing Ideal

# Spurious Vanishing Problem in Approximate Vanishing Ideal

Hiroshi Kera1  and Yoshihiko Hasegawa
11Department of Information and Communication Engineering, Graduate School of Information Science and Technology, The University of Tokyo. Corresponding author: Hiroshi Kera (e-mail: kera@biom.t.u-tokyo.ac.jp).
###### Abstract

Approximate vanishing ideal, which is a new concept from computer algebra, is a set of polynomials that almost take a zero value for a set of given data points. The introduction of approximation to exact vanishing ideal has played a critical role in capturing the nonlinear structures of noisy data by computing the approximate vanishing polynomials. However, approximate vanishing has a theoretical problem, which has given rise to the spurious vanishing problem that any polynomial turns into an approximate vanishing polynomial by coefficient scaling. In this paper, we propose a first general method that enables many recent basis construction algorithms to overcome the spurious vanishing problem. In particular, we integrate coefficient normalization with polynomial-based basis constructions, which do not need the proper ordering of monomials to process as early basis construction algorithms. We further propose a method that takes advantages of iterative nature of basis construction so that computationally costly operations for coefficient normalization can be circumvented. Moreover, a coefficient truncation method is proposed for further acceleration. As a result of the experiments, it can be shown that the proposed method overcomes the spurious vanishing problem and significantly increases the accuracy of classification.

## 1 Introduction

Discovering nonlinear structure behind data is a common task across fields, such as machine learning, computer vision, and systems biology. An emerging concept from computer algebra for this task is the approximate vanishing ideal [heldt2009approximate, robbiano2010approximate], which is defined as a set of polynomials that almost take a zero value, i.e., approximately vanish, for any point in data. Roughly, for a set of -dimensional points ,

 Iapp(X) ={g∈Pn∣∀x∈S,g(x)≈0},

where is the set of all -variate polynomials over the real numbers. An approximate vanishing polynomial holds as its approximate roots, which implies reflects the nonlinear structure underlying . In particular, computing the basis set of approximate vanishing ideal has been attracting lots of interests [heldt2009approximate, livni2013vanishing, limbeck2014computation, reza2017principal]; such basis vanishing polynomials describe the system that has as approximate common roots, implying the nonlinear structure of data is captured in the system. Various basis construction algorithms have been proposed and exploited in applications. For instance, discriminative nonlinear features of data are extracted for classifications [livni2013vanishing, shao2016nonlinear, hou2016discriminative]; independent signals are estimated for blind source separation task [kiraly2012regression, wang2018nonlinear]; nonlinear dynamical systems are reconstructed from noisy observations [kera2016noise], and so forth [torrente2009application, kera2016vanishing].

The essential ingredient for the notion of approximate vanishing is the error tolerance . A polynomial is approximately vanishing for a point if . An exact vanishing ideal, where , can result in a corrupted model that overfits to noisy data and is far from the actual data structure. By setting proper , the basis set of approximate vanishing polynomials is expected to be a polynomial system that reflects the informative structure from noise data.

However, this approximation gives rise to a new theoretical question that is never asked in the exact case: how can we fairly evaluate the approximate vanishing across polynomials? Approximate vanishing polynomials change the extent of vanishing just by rescaling. For example, a nonvanishing polynomial, , is easily turned into an approximate vanishing polynomial by rescaling its coefficients by 1/2, i.e., . In other words, approximate vanishing can be achieved by small coefficients regardless of the roots of polynomials; such spurious approximate vanishing polynomials do not hold any fruitful structure of data. The reverse is also true; polynomials that well describe data can be rejected as nonvanishing polynomials because of their large coefficients. This spurious vanishing problem has been lying behind recent basis construction algorithms of approximate vanishing ideal without being noticed.

In this paper, we deal with this spurious vanishing problem in recent basis construction algorithms (or, here, polynomial-based algorithms) by forcing basis polynomials to be normalized before evaluating the approximate vanishing, so that the extent of vanishing can be fairly measured and compared across the polynomials. As a particular normalization, we consider coefficient normalization, which constraints the sum of squares of coefficients of a polynomial to be unity. The coefficient normalization is intuitive. Actually, it has been naturally taken into account in early basis construction algorithms [heldt2009approximate, limbeck2014computation, fassino2010almost], which are here referred to as monomial-based algorithms. However, introducing the coefficient normalization into the polynomial-based basis construction algorithms is not straightforward, although these algorithms are much more commonly used in applications [livni2013vanishing, zhao2014hand, hou2016discriminative, yan18deep, wang2018nonlinear]. In addition, it is computationally costly to introduce the coefficient normalization. To be precise, extracting the coefficients of the monomials in a polynomial is the costly part. The polynomial-based algorithms generate basis polynomials in the nested sum-product form, which need to be expanded into the sum of monomials, resulting in high computational cost. For the first challenge (optimal generation of basis polynomials under normalization), we formulate it as a constraint optimization problem in the form of a generalized eigenvalue problem. We theoretically prove that a general polynomial-based basis construction with this generalized eigenvalue problem can still output a basis set of the vanishing ideal. We also provide rigorous theoretical analysis for its optimality and stability. For the second challenge (high computational cost of the coefficient normalization), we propose a method to sidestep costly polynomial expansion that aims to obtain the coefficients of polynomials by exploiting the iterative nature of basis construction and precomputation. Furthermore, we propose a coefficient truncation method, which enables the coefficient normalization to work much faster instead of giving up the exact calculation of coefficients.

In the experiments, we install the proposed methods on Vanishing Component Analysis (VCA; [livni2013vanishing]), which is the most widely-used basis construction method of approximate vanishing ideal in applications [zhao2014hand, yan18deep, wang2018nonlinear]. We show that VCA encounters severe coefficient growth and decay along with the calculation, resulting in spurious vanishing polynomials with small coefficients and spurious nonvanishing polynomials with large coefficients. When such polynomials are normalized to have a unit coefficient norm, spurious vanishing polynomials turn into nonvanishing polynomials; spurious nonvanishing polynomials turn into approximate vanishing polynomials. In contrast, VCA unified with the proposed methods does not encounter any spurious vanishing and nonvanishing polynomials. In classification tasks, our approach extracts better feature vectors than VCA, resulting in higher classification accuracy.

Our contributions are summarized as follows:

• We propose the first general method that can introduce normalization into recent basis construction algorithms to avoid the spurious vanishing problem, which has been an unnoticed problem in the polynomial-based basis construction algorithms.

• We perform rigorous theoretical analysis on the validity, optimality, and stability of the proposed algorithm.

• We propose two efficient methods (exact one and approximate one) for the coefficient normalization that is computationally costly to introduce in the polynomial-based basis construction algorithms.

## 2 Related Work

The coefficient normalization has been used in early basis construction algorithms (monomial-based algorithms [moller1982construction, heldt2009approximate, fassino2010almost, limbeck2014computation]), where monomials are linearly combined by a unit vector to construct a polynomial, naturally leading to a polynomial with the unit coefficient norm. On the other hand, more recent basis construction algorithms (polynomial-based algorithms), where polynomials are linearly combined by a unit vector to construct a polynomial, fail to consider this normalization. Linearly combining polynomials by a unit vector does not imply that the resulting polynomials are normalized by coefficients because of merge and canceling-out of terms. As a consequence, the polynomial-based algorithms have been suffering from the spurious vanishing problem without being noticed.

Nevertheless, polynomial-based algorithms have been more commonly used for applications across different fields because they do not require the monomial order as monomial-based algorithms. A monomial order defines an ordering of monomials to process [cox1992ideals]. Different monomial orders can yield different results and there are exponentially many possible monomial orders. In many applications, the proper order of monomials is unknown or monomials (in particular, variables) are expected to be fairly handled. To the best of our knowledge, the only exception in the monomial-based algorithms is the one proposed in [sauer2007approximate], which considers homogeneous polynomials for basis without a monomial order. However, this method is not as general as our method, which can be combined with existing polynomial-based algorithms to circumvent the spurious vanishing problem. Additionally, their method has to manipulate huge matrices that, for each degree-, contain evaluations of all the degree- monomials for all the points, which is more costly than our approach.

## 3 Preliminaries

### 3.1 Definitions and Notations

###### Definition 1 (Vanishing Ideal).

Given a set of -dimensional points , the vanishing ideal of is a set of -variate polynomials that take a zero value, (i.e., vanish) for any point in . Formally,

 I(X) ={g∈Pn∣∀x∈X,g(x)=0}.
###### Definition 2 (Evaluation vector).

Given a set of data points , the evaluation vector of polynomial is defined as:

 h(X) =(h(x1)h(x2)⋯h(x|X|))⊤∈R|X|,

where denotes the cardinality of a set. For a set of polynomials , its evaluation matrix is .

###### Definition 3 (ϵ-vanishing polynomial).

A polynomial is an -vanishing polynomial for a set of points when the evaluation vector for satisfies , where denotes the Euclidean norm. Polynomials that do not satisfy this condition is called -nonvanishing polynomials.

In the context of the approximate vanishing ideal, we are interested in evaluation values of polynomials at given points . Polynomials are thus identified with their -dimensional evaluation vectors. Two polynomials and are regarded as equivalent if they have the same evaluation vector, i.e., . When , then is a vanishing polynomial for . It is worth noting that the sum of evaluation of polynomials equals to the evaluation of sum of the polynomials; that is, for a set of polynomials and weight vectors ,

 H(X)v =(Hv)(X),

where defines the inner product between a set and a vector . This special inner product will be used hereafter. Similarly, denotes the multiplication of a set and a matrix , where denotes the -th column vector of . In this way, polynomials and their sums are mapped to finite dimensional vectors, and the linear algebra can be used for the basis construction of the approximate vanishing ideal.

### 3.2 Minimal Basis Construction Algorithm

Given a set of data points and error tolerance , the goal of the basis construction is to output and , where the former is the basis set of -nonvanishing polynomials and the latter is the basis set of -approximate vanishing polynomials of an approximate vanishing ideal. In the exact case (), any vanishing polynomial can be generated by as

 g=∑g′∈Ghg′g′, (1)

where . This is similar to the basis of linear subspace except that the coefficients are here polynomials. We define as the set of polynomials generated by . Also, for any polynomial ,

 f =f′+g′, (2)

where and . Here, denotes the set of linear combinations of polynomials in .

There are several basis construction algorithms of the approximate vanishing ideal. Although our idea of normalization and its realization method work with most of them, to avoid the discussion becomes unnecessarily abstract, we here fix a simple polynomial-based basis construction algorithm, called the minimal basis construction algorithm. The input to the minimal basis construction algorithm is a set of points and error tolerance . The algorithm proceeds from degree-0 polynomials to higher degree polynomials. At each degree , a set of nonvanishing polynomials and a set of vanishing polynomials are generated. We use notations and . For , and , where is any nonzero constant. At each degree , the following procedures are conducted.

##### Step 1: Generate a set of candidate polynomials Ct

Pre-candidate polynomials of degree- for are generated by multiplying nonvanishing polynomials between and .

 Cpret={pq∣p∈F1,q∈Ft−1},

At , we use , where are variables. The candidate basis is then generated via the orthogonalization procedure.

 Ct =Cpret−Ft−1Ft−1(X)†Cpret(X), (3)

where is a pseudo-inverse of a matrix.

##### Step 2: Solve an eigenvalue problem for Ct(X)

Solve the following eigenvalue problem for the evaluation matrix ,

 Ct(X)⊤Ct(X)V=VΛ, (4)

where is a matrix that has eigenvectors in its columns and is a diagonal matrix with eigenvalues eigenvalues along its diagonal.

##### Step 3: Construct sets of basis polynomials

Basis polynomials are generated by linearly combining polynomials in with .

 Ft ={Ctvi∣√λi>ϵ,i=1,2,...,|Ct|}, Gt ={Ctvi∣√λi≤ϵ,i=1,2,...,|Ct|}.

If , the algorithm terminates with output and .

###### Remark 1.

At Step 1, the orthogonalization procedure (3) makes the column spaces of orthogonal to that of , aiming at focusing on the subspace of that cannot be spanned by the evaluation vectors of polynomials of degree less than (note that , where is the identity matrix).

###### Remark 2.

At Step 3, a polynomial is classified as an -vanishing polynomial if because equals the extent of vanishing of a polynomial . Actually,

 ∥(Ctvi)(X)∥=√v⊤iCt(X)⊤Ct(X)vi=√λi.

The abovementioned algorithm shows the minimal procedures of polynomial-based basis construction. This algorithm is quite similar to the pioneering polynomial-based algorithm, VCA [livni2013vanishing], implying the elegance of VCA. Existing polynomial-based algorithms including VCA are based on this framework as well as slightly changing each step and/or introducing additional procedures for various properties, such as stability, scalability, and compact basis [livni2013vanishing, kiraly2014dual, kera18approximate]. In this paper, it is sufficiently general to discuss with the minimal basis construction algorithm. We will note when some algorithm-specific consideration is necessary. Henceforth, we refer to each step of the algorithm above as Step1, Step2, and Step3.

###### Theorem 1.

Given a set of points and , the minimal basis construction outputs basis sets and , where is the degree where the algorithm terminates. When the algorithm runs with , then,

• is a basis set of the vanishing ideal , which satisfies (1): any vanishing polynomial can be generated by as .

• and satisfy (2): any polynomial can be represented by , where and .

• and for any satisfies the following: any degree- polynomial can be represented by , where and .

This theorem can be readily proved by comparing the minimal basis construction and VCA and by using the fact that VCA also satisfies Theorem 1 (see Appendix A.1 and [livni2013vanishing]).

## 4 Proposed Method

A polynomial can be approximately vanishing only because of its small coefficients—this is the spurious vanishing problem. To sidestep this problem, we would like approximate vanishing polynomials (and nonvanishing polynomials) to be normalized by some scale, so that the spurious vanishing polynomials are properly rescaled and their actual behavior for input points becomes evident.

Here, we describe the proposed methods that deal with the challenges to introduce normalization. We intend to answer the following questions: how do we optimally generate basis polynomials under a normalization, such as the coefficient normalization (Section 4.1)?; how do we efficiently extract coefficients from polynomials and manipulate them for the coefficient normalization (Section 4.2)?

### 4.1 Polynomial-based basis construction with Normalization

Here, we describe a proposed method, which enables the minimal basis construction algorithm to construct nonvanishing and vanishing polynomials under given normalization. This method is general enough to be applied to other polynomial-based basis construction algorithms, accompanied with the following advantages: (i) it only requires to rewrite a few lines of original algorithms222except calculating values that are necessary for normalization, which depends on which normalization is used.; (ii) it is not limited to the coefficient normalization; (iii) it can inherit most properties of the original algorithms. For simplicity, we first focus on the special case with the coefficient normalization and then give the general description.

The coefficient vector of a polynomial is defined as a vector that lists the coefficients of the monomials of the polynomial. Let be a mapping that gives the coefficient vector of a given polynomial, e.g., . Then, we evaluate the approximate vanishing of for by normalizing with respect to the norm of the coefficient vector as follows

 g∥nc(g)∥.

Since spurious vanishing polynomials have coefficient vectors with the small norm, such polynomials are largely scaled in the normalization above. Similarly, spurious nonvanishing polynomials, which are polynomials that are nonvanishing because of their unreasonably large coefficients, are rescaled to have a moderate scale of coefficients.

As already mentioned, the coefficient normalization has been considered in the monomial-based algorithms, but not in the polynomial-based algorithms, leading the polynomial-based algorithms to suffer from the spurious vanishing problem. One reason that the polynomial-based algorithms fail to consider the coefficient normalization is that it has been unknown how to optimally generate combination vectors ( of Step2) under this normalization. Recall that, in Step3, a new polynomial is generated by linearly combining candidate polynomials in . This can be formulated as

 g=|Ct|∑i=1vici=Ctv,

where is a combination vector to be sought. The coefficient vector of is

 nc(g)=|Ct|∑i=1vinc(ci)=nc(Ct)v,

where we use a slight abuse of notation that is a matrix whose -th column is . Now, suppose that we want to find a polynomial that achieves the tightest vanishing under the coefficient normalization.

 minv ∥Ct(X)v∥2,s.t. ∥nc(Ct)v∥2=1.

This type of minimization problem is well-known to be solved by a generalized eigenvalue problem,

 Ct(X)⊤Ct(X)vmin=λminnc(Ct)⊤nc(Ct)vmin, (5)

where is the smallest generalized eigenvalue, and is the corresponding generalized eigenvector. We will later show that the generalized eigenvectors of the -smallest generalized eigenvalues generate polynomials that minimize the sum of extent of vanishing under the normalization. Therefore, to introduce the coefficient normalization, we only need to replace the eigenvalue problem (4) in Step2 with the generalized eigenvalue problem (5).

We now provide a general description of our method.

###### Definition 4 (Normalization component).

Let be a mapping that satisfies the following.

• is a linear mapping, i.e., , for any and any polynomials .

• The dot product is defined between normalization components; that is, is defined for any polynomials and .

• takes a zero value if and only if is the zero polynomial.

Then, is called the normalization component of , and is simply called the norm (or -norm) of .

Note that from the first requirement, the norm of the zero polynomial needs to be zero value. The third requirement insists the reverse is also true, and this is the case for the coefficient normalization; that is, if , then is the zero polynomial. Let us consider -dimensional vectors and . Using the first and second properties of ,

 ⟨n(Ctv1),n(Ctv2)⟩ =∑i,j⟨n(ci),n(cj)⟩v(i)1v(j)2, =v⊤1N(Ct)v⊤2,

where is a mapping that gives a matrix for , the -th entry of which is . With the constraints for every and , where is the Kronecker delta, basis polynomials are generated by solving the following generalized eigenvalue problem.

 Ct(X)⊤Ct(X)V=N(Ct)VΛ, (6)

where is a diagonal matrix containing generalized eigenvalues , and is a matrix whose -th column is the generalized eigenvector corresponding to . To summarize, Step2 is replaced with the following Step2 to introduce a normalization.

##### Step 2′: Solve an eigenvalue problem for Ct(X)

Solve the generalized eigenvalue problem (6) to obtain the generalized eigenvectors and the generalized eigenvalues .

###### Remark 3.

In addition to replacing Step2 with Step2, we set for consistency in the coefficient normalization.

The following theorem supports the validity of this replacement of Step2 with Step2.

###### Theorem 2.

The minimal basis construction algorithm with normalization, where Step2 is replaced with Step2, satisfies Theorem 1.

The proof is provided in Appendix A.2. An intuitive explanation is as follows. Let us consider two processes of basis construction, one with Step2 and the other with Step2. If a symbol is used for the former process, we put a tilde on it as for the latter process. Now, at each degree , Step2 finds basis sets and . These basis sets span part of the space that is spanned by basis sets and found by Step2. This is because Step2 solves the generalized eigenvalue problem (6), where additional constraints regarding normalization are imposed on the original eigenvalue problem (4) in Step2. The key claim is that basis polynomials dropped from and are redundant basis polynomials. The redundant basis polynomials (let one of them be ) are those with their norm to be zero (i.e., ). In the case of the coefficient norm, implies that the coefficient vector of is the zero vector; thus, is the zero polynomial, which is obviously a redundant basis. Note that existing methods do not exclude even such a zero polynomial from basis polynomials. The following theorem shows the optimality of Step2. See Appendix A.3 for the proof.

###### Theorem 3.

Let be an integer such that . The generalized eigenvectors of (6), which correspond to the -smallest generalized eigenvalues, generate polynomials , whose square sum of extent of vanishing achieves the minimum under the orthonormal constraint on the normalization components of polynomials.

It is known that, in practice, we need to solve the following problem instead of (6) for numerical stability.

 Ct(X)⊤Ct(X)V =(N(Ct)+αI)VΛ, (7)

where is a small positive constant. Such is typically set to a small multiple of the average eigenvalue of , i.e.,  [friedman1989regularized]. It can be shown that such addition by only gives a slight change both on the extent of vanishing and normalization measure of obtained polynomials.

###### Theorem 4.

Let be the generalized eigenvectors of (7) for . Both the extent of vanishing and the norm of differ only by from those of the original polynomial . Specifically,

 λ0k−λαk =λ0k1+α∥v0k∥2, −α∥v0k∥2λ0kλ0min+O(α2) ≤√(v0k)⊤N(Ct)v0k −√(vαk)⊤N(Ct)vαk, ≤−α∥v0k∥2λ0kλ0max+O(α2),

where and are the smallest and the largest eigenvalues of for , respectively.

###### Proof.

To simplify the notations, let and . Let us consider

 Avαk =λαk(B+αI)vαk, (8)

where is a small perturbation on and are the perturbed -th generalized eigenvalue. We cannot directly apply the standard matrix perturbation theory, which assumes positive-definite and describes by a linear combination of unperturbed generalized eigenvectors. In our case, is positive-semidefinite, and thus there are only generalized eigenvectors. Hence, the generalized eigenvectors are not complete bases of and cannot always be described by the generalized eigenvectors.

Fortunately, the theorem above holds using the fact , where denotes the nullspace of a given matrix. This relation holds because any vector implies the zero polynomial according to the third requirement for . From Lemma 1 in Appendix A.4, we conclude that the claim holds. ∎

### 4.2 Coefficient normalization

Introducing the coefficient normalization into the basis construction needs large computational cost. There are two sources that cause the cost: (i) we need to expand polynomials to obtain their coefficient vectors since in our case polynomials are in the nested sum-product form of polynomials due to the repetition of Step1 and Step3 along the degree. In general, such an expansion is computationally expensive because one can manipulate exponentially many monomials from the expansion. (ii) Even after the polynomial expansion, obtained coefficient vectors of polynomials are significantly long in general. Specifically, a degree- -variate polynomial have a coefficient vector of length . For a fixed degree , grows in , and for a fixed number of variables , grows in according to . We here propose two methods for each of the two challenges above.

#### 4.2.1 Circumventing polynomial expansion

The main idea of circumventing polynomial expansion is to hold coefficient vectors of polynomials separately and update these vectors by applying to them equivalent transformations that are applied to corresponding polynomials. For example, let us consider a weighted sum of two polynomials and by weights . Then, the coefficient vector of is also a weighted sum . In contrast to the weighted sum case, it is not straightforward to calculate the coefficient vector of a product of polynomials, e.g., . We encounter such a case at Step1, where the candidate polynomials are generated from the multiplication across linear polynomials and nonlinear polynomials. We will now deal with this problem.

Let us consider -variate polynomials. Let and be the number of -variate monomials of degree- and of degree up to , respectively. For simple description, we assume that monomials and coefficients are indexed in the degree-lexicographic order. For instance, in two variate case, the degree-lexicographic order is and so forth ( are variables and ). We will refer to ”the -th monomial” according to this ordering. Now, we consider a matrix that extends a coefficient vector of a degree- polynomial to that of a degree- polynomial after the multiplication by a linear polynomial.

###### Remark 4.

Given a linear polynomial , there is a matrix such that

where is any degree- polynomial.

The existence of such matrix will become obvious soon (see [vidal2016gpca] for the case of homogeneous polynomials). Suppose a linear polynomial is described by , where are coefficients and are variables. For convenience, we use a notation . Then, can be described as

 R≤tp =d∑k=0bkR≤txk,

because, as we have seen above, the coefficient vector of the weighted sum of polynomials is the weighted sum of their coefficient vectors. Now, the existence of is obvious. Actually, the -th entry of takes value one if the -th monomial becomes -th monomial by the multiplication with , and otherwise the -th entry of is zero value. is not dependent on input data (except the number of variables), and thus, we can compute these matrices beforehand. Note that different monomials are mapped to different monomials after multiplied by . Thus, each column of has exactly one nonzero entry (and it is 1), implying is a sparse matrix with only entries, which can be efficiently handled despite its size . Moreover, we can represent in a block diagonal matrix,

 R≤txk =(R≤t−1xkOORtxi),

where is the zero matrix, and is a submatrix of that represents the mapping from degree- monomials to degree- monomials.

In summary, in the basis construction, we first hold the coefficient vectors of besides polynomials. Then, for each , we load precomputed and calculate . Using these matrices, we can obtain the coefficient vectors of . We then extend to by appending , which is a linear combination of precomputed . In this way, we can skip the costly polynomial expansion before manipulating the coefficient vectors.

In addition to its less computational cost, we have another practical advantage in our approach that skips the polynomial expansion: it can work with the fast numerical implementation of basis construction. In the numerical implementation, a polynomial is expressed by its evaluation vector instead of a symbolic entity (for example, see the author’s web page of [livni2013vanishing]). Since we only know an evaluation vector of a polynomial in the numerical implementation, the ”polynomial” cannot be expanded because its symbolic form is unknown. Numerical implementations work much faster because, in practice, symbolic operations are much slower than the same number of numerical operations (matrix–vector operations). Also, it is slow to evaluate symbolic entities, although many evaluations are necessary to obtain evaluation vectors of polynomials.

#### 4.2.2 Coefficient truncation for acceleration

We here describe the coefficient truncation method to deal with significantly long coefficient vectors. We propose to truncate coefficient vectors based on the importance of corresponding monomials. In particular, at each degree , we only keep degree- monomials that have large coefficients in the degree- nonvanishing polynomials . Although this strategy is simple, our coefficient truncation method has an interesting contrast to a monomial-based algorithm as will be shown soon.

The specific procedures of the proposed coefficient truncation are as follows. Let be a mapping that gives the coefficient vector of the degree- monomials; thus, is a subvector of . With the same abuse of notation for , we define as a matrix whose -th column is . Note that the -th row of corresponds to the coefficients of the -th degree- monomial across polynomials of . Let be the norm of -th row of . Then, setting a threshold parameter , we select significant monomials one by one from larger as long as the following holds,

 ∑j∈BtΔ2j≤θ2, (9)

where is the index set of the selected degree- monomials. Once a monomial is discarded, we will also discard multiples of , which may appear in a higher degree. Using , we can also truncate of the previous section to size , which becomes a sparse matrix with nonzero entries.

The proposed coefficient truncation is similar to a monomial-based algorithm, approximate Buchberger–Möller algorthim (ABM algorithm; [limbeck2014computation]). This algorithm proceeds from lower to higher degree monomials, while updating a set of important monomials , which corresponds to the basis set of nonvanishing polynomials of the minimal basis construction algorithm. Given a new monomial , if the evaluation vector of cannot be well approximated by a linear combination of monomials in , ABM algorithm assorts into . More specifically, if for some coefficients , then is an approximate vanishing polynomial and is discarded; otherwise is appended to . Importantly, monomials divisible by (i.e., multiples of ) need not be considered at a higher degree, which reduces the number of monomials to handle. It is shown ; thus, the number of monomials to handle does not explode.

The proposed coefficient truncation is distinct from the strategy of ABM algorithm in that it is fully data-driven, whereas ABM algorithm relies on specific monomial ordering.

###### Example 1.

Suppose we have two degree- monomials and , where for some monomial order. Suppose for a constant . ABM considers as the more important monomial to append because and is linearly dependent on .

By contrast, our method by the generalized eigenvalue problem computes a nonvanishing polynomial

 k√1+k2m1+1√1+k2m2. (10)

Therefore, both and are preserved when . For , only is preserved and is discarded because is more nonvanishing, or more important, than for data points; for , is preserved and is discarded. Since is dependent on given data points, our strategy to remove monomials with minor coefficients is fully data-driven. By contrast, due to the predefined monomial order , ABM algorithm consistently preserves regardless of .

Next, we introduce an additional degree- monomial .

###### Example 2.

Let us consider a degree- monomial with an evaluation vector that is orthogonal to and of Example 1. Two nonvanishing polynomials and (10) are obtained by our method. The magnitude of the coefficient of is larger than that of and . This is because fully takes coefficient norm 1, whereas and share the coefficient norm 1 by and due to their mutually linearly dependent evaluation vectors.

Example 2 implies that our strategy to keep monomials with large coefficients gives a priority to monomials that have unique evaluation vectors, such as . Monomials that have similar evaluation vectors to others, such as and , tend to have moderate coefficients. Hence, our coefficient truncation method keeps from monomials with unique evaluation vectors to those with less unique evaluation vectors. It is worth noting that this property is realized because our method restricts coefficient norm of nonvanishing polynomials to unity.

As a consequence of truncating coefficient vectors, coefficient matrix in (5) is replaced with that from truncated coefficient vectors. The coefficient norm of the obtained polynomials is no longer equal but only close to unity. Note that although it is an approximation method, we can still calculate the exact evaluation of polynomials because we hold and consider their coefficient vectors separately. Thus, the generalized eigenvalues at Step2 maintain the exact value of the square extent of vanishing.

It is difficult to estimate the error caused by this truncation because basis construction proceeds iteratively and error goes accumulated. The following theorem gives a theoretical lower bound of the coefficient truncation without any loss.

###### Theorem 5.

For an exact calculation of coefficients, we need at least monomials for each degree .

See Appendix A.5 for the proof. Theorem 5 states the minimal number of monomials required to perform an exact calculation of coefficient vectors. The equality holds when the evaluation vectors of monomials are always orthogonal until the termination. Of course, this is too optimistic in practice. Instead, let us consider the case where coefficients are kept at each degree . Suppose the basis construction terminates at . Then, the length of coefficient vectors at is . The matrices used to calculate coefficient vectors of is -sparse. The number of new monomials in this step is . It is known because the evaluation vectors of (approximately) spans the . Therefore, the coefficient truncation yields coefficient vectors in polynomial-order length . As a consequence, computing in (6) costs . This is acceptable when one considers the cost of solving (6) is also .

Lastly, we consider another idea for approximating coefficient norm of polynomials: how about calculating the evaluation vectors of polynomials with respect to randomly sampled points? One may expect the norm of the coefficient vector of a polynomial can be inferred from the norm of the random evaluation vector. Let be the Veronese map, which gives the evaluations of -variate monomials of degree up to . For instance, . For a set of points , we define as a matrix whose -th row is the Veronese map of the -th point. Now, let us consider a polynomial and its evaluation for randomly sampled points .

 ∥g(Y)∥2 =∥Ct(Y)v∥2, =∥Vtn(Y)nc(Ct)v∥2, =v⊤nc(Ct)⊤Vtn(Y)⊤Vtn(Y)nc(Ct)v.

Note that is the coefficient vector of . Thus, if , then we can estimate the coefficient norm of from the random evaluation vector . However, this cannot be achieved. For instance, when ,

 V22(Y)⊤V22(Y) =⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1s21⋱s21s22s21s22⋱⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

which has the same entry both at diagonal and off-diagonal. Therefore, will not be the identity matrix regardless of the sampled points.

## 5 Results

We here compare VCA and the minimal basis construction algorithm with the coefficient normalization (MBC-). We chose VCA as the baseline because the minimal basis construction (with Step2) is problematic in practice; it can have to deal with huge value in evaluating polynomials (e.g., for input ). In the first experiment, we show that VCA encounters severe spurious vanishing problem even in simple datasets, whereas MBC- does not. In the second experiment, we demonstrate in the classification tasks that MBC- achieves lower classification error. All experiments were performed using Julia implementation on a desktop machine with an eight-core processor and 32 GB memory. We emphasize that the proposed methods (coefficient normalization with the generalized eigenvalue problem and the coefficient truncation) can be easily unified with other basis construction methods because these methods are all based on the minimal basis construction framework. However, these methods are less commonly used than VCA, and also they need more hyperparameters to control, which makes the analysis unnecessarily complicated.

### 5.1 Analysis of Coefficient Norm and Extent of Vanishing with Simple Datasets

We perform basis construction by VCA and MBC-. Coefficient norm and extent of vanishing of obtained polynomials are respectively compared between two methods. We also compare MBC- with and without the coefficient truncation.

We use three algebraic varieties: (D1) double concentric circles (radii 1 and 2), (D2) triple concentric ellipses (radii , , and ) with rotation, and (D3) . We randomly sampled 50, 70, and 100 points from these algebraic varieties, respectively, and perturbed the sampled points with the additive Gaussian noise. The mean of the noise is set to zero, and the standard deviation is set to 5% of the average absolute value of points in each dataset. is set to the mean of the range that the same number of vanishing polynomials is obtained at the lowest degree as the Gröbner basis of the generating polynomials of each dataset. We further consider two datasets by adding variables to D2 and D3. (D2) five additional variables for , where and are the variables of D2. (D3) nine additional variables for , where , , and are the variables of D3. For these datasets, we used the same as the corresponding datasets (e.g., for D2 is used for D2). This is because the newly introduced variables are dependent on the original variables; thus, these variables only add new approximate vanishing polynomials with relatively small extent of vanishing.

In Fig. 1, the coefficient norm of nonvanishing polynomials (upper row) and that of vanishing polynomials (bottom row) are plotted along the degree. The mean values are represented by solid lines and dots and the maximum and minimum are represented by dotted lines. As can be seen from the figure, the mean coefficient norm tends to sharply grow along the degree (note that the vertical axes are in logarithm-scale). Even within a degree, there can be a huge gap as in degree-5 vanishing polynomials of D1 (bottom left panel), degree-7 vanishing polynomials of D2 (bottom middle panel), and so forth. These results imply that some vanishing (or nonvanishing) polynomials from VCA might be vanishing (or nonvanishing) merely due to their small (or large) coefficients; such polynomials might become nonvanishing (or vanishing) polynomials once these polynomials are normalized to have a unit coefficient norm. This is corroborated by the result shown in Fig. 2(a). The raw extent of vanishing (blue dots and lines) is contrasted against the normalized extent of vanishing (red squares and lines). Although the average raw extent of vanishing (blue solid lines) are mostly similar along degree, it becomes significantly uneven by a few orders of magnitude after the normalization (red dashed lines). Moreover, after the normalization, some nonvanishing polynomials show the extent of vanishing below the threshold (dashed line) and some vanishing polynomials show the extent of vanishing above the threshold. This means that reversal occurs between nonvanishing polynomials and vanishing polynomials in VCA. By contrast, as shown in Fig. 2(b), the extent of vanishing of polynomials from MBC- are consistent before and after the normalization, which is simply because the polynomials are generated under the coefficient normalization. Moreover, we can see that under coefficient normalization, the extent of vanishing shows considerably lower variance for both nonvanishing and vanishing polynomials (note that the vertical axes are in linear-scale in Fig. 2(b)). In other words, the original VCA overestimates the extent of vanishing due to the bloat in the coefficient norm.

Next, we evaluate MBC- with the coefficient truncation. The result is summarized in Table 1. We change the truncation threshold in (9) from 0.0 to 1.0. Following Theorem 5, we keep at least coefficients at each degree regardless . Thus, corresponds to the case where we keep exactly coefficients for each degree. corresponds to MBC- without the coefficient truncation. We here analyze the nonvanishing polynomials in terms of the length of coefficient vectors (length), the actual coefficient norm (median and min/max), the runtime of basis construction (runtime), and the termination degree (max degree). Note that vanishing polynomials at each degree do not affect the operations in the succeeding degree; thus, the runtime of the algorithm is mainly dependent on how efficiently nonvanishing polynomials are handled. This is why we focus on the statistics of nonvanishing polynomials. As can be seen in Table 1, with , truncated coefficient vectors are approximately 50 times shorter. Nevertheless, the median and minimum of the coefficient norm are 1 and 0.9, respectively, for both datasets. This means that only 1–3% of monomials and coefficients have a significant contribution to the basis polynomials. Even in the extreme case (), the coefficient norm of polynomials still lies in the moderate range, while the coefficient vectors are significantly shortened (around 4000 times shorter in D3). By the truncation, MBC- is accelerated by about two times. Around , the coefficient vectors become no longer the main factor of runtime, and thus any further acceleration is obtained by reducing . On the other hand, VCA remains faster. However, the coefficient norm largely varies (e.g., gap between minimum and maximum for D3). Also, note that coefficient vectors are not accessible for VCA in the numerical implementation. Thus, one cannot normalize nor discard polynomials by weighing their coefficient norms as we did in the analysis. For the above analysis, we calculated the coefficient vectors for VCA in the same manner as in MBC-, which takes the additional cost. The runtime was measured by independently running VCA without the coefficient calculation.

### 5.2 Classification

Here, we extract feature vectors from data using vanishing polynomials and train a linear classifier with these vectors. In the training stage, we compute vanishing polynomials for each class. Let be the vanishing polynomials of the -th class data. A feature vector of a data point is given by

 F(x) =(⋯,∣∣g(i)1(x)∣∣,⋯,∣∣g(i)|Gi|(x)∣∣|Gi(x)|⊤,⋯),

where is the -th vanishing polynomials of the -th class. Intuitively, for the -th class data point takes small values for part and large values for the rest. The extent of vanishing of the vanishing polynomials obtained by MBC- tend not to be balanced across degree because of the coefficient norm is constrained to be unity. Hence, we rescaled vanishing polynomials based on the training data points of the corresponding class. Specifically, degree- polynomials are divided by the mean-absolute value of evaluations of these degree- polynomials. This rescaling was not applied to VCA vanishing polynomials because the classification error increased when the rescaling was actually used. We employed -regularized logistic regression and one-versus-the-rest strategy using LIBLINEAR [rong09liblinear]. We used three datasets (Iris, Vowel, and Vehicle) from the UCI dataset repository [Lichman2013machine]. The parameter was selected by 3-fold cross-validation. Since Iris and Vehicle do not have prespecified training sets and test sets, we randomly split each dataset into a training set (60%) and test set (40%), which were mean-centralized and normalized so that the mean norm of data points is equal to one.

As can be seen from Table 2, MBC- significantly outperforms VCA in terms of the classification error for all the datasets. The coefficient truncation () slightly increased the classification error. Due to the rescaling, the coefficient norm of the vanishing polynomials of MBC- () is no longer equal to unity. However, the difference between the largest and smallest coefficient norm is moderate. In contrast, VCA vanishing polynomials show large variance in terms of the coefficient norm. It is worth noting that neither VCA nor MBC- utilizes discriminative information across classes at the feature extraction stage. Therefore, the decrease in classification error suggests that MBC- better captures the nonlinear structure of data. Actually, MBC- obtained higher-degree vanishing polynomials than VCA on the average.

## 6 Conclusion and Future Work

In this paper, we discussed the spurious vanishing problem in the approximate vanishing ideal, which has been an unnoticed theoretical flaw of existing polynomial-based basis constructions. To circumvent the spurious vanishing problem, polynomial-based basis constructions are required to introduce a normalization so that the extent of approximate vanishing can be fairly evaluated across polynomials. We propose a method to optimally generate basis polynomials under a given normalization. The proposed method is enough general to extend existing basis construction algorithms and to consider various types of normalization. In particular, we consider intuitive but costly coefficient normalization. We propose two methods to ease the computational cost; one is an exact method that takes advantages of iterative nature of the basis construction framework, and the other is an approximation method, which empirically but drastically shortens the coefficient vectors while keeping the coefficient norm of polynomials in a moderate range.

The experiments show the severity of the spurious vanishing problem in VCA and the effectiveness of the proposed method for avoiding the problem. In the classification tasks, VCA improved the classification accuracy when it is combined with the proposed method. An important future direction is to design an algorithm that is more scalable. Our experiments suggest that the coefficient norm of polynomials are well regularized even when only a few proportions of monomials are considered. This can be a key observation to reduce the runtime of new algorithms. Another interesting direction is to consider a different type of normalization.

## Acknowledgement

This work was supported by JSPS KAKENHI Grant Number 17J07510.

## Appendix A Proofs

### a.1 Proof of Theorem 1

###### Proof.

The minimal basis construction with Step2 is identical to VCA up to constant factor in basis polynomials. Specifically, VCA set , and normalize polynomials by at each degree. Thus, from Theorem 5.2 in [livni2013vanishing], which shows that VCA satisfies Theorem 1, the minimal basis construction also satisfies Theorem 1. ∎

### a.2 Proof of Theorem 2

###### Proof.

We prove the claim by induction with respect to degree . Let us denote by and the basis sets obtained at degree- iteration in the minimal basis construction with Step2. For the corresponding items in the basis construction with Step2, we put a bar on the symbols such as and . From Theorem 1, we know that collecting and gives complete basis sets for both nonvanishing and vanishing polynomials. Here, we prove the claim by comparing and with and . Specifically, we show and . Note that and are obvious because and are generated by assigning additional constraints on normalization in the original generation of and . Thus, the main goal is to prove the reverse inclusions and .

At , it is obvious that and . For , we assume and . Then, we can show