Hierarchical Data Reduction and LearningSubmitted to the editors June 26, 2019. \fundingThis work was funded by the grants NSF1821311, NSF1645053, NSF1621853.

# Hierarchical Data Reduction and Learning††thanks: Submitted to the editors June 26, 2019. \fundingThis work was funded by the grants NSF1821311, NSF1645053, NSF1621853.

Prashant Shekhar Computational and Data Enabled Sciences, University at Buffalo ().    Abani Patra Computational and Data Enabled Sciences and Department of Mechanical and Aerospace Engineering, University at Buffalo ().
###### Abstract

Paper proposes a hierarchical learning strategy for generation of sparse representations which capture the information content in large datasets and act as a model. The hierarchy arises from the approximation spaces considered at successively finer data dependent scales. Paper presents a detailed analysis of stability, convergence and behavior of error functionals associated with the approximations and well chosen set of applications. Results show the performance of the approach as a data reduction mechanism on both synthetic (univariate and multivariate) and real datasets (geo-spatial, computer vision and numerical model outcomes). The sparse model generated is shown to efficiently reconstruct data and minimize error in prediction.

S
\newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis \newsiamthmclaimClaim \headersHierarchical Data Reduction and LearningP.Shekhar and A.Patra \externaldocumentex_supplement \pdfstringdefDisableCommands

parse representation and learning, Data Reduction, Approximation error analysis

{AMS}

68W25, 65D15, 33F05

## 1 Introduction

Hierarchical modeling is widely used both in data driven and physics based models [27, 14, 18, 23] and is deeply influenced by early and ongoing work in multigrid methods [5, 17]. The hierarchical model we consider builds on the ability to analyze data at different resolutions rapidly minimizing the error in fitting. In this paper, we focus on a strategy which processes the data at different scales by constructing corresponding basis representations and inferring if these scale and data dependent bases are able to approximate the observed data () in the sense

 minCs∈Rls||f|X−BsCs||2≤TOL (1)

Here, is a scale dependent quantity chosen by the algorithm, is a user defined tolerance level, with being the number of observations in a dimensional space (, ) . in (1) is computed by the algorithm as optimal coordinates of projection on . Besides constructing these basis at each scale, this strategy also helps us in identifying “representative” data points from the complete dataset which we refer to as the corresponding sparse data (). along with the corresponding scale of convergence help us to reconstruct not only the original dataset but also predict (model) at any new point in the domain of interest. Therefore the approach discussed here not only contributes to data reduction (by sparsifying it)[20, 10, 25] but also provides an efficient model for approximation of the underlying data [13, 7].

Mathematically, data sparsification can be illustrated as follows. Suppose we have data taken at scattered points ( for dimensional inputs). Then our aim is to find a subset of the original data points (), which is sufficient for acceptable reconstruction or approximation of as in equation (1). This subset with the corresponding projection coordinate , is regarded as our obtained sparse representation. This compressed representation of the dataset acts as a model for new prediction points . The sparse representation also enables reconstruction of the original dataset at when needed. While this is somewhat related to the statistical emulators [15], the notion of scale is rarely explained in that domain.

The approximation problem [22, 8] is a more traditional one. With the data configuration as explained above, the objective is to compute an approximation for the underlying function which was discretely observed () such that it minimizes the normed error measure . Following the standard notation, for a normed space and any approximation to such that , the quality of approximation is usually quantified by the error norm . here is usually constructed by first fixing a finite dimension approximation space and then computing the optimal approximation by minimizing normed distance between and :

 min~f∈Q||f−~f||V=||f−(Af)||V (2)

For having a general basis formulation strategy for any finite dimension, we have implemented a data dependent basis () [11]. This directly follows from Mairhuber-Curtis theorm (theorem 2 in [16]) which states that for and , there is no Chebyshev system on . Therefore for some data , the vandermonde matrix can be singular affecting the quality of approximation. As a remedy, we choose a data dependent continuous kernel function which is positive definite and symmetric on . Thus, for design points , the kernel function becomes . From the literature available for these functions [3, 6, 29] we know that, these kernels are reproducing in the native Hilbert Space of functions on in the sense

 H=f(x)x∈Ω,f∈H (3)

The functionals are the Reisz representers of the linear point functionals in the dual space of . Therefore we have the relationship

 K(x,y)=H=<δx,δy>H∗x,y∈Ω (4)

For observed data at , one can consider the space of trial function as a choice for the approximation space. However from work such as [11, 12] we know that the bases formed by translates of the kernel functions are highly dependent on the distribution of data points in X and hence ill-conditioned for many problems. For dealing with this issue, at each scale s, our approach considers a subset of original trial functions as bases such that

 Γs=span{Ks(.,xi):xi∈Xs}=span{Ks(.,xj):xj∈X}Xs⊆X (5)

This is equivalent to identifying a lower dimension manifold in the space (as compared to original dimension being n for n data points, represented by the kernel matrix ). These linearly independent data based basis functions are sampled using a Pivoted QR strategy as detailed in the following section (Algorithm 1). It should be noted here that at each scale we have a different kernel Function . Hence, effectively we are working in a different RKHS at every scale (owing to the uniqueness property [3] of kernel functions). Let this scale dependent RKHS be represented as . Therefore at any scale s, the function to be approximated () is assumed to be in the RKHS . However, our approach is aimed at finding an approximation to as in (5). Hence, if we look at the problem formulation (2), then is the space and is our approximation space .

The contributions of this paper can be summarized as follows. Firstly we introduce an approach which computes a scale () dependent sparse representation () of a large dataset. Besides data reduction, the algorithm also provides a model for approximations at new data points in the observation space using just , enabling any sort of learning from a compressed version of the dataset. We then provide detailed analysis on convergence, establishing the optimality of the solution obtained by the algorithm, computation of point-wise error functionals and their behavior which governs the performance of the approach. At each scale, besides providing confidence bounds which quantify our belief in the estimation, we also provide prediction intervals which estimates the bounds in which the algorithm believes any observations not included in the sparse representation as well as any new samples to be made in the future will lie. This also makes the sparse representation more useful. Towards the end, we have also provided stability estimates to further establish the dependability of the approach.

## 2 The Hierarchical Approach

In this research, developing on the work of [4], we introduce a methodology of data reduction through efficient basis construction exploiting the multilevel nature of the correlation structure present in the data. The basic steps involved in the approach are presented as Algorithm 1, here.

### 2.1 Proposed Algorithm

The proposed approach (Algorithm 1) constructs a sequence of scale (s) dependent approximations to the unknown function . Each of these approximations only use a subset of dataset respectively. Algorithm begins by taking a dataset, where a data point is mapped to a functional value . In matrix form values are obtained at data points ( and ). The scalars are the algorithmic parameters defined by the user. TOL is the simple - error tolerance, P is assumed to be 2 (Based on [4]). This choice of P reduces the length scale of the gaussian kernel by 0.5 at each scale increment, providing an intuitive understanding of how the support of basis functions is modified. If we assume the diameter of the dataset to be distance between the most distant pair of datapoints, then T is given by

 T=2(Diameter(X)/2)2 (6)

Besides these parameters, the algorithm also accepts the prediction points , which represent the data points at which the user wants to approximate the underlying function. One other choice which needs to be made before moving further is the choice of the positive definite function (). For this paper we use the squared exponential kernel (7) for mapping the covariance structure and generating the space of trial functions (equation 5) at each scale s.

 Gs(x,y)=exp(−||x−y||2ϵs) ; ϵs=TPs (7)

Here (also known as the length scale parameter) determines the width of the correlation structure at a particular scale. This squared exponential kernel is very widely used in the gaussian process literature [24]. The learning phase (for generating ) of the proposed algorithm has been explained in and below. constitutes the prediction phase which uses the produced sparse representation for the dataset. It should be noted that, it is not required to wait till the convergence of the algorithm to go to the prediction phase. In fact, it is possible to predict from the sparse representation at each scale (). We have shown the prediction phase separately for the purpose of clarity.

STEP-1 (Getting sparse data-): Given the Dataset , the algorithm begins with the computation of the covariance operator (7). However, based on [11, 12], the distribution of the dataset might lead to ill-conditioning of this covariance kernel. Therefore we carry out a column pivoted QR decomposition to identify the space (at each scale) which represents the span of the trial functions at some scale s. The QR decomposition is carried out on W for obtaining the Permutation matrix . is produced by the product of a random normal matrix with the . Here we have with . For our experiments we have assumed (as in [4]) which means we sample 8 additional rows to account for numerical round-offs during the QR decomposition. However, this is a conservative step and even without any additional sampled columns, the algorithm was found to perform well. The permutation matrix produced by the decomposition captures the information content of each column of W. is then used to extract independent columns with the biggest norm contributions along with the observation points these columns correspond to in the covariance kernel (). This set of sampled observations from the original dataset is termed as the corresponding sparse data(). The number of columns sampled from come from its numerical rank estimated by using strategies such as a or a decomposition.

STEP-2 (Getting projection coordinate-): Once, we have the relevant columns of () which also represent the approximation subspace (in the native RKHS) which spans , the algorithm proceeds to solve the over-constrained system (). We can think of it as an orthogonal projection problem where needs to be projected on the column space of and then it is required to compute the specific weighting of vectors in the basis matrix which produces this projection. The Algorithm computes the orthogonal projection () given the required coordinates and basis vectors in .

Once we have the scale at which the algorithm satisfies the - condition, it produces the sparse representation and proceeds to the prediction stage if required by the user. Here is called the convergence scale as detailed in the following subsection.

STEP-3 (Prediction at from ): For computing functional values at unobserved location , basis functions are constructed by computing (gaussian kernel for the prediction points with the sparse data - ). The Prediction step weighs the constructed basis with coordinates of orthogonal projection (, if the prediction is being made at the convergence scale) and linearly combines them to produce the required approximation.

Finally, before moving forward, it is worth mentioning here that the - criteria is just one of the many possible kinds of norms which can be used to measure the scale dependent fidelity of the model.

### 2.2 Critical Scale (Sc) and Convergence Scale (Sa)

In this work, the scale at which the kernel matrix becomes well conditioned and numerically full rank is referred to as Critical Scale. Working with proposition 3.7 in [4], if represents the precision of rank for the gaussian kernel matrix, then we can define the numerical rank of the kernel as

 Rδ(Gs)=#(j:σj(Gs)σ0(Gs)≥δ) (8)

where is the largest singular value of . Also let represents the length of the bounding box of the data in () dimension. Then given the length scale parameter , the rank of the gaussian kernel can be bounded above as

 Rδ(Gs)≤d∏i=1(2|Ii|π√ϵ−1sln(δ−1)+1) (9)

Proposition 3.7 in [4] states that numerical rank of the gaussian kernel matrix is proportional to the volume of the minimum bounding box and to . Therefore for a fixed data distribution, following relation holds

 Rδ(Gs)∝ϵ−d/2∝Psd∝s;for % P=2 (10)

Hence numerical rank of the gaussian covariance kernel is directly proportional to the scale of study. Therefore, there exists a minimum scale , at which the kernel becomes full and continue to stay full rank as the scale is further increased.This scale is referred to as the Critical Scale (). Hence, based on the overall Algorithm 1, If and so on, are the sampled sparse representation at scale s, they satisfy the relation

 |X1|≤|X2|≤|X3|≤.....≤|XSc| (11)

where denotes the cardinality of a set. Therefore with increasing scales more and more data points are added leading to better approximation. Thus for any scales j and i satisfying the relation , the following relation holds

 ||f|X−(Ajf)|X||2≤||f|X−(Aif)|X||2 (12)

From the approximation theory literature [9], we can now make following statements about the kernel at the Critical Scale. For a finite point set

• is positive definite

• If vanishes on , then

• The approximation satisfying the following relation is unique

 lims→Sc(Asf)|X=f|X

Now, coming to Convergence scale (). We define it as the minimum scale s which satisfies

 ||f|X−(Asf)|X||2≤TOL

Therefore it is the scale at which Algorithm 1, stops. It is worth noting here that based on the definition of and , we have and

## 3 Convergence Properties

Following the notations introduced in the previous sections, again let be the function which needs to be approximated and be the approximation produced by the Algorithm 1 at scale s. Correspondingly, let be the error in approximation as scale s. Then the - for the error in can be stated as

 ||Es|X||2=||f|X−(Asf)|X||2

However could be written as the projection of the on the basis at scale . Here the projection operator is given as , where is the pseudo-inverse of basis matrix . We will now denote this projection operator as .Therefore

 ||Es|X||2=||f|X−Rsf|X||2=||(I−Rs)f|X||2

 ||Es|X||2≤||I−Rs||2||f|X||2 (13)

Now we know from Equation (10), that numerical rank of basis increases monotonically with scale s. Hence as

 Rs→In

Thus and hence . Therefore if we denote the operator norm by , then by (13)

 ||Es|X||2≤βs||f|X||2

and, therefore

 lims→Sc||Es|X||2→0 (14)

This establishes the convergence for the proposed approach in - at some finite scale . Now, we will generalize the convergence guarantee to all possible norm in .

###### Lemma 1

For any choice of and any norm: implemented by the user, the proposed algorithm (Algorithm 1) converges in the sense

 ||Es|X||=||f|X−(Asf)|X||≤TOL (15)

for some finite and the corresponding produced approximation

###### Proof

The proof directly follows from equivalence of norms on finite dimensional vector space [19].

 g1||x||a≤||x||b≤g2||x||a

where and . Therefore under certain conditions if , then has to go to 0. Similarly for our case (equation 14), convergence in , shows convergence in any other possible norm in .

In the following theorem we analyze a particular updating scheme for the proposed hierarchical approach

###### Theorem 1

If the projection update of the proposed algorithm is written in an iterative form

 (As+1f)|X=(Asf)|X+αs⋅Es|X (16)

where . Then follows the bounds

 0<αs<2 (17)

and Convergence rate can be expressed as a function of as

 ρs=|1−αs| (18)

###### Proof

Since at each scale s, is generated as projection of on space

 Γs=span{Ks(.,xi):xi∈Xs}Xs⊆X

Therefore using the notation , we write this update in an iterative form

 (As+1f)|X=(Asf)|X+αs⋅Es|X (19)

Firstly it should be easy to see that if is positive (direction wise), that means should be obtained after adding a positive quantity to the (if the algorithm has to converge) and hence should be non-negative. In the second scenario if is negative, that would mean should be obtained after subtracting some quantity from and hence again . Therefore overall based on the type of update defined in (19), is established. Coming back to (19)

 (As+1f−Asf)|X=αs⋅Es|X

and taking an inner product with respect to

 <(As+1f−Asf)|X,Es|X>=<αs⋅Es|X,Es|X>

Therefore,

 αs=<(As+1f−Asf)|X,Es|X>||Es|X||2

Now, we know

 αs ≤||(As+1f−Asf)|X||⋅||Es|X||||Es|X||2(Cauchy Schwarz) =||(As+1f−Asf)|X||||Es|X||≤1+||Es+1|X||||Es|X||(Triangle inequality)

Therefore follows the bounds

 0≤αs≤1+||Es+1|X||||Es|X|| (20)

Now, since based on the nature in which approximations are constructed . Therefore for all scales, . However, we note that the end member imply the iterative scheme has converged and no improvement are needed. The stopping criterion in Algorithm 1 makes sure of that. Hence we remove these end members obtaining our desired bounds.

Coming back to equation (19)

 (As+1f−f)|X=(Asf−f)|X+αs(f−Asf)|X

 ρs=||(f−As+1f)|X||||(f−Asf)|X||=|1−αs| (21)

Now we analyze the inner product of the error of the projection () with respect to projection at scale in the native RKHS. The major power of the following theorem lies in the fact that as the proposed algorithm converges, we are able to upper bound this inner product as a function of the user defined tolerance (). This provides the user direct control on the approximation properties on the algorithm even in the native RKHS. It should be noted again that at each scale , it is assumed that the function to be approximated belongs to the same Hilbert space and is approximated in , justifying the use of the reproducing property.

###### Theorem 2

For observed data on , the approximation produced by the hierarchical algorithm and its corresponding prediction error satisfies the following bounds

 lims→Sa∣∣⟨(Asf),Es⟩Hs∣∣ ≤||CSa||∞√n(TOL) (22)

where contains the modulus of coefficients of the basis vector at . n is the number of observation and TOL is the - convergence error tolerance in the hierarchical algorithm.

###### Proof

Beginning with the Inner products in the RKHS

 ∣∣⟨(Asf),Es⟩Hs∣∣ =∣∣⟨∑xj∈XsKs(⋅,xj)cj,f−Asf⟩Hs∣∣ =∣∣∑xj∈Xsf(xj)cj−∑xj∈Xs(Asf)(xj)cj∣∣(Reproducing Property) =∣∣∑xj∈Xscj(f(xj)−(Asf)(xj))∣∣ ≤∑xj∈Xs∣∣cj(f(xj)−(Asf)(xj))∣∣(Triangle Inequality) ≤||Cs||∞∑xj∈Xs∣∣(f(xj)−(Asf)(xj))∣∣ ≤||Cs||∞∑xj∈X∣∣(f(xj)−(Asf)(xj))∣∣(Xs⊆X) =||Cs||∞||Es|X||1

Here is the norm of .

 |<(Asf),Es>Hs| ≤||Cs||∞||Es|X||1 ≤||Cs||∞√n||Es|X||2(Cauchy-% Schwarz Inquality)

However, from the convergence criteria in the proposed Algorithm, we know as . Hence the theorem follows

## 4 Approximation Properties and Confidence Intervals

This section provides estimates quantifying the quality of approximations generated by the proposed algorithm at each scale. The first result is a direct application of theorem (2)

### 4.1 Approximation Properties

###### Corollary 1

For observed data on , the approximation produced by the hierarchical algorithm at the convergence scale (), follows the Pythagoras Theorem in the limit . i.e,

 ||f||2HSa=||ASaf||2HSa+||f−ASaf||2HSa (23)

###### Proof

Let be the function to be approximated. As stated earlier, it was discretely observed at leading to the restricted function . Starting with the norm of the function in the Hilbert Space

 ||f||2Hs =||(Asf)||2Hs+||f−(Asf)||2Hs+2|<(Asf),f−(Asf)>Hs| =||(Asf)||2Hs+||f−(Asf)||2Hs+2|<(Asf),Es>Hs|

Using theorem 2, at in the limit , the result follows

Now, we will provide results related to uniqueness and quality of solution.

###### Theorem 3

For observed data on , the approximation in the limit produced by the proposed algorithm at the convergence scale is

1. the unique orthogonal projection to

2. the unique best approximation to with respect to

###### Proof

(1): would be a unique orthogonal projection of on if . Now again using the reproducing property of RKHS

 =∣∣∑xj∈Xscjf(xj)−∑xj∈Xscj(Asf)(xj)∣∣

The rest of the proof is similar to steps in the proof of theorem 2 showing the considered native Hilbert space norm vanishes as

(2): Considering . Then at , as (by part 1) Therefore,

 ||d−f||2Hs=||d−Asf+Asf−f||2Hs=||d−Asf||2Hs+||Asf−f||2Hs

This implies

 ||Asf−f||2Hs<||d−f||2Hsds≠Asf

Which establishes the optimality of the approximation

Now, we will move towards the analysis of Error functional and pointwise error bounds. As stated earlier, at scale s, we are searching for a solution in the space . Thus our approximation is of the form

 (Asf)(x)=∑xj∈XsKs(x,xj)cj

Now, any can also be expressed as

 (Asf)(x)=λyKs(x,y); Where λ=∑xj∈Xscjδxj

Here is the evaluational functional for , i.e. and is application of linear functional on . Thus we have the dual space

 H∗s={∑xj∈Xscjδxj}

Now, we present a result which provides an inner product representation for approximation at a point . The objective here is to show, that we can even show the optimality of the approximation by requiring vanishing gradient for the norm of the Error Functional in .

###### Theorem 4

If we represent the error functional at scale s in a form such that

 |εsx(f)|=|f(x)−(Asf)(x)|=|δsx(f)−Ms(x)TδsX(f)| (24)

Then the optimal value of which minimizes the error functional norm satisfies the inner product

 =(Asf)(x) (25)

Thereby, establishing the optimality of approximation generated by the proposed algorithm at each scale s

###### Proof

We begin by expressing the error functional norm

 ||εsx||2Hs =<δsx−Ms(x)TδsX,δsx−Ms(x)TδsX>Hs =||δsx||2Hs−2Ms(x)TδsXδsx+Ms(x)TδsXδsTXMs(x)

Now, based on the property of dual space, we know at scale s,

 <δsa,δsb>Hs=Ks(a,b)

Also, let

 Rs(x)=δsXδsx=(Ks(x,x1),Ks(x,x2),....,Ks(x,xn))T∈Rn
 Gs=δsXδsTX(gaussian kernel at scale s)

Therefore we have

 ||εsx||2Hs =||δx||2Hs−2Ms(x)TRs(x)+Ms(x)TGsMs(x) (26)

On differentiation we find that the optimal that minimizes , is the the solution of the equation

 GsMs(x)=Rs(x) (27)

However, from the Algorithm 1, we know need not be full rank. Hence, again using matrix A (as in Algorithm 1) and constructing . Now, we again do a Column pivoted QR decomposition and have

 WPs=QR

Applying the permutation operator on equation (27)

 PsRs(x)=PsGsMs(x) (28)

Now we use the fact that is a symmetric operator, therefore if represents the numerical rank of , then sampling first rows of will produce , i.e. transpose of the basis considered at scale s.

Correspondingly sampling the respective values in produces the restriction of to set (represented as ). Therefore, restricting system (28) to equations only corresponding to

 Rs(x)|Xs=BsTMs(x) (29)

using MooreâPenrose inverse for getting the optimal projection of . Thus optimal solution for