Multiple Tensor on Tensor Regression: An approach for modeling processes with heterogeneous sources of data

# Multiple Tensor on Tensor Regression: An approach for modeling processes with heterogeneous sources of data

## Abstract

With advancements in sensor technology, a heterogeneous set of data, containing samples of scalar, waveform signal, image, or even structured point cloud are becoming increasingly popular. Developing a statistical model, representing the behavior of the underlying system based upon such a heterogeneous set of data can be used in monitoring, control, and optimization of the system. Unfortunately, available methods only focus on the scalar and curve data and do not provide a general framework that can integrate different sources of data to construct a model. This paper poses the problem of estimating a process output, measured by a scalar, curve, an image, or a point cloud by a set of heterogeneous process variables such as scalar process setting, sensor readings, and images. We introduce a general multiple tensor on tensor regression (MTOT) approach in which each set of input data (predictor) as well as the output measurements are represented by tensors. We formulate a linear regression model between the input and output tensors and estimate the parameters by minimizing a least square loss function. In order to avoid overfitting and to reduce the number of parameters to be estimated, we decompose the model parameters using several bases, spanning the input and output spaces. Next, we learn both the bases and their spanning coefficients when minimizing the loss function using an alternating least square (ALS) algorithm. We show that such a minimization has a closed-form solution in each iteration and can be computed very efficiently. Through several simulation and case studies, we evaluate the performance of the proposed method. The results reveal the advantage of the proposed method over some benchmarks in the literature in terms of the mean square prediction error.

## 1 Introduction

With advancements in sensor technology, heterogeneous sets of data containing scalars, waveform signals, images, etc., are more and more available. For example in semiconductor manufacturing, machines/process settings (scalar variables), sensor readings in chambers (waveform signals), and wafer shape measurements (images) might be available to represent the process. Analysis and evaluation of such diverse data can benefit many applications including manufacturing processes [1, 2], food industries [3], medical decision-making [4], and structural health monitoring [5]. Specifically, such data can be used to produce a statistical model that estimates a value of interest based upon several other input variables (regression). Unfortunately, most works in regression modeling only consider scalars and waveform signals as the inputs and output of a process [6, 7, 8, 9], and are not able to include image or structured point cloud measurements. However, in several applications image or structured point cloud data is available. For example, material scientists are interested in developing a link between the process variables, e.g., the temperature and pressure under which a material is being used, and the microstructure of the material [10]. Usually, the microstructure of a material is represented by an image, or a variation of the original image obtained by two-point statistics. To generate such a linkage model, one requires to regress between scalar and profile process variables as inputs and an image data. For more details on such a model and a case study please refer to [11].

As another example, in semiconductor manufacturing, overlay errors, defined as the difference between in-plane displacements of two layers of a wafer, are directly influenced by the shape of the wafer before the lithographic process. In this process, both the wafer shape and the overlay error (in x and y directions) can be represented as images as shown in Figure b. Prediction of the overlay error across a wafer based on the wafer shape can be fed forward to exposure tools for specific corrections [12]. In order to predict the overlay error based on the wafer shape deformation, an image-on-image statistical model is required to capture the correlation between these two objects.

Developing an accurate, interpretable model for a process with a heterogeneous group of variables (i.e., variables of different forms such as scalar, curve, and images, etc.) is challenging due the size and complex structure of each data point, as well as the integration of different forms of inputs. Regular regression approaches that take each observation within an input image/profile as an independent covariate, excessively increase the number of covariates in comparison to the sample size () and ignore the correlation between the observations. Principle component regression (PCR) alleviates the problem by first reducing the dimension of both the input variables and the output. Nevertheless, PCR fails to exploit the structure of, for example, images, or point clouds. Furthermore, PCR determines the principle components (PCs) of the inputs and the outputs separate from each other without considering the interrelationship between the input variables and the output. Functional Data analysis, specifically functional regression models (FRM), has become popular in recent years due to their built-in data reduction functionality, and their capability in capturing nonlinear correlation structures as well as exploiting the data structure (see for example, [6, 7, 13, 8, 14, 9]). Unfortunately, these approaches are difficult to extend to data forms other than curves, and require specifying a set of basis functions that are extremely domain dependent. Furthermore, combining a scalar data with other forms of data generates an extra challenge, and in many situations requires a separate treatment. Recently, [9] proposed an approach that can combine several profile and scalar inputs to predict a curve output, while learning the bases. Unfortunately, it is not clear how to extend this approach to other forms of data effectively. As a result, the FRM may not be suitable for the processes with heterogeneous set of variables.

In the past few years, multilinear algebra, and in particular tensor analysis, has shown promising results in many applications from network analysis to process monitoring [15, 16, 17]. Unfortunately, the literature on tensor regression is limited. These approaches mostly depend on tensor factorization methods, e.g., PARAFAC/CANDECOMP (CP) and Tucker decompositions that approximate a tensor using several bases or lower rank tensors [18]. The CP decomposition approximates a tensor as a sum of several rank-1 tensors [19]. The Tucker decomposition is a form of higher-order PCA. It decomposes a tensor into a core tensor multiplied by a matrix along each mode [20]. The CP decomposition is a special case of a Tucker decomposition and assumes the same rank for all the basis matrices. [21] has successfully employed tensor regression using CP decomposition to estimate a scalar variable based upon an image input. They further extend their approach to the generalized linear model for the tensor regression in which the scalar output follows any exponential-family distribution. [22] performed tensor regression with scalar output using Tucker decomposition. [23] performed the opposite regression and estimated a point cloud data using a set of scalar process variables. Recently, [24] developed a tensor on tensor regression (TOT) approach that can estimate a tensor using a tensor input, while learning the decomposition bases. However, several limitations in their approach should be addressed. First, TOT uses the CP decomposition which restricts both the input and output bases to have exactly the same rank (say ). This restriction may cause over or under estimation because in many applications the input and the output have different ranks. For example, when estimating an image based on few scalar inputs, the rank of the output can be far larger than the input matrix. Second and more importantly, their approach can only take into account a single tensor input and cannot be used effectively when multiple sources of input data with different dimensions and forms (e.g., a combination of scalar, curve, and image data) are available. The extension of their approach to multiple tensor inputs generates a significant challenge due to the first limitation mentioned before. Because the output and the inputs should have the same rank, extending the TOT approach to multiple tensor inputs as well requires all the inputs to have the same rank (which is equal to the rank of the output). But this means in situations when for example a scalar and image inputs exists, one of the inputs should take the rank of the other, causing a severe underfitting or overfitting problem. One approach to allow multiple forms of data in the TOT is to combine all the data into one single tensor input, e.g., an image and scalar input can be merged by transforming each scalar value into a constant image. But this approach generates some issues: first, it significantly increases the size of data and second, it masks the separate effect of each input on the output due to the fusion of inputs. Furthermore, in the situations when the dimension of modes does not match (for example a curve input with 60 observed points, and an image of size 50x50), merging data into one tensor requires a method for dimension matching. Finally, their approach fails to work on moderate size tensors (e.g., on the image data of size 20000 pixels we used in our case study) due to its high space complexity.

The overarching goal of this paper is to overcome the limitations of the previous methods, such as the PCR, FRMs, and the TOT, by developing a unified regression framework that estimates a scalar, curve, image, or structured point cloud output based upon a heterogeneous set of input variables. This will be achieved by representing the output and each group of input variables as a separate tensor, and developing a multiple tensor-on-tensor regression (MTOT). To avoid overfitting and estimating a large number of parameters, we perform a Tucker decomposition over each group of inputs’ parameters, using a set of bases that expands each of the input spaces and another set of bases that spans the output space. We obtain the input bases by performing a Tucker decomposition on the input tensors and then define a least square loss function to estimate the decomposition coefficients and output bases. To avoid identifiability issues, we impose an orthonormality constraint over the output bases when minimizing the loss function, and show a closed-form solution for both bases and decomposition coefficient in each iteration of our algorithm. This approach not only performs dimension reduction similar to PCR, but also does it assure that the learned output bases are those that correlate with the input space. Furthermore, because we use tensors to represent the data, we can better exploit the structure of image or structured point cloud data.

The rest of the article is organized as follows: In Section 2, we introduce our notation and explain the multilinear algebra that is used in the paper. Next, in Section 3, we first formulate the multiple tensor regression model and illustrate the closed form solution for estimating the parameters. In Section 4, we describe four simulation studies. The first simulation study combines a profile and scalar data to estimate a profile output. This is particularly selected to be compared with the available methods in functional regression. The second and third simulation studies contain images or point clouds either as the input or output. Finally, the last simulation considers estimating a nonsmooth profile using a set of scalar inputs. In each simulation study, we compare the performance of the proposed method with benchmarks in terms of (standardized) prediction mean square errors. A case study on predicting the overlay errors based on the wafer shape is conducted in Section 5. Finally, we summarize the paper in Section 6.

## 2 Tensor Notation and Multilinear Algebra

In this section, we introduce the notations and basic tensor algebra used in this paper. Throughout the paper, we denote a scalar by a lower or upper case letter, e.g., or , a vector by a boldface lowercase letter and a matrix by a boldface uppercase letter, e.g., and, and a tensor by a calligraphic letter, e.g., . For example, we denote an order– tensor with , where is the dimension of the mode of tensor . We also denote a mode– matricization of tensor as whose columns are the mode– fibers of the corresponding tensor , and . We also define a more general matricization of a tensor as follows. Let and be two sets partitioning the set of dimension indices of the tensor . Then, the matricized tensor is specified by , where and , and

 (T(I×Q))ij=Tp1⋯plq1⋯qd,

where and . For simplicity of notation, we will denote as .

The Frobenius norm of a tensor can be defined as the Frobenius norm on any matricized tensor, e.g., . The mode– product of a tensor by a matrix is a tensor in and is defined as . The Tucker decomposition of a tensor decomposes the tensor into a core tensor and orthogonal matrices so that . The dimensions of the core matrix is smaller than , i.e., . Furthermore, the Kronecker product of two matrices and is denoted as and is obtained by multiplying each element of matrix to the entire matrix :

 A⊗B=⎡⎢ ⎢⎣a11B...a1nB⋮⋱⋮am1B...amnB⎤⎥ ⎥⎦

We link the tensor multiplication with the Kronecker product using the proposition 1.

###### Proposition 1.

Let and , and let and , then

 T=C×1U1×U2×3⋯×lUl×l+1V1×l+2⋯×l+dVd⟺ T=(Ul⊗Ul−1⊗⋯⊗U1)C(Vd⊗Vd−1⊗⋯⊗V1)T,

where , and is an unfold of the core tensor .

The proof of proposition can be found in [18]. Finally, the contraction product of two tensors and is denoted as and is defined as,

 (X∗B)q1⋯qd=∑p1,⋯,plXp1,⋯,plBp1,⋯,pl,q1,⋯,qd.

## 3 Tensor on Tensor Regression Framework

In this section, we discuss the tensor on tensor framework as an approach for combining multiple sources of data with different dimensions. Assume a set of training data of size is available and includes response tensors and input tensors , where is the number inputs. The goal of the tensor regression is to model the relationship between the input tensors and the responses using a linear form,

 Yi=p∑j=1Xji∗Bj+Ei,i=1,⋯M, (1)

where is the model parameter to be estimated and is an error tensor whose elements are from a random process. To achieve a more compact representation of the model (1), we can combine tensors , , and , into a one–mode larger tensors , , and , and write,

 Y=p∑j=1Xj∗Bj+E, (2)

The matricization of (2) gives,

 Y(1)=p∑j=1Xj(1)Bj+E(1), (3)

where and are mode–1 unfolding of tensors and , respectively, and the first mode corresponds to the sample mode. is an unfold of tensor with and . It is intuitive that the parameters of (3) can be estimated by minimizing a mean squared loss function, . However, this requires estimating parameters. For example, in the situation when , minimizing the loss function gives a closed-form solution that requires estimating parameters. Estimating such a large number of parameters is prone to the severe overfitting and is often intractable. In reality, due to the structured correlation between and , we can assume that the parameter lies in a much lower dimensional space and can be expanded using a set of basis matrices via tensor product. That is for each , we can write,

 Bj=Cj×1Uj1×2Uj2×3⋯×ljUjlj×lj+1V1×lj+2⋯×lj+dVd, (4)

where , and is a core tensor, is a basis matrix expanding the input space, and is a basis matrix expanding the output space. With this low dimensional representation, the estimation of reduces to learning the core tensor and the basis matrices, and . In this paper, we allow to be selected by the users. Two important choices of are truncated identity matrix (i.e., no transformation on the inputs), or the bases obtained from the Tucker decomposition of the input tensor , i.e.,

 {Dj,Uj1,⋯,Ujlj}=argminDj,{Uji}∣∣∣∣Xj−Dj×1Uj1×⋯×ljUjl∣∣∣∣2F

In a particular case that an input tensor is a matrix, the bases are the principle components (PCs) of that input if one uses the Tucker decomposition. Allowing be selected is reasonable because is an independent variable and its basis matrices can be obtained separately from the output space. Furthermore, learning the core tensor and the bases provides enough degree of freedom to learn . Next, we iteratively estimate the core tensor and the basis matrices by solving the following optimization problem:

 {Cj,V1,⋯,Vd}=argmin∣∣ ∣∣∣∣ ∣∣Y(1)−p∑j=1Xj(1)Bj∣∣ ∣∣∣∣ ∣∣2Fs.t.VTiVi=I~Qi;i=1,⋯,d, (5)

where is a identity matrix. The orthogonality constraint is to assure the uniqueness of both the bases and the core tensor. In general, the problem of estimating functional data through a set of functions may not be identifiable under some conditions. That is, assuming , we can find , such that , i.e., and both estimate same mean value for the output. [25, 26, 24] discuss the identifiability problem in functional and tensor regression. Because the main purpose of this paper is to estimate and predict the output, we do not deal with the identifiability problem here, as leaning any correct set of parameter will eventually lead to the same estimation of the output.

In order to solve (5), we combine the alternating least square (ALS) approach with the block coordinate decent (BCD) method. The advantage of the ALS algorithms that leads to their widespread use are the conceptual simplicity, noise robustness, and computational efficiency [27]. In tensor decomposition and regression, due to the non-convex nature of the problem, finding the global optimum is often intractable and it is well-known that the ALS algorithm also has no global convergence guarantee and may be trapped in a local optima [18, 27]. However, ALS has been shown great promise in the literature to solve the tensor decomposition and regression applications with satisfying result on the local optima. To be able to employ ALS-BCD, we first demonstrate proposition 1:

###### Proposition 2.

When ,, and are known, a reshaped form of the the core tensor can be estimated as

 ~Cj=Rj×1(ZTjZj)−1ZTj×2(VT1V1)−1VT1×3(VT2V2)−1VT2⋯×d+1(VTdVd)−1VTd (6)

where, , and . Note that has fewer modes () than the original core tensor in (4), but it can be transformed into by a simple reshape operation.

The simplified proof of this proposition is given in appendix A. Furthermore, if ’s are orthogonal, the core tensor can be obtained efficiently by the tensor product as,

 Extra open brace or missing close brace

Furthermore, one can estimate the basis matrices as follows:

###### Proposition 3.

With known , , and , we can solve by

 Vi=RWT

where and are obtained from the singular value decomposition of and , , and is the mode–i matricization of tensor . Note that is truncated.

The simplified proof of this proposition is shown in Appendix B. First, we do not require to calculate the Kronecker product explicitly to find . In real implementation, we can use proposition (1) to calculate the complete matrix using tensor products efficiently. Second, unlike the principle component regression (PCR) in which the principle components of the output are learned independent of the inputs, the estimated basis matrices directly depend on the input tensors, assuring correlation between the basis and the inputs. By combining propositions 2 and 3, Algorithm 1 summarizes the estimation procedure for multiple tensor on tensor regression. This algorithm in fact combines the block coordinate decent (BCD) algorithm with the ALS algorithm.

### 3.1 Special cases

Let us turn to a special case in which inputs and outputs are scalars. We can combine the samples and inputs into a vector (a first order tensor) and a matrix . Let us start by choosing , a identity matrix. That is, we do not perform any transformation on . Proposition 2 gives the core tensor as, . Note that is an scalar, which reduces the core tensor to . Proposition 3 gives, , which results in . This is precisely, the results of a regular regression. If we assume that is obtained from the Tucker decomposition of , the result is equivalent to the principle component regression.

Lets us also consider another special case, solution to which is offered by [23]. Let be a tensor to be estimated through a set of scalars, combined in an input tensor . Then, , which is precisely the solution offered by [23] when the bases are known. Unlike their work which assumes that ’s are obtained from the (regularized) tucker decomposition of the output tensor , we learn these bases in accordance to the input. As a result our approach not only generalize that work in terms of the input forms, but also it strengthens it by learning the bases.

### 3.2 Selection of tuning parameters

The proposed approach requires the selection of the values , ; and , . For simplicity and practicality, we assume that for each predictor and the response the rank is fixed, i.e., and . As a result, we only need to select parameters. For this purpose, we use the k-fold cross-validation method on a -D grid of parameters and find the tuple of parameters that minimizes the mean squared error. As a result, we should define a grid over the rank values. This is achieved as following: First, we unfold each tensor and along their first mode. Next, we find the rank of each unfolded matrix, denoted as and . Then, for each and , we select the tuning parameters from and . Next, for each tuple , we calculate the average sum square error (RSS) and finally take the one that minimizes the RSS. For all the studies in the next sections we perform 5-fold CV.

## 4 Performance Evaluation Using Simulation

This section contains two parts. In the first part, we only consider curve-on-curve regression and compare our proposed method to function-on-function regression approach proposed by [9], designated as sigComp. The reason that we compare our approach to sigComp is that it can handle multiple functional inputs (curves) and can learn the basis functions similar to our approach. In the second part, we conduct a set of simulation studies to evaluate the performance of the proposed method when the inputs or outputs are in the form of images or structured point clouds, or curves with jumps. In this part, we compare the proposed method with two benchmarks: 1) The first benchmark is the TOT approach proposed by [24]. Because this approach can only handle single input tensor, when multiple inputs exist, we perform some transformations to merge the inputs into one single tensor. 2) The second benchmark is based upon the principle component regression (PCR) similar to a benchmark considered in [8]. In this approach, we first matricize all the input and output tensors, and then perform principle component analysis to reduce the dimension of the problem by computing the PCA score of the first few principle components that explain at least percent of the variation in the data. Next, we perform the linear regression between the low dimensional PC–score of both inputs and output. More formally, let and denote the mode–1 matricization of the inputs and output, and be a concatenation of all the input matrices. We first compute the first and principle components of the and the response . Next, the PC scores of the input is calculated (a matrix in ) and is used to predict the matrix of scores of the response function (a matrix in ). Then, given the PC scores of the new inputs, we use the fitted regression model to predict the response scores. Finally, we multiply the predicted response scores by the principle components to obtain the original responses. The number of principle components and can be identified through a cross–validation procedure. In this paper instead of cross-validating over and directly, we perform CV over the percentage of variation the PCs explain, i.e., . For this purpose, we take the value of from and take the that minimizes the CV error. The standardized mean square prediction error (SMSPE) is used as a performance measure to compare the proposed method with the benchmarks. The SMSPE is defined as

### 4.1 Simulation studies for curve on curve regression

In this simulation, we consider multiple functional (curve) and multiple scalar predictors similar to the simulation study in [9]. We first randomly generate as follows:

 Missing or unrecognized delimiter for \right

Where and , and are Gaussian processes with covariance function . Next, we generate functional predictors using the following procedure: Let be a matrix with the element be equal to for and equal to one for diagonal elements. Next, we decompose , where is a matrix and generate a set of curves using a Gaussian process with covariance function . Finally, we generate the predictors at any given point as

 (x1(s),⋯,xp(s))=(w1(s),w2(s),⋯,wp(s))ΔT.

With this formulation each curve of is a Gaussian process with covariance function , and for each , the vector is a multivariate normal distribution with covariance . When , this vector becomes and independent vector of normally distributed variables. Figure b illustrates examples of the predictors when for . We also generate the scalar predictors from a multivariate normal distribution with mean vector zero and the covariance matrix with diagonal elements equal to and off-diagonal elements equal to . The coefficients of the the scalar variables denoted by are generated from a Gaussian process with covariance function Finally, we generate the response curves as

 y(t)=5∑i=1αi(t)ui+p∑i=1∫Bi(s,t)xi(s)ds+ϵ(t),

where is generated from a normal distribution with zero mean and . We generate all the input and output curves over and , and take the samples over an equidistant grid of size .

For each combination of , we compare the performance of the proposed method with the methods in [9] based on the mean square prediction error (MSPE) and the mean square estimation error (MSEE). We do not compare our approach to PCR in this simulation because sigComp already demonstrated the superiority over PCR in simulation studies in [9]. We implement the sigComp benchmark method using the R package FRegSigCom in which we use 50 spline bases for both the inputs and output and default convergence tolerance. To calculate the MSPE and MSEE, we first generate a sample data of size that is used to learn the model parameters. Next, we generate a testing sample of size and calculate MSPE as,

 MSPE=1MtestMtest∑j=1(1100100∑i=1(ytestj(ti)−^ytestj(ti))2),

and

 Missing or unrecognized delimiter for \left

We repeat this procedure for 50 times to find the mean and standard deviation of the MSPE and MSEE for each method. Table 1 reports the results at different values of and different number of predictors, . As reported, our proposed approach is superior to the sigComp method in terms of MSPE and MSEE for. For example, when and , the average MSPE and MSEE of the sigcomp are and that are much larger of the corresponding values ( and ) achieved by MTOT. However, the performance of the sigComp may outperfom MTOT for larger values of . For example, when and , the average MSPE over 30 replications is for the proposed method and it is for the benchmark. The reason that the sigComp performs slightly better for larger is that it imposes sparsity when estimating the parameters, reducing the chance of overfitting.

### 4.2 Simulation studies for image/ structured point-cloud or non-smooth output

#### Case III–Curve response with jump simulation: We simulate a response function with jump using a group of B–spline bases. Let ti=iI with i=0,1,⋯,I;I=200, and let B1∈RI×5 and B2∈RI×51 be two matrices of fourth–order B–spline bases obtained by one and 47 knots over [0,1]. We generate a response profile by combining these two bases as follows. yj(ti)=B1(ti)x1j+B2(ti)x2j+e(ti), where B1(ti) and B2(ti) are basis evaluations at the point ti, and e(ti) is a random error simulated from N(0,σ2). The input vector x1j is dense and its elements are generated from a uniform distribution over [0,1]. x2j is a sparse vector with five consecutive elements equal to one and the rest equal to zero. The location of five consecutive elements is selected at random. Figure c illustrates examples of response functions. For this simulation study, we first generate a set of M=500 data points, i.e., {(yj,x1j,x2j)}Mj=1. Then, we randomly divide the data into a set of size 400 for training and a set of size 100 for testing. We perform CV and train the model using the training data set. Next, we calculate the SMSPE for the proposed and the benchmark based on the testing data. We repeat this procedure for 50 times to capture the variance of the SMSPE.

In each case, we compare the proposed method with benchmarks based on the SMSPE calculated at different levels of noise . Tables 2, 3, and 4 report the average and the standard deviation of SMSPE (or its logarithm) along with the average running time of each algorithm for the simulation cases I, II, and III, respectively. In all the cases, the MTOT has the smallest prediction errors, reflecting the advantage of our method in terms of prediction. Furthermore, with the increase in , all methods illustrate a larger SMSPE in all cases. In the first case, the TOT illustrates a prediction performance comparable to our method at a cost of a much longer running time. For example, when the TOT requires about 154.98 seconds to reach the SMSPE of 0.0046, obtained in 1.05 seconds by MTOT. The performance of both PCR and the TOT is significantly worse than the MTOT in the second case. The inferior performance of the TOT is because of both its restriction on selecting the same rank for both the input and output and the fact that the CP decomposition it uses does not consider the correlation between multiple modes.

In the third case, the prediction performance of all three methods are comparable, indicating all three are capable in predicting a functional output with discontinuity. However, our approach shows slightly smaller prediction errors. Although, the running time of the PCR is significantly lower than the other two approaches, The MTOT running time is reasonable and within two-tenth of a second. The TOT shows slightly larger prediction error in all cases with much longer running time, making this approach less appealing. Recall that in this simulation, we used B-spline bases as the coefficients of the input to generate the output curve. Figure a illustrates the plot of the columns of the learned coefficient matrix that corresponds to . As can be seen the learned bases are vary similar to B-spline bases used originally as the coefficients. Figure b illustrates some of the columns of the learned parameter that corresponds to . Unlike the first set of parameters, these parameters are slightly different from the B-spline bases that originally used for data generation purpose. This is due to the identifiability issue. Our approach imposes orthogonality restriction that may generate a set of parameters (when the identifiability issue exists) different from those the data is originally generated from, but can still produce accurate predictions in terms of the mean value.