Abstract
We analyze the problem of regression when both input covariates and output responses are functions from a nonparametric function class. Function to function regression (FFR) covers a large range of interesting applications including timeseries prediction problems, and also more general tasks like studying a mapping between two separate types of distributions. However, previous nonparametric estimators for FFR type problems scale badly computationally with the number of input/output pairs in a dataset. Given the complexity of a mapping between general functions it may be necessary to consider large datasets in order to achieve a low estimation risk. To address this issue, we develop a novel scalable nonparametric estimator, the TripleBasis Estimator (3BE), which is capable of operating over datasets with many instances. To the best of our knowledge, the 3BE is the first nonparametric FFR estimator that can scale to massive datasets. We analyze the 3BE’s risk and derive an upperbound rate. Furthermore, we show an improvement of several orders of magnitude in terms of prediction speed and a reduction in error over previous estimators in various realworld datasets.
Fast Function to Function Regression
Junier Oliva Willie Neiswanger
Barnabas Poczos Eric Xing Jeff Schneider
\aistatsaddress Machine Learning Department
Carnegie Mellon University
1 Introduction
Modern datasets are not only growing in quantity of instances but the instances themselves are growing in complexity and dimensionality. The goal of this paper is to perform regression with datasets that are massive not only in terms of the number of instances but also in terms of the complexity of instances; specifically we consider functional data. We study function to function regression (FFR) where one aims to learn a mapping that takes in a general input functional covariate and outputs a functional response . In general, functions are infinite dimensional objects; hence, the problem of FFR is not immediately solvable by traditional regression methods on finite vectors. Furthermore, unlike with typical regression problems, neither the covariate nor the response will be directly observed (since it is infeasible to directly observe functions). Previous nonparametric estimators for FFR do not scale computationally to large datasets. However, large datasets are often needed to achieve a low risk; to mitigate this issue we introduce the TripleBasis Estimator (3BE).
The FFR framework is quite general and includes many interesting problems. For instance, one may consider input/output functions that are probability distribution functions (pdfs). An example of a financial domain related FFR problem with density functions is learning the mapping that takes in the pdf of stock prices in a specific industry and outputs the pdf of stock prices in another industry. Additionally, in cosmology one may be interested in regressing a mapping that takes in the pdf of simulated particles from a computationally inexpensive but inaccurate simulation and outputs the corresponding pdf of particles from a computationally expensive but accurate simulation. In essence, one would be enhancing the inaccurate simulation using previously seen data from accurate simulations. There are also many nondistributional FFR problems. For example, one may view foreground/background segmentation as a FFR problem that maps an image’s function to a segmentation’s function, where is a function that takes in a pixel’s position and outputs the corresponding pixel’s intensity, and is function that takes in a pixel’s position and outputs if the pixel is in the foreground and otherwise.
Moreover, several timeseries tasks may be posed in the FFR framework (see Figure 1). Suppose, for example, that one is interested in predicting the next unit interval of a timeseries given the previous unit interval; then, one may frame this as a FFR problem by letting input functions be the function representing the timeseries during the first unit interval and output functions be the function representing the timeseries during the next unit interval (Figure 1(a)). A related problem is that of predicting cooccurring functions (Figure 1(b)). An interesting application of predicting cooccurring functions is with motion capture data, where one may be interested in predicting the movement of joints that are occluded given the movement of observed joints.
As stated previously, the problem of FFR boils down to the study of a mapping between infinite dimensional objects. Thus, the regression task would benefit greatly from learning on datasets with a large number of input/output pairs. However, many nonparametric estimators for regression problems do not scale well in the number of instances in a dataset. Thus, if the number of instances is in the many thousands, millions, or even more, then it will be infeasible to use such an estimator. This leads to a paradox: one wants many instances in a dataset in order to effectively learn the FFR mapping, but one also wants a low number of instances in order to avoid a high computational cost. We resolve this issue through the 3BE, which we will show can perform FFR in a scalable manner.
The datasets we consider are as follows. Since general functions are infinite dimensional we cannot work over a dataset where . Instead we shall work with a dataset of instances that are (inexact) observation pairs from input/output functions where , and are some form of empirical observations from and (see Figure 2). For example, one may consider the functional observations to be a set of noisy function evaluations at uniformly distributed points, or a sample of points drawn from and respectively (when are distributions). Using we will make an estimate of as where are functional estimates created using respectively. The task then is to estimate as when given a functional observation, , of an unseen function .
Our approach will be as follows. First, we convert the infinite dimensional task of estimating the output function into a finite dimensional problem by projecting into a finite number of basis functions (focusing on the crucial characteristics of , roughly speaking). Then, to estimate the projections onto the basis functions we embed the input functions into a nonlinear space where linear operations are approximately evaluations of a nonlinear mapping in a broad function class. Finally, is estimated by empirically minimizing the risk of a linear operation in the nonlinear embedding of input functions for predicting the basis projections of output functions in a dataset.
Our Contribution
We develop the TripleBasis Estimator (3BE), a novel nonparametric estimator for FFR that scales to large datasets. The 3BE is the first estimator of its kind, allowing one to regress functional responses given functional covariates in massive datasets. Furthermore, we analyze the risk of the 3BE under nonparametric assumptions. Lastly, we show an improvement of several orders of magnitude over existing estimators in terms of prediction time as well as a reduction in error in various realworld datasets.
2 Related Work
A previous nonparametric FFR estimator was proposed in [4]. [4] attempts to perform FFR on a functional RKHS. That is, if we consider as a functional Hilbert space, where is such that , then is estimated by However, when each function is observed though noisy function evaluations this estimator will require the inversion of a matrix, which will be computationally infeasible for datasets of even a modest size.
In addition, [8] provides an estimator for doing FFR for the special case where both input and output functions are probability distribution functions. The estimator, henceforth referred to as the linear smoother estimator (LSE), works as follows when given a training datasets of of empirical functional observations and of function estimates and a function estimate of a new query input function:
(1)  
(2) 
Here is taken to be a symmetric kernel with bounded support, and is some metric over functions. However, while such an estimator is useful for smaller FFR problems, it may not be used in larger datasets. Clearly, the LSE must perform a kernel evaluation with all input distributions in one’s dataset to produce a prediction, leading to a total computational cost of when considering the cost of computing metrics when . This implies, for example, that obtaining estimates for each training instance scales as , which will be prohibitive for big datasets.
Previous work for nonparametric estimation in large datasets with functional inputs includes [7]. There an estimator was proposed for scalable learning of a distribution input covariate to realvalue output response regression problem. We note however that it is not immediately clear how to achieve a scalable estimator for regression functional responses with functional covariates, nor how to analyze such an estimator’s risk since general functional responses are infinite dimensional.
3 Model
We expound upon our model of input/output functions and the mapping between them. Later, we introduce the 3BE and its risk for the case when one has a dataset of pairs of input/output functional observations that are a set of noisy function evaluations at uniformly distributed points. However, the following is generalizable for the case where one observes function evaluations at a fixed grid of points or function observations of samples from distributions. In short, we assume smooth input/output functions that are well approximated by a finite number of basis functions. Further, we consider a nonparametric mapping between them, where the projection of the output function onto each basis function may be written as an infinite linear combination of RBF kernel evaluations between the input function and unknown functions (see below).
We take our dataset to be input/output empirical function observation pairs:
(3)  
(4) 
with sample points , , and noise , . With error distributions , s.t. , . Furthermore, , , , , ^{1}^{1}1Similarly, one may consider a model , where is a standard Wiener process. This however will be akin to adding variance to our noisy function evaluations, hence we omit for simplicity., and where and are some class of input/output functions and is some measure over . Furthermore, we shall assume that and . We shall use to make estimates of the true input/output functions , which will then be used to estimate the output function corresponding to an unseen input function .
3.1 Basis Functions and Projections
Let be an orthonormal basis for . Then, the tensor product of serves as an orthonormal basis for ; that is, the following is an orthnormal basis for
So we have that . Let , then
(5)  
(6) 
As previously mentioned, a dataset of estimated input/output function pairs, , will be constructed from the dataset of input/output function evaluation sets . Suppose function has a corresponding set of evaluations where and , . Then, , the estimate of , will be as follows:
(7)  
(8) 
and is a finite set of indices for basis functions.
3.1.1 Crossvalidation
In practice, one would choose indices in (8) through crossvalidation. The number of projection coeffients one chooses will depend on the smoothness of the function aswell as the number of points in . Typically, a larger will correspond to a higher frequency 1dimensional basis function ; thus, a natural way of selecting is to consider sets
(9) 
with . One would then choose the value of (setting ) that minimizes a loss, such as the mean squared error between and . We shall see below that considering in this manner corresponds to a smoothness assumption on the class of input/output functions.
3.2 Function to Function Mapping
Let , , as in (32), we have that
(10)  
(11) 
where . Hence we may think of as consisting of countably many functions , where each is responsible for the mapping of to the projection of on to . We take functions to be a nonparametric linear smoother on a possibly infinite set of functions weighted by a kernel:
(12)  
(13) 
We shall consider the following class of functions:
(14) 
4 TripleBasis Estimator
If the tailfrequency behavior of output functions are controlled, then we may effectively estimate output functions using a finite number of projection coefficients; thus, we only need to estimate a finite number of the functions. The 3BE consists of two orthonormal bases for estimating input and output functions respectively and a random basis to estimate the mapping between them. To efficiently estimate the functions, we shall use random basis functions from Random Kitchen Sinks (RKS) [10]. We shall show that to approximate , we need only estimate a linear mapping in the random RKS features. [10] shows that if one has a shiftinvariant kernel (in particular we consider the RBF kernel ), then for fixed , , we have that for each :
(15)  
(16) 
and is the number of random basis functions (see [10] for approximation quality) . Let and be a set of indices for basis functions to project input and output functions respectively:
(17) 
In practice one would choose and through crossvalidation (see 3.1.1). First note that:
(18)  
(19)  
(20) 
where . Thus, where the norm on the LHS is the norm and the on the RHS.
Consider a fixed , and let , , be fixed. Then
(21)  
(22)  
(23)  
(24) 
where . Hence, by (24) is approximately linear in ; so, we consider linear estimators in the nonlinear space induced by . In particular, we take the OLS estimator using the dataset , and for each we estimate :
(25)  
(26) 
for , and the matrix . Suppose that the indices of basis functions we project output function onto is (as in (17)), then the set of functions we estimate is . Let , :
(27)  
(28) 
4.1 Evaluation Computational Complexity
We see that after computing , evaluating the estimated projection coefficients for a new function amounts to performing a matrix multiplication of a matrix with a vector. Including the time required for computing , the computation required for the evaluation, (28), is: 1) the time for evaluating the projection coefficients , ; 2) the time to compute the RKS features , ; 3) the time to compute the matrix multiplication, , . Hence, the total time is .
We’ll see that we may choose , , and . If we assume further that , the total runtime for evaluating is . Since we are considering datasets where the number of instances far outnumbers the number of points per sample set , is a substantial improvement over for the LSE; indeed, the LSE requires a metric evaluation with every trainingset input function (2) where the 3BE does not. Furthermore, the space complexity is much improved for the 3BE since we only need to store the matrix and the total space for the RKS basis functions . Contrast this with the space required for the LSE, , which is much larger for our case of . Lastly, note that to evaluate once one has computed , one only needs to compute where .
TripleBasis Estimator
We note that a straightforward extension to the 3BE is to use a ridge regression estimate on features rather than a OLS estimate. That is, for let
(29)  
(30) 
The Ridge3BE is still evaluated via a matrix multiplication, and our complexity analysis holds.
4.2 Algorithm
We summarize the basic steps for training the 3BE in practice given a dataset of empirical functional observations , parameters and (which may be crossvalidated), and an orthonormal basis for .

Determine the sets of basis functions and (17) for approximating , and respectively. For each in a subset ^{*}^{*}*Empirically it has been observed that and perform well even when is much smaller than one can select a set (9) to estimate by crossvalidating a loss as described in 3.1.1. One may then set where . Similarly, one may set by crossvalidating ’s for ’s.

Let , draw , for ; keep the set fixed henceforth.

Let . Generate the dataset of random kitchen sink features, output projection coefficient vector pairs . Let where , . Note that and can be computed efficiently using parallelism.

For all future query input functional observations , estimate the projection coefficients of the corresponding output function as .
5 Theory
We analyze the risk for the 3BE estimator below. We assume that input/output functions belong to a Sobolev Ellipsoid function class and that the mapping between input and output functions is in (14).
5.1 Assumptions
5.1.1 Sobolev Ellipsoid Function Classes
We shall make a Sobolev ellipsoid assumption for classes and . Let . Suppose that the projection coefficients and are as follows for , :
(31)  
(32) 
where , , , , and
(33)  
(34) 
See [3, 5] for other work using similar Sobolev elipsoid assumptions. The assumption in (32) will control the tailbehavior of projection coefficients and allow one to effectively estimate and using a finite number of projection coefficients on the empirical functional observation.
Suppose as before that function is such that has a corresponding set of evaluations where and , . Then, , the estimate of , is:
(35)  
(36) 
Choosing optimally^{†}^{†}†See appendix for details. can be shown to lead to , where , . Thus, we can represent using a finite number of projection coefficients ; this allows one to approximate the FFR problem as a regression problem over finite vectors and . Note that our choice of sets (9) in 3.1.1 corresponds to the estimator in (36) with . Varying in this case will still be adaptive to the smoothness of , and the number of points in .
5.1.2 Function to Funcion Mapping
Recall that we take output functions to be:
where . An our assumption of the class of mappings is:
Suppose further that:
(37) 
Hence, if then since and (37) holds.
5.2 Risk Upperbound
Below we state our main theorem, upperbounding the risk of the 3BE.
Theorem 5.1.
See appendix for proof. The rate (39) yields consistency for our estimator if ; that is, so long as one is in the large dataset domain where the number of instances is larger than the number of points in function observations. Note that the first summand in (39) is similar to typical functional estimation rates, and it stems from our approximation with bases; the second summand is akin to a linear regression rate, and it stems from our OLS estimation (26).
6 Experiments
Below we show the improvement of the 3BE over previous FFR approaches in several realworld datasets. Empirically, the 3BE proves to be the most general, quickest, and effective estimator. Unlike previous timeseries FFR approaches, the 3BE easily lends itself to working over distributions. Moreover, unlike previous nonparametric FFR estimators the 3BE does not need to compute pairwise kernel evaluations, making it much more scalable. All differences in MSE were statistically significant () using paired ttests.
6.1 Rectifying 2LPT Simulations
Numerical simulations have become an essential tool to study cosmological structure formation. Astrophysics use Nbody simulations [15] to study the gravitational evolution of collisionless particles like dark matter particles. Unfortunately, Nbody simulations require forces among particles to be recomputed over multiple time intervals, leading to a large magnitude of time steps to complete a single simulation. In order to mitigate the large computational costs of running Nbody simulations, often simulations based on Second Order Lagrange Perturbation Theory (2LPT) [13] are used.
Although 2LPT simulations are several orders of magnitude faster, they prove to be inaccurate, especially at smaller scales. In this experiment we bridge the gap between the speed of 2LPT simulations and the accuracy of Nbody simulations using FFR and the 3BE. Namely, we regress the mapping between a distribution of particles in an area coming from a 2LPT simulation and the distribution of the particles in the same area under an equivalent Nbody simulation.
Method  MSE  MPT 

3BE  4.958  0.009 
LSE  6.816  4.977 
2LPT  6.424  NA 
AD  9.289  NA 
tableMSE and MPT(s) results.
We regress the distribution of 3d (spatial) Nbody simulation particles in cubes when given the distribution of particles of the 2LPT simulation in the same cube (note that each distribution is estimated through the set of particles in each cube). A trainingset of over 900K pairs of 2LPT cube sampleset/Nbody cube sampleset instance was used, along with a testset of 5K pairs. The number of projection coefficients used to represent input and output distributions was 365/401 respectively, chosen by crossvalidating the density estimates. We chose the number of RKS features to be 15K based on rulesofthumb. We crossvalidated the and parameters of the ridge variant 3BE (30) and the smoothing parameter of the LSE and reported back the MSE and mean prediction time (MPT, in seconds) of our FFR estimates to the distributions truly coming directly through Nbody simulation (Table 6.1); we also report the MSE of predicting the average output distribution (AD).
We see that the 3BE is about faster than the LSE in terms of prediction time and achieved an improvement in of over over using the distribution coming directly from the 2LPT simulation (2LPT). Note also that the LSE does not achieve an improvement in MSE over 2LPT.
6.2 Timeseries Data
We compared the performance of the 3BE in timeseries prediction problems to using the LSE and widely used timeseries prediction methods like Dynamics Mining with Missing values (DynaMMo) [6] and Kernel Embedded HMMs (SHMM) [14]. DynaMMo is a latentvariable probabilistic model trained with EM aimed at predicting data that is missing in chunks and not just in a single timestep (as we also attempt with our functional responses).
6.2.1 Forward Prediction with Music Data
Music data presents a particularly interesting application of forward prediction for timeseries. That is, given a short segment of audio data from a piece of music, can we predict the audio data in the short segment that follows? Uses for forward prediction with music include compression and music similarity.
In this experiment, we use a 30 second clip, sampled at 44.1 kHz from the song “I Turn To You” by the artist Melanie C. We extract a mono signal of the sound clip and use the first for training and holdout, and the final for testing. To perform forward prediction in the test set, we take a timestep segment of the (true) music timeseries as input and use it to predict the following timesteps. We repeat this sequentially over consecutive disjoint segments in the test set until we have made predictions for the entire test set. In total our dataset consisted of about training instances. For audio prediction with the 3BE we use the ridge variant (30). We use 150 trigonometric basis functions for both input and output functions, and 5000 RKS basis functions (both quantities chosen via rules of thumb). We then crossvalidate the bandwidth and penalty parameters.
Method  MSE 

3BE  0.0327 
LSE  0.0351 
Dyna  0.0492 
SHMM  0.1082 
tableAudio MSE.
We crossvalidated the number of dimensions for hiddenstates for DynaMMo, and the bandwidth parameter for the LSE. The mean squared error (MSE) on the testset is reported in Table 6.2.1 for each method. The 3BE achieves the lowest estimation error. Furthermore, looking at Figure 4 it is apparent that the 3BE outperforms the other methods in terms of capturing the structure of the audio data. The quality of the audio predicted with the 3BE is also superior to the other methods (hear predicted sound clips in supplemental materials). Furthermore, DynaMMo takes over 4 hours to learn a model given a fixed hidden state dimensionality with no missing data (and even longer if also predicting missing data), where as the 3BE takes only about 2 minutes to crossvalidate and perform predictions (a speedup of over ). Similarly the 3BE was over faster than SHMM for predictions. Additionally, even though the dataset is of a smaller scale, the 3BE still enjoys a speedup over LSE for prediction time.
6.2.2 Cooccurring Predictions with Joint Motion Capture Data
Next, we explore predicting cooccurring timeseries with motion capture (MoCap) data. We use the MSRC12 Dataset [2]. The 3d positions are provided for 20 total joints. We look to predict the timeseries of the position of an unobserved joint over a timestep segment given timeseries data (one function for each joint’s x, y, or z position) for observed joints for the segment.
We performed cooccurring timeseries prediction with MoCap data of a subject performing the gesture “duck” (Figure 5). We randomly chose 10 joints to designate as occluded, and used the other 10 as our nonoccluded joints. We then solved 30 separate FFR problems, where each of the problems had one of the missing joints’ timeseries as the output response function (e.g. missing joint 1’s y position or missing joint 4’s x position). In each of the problems, the 30 functions corresponding to the timeseries for nonoccluded joint spatial positions were used as inputs (by concatenating the projection coefficients of each input function) . We considered segments of 24 timesteps for timeseries functions. In total we used a training set of about 1100 instances. The number of projection coefficients for functions was taken to be 10 while the number of RKS features was 250. The same parameters for all estimators were cross validated as before.
Method  MSE 

3BE  7.78E4 
LSE  1.3E3 
Dyna  2.40E4 
tableMoCap MSE.
DynaMMo performs the best (Table 6.2.2), which is perhaps not surprising given that MoCap occlusion prediction was a point of emphasis for DynaMMo. However, the differences in prediction qualities among the different methods is not as pronounced in this dataset (Figure 6). We again see a speed up of over 1000x using 3BE over DynaMMo, also there was a speed up of over in prediction time over LSE.
7 Conclusion
In conclusion, this paper presents a new estimator, the Triple Basis Estimator (3BE), for performing function to function regression in a scalable manner. Since functional data is complex, it is important to have an estimator that is capable of using massive datasets in order to achieve a low estimation risk. To the best of our knowledge, the 3BE is the first nonparametric FFR estimator that is capable to scaling to big datasets. The 3BE achieves this through the use of a basis representation of input and output functions and random kitchen sink basis functions. We analyzed the risk of the 3BE given nonparametric assumptions. Furthermore, we showed an improvement of several orders of magnitude for prediction speed and a reduction in error over previous estimators in various realworld datasets.
References
 [1] F. Ferraty and P. Vieu. Nonparametric functional data analysis: theory and practice. Springer, 2006.
 [2] Simon Fothergill, Helena Mentis, Pushmeet Kohli, and Sebastian Nowozin. Instructing people for training gestural interactive systems. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems, pages 1737–1746. ACM, 2012.
 [3] Y. Ingster and N. Stepanova. Estimation and detection of functions from anisotropic sobolev classes. Electronic Journal of Statistics, 5:484–506, 2011.
 [4] Hachem Kadri, Emmanuel Duflos, Philippe Preux, Stéphane Canu, Manuel Davy, et al. Nonlinear functional regression: a functional rkhs approach. In JMLR Workshop and Conference Proceedings, volume 9, pages 374–380, 2010.
 [5] B. Laurent. Efficient estimation of integral functionals of a density. The Annals of Statistics, 24(2):659–681, 1996.
 [6] Lei Li, James McCann, Nancy S Pollard, and Christos Faloutsos. Dynammo: Mining and summarization of coevolving sequences with missing values. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 507–516. ACM, 2009.
 [7] Junier B Oliva, Willie Neiswanger, Barnabas Poczos, Jeff Schneider, and Eric Xing. Fast distribution to real regression. AISTATS, 2014.
 [8] Junier B Oliva, Barnabás Póczos, and Jeff Schneider. Distribution to distribution regression. ICML, 2013.
 [9] Junier B Oliva, Barnabas Poczos, Timothy Verstynen, Aarti Singh, Jeff Schneider, FangCheng Yeh, and WenYih Tseng. Fusso: Functional shrinkage and selection operator. AISTATS, 2014.
 [10] Ali Rahimi and Benjamin Recht. Random features for largescale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2007.
 [11] James O Ramsay and B.W. Silverman. Functional data analysis. Wiley Online Library, 2006.
 [12] J.O. Ramsay and B.W. Silverman. Applied functional data analysis: methods and case studies, volume 77. Springer New York:, 2002.
 [13] Roman Scoccimarro. Transients from initial conditions: a perturbative analysis. Monthly Notices of the Royal Astronomical Society, 299(4):1097–1118, 1998.
 [14] Le Song, Byron Boots, Sajid M Siddiqi, Geoffrey J Gordon, and Alex J Smola. Hilbert space embeddings of hidden markov models. In Proceedings of the 27th international conference on machine learning (ICML10), pages 991–998, 2010.
 [15] Hy Trac and UeLi Pen. Outofcore hydrodynamic simulations for cosmological applications. New Astronomy, 11(4):273–286, 2006.