Simultaneous Parameter Learning and BiClustering
for MultiResponse Models
Abstract
We consider multiresponse and multitask regression models, where the parameter matrix to be estimated is expected to have an unknown grouping structure. The groupings can be along tasks, or features, or both, the last one indicating a bicluster or “checkerboard” structure. Discovering this grouping structure along with parameter inference makes sense in several applications, such as multiresponse GenomeWide Association Studies. This additional structure can not only can be leveraged for more accurate parameter estimation, but it also provides valuable information on the underlying data mechanisms (e.g. relationships among genotypes and phenotypes in GWAS). In this paper, we propose two formulations to simultaneously learn the parameter matrix and its group structures, based on convex regularization penalties. We present optimization approaches to solve the resulting problems and provide numerical convergence guarantees. Our approaches are validated on extensive simulations and real datasets concerning phenotypes and genotypes of plant varieties.
Simultaneous Parameter Learning and BiClustering
for MultiResponse Models
Ming Yu Booth School of Business The University of Chicago Chicago, IL ming93@uchicago.edu Karthikeyan Natesan Ramamurthy IBM Research Yorktown Heights, NY knatesa@us.ibm.com Addie Thompson Dept. of Plant, Soil and Microbial Sciences Michigan State University East Lansing, MI thom1718@msu.edu Aurélie Lozano IBM Research Yorktown Heights, NY aclozano@us.ibm.com
1 Introduction
We consider multiresponse and multitask regression models, which generalize singleresponse regression to learn predictive relationships between multiple input and multiple output variables, also referred to as tasks [2]. The parameters to be estimated form a matrix instead of a vector. In many applications, there exist grouping structures among input variables and output variables, and the model parameters belonging to the same inputoutput group tend to be close to each other. A motivating example is that of multiresponse GenomeWide Association Studies (GWAS) [25], where for instance a group of Single Nucleotide Polymorphisms or SNPs (input variables or features) might influence a group of phenotypes (output variables or tasks) in a similar way, while having little or no effect on another group of phenotypes. It is therefore desirable to uncover and exploit such inputoutput structures in estimating the parameter matrix. See figure 1 for an example.
Contributions:
In this work, we develop formulations that simultaneously learn: (a) the parameters of multiresponse/task regression models, and, (b) the grouping structure in the parameter matrix (row or column or both) that reflects the group relationship between input and output variables. We present optimization approaches to efficiently solve the resulting convex problems, and show their numerical convergence. We describe and justify various options we take and choices we make in the optimization algorithms. Our proposed methods are validated empirically on synthetic data and on a realworld datasets concerning phenotypes and genotypes of plant varieties. From the synthetic data experiments, we find that our methods provide a much better and more stable (i.e., lesser standard error) recovery of the underlying group structure, and improved estimates of parameters. In realworld data experiments, our approaches reveal natural groupings of phenotypes and checkerboard patterns of phenotypeSNP groups that inform us of the joint relationship between them. In all experiments, we demonstrate better RMSEs.
We emphasize that the parameters as well as the grouping structures are fully unknown apriori, and inferring them simultaneously is our major contribution. This is in contrast to the naive way of estimating the parameters first and then clustering. This naive approach has the danger of propagating the estimation error into clustering, particularly in high dimensions, where the estimator is usually inaccurate due to lack of sufficient samples. Moreover, the clustering step of the naive approach does not use the full information of the data. The joint estimationclustering procedure we propose naturally promotes sharing of information within groups. Our formulations adopt the convex biclustering cost function [4] as the regularizer to encourage groupings between columns (tasks) and rows (features) in the parameter matrix. Note that [4] assume that the data matrix to be used for biclustering is known apriori, which is obviously not the case for our setting.
Related Work:
Multitask learning attempts to learn several of the inference tasks simultaneously, and the assumption here is that an appropriate sharing of information can benefit all the tasks [3, 22, 33]. The implicit assumption that all tasks are closely related can be excessive as it ignores the underlying specificity of the mappings. There have been several extensions to multitask learning that address this problem. The authors in [13] propose a dirty model for feature sharing among tasks, wherein a linear superposition of two sets of parameters  one that is common to all tasks, and one that is taskspecific  are used. [16] leverages a predefined tree structure among the output tasks (e.g. using hierarchical agglomerative clustering) and imposes group regularizations on the task parameters based on this tree. The approach proposed in [17] learns to share by defining a set of basis task parameters and posing the taskspecific parameters as a sparse linear combination of these. The approaches of [12] and [15] assume that the tasks are clustered into groups and proceed to learn the group structure along with the task parameters using a convex and an integer quadratic program respectively. However, these approaches do not consider joint clustering of the features. In addition, the mixed integer program of [15] is computationally intensive and greatly limits the maximum number of tasks that can be considered. Another pertinent approach is the Network Lasso formulation presented in [7]. This formulation, however, is limited to settings where only clustering among the tasks is needed. A straightforward special case of the proposed approach to column or rowonly clustering (a.k.a. Uniclustering) is presented in [32]. Portions of [32] are reproduced in this paper also for comprehensive coverage.
Roadmap.
In Section 2, we will discuss the proposed joint estimationclustering formulations, and in Section 3, we will present the optimization approaches. The choice of hyperparameters used and their significance is discussed in Section 4. We illustrate the solution path for one of the formulations in Section 5. We will provide results for estimation with synthetic data, and two case studies using multiresponse GWAS with real data in Sections 6 and 7 respectively. We conclude in Section 8. Additional details, convergence proofs, solution paths, and experiments are provided in the supplementary material.
2 Proposed Formulations
We will motivate and propose two distinct formulations for simultaneous parameter learning and clustering with general supervised models involving matrix valued parameters. Our formulations will be developed around multitask regression in this paper.
We let be the design matrices and be the response vectors for each task , in multitask regression. is the parameter or coefficient matrix for the tasks. We wish to simultaneously estimate and discover the bicluster structure among features and tasks, respectively the rows and columns of . Note that discovering groupings just along rows or columns is a special case of this.
2.1 Formulation 1:
We begin with the simplest formulation, which, as we shall see, is a special case of the latter one.
(1) 
Here is the loss function, is a regularizer, and and is the column of . is inspired by the convex biclustering objective [4, eqn. 2.1] and it encourages sparsity in differences between columns of . Similarly, encourages sparsity in the differences between the rows of . When the overall objective is optimized, we can expect to see a checkerboard pattern in the model parameter matrix. Note that and are nonnegative weights that reflect our prior belief on the closeness of the rows and columns of .
The degree of sharing of parameters and hence that of biclustering, is controlled using the tuning parameter . When is small, each element of will be its own bicluster. As increases, more elements of fuse together, the number of rectangles in the checkerboard pattern will reduce. See Figure 2 for the change of the checkerboard structure as increases. Further, by varying we get a solution path instead of just a point estimate of . This will be discussed more in Section 5. In the remainder of the paper, we will use the same design matrix across all tasks for simplicity, without loss of generality.
For sparse multitask linear regression, formulation 1 can be instantiated as,
(2) 
Here the rows of correspond to the features, i.e. the columns of , and the columns of correspond to the tasks, i.e., the columns of . Therefore, the checkerboard pattern in provides us insights on the groups of features that go together with the groups of tasks.
2.2 Formulation 2
Formulation 1 is natural and simple, but it forces the parameters belonging to the same row or column cluster to be equal, and this may be limiting. To relax this requirement, we introduce a surrogate parameter matrix that will be used for biclustering. This will be mandated to be close to . This yields the following objective
(3)  
Remark 1.
To interpret this more carefully, let us assume that in (3). In other words, has a global component , and the component that participates in the clustering. As , , and hence . Now, formulation 2 reduces to formulation 1. Further, if and are held constant while only increases, , since for all . The key difference between formulation 2 and 1 is the presence of a taskspecific global component , which lends additional flexibility in modeling the individual tasks even when . Whereas, in (2), when , for all , and the tasks are forced to share the same coefficients without any flexibility.
Remark 2.
In certain applications, it might make sense to cluster together features / tasks whose effects have the same amplitude but different signs. This can be accommodated by considering where are predefined constants reflecting whether the features or tasks are expected to be negatively or positively correlated.
3 Optimization Approaches
We describe the optimization procedures to solve the two proposed formulations. Note that as long as the loss function and the regularization are convex, our formulations are also convex in and , and hence can be solved using modern convex optimization approaches. Here we adopt two computationally efficient approaches.
3.1 Formulation 1
For our formulation 1 we use the proximal decomposition method introduced in [6]. This is an efficient algorithm for minimizing the sum of several convex functions. Our general objective function (1) involves such functions: being , being and being the term that multiplies . At a high level, the algorithm iteratively applies proximal updates with respect to these functions until convergence.
We stack the regression matrix into a column vector . The proximal operator is given by:
(4) 
where and are dimensional vectors. The proximal operator of the regularized loss can be computed according to the specific and functions. The overall optimization procedure is given in Algorithm 1 and update rules are given in the supplementary material.
3.2 Formulation 2
For our formulation 2 we use an alternating minimization method on and ; i.e., we alternatively minimize over and with the other fixed.
The first step in the alternating minimization is to estimate while fixing . This minimization problem is separable for each column and each subproblem can be easily written as a standard Lasso problem:
(5) 
by defining
(6) 
and hence can be solved efficiently and in parallel for each column.
In the second step, we fix and optimize for . The optimization is
(7) 
which is a standard biclustering problem on and can be solved efficiently using the COnvex BiclusteRing Algorithm (COBRA) introduced in [4], and described in Algorithm 3 in supplementary material for completeness. The overall procedure is given in Algorithm 2.
3.3 Convergence
We establish the following convergence result for our algorithms, when the loss function is convex in . The proofs are given in the supplementary material.
Proposition 1.
The algorithm described in Section 3.1 converges to the global minimizer.
Proposition 2.
The algorithm described in Section 3.2 converges to the global minimizer.
4 Hyperparameter Choices and Variations
We will describe and justify the various choices for hyperparameters while optimizing formulations 1 and 2.
4.1 Weights and Sparsity Regularization
The choice of the column and row similarity weights and can dramatically affect the quality of the clustering results and we follow the suggestion in [4] to set these. However, we need an estimate of to obtain the weights and this can be found by solving
(8) 
where is tuned using crossvalidation (CV) and reused in the rest of the algorithm. From our multitask regression experiments, we find that the clustering results are quite robust to the choice of .
With the estimated we use the approach suggested in [4] to compute and . The weights for the columns and are computed as where is 1 if is among ’s nearestneighbors or vice versa and otherwise. is nonnegative and corresponds to uniform weights. In our synthetic and real data experiments we fix . is computed analogously. It is important to keep the two penalty terms and on the same scale, else the row or column objective will dominate the solution. We require that the column weights sum to and the row weights sum to , following [4]. More rationale on the weight choices is provided in [4] and [5].
4.2 Penalty Multiplier Tuning
We set the penalty multipliers (, , and ) for both the formulations using a CV approach. We randomly split our samples into a training set and a holdout validation set, fitting the models on the training set and then evaluating the rootmeansquared error (RMSE) on the validation set to choose the best values. In order to reduce the computational complexity, we estimate the multipliers greedily, one or two at a time. From our simulations, we determined that this is a reasonable choice. We recognize that these can be tuned further on a casebycase basis with additional computational complexity.
is set to the reasonable value as determined in Section 4.1 for both formulations, since the clustering results are quite robust to this choice. For formulation 1, we estimate the best by CV using (1). For formulation 2, the tuning process is similar, however we pick a sequence of and . We estimate both and , but calculate RMSE with , since it directly participates in the clustering objective. When the path of biclusterings is computed, we fix to the CV estimate and vary only .
4.3 Biclustering Thresholds
It is well known that LASSO tends to select too many variables [19]. Hence may not be exactly zero in most cases, and we may end up identifying too many clusters as well. In [4] the authors defined the measure and placed the and columns in the same group if for some threshold , inspired by [19]. In our formulation 1, after selecting the best tuning parameters and estimating , we place the and rows in the same group if . Similarly, if we place the and columns in the same group. For formulation 2, we repeat the same approach using instead of .
To compute the thresholds and , we first calculate and stack this matrix to vector ; similarly we calculate and stack to vector . In the case of sparse linear regression, should be on the order of the noise [19]: , where is typically estimated using the standard deviation of residuals. In [4] the authors set proportional to the standard deviation of or .
However in our case, we have an additional regression loss term for estimating the parameters and hence there are two sources of randomness, the regression residual and the error in . Taking these into account, we set and . We set the multiplier to , following the conservative suggestion in [4].
4.4 Specializing to Column or Rowonly Clustering (a.k.a. UniClustering)
5 Solution Path
Since we are able to obtain the entire solution path for the coefficients by varying the penalty multipliers, we provide an example of the solution paths for estimated using formulation 1. The dataset is generated using the same procedure described in Section 6.2 except that we set , , and . We use relatively small values for and since there will be a total of solution paths to visualize. We will illustrate this only for formulation 1 since it is simpler. The solution paths for formulation 2 are provided in the supplementary material.
We first fix a reasonable and vary to get solution paths for all the coefficients. In our experiment, we chose based on crossvalidation as described in Section 4.1. These paths are shown in Figure 3. We can see that as increases, the coefficients begin to merge and eventually for large enough they are all equal. The solution paths are smooth in . Also note that the coefficient values are not monotonically increasing or decreasing, similar to the LASSO solution path [26].
Similarly, we fix based on the crossvalidation scheme described in Section 4.2 and vary to get solution paths for all the coefficients. This is shown in Figure 4. It is wellknown that the solution paths for LASSO are piecewise linear [24], when is least squares loss. Here, we see that the solution paths are not piecewise linear, but rather a smoothed version of it. This smoothness is imparted by the convex clustering regularization, the third term in (2).
6 Synthetic Data Experiments
We demonstrate our approach using experiments with synthetic data on the problem of multitask learning. Detailed experiments on row or columnalone clustering (uniclustering) are provided in supplementary materials.
We begin by describing the performance measures used to evaluate the clustering and estimation performance.
6.1 Performance Measures
The estimation accuracy is measured by calculating the RMSE on an independent test set, and also the parameter recovery accuracy, where and are the estimated and true coefficient matrices. Assessing the clustering quality can be hard. In this paper, we use the following three measures to evaluate the quality of clustering: the adjusted Rand index [11] (ARI), the F1 score (F1), and the Jaccard index (JI). The definitions of F1 and JI are given in the supplementary material. For all these three measures, a value of 1 implies the best possible performance, and a value of 0 means that we are doing poorly.
In order to compute ARI, F1, and JI, we choose the value of the multiplier in formulation 1, and in formulation 2 using the approach described in Section 4.2, and obtain the estimated clusterings.
6.2 Simulation Setup and Results
We focus on multitask regression: with . All the entries of design matrix are generated as independent standard normal. The true regression parameter has a bicluster (checkerboard) structure. To simulate sparsity, we set the coefficients within many of the blocks in the checkerboard to . For the nonzero blocks, we follow the generative model recommended in [4]: the coefficients within each cluster are generated as with to make them close but not identical, where is the mean of the cluster defined by the row partition and column partition. We set , , and in our experiment. For the nonzero blocks, we set and set . We try the lownoise setting (), where it is relatively easy to estimate the clusters, and the highnoise setting (), where it is harder to obtain them.
We compare our formulation 1 and formulation 2 with a 2step estimatethencluster approach: (a) Estimate first using LASSO, and (b) perform convex biclustering on . is estimated by solving (8) while selecting the best as discussed in Section 4.1, and the convex biclustering step is implemented using COBRA algorithm in [4]. Our baseline clustering performance is the best of the following: (a) letting each coefficient be its own group, and (b) imposing a single group for all coefficients.
The average clustering quality results on 50 replicates are shown in Table 1 and Table 3 for low and high noise settings, respectively. In both tables, the first, second, and third blocks correspond to performances of row, column and rowcolumn biclusterings, respectively. We optimize only for biclusterings, but the row and the column clusterings are obtained as byproducts. Note that this could lead to different results compared to directly performing uniclustering.
The RMSEs evaluated on the test set and the parameter recovery accuracy are provided in Table 2 and Table 4. Most performance measures are reported in the format
From Table 1 and Table 3 we see that both our formulation 1 and 2 give better results on row clustering, column clusterings, and rowcolumn biclustering compared to the 2step procedure. Moreover, the clustering results given by our formulations are more stable, with lesser spread in performance.
We compare the RMSE and the parameter recovery accuracy of the proposed formulations with other approaches and report the results in Table 2 and Table 4. The oracle RMSE (with known) is for Table 2 and for Table 4, and we can see that the proposed methods provide improvements over the others. We also observe improvements in the parameter recovery accuracy.
The performance boost obtained with high noise is much higher compared to that with low noise. This makes sense because when noise level is low, the estimation step in the 2step approach is more accurate and the error propagated into the clustering step is relatively small. However at high noise levels, the estimation can be inaccurate. This estimation error propagates into the clustering step and makes the clustering result of 2step approach unreliable. However, our formulations are able to jointly do the estimation and clustering, and hence have more reliable and stable results.
Baseline  2step  Form1  Form2  

ARI  0  0.6790.157  0.8690.069  0.9000.046 
F1  0.446  0.7570.128  0.9070.052  0.9310.022 
JI  0.287  0.6250.161  0.8340.081  0.8710.042 
ARI  0  0.8770.043  0.9140.020  0.9150.013 
F1  0.446  0.9080.037  0.9330.023  0.9340.012 
JI  0.287  0.8470.048  0.8760.031  0.8870.025 
ARI  0  0.7080.118  0.8410.059  0.8630.035 
F1  0.172  0.7340.110  0.8570.052  0.8770.026 
JI  0.094  0.5910.134  0.7530.077  0.7810.035 
Lasso  2step  Form1  Form2  

RMSE  1.6270.02  1.6220.02  1.6130.02  1.6120.02 
Rec. acc.  0.2340.03  0.2310.03  0.2230.03  0.2220.03 
Baseline  2step  Form1  Form2  

ARI  0  0.5770.163  0.8030.104  0.8040.096 
F1  0.446  0.6740.138  0.8740.093  0.8740.075 
JI  0.287  0.5250.159  0.7930.097  0.7920.098 
ARI  0  0.7340.132  0.9050.077  0.9050.046 
F1  0.446  0.7990.107  0.9240.054  0.9330.039 
JI  0.287  0.6890.120  0.8720.078  0.8670.065 
ARI  0  0.5550.187  0.8010.125  0.8120.105 
F1  0.172  0.5860.152  0.8240.104  0.8210.086 
JI  0.094  0.4370.179  0.7140.118  0.7130.104 
Lasso  2step  Form1  Form2  

RMSE  3.340.02  3.300.02  3.230.02  3.160.02 
Rec. acc.  0.3640.06  0.3620.06  0.3270.05  0.3250.06 
7 Real Data Experiments
We demonstrate the proposed approaches using real datasets obtained from experiments with sorghum crops [27]. Accurate phenotyping of different crop varieties is a crucial yet traditionally a timeconsuming step in crop breeding, requiring manual survey of hundreds of plant varieties for the traits of interest. Typical workflow of recent automated, highthroughput remote sensing systems for trait development and GWAS is shown in Figure 5. We consider two specific problems from this pipeline: (a) predictive modeling of plant traits using features from remote sensed data (Section 7.1), (b) GWAS using the reference traits (Section 7.2). Additional experiments are provided in the supplementary material.
7.1 Phenotypic Trait Prediction from Remote Sensed Data
The experimental data was obtained from sorghum varieties planted in replicate plot locations, and we considered three different traits: plant height, stalk diameter, and stalk volume. We report results only for plant height here and the results for other traits, along with variety names are given in the supplementary material.
From the RGB and hyperspectral images of each plot, we extract features of length . Hence , , and the number of tasks , for each trait considered. The presence of multiple varieties with replicates much smaller in number than predictors poses a major challenge: building separate models for each variety is unrealistic, while a single model does not fit all. This is where our proposed simultaneous estimation and clustering approach provides the flexibility to share information among tasks that leads to learning at the requisite level of robustness. Note that here we use the columnonly clustering variant of formulation 1.
The dendrogram for task clusters obtained by sweeping the penalty multiplier is given in Figure 6. This provides some interesting insights from a plant science perspective. As highlighted in Figure 6, the predictive models (columns of ) for thicker medium dark plants are grouped together. Similar grouping is seen for thinner tall dark plants, and thick tall plants with many light leaves.
To compute RMSE, we perform 6folds CV where each fold consists of at least one example from each variety. As we only have samples per variety (i.e. per task), it is unrealistic to learn separate models for each variety. For each CV split, we first learn a grouping using one of the compared methods, treat all the samples within a group as i.i.d, and estimate their regression coefficients using Lasso. The methods compared with our approach include: (a) single model, which learns a single predictive model using Lasso, treating all the varieties as i.i.d., (b) No group multitask learning, which learns a traditional multitask model using Group Lasso where each variety forms a separate group, and (c) Kang et al. [15], which uses a mixed integer program to learn shared feature representations among tasks, while simultaneously determining “with whom” each task should share. Results reported in Table 5, indicate the superior quality of our groupings in terms of improved predictive accuracy.
Method  RMSE 

Single model  44.396.55 
No group multitask learning  36.946.10 
Kang et al.  37.557.60 
Proposed  33.315.10 
7.2 MultiResponse GWAS
We apply our approach in a multiresponse GenomeWide Association Study (GWAS). While traditional GWAS focuses on associations to single phenotypes, we would like to automatically learn the grouping structure between the phenotypes as well as the features (columns and rows of ) using our proposed method. We use the proposed formulations 1 and 2 (biclustering variant) in this experiment.
The design matrix consisted of SNPs of Sorghum varieties. We consider varieties and over SNPs. We remove duplicate SNPs and also SNPs that do not have significantly high correlation to at least one response variable. Finally, we end up considering SNPs. The output data contains the following response variables (columns) for all the varieties collected by hand measurements:

Height to panicle (h1): The height of the plant up to the panicle of the sorghum plant.

Height to top collar (h2): The height of the plant up to the top most leaf collar.

Diameter top collar (d1): The diameter of the stem at the top most leaf collar.

Diameter at 5 cm from base (d2): The diameter of the stem at 5 cm from the base of the plant.

Leaf collar count (l1): The number of leaf collars on the plant.

Green leaf count (l2): The total number of green leaves. This will be less than l1 since some leaves may have senesced and will not be green anymore.
Each trait can be an average of measurements from up to plants, in every variety.
Lasso  2step  Form1  Form2  
RMSE  2.181  2.206  2.105  2.119 
We split our data set into 3 parts: 70% training, 15% validation, and 15% test. We estimate the coefficient matrices by optimizing our formulations on the training set, select the tuning parameters based on the validation set (Sections 4.2, 4.3), and then calculate the RMSE on the test set. Table 6 shows the RMSE on test set. The coefficient matrix given by our formulations are visualized in Figure 8. To make the figure easier to interpret, we exclude the rows with all zero coefficients and take the average over the coefficients within each bicluster. The light yellow areas in the figures are zero coefficients; red and blue areas are positive and negative coefficients, respectively. The rows and columns are reordered to best show the checkerboard patterns. We wish to emphasize again that this checkerboard pattern is automatically discovered using our proposed procedures, and is not evidently present in the data.
The two formulations reveal similar biclustering patterns up to reordering. For column clusters, the plant height tasks (h1 and h2), the stem diameter tasks (d1 and d2), and the leaf tasks (l1 and l2) group together. Also, the stem diameter and leaf tasks are more related to each other compared to the height tasks. The biclustering patterns reveal the group of SNPs that influence similar phenotypic traits. Coefficients for height features in the GWAS (Figure 7) study show SNPs with strong effects coinciding with locations of Dwarf 3 [20] and especially Dwarf 1 [9] genes known to control plant height that are segregating and significant in the population. The lack of any effect at the Dwarf 2 [10] locus supports previous work indicating that this gene is not a strong contributing factor in this population.
We also estimate the RMSE of the proposed formulations and compare it with the RMSE provided by a simple Lasso model and 2step procedure. This is shown in Table 6. We see that the RMSE of our formulations are slightly less than that of the Lasso and 2step procedure. Hence, for similar estimation performance, we are able to discover additional interesting structure in the inputoutput relationship using our proposed methods.
8 Concluding Remarks
In this paper we introduced and studied formulations for joint estimation and clustering (row or column or both) of the parameter matrix in multiresponse models. By design, our formulations imply that coefficients belonging to the same (bi)cluster are close to one another. By incorporating different notions of closeness between the coefficients, we can tremendously increase the scope of applications in which similar formulations can be used. Some future applications could include sparse subspace clustering and community detection.
Recently there has been a lot of research on nonconvex optimization formulations, both from theoretical and empirical perspectives [29, 30]. It would be of interest to see the performance of our formulations on nonconvex loss functions. Another extension would be to construct confidence intervals and perform hypothesis testing for the coefficients in each cluster. There are two kinds of bias in our formulations: (a) shrinkage bias due to the regularization, and (b) bias due to the biclustering regularization, because of the fact that it forces coefficients to be close to each other. Many debiasing methods have been proposed to handle the first type of bias [28, 31, 14, 21], while debiasing methods for the second type of bias is a potential research area.
Acknowledgment
The information, data, or work presented herein was funded in part by the Advanced Research Projects AgencyEnergy (ARPAE), U.S. Department of Energy, under Award Number DEAR0000593. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
References
 [1] Amir Beck. On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization, 25(1):185–209, 2015.
 [2] Hanen Borchani, Gherardo Varando, Concha Bielza, and Pedro Larrañaga. A survey on multioutput regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(5):216–233, 2015.
 [3] Rich Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
 [4] Eric C Chi, Genevera I Allen, and Richard G Baraniuk. Convex biclustering. arXiv preprint arXiv:1408.0856, 2014.
 [5] Eric C Chi and Kenneth Lange. Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4):994–1013, 2015.
 [6] Patrick L Combettes and JeanChristophe Pesquet. A proximal decomposition method for solving convex variational inverse problems. Inverse problems, 24(6):065014, 2008.
 [7] David Hallac, Jure Leskovec, and Stephen Boyd. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 387–396. ACM, 2015.
 [8] Christina Heinze, Brian McWilliams, Nicolai Meinshausen, and Gabriel Krummenacher. Loco: Distributing ridge regression with random projections. arXiv preprint arXiv:1406.3469, 2014.
 [9] Josie Hilley, Sandra Truong, Sara Olson, Daryl Morishige, and John Mullet. Identification of Dw1, a regulator of sorghum stem internode length. PLoS One, 11(3):e0151271, 2016.
 [10] Josie L Hilley, Brock D Weers, Sandra K Truong, Ryan F McCormick, Ashley J Mattison, Brian A McKinley, Daryl T Morishige, and John E Mullet. Sorghum Dw2 encodes a protein kinase regulator of stem internode length. Scientific Reports, 7(1):4616, 2017.
 [11] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2(1):193–218, 1985.
 [12] Laurent Jacob, Jeanphilippe Vert, and Francis R Bach. Clustered multitask learning: A convex formulation. In Advances in Neural Information Processing Systems, pages 745–752, 2009.
 [13] Ali Jalali, Sujay Sanghavi, Chao Ruan, and Pradeep K Ravikumar. A dirty model for multitask learning. In Advances in Neural Information Processing Systems, pages 964–972, 2010.
 [14] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for highdimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909, 2014.
 [15] Zhuoliang Kang, Kristen Grauman, and Fei Sha. Learning with whom to share in multitask feature learning. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 521–528, 2011.
 [16] Seyoung Kim and Eric P. Xing. Treeguided group lasso for multitask regression with structured sparsity. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 543–550, 2010.
 [17] Abhishek Kumar and Hal Daume III. Learning task grouping and overlap in multitask learning. arXiv preprint arXiv:1206.6417, 2012.
 [18] Yichao Lu and Dean P Foster. Fast ridge regression with randomized principal component analysis and gradient descent. arXiv preprint arXiv:1405.3952, 2014.
 [19] Nicolai Meinshausen and Bin Yu. Lassotype recovery of sparse representations for highdimensional data. The Annals of Statistics, pages 246–270, 2009.
 [20] Dilbag S Multani, Steven P Briggs, Mark A Chamberlin, Joshua J Blakeslee, Angus S Murphy, and Gurmukh S Johal. Loss of an MDR transporter in compact stalks of maize br2 and sorghum dw3 mutants. Science, 302(5642):81–84, 2003.
 [21] Yang Ning and Han Liu. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1):158–195, 2017.
 [22] Guillaume Obozinski, Ben Taskar, and Michael Jordan. Multitask feature selection. Statistics Department, UC Berkeley, Tech. Rep, 2, 2006.
 [23] Karthikeyan Natesan Ramamurthy, Zhou Zhang, Addie Thompson, Fangning He, Melba Crawford, Ayman Habib, Clifford Weil, and Mitchell Tuinstra. Predictive modeling of sorghum phenotypes with airborne image features. In Proc. of KDD Workshop on Data Science for Food, Energy, and Water, 2016.
 [24] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. The Annals of Statistics, pages 1012–1030, 2007.
 [25] Elizabeth D Schifano, Lin Li, David C Christiani, and Xihong Lin. Genomewide association analysis for multiple continuous secondary phenotypes. The American Journal of Human Genetics, 92(5):744–759, 2013.
 [26] Ryan J. Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. Ann. Statist., 39(3):1335–1371, 06 2011.
 [27] Mitch Tuinstra, Cliff Weil, Addie Thompson, Chris Boomsma, Melba Crawford, Ayman Habib, Edward Delp, Keith Cherkauer, Larry Biehl, Naoki Abe, Meghana Kshirsagar, Aurelie Lozano, Karthikeyan Natesan Ramamurthy, Peder Olsen, and Eunho Yang. Automated sorghum phenotyping and trait development platform. In Proc. of KDD Workshop on Data Science for Food, Energy, and Water, 2016.
 [28] Sara Van de Geer, Peter Bühlmann, Yaâacov Ritov, Ruben Dezeure, et al. On asymptotically optimal confidence regions and tests for highdimensional models. The Annals of Statistics, 42(3):1166–1202, 2014.
 [29] Zhaoran Wang, Huanran Lu, and Han Liu. Nonconvex statistical optimization: Minimaxoptimal sparse pca in polynomial time. arXiv preprint arXiv:1408.5352, 2014.
 [30] Ming Yu, Varun Gupta, and Mladen Kolar. An influencereceptivity model for topic based information cascades. 2017 IEEE International Conference on Data Mining (ICDM), pages 1141–1146, 2017.
 [31] Ming Yu, Mladen Kolar, and Varun Gupta. Statistical inference for pairwise graphical models using score matching. In Advances in Neural Information Processing Systems, pages 2829–2837, 2016.
 [32] Ming Yu, Addie M. Thompson, Karthikeyan Natesan Ramamurthy, Eunho Yang, and Aurelie C. Lozano. Multitask Learning using Task Clustering with Applications to Predictive Modeling and GWAS of Plant Varieties. ArXiv eprints, October 2017.
 [33] Ming Yu, Zhaoran Wang, Varun Gupta, and Mladen Kolar. Recovery of simultaneous low rank and twoway sparse coefficient matrices, a nonconvex approach. arXiv preprint arXiv:1802.06967, 2018.
Supplementary Material
This material provides additional details to support the main paper. In particular, we provide additional solution paths, define the clustering error measures used, provide detailed optimization steps and convergence proofs for formulations 1 and 2. From the experimental side, we provide additional experiments to compare the columnalone (uni)clustering variant of the proposed approach with other methods, using synthetic data. We also provide additional experiments for phenotypic trait prediction and GWAS using image features rather than phenotypic traits as responses.
Appendix A Solution path for formulation 2
Similar to the solution path we showed for formulation 1 in Section 5, we can obtain the solution path for formulation 2 as functions of two variables.
We first fix a reasonable and vary to get solution paths for all the coefficients. These paths are shown in Figure 10. The solution paths are smooth in and . Similarly, we fix a reasonable and vary to get solution paths for all the coefficients. These paths are shown in Figure 10. The solution paths are smooth in and . The reasonable values are obtained using crossvalidation.
Appendix B Definition of measures
We provide definitions for the measures of quality of clustering used in synthetic data experiments (Section 6).
F1 score (F1).
Assume is the true clustering, define to be the number of pairs of elements in that are in the same subset in and in the same subset in . This is the true positive and similarly we can define , , as true negative, false negative, and false positive, respectively. Define and , the F1 score is defined as:
(9) 
Jaccard Index (JI).
Using the same notation as F1 score, the Jaccard Index is defined as:
(10) 
Appendix C Optimization of the two formulations
We provide the detailed update rules for optimizing formulation 1 described in Section 3.1 here.
c.1 Formulation 1 for multitask regression

Update for : Let For each , ,
c.2 Proof of Proposition 1
Proof.
We need to check that the conditions in Theorem 3.4 in [6] are satisfied in our case:
Let be the domain of which can be set as . Let be a nonempty convex subset of , the strong relative interior of is
where , and is the closure of span .
Now we check the conditions. For (i), goes to infinity means some goes to infinity, and then we know goes to infinity. Therefore (i) holds.
For (ii), we do not have any restriction on , so the right hand side is just , hence (ii) holds.
Therefore, the proposition follows according to Theorem 3.4 of [6].
∎
c.3 Proof of Proposition 2
Proof.
The optimization step for is solved by COBRA and it converges to global minimizer according to Proposition 4.1 in [4]. Define with and , it is clear that and are both Lipschitzcontinuous in and , respectively. Since the optimization step for is also assumed to find global minimizer, Theorem 3.9 in [1] guarantees that our algorithm in Section 3.2 converges to the global minimizer. ∎
Appendix D Synthetic data experiments for uniclustering
In this section we run additional experiments on synthetic data for columnalone clustering (uniclustering) variant of formulation 1. To display the hierarchical tree structure, we use relatively small scale data here. We compare our methods with the following approaches.

Single task: Each task is learned separately via Lasso.

No group MTL [22]: The traditional multitask approach using group lasso penalty, where all tasks are learned jointly and the same features are selected across tasks.

Pregroup MTL: Given the true number of groups, first partition the tasks purely according to the correlation among responses and then apply No group MTL in each cluster.

Treeguided group Lasso [16]: Employs a structured penalty function induced from a predefined tree structure among responses, that encourages multiple correlated responses to share a similar set of covariates. We used the code provided by the authors of [16] where the tree structure is obtained by running a hierarchical agglomerative clustering on the responses.
We consider samples, features, and three groups of tasks. Within each group there are 5 tasks whose parameter vectors are sparse and identical to each other. We generate independent train, validation, and test sets. For each method, we select the regularization parameters using the validation sets, and report the rootmeansquarederror (RMSE) of the resulting models on the test sets. We repeat this procedure 5 times. The results are reported in Table 7. From the table we can see that the Single task method has the largest RMSE as it does not leverage task relatedness; No group MTL has slightly better RMSE; both Kang et al. and Tree guided group Lasso get some improvement by considering structures among tasks; Pregroup MLT achieves a better result, mainly because it is given the true number of groups and for these synthetic datasets it is quite easy to obtain good groups via response similarity, which might not necessarily be the case with real data and when the predictors differ from task to task. Our method achieves the smallest RMSE, outperforming all approaches compared. We also report the running times of each method, where we fix the error tolerance to for fair comparisons. Though the timing for each method could always be improved using more effective implementations, the timing result reflect the algorithmic simplicity of the steps in our approach compare to e.g. the mixed integer program of [15].
Method  RMSE  std  time 

Single task  5.665  0.131  0.02 
No group multitask learning  5.520  0.115  0.05 
Pregroup multitask learning  5.256  0.117  0.10 
Kang et al  5.443  0.096  
Tree guided group Lasso  5.448  0.127  0.03 
Ours  4.828  0.117  0.16 
An example of tree structure learnt by our approach is shown in Figure 12. It is contrasted with Figure 12, which depicts the structure obtained aposteriori by performing hierarchical clustering on the regression matrix learnt by Single Task. The simulation scenario is the same as before except that the nonzero coefficients in each true group are not equal but sampled as where denotes the normal distribution. As can be seen from Figure 12, no matter what is, our approach never makes a mistake in the sense that it never puts tasks from different true groups in the same cluster. For our approach recognizes the true groups. As becomes very large there are no intermediary situations where two tasks are merged first. Instead all tasks are put in the same cluster. We see tasks merge first in group and merge first in group This corresponds to the fact that tasks have largest correlation among group and has largest correlation among group We can see in Figure 12 that for Single Task post clustering, task 10 does not merge with
Impact of the weights
One might argue that our approach relies on “good” weights among tasks. However, it turns out that it is fairly robust to the weights. Recall that we select the weight by nearestneighbors. In this synthetic dataset, we have 5 tasks in each group so the most natural way is to set . We also try setting and see how this affects the result. The test RMSEs for different ’s are given in Table 8. From the table we see that although the best performance is when we select , our method is quite robust to the choice of weights, especially when is smaller than the natural one. When is large the result gets slightly worse, because now we cannot avoid positive weights across groups. But even in this case, our method is still competitive.
2  3  4  5  6  

RMSE  4.847  4.836  4.828  4.896  4.928 
Appendix E Real Data Analysis on uniclustering structure
We provide additional real data experiments with the columnalone (uni)clustering variant of our approach to augment those in Section 7. The varieties and their names used in the trait prediction experiments (Section 7.1 and Section E.1) are given in Table 9. We provide an experiment that directly tries to associate remotely sensed image features with SNP data. In this case, since the image features are likely multidimensional, our approach of learning the group structure among the feature dimensions is intuitive. Note that this setting is different from GWAS, where we associate the traits with SNP data.
Variety number  Variety name 

1  RS 392x105 BMR FS 
2  RS 400x38 BMR SG 
3  RS 341x10 FG white 
4  RS 374x66 FS 
5  RS 327x36 BMR FS 
6  RS 400x82 BMR SG 
7  RS 366x58 FG white 
8  SP NK5418 GS 
9  SP NK8416 GS 
10  SP SS405 FS 
11  SP Trudan Headless FS PS 
12  SP Trudan 8 FS 
13  SP HIKANE II FS 
14  SP NK300 FS 
15  SP Sordan Headless FS PS 
16  SP Sordan 79 FS 
17  PH 849F FS 
18  PH 877F FS 
e.1 Phenotypic Trait Prediction from Remote Sensed Data
We repeat the experiment described in Section 7.1 for stalk diameter and stalk volume traits and provide the tree structure for tasks in Figures 13 and 14. They look very similar to each other and this makes sense since stalk diameter and stalk volume are highly correlated. For these two traits, we can see that variety 12 is very different from others. It corresponds to tall thin plants with few small dark leaves.
e.2 GWAS dataset
In this experiment, we use SNP data from 850 varieties as input (). We considered SNPs (features). There are plots (observations), each containing a single variety. The output data () is the histogram of photogrammetrically derived heights obtained from RGB images of the plots. We consider bins, and each bin is treated as a task. Therefore, . It has been demonstrated that height histograms describe the structure of the plants in the plot and are hence powerful predictors of various traits [23]. Therefore it is worthwhile performing genomic mapping using them bypassing trait prediction. Since it is reasonable to expect the neighboring bins of the histograms to be correlated, our approach for hierarchical task grouping will result in an improved association discovery.
For this dataset, Kang et al.’s method [15] did not scale to handle the amount of features. In general, we noticed that this algorithm is quite unstable, namely the task membership kept changing at each iteration especially as the dimensionality increases.
The tree structure obtained by our method is given in Figure 15. Please note that the axis in the figure is We notice that bins merge quickly while bins {2,3,4} merge when is extremely large. Note that the distance from bin 5 to bin 4 is much larger, compared to the distance from bin 7 to bin 5. In the figure these look similar due to logarithmic scale. Bins are rarely populated. They all have small coefficients and merge together quickly, while more populated bins tend to merge later.