Interactive Prior Elicitation of Feature Similarities for Small Sample Size Prediction
Abstract.
Regression under the “small , large ” conditions, of small sample size and large number of features in the learning data set, is a recurring setting in which learning from data is difficult. With prior knowledge about relationships of the features, can effectively be reduced, but explicating such prior knowledge is difficult for experts. In this paper we introduce a new method for eliciting expert prior knowledge about the similarity of the roles of features in the prediction task. The key idea is to use an interactive multidimensionalscaling (MDS) type scatterplot display of the features to elicit the similarity relationships, and then use the elicited relationships in the prior distribution of prediction parameters. Specifically, for learning to predict a target variable with Bayesian linear regression, the feature relationships are used to construct a Gaussian prior with a full covariance matrix for the regression coefficients. Evaluation of our method in experiments with simulated and real users on text data confirm that prior elicitation of feature similarities improves prediction accuracy. Furthermore, elicitation with an interactive scatterplot display outperforms straightforward elicitation where the users choose feature pairs from a feature list.
1. Introduction
Regression analysis becomes difficult when the sample size is substantially smaller than the number of features. “Small , large ” refers to the generic class of such problems which arise in different fields of applied statistics such as personalized medicine (costello2014community, ; tian2014simple, ) and text data analysis (forman2003extensive, ; qu2010bag, ). The problem poses several challenges to standard statistical methods (johnstone2009statistical, ) and demands new concepts and models to cope with the challenges. An important challenge is that prediction by fitting regression models using traditional techniques is an illposed task in “small , large ” and is unlikely to be accurate and reliable. Regularization methods (tibshirani1996regression, ; zou2005regularization, ) have been proposed to cope with this challenge; however, the improvement they can give is limited. Additionally, modelling could use prior information, i.e. information available about the problem prior to observing the learning data. Prior information is often available only as the experience and knowledge of experts. The process of quantifying and extracting user’s prior knowledge is known as prior elicitation. The extracted knowledge can be used to improve an underlying model. The two main questions in the process are how to quantify the prior knowledge, and how to plugin the extracted prior knowledge to the model.
Garthwaite et al (garthwaite2013prior, ) proposed a method of defining the full prior distribution for a generalized linear model by quantifying experts’ opinions on different statistics such as the median, lower and upper quantiles. Interactive Principal Component Analysis (iPCA) (jeong2009ipca, ) supports data analysis of multivariate data sets through modification of the model parameters by the user. The drawback of these types of prior elicitation is that they assume users are experts in the underlying model and not just domainexperts. To solve this problem, observationlevel interaction has been proposed where the focus is on interaction between the user and the data rather than model parameters (brown2012dis, ; endert2011observation, ). Using the extracted knowledge from the interaction, the parameters of the underlying model are tuned to reflect the user’s knowledge. In recent work, Daee et al (daee2016knowledge, ) proposed a method of eliciting user’s knowledge on single features to improve the predictions in a sparse linear regression problem. The user’s knowledge assumed to be about feature relevance and/or feature weight values. Similarly, Micallef et al (micallef2016interactive, ) proposed an interactive visualization to extract user’s knowledge on the relevance of individual features for a prediction task.
In this paper, we present a novel approach on interactive prior elicitation of pairwise similarities of features in “small , large ” prediction task. The proposed approach uses an interactive MDStype scatterplot of the features to let users give feedback on their pairwise similarities, in the sense of how similarly they would affect the predictions. Based on this input, the system learns a new similarity metric for the features and redraws the scatterplot. Finally, the learned metric is used to define a prior distribution for the prediction parameters. The proposed approach shields users from the technicalities of the underlying model. The contributions of this paper can be summarized as:

User’s prior knowledge is quantified as the prior covariance of the regression coefficients in a Bayesian linear regression model. Using this interpretation, our system lets the user manipulate the prior distribution of the model parameter indirectly by his feedback, without having to understand modelling details.

Feedback is collected on pairwise similarities of the features rather than the data, parameters or single features. This type of feedback is complementary to all earlier approaches.

The prior is elicited with an MDStype of interactive visualization that has earlier been used for visualizing similarities of data items.
Our simulation results and preliminary user study demonstrate that when collecting pairwise similarity knowledge using the proposed interactive intelligent interface, users are able to provide more informative feedback, and the performance of the underlying model increases in prediction tasks.
2. Overview
To motivate our algorithm and for the purpose of clarity, we illustrate our basic idea with a simple use case. We used the sentiment data set (blitzer2007biographies, ) which contains text reviews and the corresponding rating values (taken from www.amazon.com) of four product categories. Each review is represented using a vector of unigram and bigram keywords that appear in at least 100 reviews within the same category. We focus on the kitchen appliances category where there are reviews, each represented by a feature vector of size (Hernandez2015Expectation, ). The task is to learn a model that linearly relates the keywords (which here are features) to the ratings (outputs) to predict the ratings from the textual content of the reviews. This is a supervised learning task where we have a training set of inputs and outputs . To simulate the “small , large ” paradigm, we randomly select 100 reviews and their corresponding ratings as the training set.
A linear regression model for this task can be defined using a parameter vector . Mathematically, the model is
(1) 
where is the residual noise. Equation 1 induces a Gaussian distribution for the likelihood as . The goal is to learn the posterior distribution of given the training data.
Inferring the posterior of the parameters in the Bayesian setting requires a prior distribution. In data sets with large sample sizes, the choice of the prior distribution will have a minor effect on the posterior inferences; however, since we assumed a “small , large ” data set, the role of the prior distribution becomes more important. Setting prior distributions is a difficult task and requires knowledge on both the domain and the model parameters. In this paper, we introduce a method for helping in this task, by learning and refining a good prior distribution for the prediction parameters using feedback given by a user. User’s knowledge is assumed to be about the pairwise similarities of the keywords with regard to the role they have in the prediction task. In other words, we mean that keywords have a similar effect on the rating values (the values of the regression coefficients are similar). As an example, keywords “good” and “excellent” have a similar role in the prediction since both of them convey information that the user will give a high rating to the product, while keywords “bad” and “good” are dissimilar.
Figure 1 illustrates an example interaction between the user and our system. Keywords (features) are visualized to the user on the scatter plot, where she can zoom in/out by scrolling down/up the mouse. The user investigates the distances among keywords and decides whether two keywords should be closer to each other (similar) or farther away from each other (dissimilar) based on her prior knowledge. As an example, the user concluded that according to her prior knowledge, the distances between keywords “love_it” and “perfectly” should be less than what is shown in the scatterplot. She selects these keywords by clicking on them (their color will change to green as shown in Figure (a)a), selecting similar/dissimilar box in the menu bar and then clicking on the submit button. Then the user can ask for a new visualization (New Visualization button in Figure 1) to see the effect of her feedback on the distances between keywords (Figure (b)b), or she can continue giving more feedback according to current distances. As shown in Figure (b)b, the one feedback given by the user modifies the distances between keywords, however it was not informative enough to make distances perfect. This will iterate until the user is satisfied with the visualization. The knowledge extracted from the user is used to build a proper covariance matrix for the prior distribution of the prediction parameter . Finally, using the obtained prior, we compute the posterior of the prediction parameters.
3. Interactive Prior Elicitation of Pairwise Similarities
We reformulate the Interactive Neighbor Retrieval Visualizer (peltonen2013information, ), which visualized data items, to a method for prior elicitation on features. To visualize the features for the user, we use the original data space as the representation for the features in the highdimensional space. More precisely, we define as the original representation of the feature, where is the element of the sample. With this definition, we have features, each of which is an dimensional vector. We define as the corresponding lowdimensional projections of , to be learned from user feedback.
At each iteration , we define the similarity matrix of the features in the highdimensional space as
(2) 
where is the unknown similarity metric between the features, and is a scaling parameter. The unknown similarity metric encodes the user feedback and is learned iteratively by interaction with the user. The metric is initialized to unit matrix.
To find the location of the points in the visualization space at iteration , an analogous matrix is defined for the lowdimensional projections:
(3) 
Finally, the locations of the points in the lowdimensional space are obtained by optimizing the following expected cost function (peltonen2013information, ):
(4) 
where denotes the expectation over the posterior distribution of the learned metric given the feedbacks , and is expectation over the training set points. Since the highdimensional distributions are functions of the unknown metric , the cost function is represented as the expectation over the possible metrics. The parameter controls the relative importance of recall and precision of the display (venna2010information, ). The final similarity metric , learned in the last iteration of user interaction, is used to define a prior distribution for the regression weights according to equations 5 and 6:
(5) 
(6) 
where and are scalar scale parameters. In our implementation, the value of is set by crossvalidation.
By defining this prior distribution for the regression coefficients, and gamma prior distributions on and , the posterior distribution is analytically intractable, but can be efficiently approximated using Variational Bayes (e.g., (Bishop2006, , Chapter 10)). This gives a Gaussian posterior approximation for . Finally, the prediction is done using the posterior mean. Pseudocode of the proposed method is presented in Algorithm 1.
4. Simulation Experiment
We conducted a simulated study on the data set introduced in Section 2 with two scenarios where a simulated user (i) gives all feedbacks at once, and (ii) gives feedback sequentially. As baselines, we used Bayesian linear regression with unit prior covariance and Bayesian linear regression with the prior covariance used in the first round of our method (”Without Feedback” in the following, since the prior is obtained by setting and without using feedback). We used a set of 3149 randomly selected reviews with their corresponding ratings to construct the simulated user. This is done by using the mean of the posterior distribution of the regression coefficient vector of a Bayesian linear regression model trained on the randomly selected data. The simulated user assumes two similarity clusters: (i) features with the highest 30 regression coefficients and (ii) features with the lowest 30 regression coefficients. Features in these two clusters are dissimilar to each other. Since there are enough samples (3149) compared to the dimensionality of the data (824), the posterior mean of the regression coefficient is a good representative of the true values of the feature weights and consequently the similarity of the role of the features in the prediction task.
The remaining samples are randomly partitioned into training and test sets. The results reported in this section are averaged over 10 simulated user construction iterations and 50 random training data selection. Figure (a)a shows simulation results for the first scenario, in which the proposed method is evaluated with an increasing number of randomly selected training samples, from 50 to 500. Figure (b)b shows the changes of Mean Squared Errors (MSE) on the test data with 100 randomly selected training samples when the simulated user gives feedback sequentially in 60 rounds; round 0 works without feedback. The simulated user gives 10 similarity feedback and 10 dissimilarity feedback in each round.
From Figure 2, it can be concluded that assuming pairwise similarity/dissimilarity knowledge from the user, the proposed method improves the predictions by extracting prior knowledge.
5. User Study
We conducted a user study on 10 naive university students to empirically evaluate our two hypotheses that (i) by collecting prior knowledge on the pairwise similarity of the features we can improve predictions, and (ii) the interactive interface helps users to give better feedback and consequently improves the system’s predictions. To evaluate the first hypothesis, we consider the same baselines used in the previous section. To evaluate the second hypothesis, we implemented two different versions of our system, both with the same underlying model, but with different interfaces: the proposed interactive interface and a simple noninteractive list visualization of the features. In the list visualization, the order of the features is random and fixed during the whole experiment for a user. The user goes through the list and selects the pairs which are similar or dissimilar according to her prior knowledge and gives feedback on them. This very simple interface was designed for testing hypothesis (ii). As far as we know, there are no earlier methods for the same task.
We designed a betweensubject study, where each participant performed two prior elicitation tasks with different interfaces and different data collections: the sentiment data set introduced in Section 2 and the reviews from the Yelp data set challenge (www.yelp.com/dataset_challenge). Users were asked to give feedback on pairwise similarity/dissimilarity of the words in the role they have in the prediction. For the Yelp data, we used a subset with 4086 reviews. In both data sets, we set a threshold on the tfidf values (a standard technique in information retrieval, see (sparck1972statistical, )) of the words to choose 300 words. To simulate a “small , large ” training data, we randomly selected five subsets of each data set with 100 samples, and used the rests for test. Therefore, the training set for each task contains 100 samples and 300 features. Each of the selected five subsets (from each data set) was used once for the interactive interface and once for the noninteractive interface with different users. Users interact with each interface for 20 rounds and give 5 feedbacks (similarity/dissimilarity) per round. The study was balanced with respect to the combination of the type of interface, task and order. After both tasks, a short semistructured interview was conducted with each participant.
Figures (a)a and (b)b show MSEs on test data as a function of the number of feedback iterations for the two data sets. According to the figures, extracted prior knowledge of the user improves the mean squared errors of the predictions compared to both baselines. Moreover, the difference between the MSE values obtained by the interactive interface and the noninteractive interface shows the amount of improvement made to the predictions using the interface. To test the statistical significance of the improvements made by our method compared to each of the other methods, we used the same procedure introduced in (micallef2016interactive, ). The distance between the average curves in the last round (round in Figure 3) is used as the test statistics. By assuming that there is no difference between the results obtained by the interactive interface and other methods, we compute the distribution of the test statistics by performing permutations of the labels, e.g. interactive interface, noninteractive interface, etc. Finally, the proportion of the permutations which has higher values of the test statistics compared to the test statistics when using true labels, is used as value of the significance test. Based on this test, the improvement made on the ”Unit Prior Covariance” ( for the sentiment and for the YELP data set) and the ”Without Feedback” ( for the sentiment and for the YELP data set) baselines in both data sets are statistically significant, while for the noninteractive interface, the differences between MSEs are not statistically significant ( for the sentiment and for the YELP data set) which might be due to the small number of users.
It should be noted that since only a small portion of the words are meaningfully related to the rate prediction task, i.e. most of the words are verbs (am, is, etc.) or subjects (I, he, etc.) which are difficult for the user to give feedback on, users gave their best feedback in the first couple of rounds which causes prior elicitation to best improve the prediction errors in the first half of the rounds. But, in the second half of the rounds, prediction errors either improve slowly because of the repetitive feedbacks given by the user, or even sometimes drop since some users started to give feedback on irrelevant words.
In the interview, 8 out of 10 users reported that they felt the intelligent interface helped them to accomplish the task. However, 5 users stated that they preferred the simple interface over the intelligent one. This is not surprising since people often prefer simpler systems over more complex ones (hearst2009search, ) even if the complex system benefits them in accomplishing the required task.
6. Discussion and Conclusion
In this paper, we presented a new method and a prototype implementation of an interactive prior elicitation system which elicits an expert’s prior knowledge on feature similarities to improve prediction accuracy. The system involves an intelligent user interface which helps the user in the interaction. We believe that this is an important step toward more efficient interactive prior elicitation methods. The main novelties are the type of feedback assumed from the user and the interpretation of the extracted knowledge as prior covariance for the parameter of the linear regression model.
In the current implementation, we pruned the number of features to avoid overwhelming the user; however, for the general case of a large number of features, we are working on an active learning version of the method to prioritize the feature pairs and allow scaling up to a much larger number of features.
References
 (1) Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006.
 (2) Blitzer, J., Dredze, M., and Pereira, F. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. In ACL, vol. 7 (2007), 440–447.
 (3) Brown, E. T., Liu, J., Brodley, C. E., and Chang, R. Disfunction: Learning distance functions interactively. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST) (2012), 83–92.
 (4) Costello, J. C., Heiser, L. M., Georgii, E., Gönen, M., Menden, M. P., Wang, N. J., Bansal, M., Hintsanen, P., Khan, S. A., Mpindi, J.P., et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nature Biotechnology 32, 12 (2014), 1202–1212.
 (5) Daee, P., Peltola, T., Soare, M., and Kaski, S. Knowledge elicitation via sequential probabilistic inference for highdimensional prediction. arXiv preprint arXiv:1612.03328 (2016).
 (6) Endert, A., Han, C., Maiti, D., House, L., and North, C. Observationlevel interaction with statistical models for visual analytics. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST) (2011), 121–130.
 (7) Forman, G. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3 (2003), 1289–1305.
 (8) Garthwaite, P. H., AlAwadhi, S. A., Elfadaly, F. G., and Jenkinson, D. J. Prior distribution elicitation for generalized linear and piecewiselinear models. Journal of Applied Statistics 40, 1 (2013), 59–75.
 (9) Hearst, M. Search user interfaces. Cambridge University Press, 2009.
 (10) HernándezLobato, J. M., HernándezLobato, D., and Suárez, A. Expectation propagation in linear regression models with spikeandslab priors. Machine Learning 99, 3 (2015), 437–487.
 (11) Jeong, D. H., Ziemkiewicz, C., Fisher, B., Ribarsky, W., and Chang, R. iPCA: An interactive system for PCAbased visual analytics. In Computer Graphics Forum, vol. 28, Wiley Online Library (2009), 767–774.
 (12) Johnstone, I. M., and Titterington, D. M. Statistical challenges of highdimensional data. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 367, 1906 (2009), 4237–4253.
 (13) Micallef, L., Sundin, I., Marttinen, P., Ammadud din, M., Peltola, T., Soare, M., Jacucci, G., and Kaski, S. Interactive elicitation of knowledge on feature relevance improves predictions in small data sets. arXiv preprint arXiv:1612.02487 (2016).
 (14) Peltonen, J., Sandholm, M., and Kaski, S. Information retrieval perspective to interactive data visualization. EuroVisShort Papers (2013), 49–53.
 (15) Qu, L., Ifrim, G., and Weikum, G. The bagofopinions method for review rating prediction from sparse text patterns. In Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics (2010), 913–921.
 (16) Sparck Jones, K. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation 28, 1 (1972), 11–21.
 (17) Tian, L., Alizadeh, A. A., Gentles, A. J., and Tibshirani, R. A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association 109, 508 (2014), 1517–1532.
 (18) Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), 267–288.
 (19) Venna, J., Peltonen, J., Nybo, K., Aidos, H., and Kaski, S. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research 11 (2010), 451–490.
 (20) Yang, L., Jin, R., and Sukthankar, R. Bayesian active distance metric learning. In Proceedings of the TwentyThird Conference on Uncertainty in Artificial Intelligence, AUAI Press (2007), 442–449.
 (21) Zou, H., and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301–320.