Visual Feature Fusion and its Application to Support Unsupervised Clustering Tasks

Visual Feature Fusion and its Application to Support Unsupervised Clustering Tasks

Gladys Hilasaca and Fernando Paulovich

On visual analytics applications, the concept of putting the user on the loop refers to the ability to replace heuristics by user knowledge on machine learning and data mining tasks. On supervised tasks, the user engagement occurs via the manipulation of the training data. However, on unsupervised tasks, the user involvement is limited to changes in the algorithm parametrization or the input data representation, also known as features. Depending on the application domain, different types of features can be extracted from the raw data. Therefore, the result of unsupervised algorithms heavily depends on the type of employed feature. Since there is no perfect feature extractor, combining different features have been explored in a process called feature fusion. The feature fusion is straightforward when the machine learning or data mining task has a cost function. However, when such a function does not exist, user support for combination needs to be provided otherwise the process is impractical. In this paper, we present a novel feature fusion approach that uses small data samples to allows users not only to effortless control the combination of different feature sets but also to interpret the attained results. The effectiveness of our approach is confirmed by a comprehensive set of qualitative and quantitative tests, opening up different possibilities of user-guided analytical scenarios not covered yet. The ability of our approach to providing real-time feedback for the feature fusion is exploited on the context of unsupervised clustering techniques, where the composed groups reflect the semantics of the feature combination.

Visual Feature Fusion and its Application to Support Unsupervised Clustering Tasks

Gladys Hilasaca and Fernando Paulovich


Feature fusion, Dimensionality Reduction, Visual analytics, User interaction

00footnotetext: University of Sao Paulo, Brazil
Dalhousie University, Canada
00footnotetext: Corresponding author:
Fernando Paulovich, Faculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, NS, Canada.
00footnotetext: Email:


Machine learning and data mining techniques are, in general, split into supervised and unsupervised approaches or a combination of both. On supervised approaches, user knowledge is added to the analytical process through sets of already analyzed instances. On unsupervised, knowledge can be added by changing algorithms’ parameters or the input data representation, also known as features. The challenge is, therefore, not only to define the most appropriate set of parameters but also to find the data representation that best expresses the user or analyst knowledge.

Depending on the application domain (e.g., text or image), there exist several approaches to construct features, each providing complementary information about the original or raw data. Since there is no perfect feature, the idea of joining different representations is straightforward. This process is called data or feature fusion 1, and can occur through the combination of features (vector) or merging distances calculated from the features. When the machine learning or data mining task involves a cost function, for instance, classification accuracy, such a function can be used to guide the combination. However, for tasks, like clustering 2, 3 or multidimensional projection 4, 5, where such a function does not exist, user support for combination needs to be provided. Otherwise, in practice, the data fusion is impossible or useless given the abundance of different combinations.

In this paper, we present a novel feature fusion approach that allows users to control and understand the fusion of different feature sets. Starting with a small sample, users employ a simple widget to define the weights for the combination and observe the outcome in real-time through a scatterplot-based visualization. Once the user finds the weighted combination that best matches his or her point of view of similarity, the same weights are used to combine the complete dataset. In this way, we not only allow users to effortless test different combinations but also enables the interpretation of the attained results.

In summary, the main contributions of this paper are:

  • A novel feature fusion technique that allows users to explore and understand different combinations of features in real-time;

  • An approach to input user knowledge into unsupervised tasks much more interpretable than parameter tweaking;

  • An interactive visualization-assisted tool to explore large image collections that allows real-time tuning of the similarity between images to match users expectations.

Related Work

The process of integrating information from multiple sources to produce a unified enhanced data model is called data fusion 1. The reason is to combine different data representations into a single model aiming at incorporating properties of the various sources. Data fusion can occur in different ways, including combining features, that is, the vectorial data representation, or merging distances calculated from the various sources.

The concept of merging features is called feature fusion. Feature fusion aims at generating a unified vectorial data representation based on different sets of features (vectorial representations) 6, 7. The most straightforward approach is the feature concatenation 8, 9. In the concatenation, given the sets of features , the unified representation is given by . Despite its simplicity, the literature reports several examples. In 10, Local Binary Pattern (LBP) 11 and Histograms of Oriented Gradients (HOG) 12 features are concatenated to improve performance in pedestrian detection. In 13, Scale Invariant Transform Features (SIFT) 14 and boundary-based shape 15 features are concatenated to improve object recognition. In 16, high, low, and medium-layers features of a deep neural network are united to support object detection, and in 17 color and texture features are progressively concatenated aiming at reducing model complexity in a content-based retrieval framework. Feature concatenation was also used in the text domain. In 18, the authors extract seven types of lexical, syntactical and semantic features and combine subsets of them to improve text classification.

Weights can be used in the concatenation process to control the influence of the different features. In this process, the unified representations is given by  6, where are the weights. In 19, the weighted concatenation was used to improve text classification by combining lexical, syntactic, and semantic features. In 20, a neural network was used to learn the weights of a concatenation, combining different image features, such as color, shape, and texture, to improve classification accuracy. In 21, the authors use a saliency detection model to fuse color and texture features through a weighting strategy. They first transform the color and texture features in saliency features and then linearly combine the saliency features. Different from the previous weighted techniques, in this case, they linearly combine the features instead of concatenating them, that is, the unified representation is given by . This is possible since the saliency representations have the same dimensionality.

In practice, the feature concatenation is not recommended since it may result in a huge feature vectors leading to the curse of dimensionality problem 6. One solution is to apply a dimensionality reduction after the concatenation 22, or to perform a distance fusion. In the distance fusion, instead of combining the vectorial representations, the distances calculated from the representations are combined. If represents the distance matrix calculated from , the resulting distance matrix is given by . In 23, a simple normalized combination of distances computed from different types of features is used to cover song identification. The distance fusion can also be performed using weights. In 24, weights are used to combine distances calculated from color and texture features to improve the results of a content-based image retrieval system. In 25, distances calculated from color and texture features are also combined to support content-based image retrieval applications. Finally, in 26 and in 16 distances calculated from features extracted from different layers of a deep neural network are combined seeking to improve retrieval tasks.

Different from data fusion, model fusion combines computational models instead of data. Such combination can be performed in two different ways: by combining different models (parametrizations) processing a single feature set (data set), or by combining different models processing different feature sets 27. The former is called ensemble learning and has been extensively used for classification tasks. The idea is to combine the prediction of different models using some voting strategy to improve model diversity and classification accuracy 28, 29. Ensembles of classifiers typically outperform single classifiers 30 and have been used in different domains, including remote sensing, computer security, financial risk assessment, fraud detection, recommender systems, medical computer-aided diagnosis, and others 31, 27. Similarly, the later also employs a (weighted) voting strategy to combine different models, but in this case, the models use as input different sets of features. Examples of applications include fruit classification 9 and sentiment analysis 32.

Common to all these data and model fusion approaches is that the combinations can only be appropriately performed when a loss function is available to guide the process, like in classification. If such a function does not exists, or there is a degree of subjectivity in the process, the combination without proper user support hampers its applicability in practice or real scenarios, and none of the mentioned approaches offer such support. In this paper we devise an approach to aid on the process of feature combination, allowing users to control the process to match individual expectations, enabling applications where the user judgment is crucial.

Proposed Methodology

Our approach for feature fusion employs a two phase strategy to support users on defining combinations that reflect a particular point-of-view regarding similarity relationships. On the first phase, samples are extract from each different set of features and merged so that each set presents the same objects but represented using the different types of feature. Each sample is then mapped to a vectorial representation preserving as much as possible the distance relationships between the instances. These vectorial representations are then combined to generate a single representation , which is visualized.

The user can then change the features weights and observe the outcome. Once the sample visualization reflects the user expectations, that is, once the proper weights are found, the second step takes place and the defined weights are used to combine the complete sets of features. In this process, the vectorial sample representations and the samples are used to construct models to map each set of feature to a vectorial representation . Since these vectorial representations are embedded in the same space, they can be combined using the weights , obtaining the final vectorial representation that matches the users expectations defined by the sample visualization. Figure 1 outlines our approach showing the involved steps. Next we detail these steps, starting with the sampling and the dimensionality reduction.

Figure 1: Overview of our process for feature fusion. Initially a sample is extracted, combined and visualized. Based on that, the user can test different weights to fuse the features and observe the outcome. Once sample combination reflects the user expectation, the same weights are used to combine the complete sets of features that can them be used on subsequent tasks, such as clustering.

Sampling and Mapping

The first step of our process is sampling. Since users employ the sample visualization to guide the feature fusion process, it is important to have all possible data structures of the different features represented. Therefore, we recover samples from each different set of features so as to faithfully represent the distribution of each individual set.

In this process, we extract samples from each set separately using a cluster-based strategy. We employ the k-means algorithm to create clusters, getting the medoid of each cluster as a sample, where is the number of instances in the raw dataset . We set the number of cluster to since this is considered a good heuristic for the upper-bound number of clusters in a data set 33. After extracting the sample sets , we merge their indexes defining a unified set of indexes. Then we recreate the sets to have the instances with the indexes contained in the unified set of indexes. Therefore, all sample sets have the same instances, which is mandatory for the sample visualization given that we visualize the combination of all features . Also we guarantee that the structures defined by the different types of features are represented by the samples. Notice that, the combined sample features will have at most instances, enhancing the probability of having samples that represent the distribution of each individual set of features while not hampering the computational complexity of the overall process since .

After recovering the samples, we map them to a common -dimensional space, obtaining their vectorial representation so that we can combine them to obtain (for the sample visualization). In this process, each set of samples is mapped to preserving as much as possible the distance relationships in . We do this by minimizing


where and are instances in , is the distance between them, and and are the vectorial representations in the -dimensional space of and , respectively.

Besides preserving distance relationships, our mapping process aim to align the vectorial representations so that is placed as close as possible to without affecting the distance preservation of the individual mappings. This is necessary since the unified sample representation is calculated as a convex combination of these representations, that is, , with , and misalignments could result in meaningless unified representations. We first calculate the normalized average distance matrix combining the distance matrices of all sets of features, where is the distance matrix calculated from . Then we map to the -dimensional space using the Equation (1). The idea is to use this average representation as a guide to align the vectorial representations minimizing


where is the distance between two instances of the average vectorial representation.

Joining Equation (1) and (2) we define the function we optimize in our mapping process seeking to preserve, as much as possible, the distance relationships of the original features in the vectorial representations while aligning them. This function is given by


where is a used to control the importance of the distance preservation and the alignment to the produced vectorial representations. is a a hyperparameter and can be changed to defined a good tradeoff between distance preservation and alignment.

To minimize Equation (3) we use a stochastic gradient descent approach with a polynomial decay learning rate. We set the initial learning rate to and the decay power to following common choices found in the literature 34. Algorithm 1 outlines our mapping process. Function random select samples from randomly and function init() initialize the mapping also randomly. We have tested a deterministic initialization using Fastmap 35 but the gain in quality does not justify the computational overhead. Notice that we normalize all features before this process, so that the Euclidean norm . Given the triangular inequality property (), this guarantees a upper limit for the maximum pairwise distance between features. Therefore the distances are in the same range despite the type of feature or its dimensionality, avoiding biasing the process towards the type of feature with the largest maximum distance. In addition, we define the desire dimensionality of the resulting mappings as the largest intrinsic dimensionality of , calculated using the maximum likelihood estimation 36. Such dimensionality can also be defined by the user if the target dimensionality is known, such as, for visualization purposes.

calculate the average distance matrix
compute the dimensionality reduction of
for  do
end for
function mapping(, , )
    initialize the dimensionality reduction
   for  do
       polynomial decay of the learning rate
       get random samples from
      for  do
         for  do
         end for
      end for
   end for
end function
Algorithm 1 Algorithm for mapping different feature sets to a common vectorial space.

Weighted Feature Combination

Given the samples vectorial representations we build a set of functions using the process defined in 37 to map each feature set into its vectorial representation preserving as much as possible the distance relationships while obeying the geometry define in . In this process, each instance is mapped to the -dimensional space trough a orthogonal local affine transformation , where is the dimensionality of .

The affine transformation associated to is defined so as to minimize:


where , with the original feature representation of the -th sample in .

Equation (4) can be re-written in the matrix form , where denotes the Frobenius norm, is a diagonal matrix with entries , and and are matrices with the -th row given by the vectors

Based on that, is computed as where and are obtained from the singular value decomposition of . Then the vectorial representation of is given by


Equation (4) is subject to , which avoids scale and shearing effects, therefore preserving the distance relationships of the input features. Also, notice that the sample vectorial representations dictates the geometry of the embeddings . Since they are aligned by the mapping process defined in the previous section, the linear combination can be performed to obtain the final embedding that incorporates the structures defined by each set of features, weighted according to the user’s point-of-view. For more information about this affine transformation and how the sample vectorial representation controls the final results, please refer to 37.

Figure 2: Feature Combination Widget. Using the orange “dial” users can control the contributions of the different types of features to the final feature combination.

Feature Combination Widget

To visually support the feature sample combination, we create a widget inspired by the strategy presented in 38. The idea is to position anchors (circles) representing each different set of features over a circumference, computing the weights according to their distances to a “dial” contained in the circumference. If are the coordinates of the anchor representing the feature on the plane and the coordinates of the “dial”, the weight related to is calculated as


To help the perception of the weights, we change the transparency level of the anchors and fonts according to . Figure 2 shows the combination widget. In this example, the “dial”, in orange, is closer to the anchor representing the feature , so the corresponding anchor is more opaque than the other anchors.

Results and Evaluation

In this section, we evaluate our mapping and feature combination processes using different datasets aiming at showing that the sample manipulation effectively controls the complete feature fusion. Next, we describe the employed datasets, detail how we extract features, and present our quantitative and qualitative evaluation.


We use five datasets in our tests, named STL-10 39, Animals 40, Zappos 41, CIFAR-10 42 and Photographers 43. These datasets come from a variety of different domains. The STL-10 consists of images split into classes of different objects. Similarly, CIFAR-10 contains images of commonly seen object categories (e.g., animals, vehicles, and such) in lower resolution. The Animals dataset is more specific and it is composed of images of animals in categories. Zappos is a dataset for shoes with images from split into shoe categories. Finally, the Photographers consists of photos taken by well-known photographers. Table 1 summarizes the datasets, showing the number of instances and classes.

Name Size Classes
STL-10 13,000 10
Animals 30,475 50
Zappos 50,025 4
CIFAR-10 60,000 10
Photographer 181,948 41
Table 1: Datasets employed in the evaluations. We report its size, number of classes, and intrinsic dimension.


We use distinct methods to extract features, representing low-level and high-level image components. Low-level means that the dimensions of the feature vector has no inherent meaning, but represent a basic understanding of the image such as edges or color. High-level features have semantic meaning. For example, they denote the presence of an object or not in the image.

For the low-level features, we represent (1) color with LAB color histogram; (2) texture with Gabor filters 44 with orientations and scales; and (3) shape with HoG technique 12 with a window size of . For the high-level, we extract deep-features from the pool5 layer using a pre-trained CNN CaffeNet 45. This network was trained on approximately images to classify images into object categories.

We believe that these features are discriminative for our datasets. For example, we can differentiate a leopard from a panda using a texture extractor. Texture can identify spots in leopard, and differentiate them from other animals. Similarly, color features can be helpful to recognize pandas, where the more common colors are black and white. Also, HOG is helpful to differentiate the type of animals by their shape, e.g., quadrupeds from birds. Finally, object recognition can complement the HOG descriptor. These examples can be generalized to other datasets as well.

Quantitative Evaluation

Figure 3: Comparison for distance preservation and alignment error varying . The best trade-off is achieved in the range . The lines connect the average values of the boxplots.

To confirm the quality of our approach, we quantitatively evaluate our mapping and feature combination processes. For the mapping process evaluation, the five datasets of Table 1 are sampled times randomly reducing them to of their original sizes. We sample the data since we cannot execute the mapping process with large datasets since its memory footprint is O(). Due to the random initialization (see Algorithm 1), we repeat the mapping process test times. Each different feature from the dataset has its own dimensionality. To ensure a common dimensional space, we calculate the intrinsic dimensionality for each of them and choose the smallest value. This value is used to do the mapping. The minimum values of intrinsic dimensionality are , , , , and for STL-10, Animals, Zappos, CIFAR-10, and Photographer datasets, respectively.

We use stress and alignment error to evaluate the mapping process (see Equation (1) and Equation (2), respectively). We summarize our results in Figure 3 varying the value of . The stress boxplots (in orange) decrease as increases. On the other hand, alignment boxplots (in blue) have the opposite behavior. This is the expected outcome since larger values of preserve the distance relationships, whereas small values align the data.

Setting preserves as much as possible the original distance relationships. This is reflected on a average stress of , but it does not ensure a good alignment (average alignment of ). On the other hand, delivered almost a perfect alignment (average alignment of ), but it does not enforce the distance preservation (average stress of ). In this paper, we are interested in the best trade-off between distance preservation and alignment so that the alignment is obtained without penalizing the overall distance preservation of the mappings. According to our experiments, we achieved this in the range , where both stress and alignment errors are nearly for our experiments (see Figure 3).

For a qualitative evaluation, we generate two-dimensional representations of the samples using our mapping process setting the target dimensionality to two. We show the results for the STL-10, Zappos and CIFAR-10 datasets in Figures 4, 5, and 6, respectively. In these figures, the points are colored according to image classes. The stress and alignment error values are shown on the top-left corner of each scatterplot. To show the influence of different in the mapping process, we vary it in the range . The first column shows the result produced using , best preserving the original distance relationship. Notice that the visual representations of each different feature are misaligned among themselves. The second column depicts results with . Now, the 2D mappings start to align (points of representing images of the same class are placed in close positions). We observe a small increase of the stress error, but the alignment error decreases considerably compared to the first column (see the second measure on the top-left corner). The same behavior is verified in the remaining columns. The last column aligns almost completely all features. As expected, as lambda decreases, the distance preservation also decreases (stress increases), and the alignment improves (alignment decreases). However, the stress changes are minimal. Hence, our approach is capable of making a good alignment between features whereas preserving distance relationships. Similar behavior can be observed in Figure 5 and Figure 6.

Figure 4: Resulting 2D mapping process for the STL-10 dataset. As decreases, the features get more aligned (See column 5). Top-left numbers correspond to stress and alignment error.
Figure 5: Resulting 2D mapping process for the Zappos dataset. As decreases, the features get more aligned (See column 5). Top-left numbers correspond to stress and alignment error.
Figure 6: Resulting 2D mapping process for the CIFAR dataset. As decreases, the features get more aligned (See column 5). Top-left numbers correspond to stress and alignment error.

For the feature combination, we assess the degree the distance relationships of the sample are preserved into the feature fusion of the whole dataset, intending to demonstrate the effectiveness of the user sample manipulation to the produced dataset. In this evaluation, we first generate different weight combinations randomly summing up to and apply it to sample data. Then, we reuse these weights for the whole data fusion and measure if the distance relationships induced by the weights on the sample are presented in the whole dataset. We use the Nearest Neighbor Measure (NNM) 46 in this analysis.

NNM quantifies the similarity of each instance in the whole data with its nearest neighbor in the sampled data. NNM is given by Equation 7, where is the smallest distance among the instance in the complete dataset and the instances in the sample, and denotes the number of instances. The authors normalized each dimension of the data to the range . However, this results in the loss of the magnitude of the dimensions. So, we change the normalization per dimension by a unit vector normalization per instance to avoid such an effect. The output of NNM is in the interval with larger values indicating better results.


We compare the NNM values of our feature fusion with two baselines: feature concatenation and distance fusion (see Section Related Work). Boxplots in Figure 7 show that our approach outperforms the other two baselines by at least . The mean value for our method is , and the baselines achieve and , respectively. Hence, our method preserves more accurately the data distribution of the sample in the whole dataset fusion.

Figure 7: NNM evaluation. We compare our approach of user-guided feature fusion (light green box), with two baselines: feature concatenation, and feature distance combination. Our feature fusion strategy surpasses current state-of-the-art strategies, indicating that the similarity patterns observed in the sample data combination are preserved on the complete dataset combination.

Qualitative Evaluation

Besides de quantitative evaluation, we also present an example based on projections for qualitative evaluation. The reasoning is to project the complete combined dataset (), showing that the patterns observed in the sample projection () are preserved on the complete projection. In this example, we use our approach to explore large photo collections considering different user perspectives about similarity among images. We use the photographers dataset. In addition to the features described on Section Datasets, we create a new set of features to describe each photographer. We use Wikipedia articles about each photographer and construct a bag-of-words vector to represent them. Photos of the same photographer share the same feature vector, and the similarity among photos is defined as the similarity between texts describing the photographers.

As explained before, based on a sample, using our approach users can combine different features employing the combination widget (see Figure 2) until the sample visualization reflects a particular understanding regarding the similarity among photos. Figure 8 shows three different combinations. The first (Figure 8(a)) provides more importance to color and objects contained in photos and little importance to information about photographers. The second (Figure 8(b)) is defined taking the idea of photographic style from 43, fusing objects and Wikipedia features. Finally, the third (Figure 8(c)) shows the result of combining texture, borders and a little amount of color.

Figure 8: User-defined similarity configurations. Based on a small sample, users can interactively combine different features seeking for the combination that best approaches a particular point of view. This combination is then propagated to the entire dataset for a complete projection.

Once the feature combination has been defined reflecting the users’ point of view, a projection representing the complete photo collection is constructed. Figure 9 shows the produced layout using the weights defined on Figure 8(a). In this figure, since the color is an important feature, we observe a clear separation between back-and-white and colorful images. Also, given the weight assigned to the features representing objects, it is possible to notice a separation among photos of people, landscapes, houses in certain regions of the figure. We zoom in two small portion of the projection (at the top and at the right side) to show this effect. On the colored images (right), we observe images with sky and forest. On the gray images (top), we observe houses, sky, and forest.

Figure 10 depicts the final projection using the weights defined on Figure 8(b). In this figure, we zoom-in a region on the bottom-left. We mainly find portrait images in the zoomed region. Remember that in this weight combination, our goal was to represent the photographic style. The selected photos are from two well-known photographers, Van Vechten and Curtis, that mostly work with portraits, presenting similar styles 43. These examples qualitatively attest that the similarity patterns observed on the sample projection are presented on the complete projection, corroborating the quantitative results measured using the NNM index.

Figure 9: Photographers dataset projection using the weight combination of Figure 8 (a). Since a larger weight is assigned to the color feature, a clear global separation between black-and-white and colorful photos can be observed. This configuration also considers presence of objects and photographer information.
Figure 10: Photographers dataset projection using the weight combination of Figure 8 (b). A larger weight is assigned to the object and photographers features. Photos with similar visual features are grouped. The zoom-in region (bottom-left) shows photos of well-known photographers that share similar styles (portraits photos).

User-guided Clustering

One of the most appealing application scenarios for our approach is to assist non-supervised strategies, such as clustering techniques. Clustering techniques seek to split sets of data instances into groups so that instances belonging to the same group are more similar to each other than to those in other groups. Therefore, clustering is a subjective task that depends on the way similarity is computed, and the ability to explicitly control and understand similarity is the benefit our approach offers.

Following we present an example of using our approach to control clustering results of a sample of the photographers dataset containing instances. In this example, we define different weights for features and observe how this influence the composed groups. In Figure 11 we analyze a transition between color and Wikipedia features. Color starts with weight , and decreases to weight as Wikipedia weight increases from to . We generate new fused features in each intermediate state. In each combination state, we compute clusters using the Mean-shift Algorithm 47. We opt to use this algorithm because we do not need to provide the number of clusters as input, so the produced results directly reflect the provided similarity (or combination of features).

We display the different clustering configuration (for each combination) using the parallel sets 48. On the parallel sets, the vertical axes represent different clusterings , where indicates a different weight combination of features. All axes contain a set of groups where different colors represent different groups. Curves between axis and are colored using the colors of the groups in . This coloring scheme improves the perception of membership changes between different clusterings results. To reduce cluttering, we implement a simple filtering strategy to remove non-relevant curves. For each group in , we evaluate how many instances from this group are redistributed in the groups in . If the quantity is less than a percentage threshold, the curves representing these instances are removed. This threshold is a user parameter and can be adjusted accordingly.

Figure 11 shows parallel sets with axis representing clusterings results with a filtering threshold equals to . axis represents the results for the color feature only (no Wikipedia feature is considered). It has two groups, one presenting colorful and the other black-and-white photos. shows fused features with weight to color and weight to Wikipedia. Most of the two groups presented in remain in , but some instances change their membership.

From , more groups are composed, and the colorful and black-and-white photo division is lost. Finally, represents the clustering for Wikipedia feature only (no color features). Note that from clusterings to , the groups are more stable, that is, most of the items in a certain group tend to be assigned to the same group as increases. In order to analyze the semantic meaning of the groups, we select the purple group () from , and we check its correspondent instances backward. Photos of that group were taken by Brumfield, Gottscho, and Horydczak, which are three iconic American photographers. We map the data from to the visual space using the force-scheme technique 49. Figure  12(d) shows the result where each photo border is colored with its group color. As can be observed, photos are similar in content and appearance. Brumfield, Gottscho, and Horydczak work is focused on architectural photography We also observe that there is a mixture of colorful and black-and-white photos in this group. However, clustering shows a clear separation between these two types of photos (Figure 12(a)). Looking at the sequence of curves from clustering to , it is possible to analyze the group, and when these photos are merged backward. We highlighted the path in the parallel set with darker colors for easy navigation. From to , the groups are stable. Instances of that groups are also projected and depicted in Figures 12(d) and 12(c). In , is formed by instances from , and groups. Corresponding instances from are mapped in Figure 12(b). Note that in , colorful and black-and-white photos are mixed.

Figure 11: Using Parallel Sets to visualize cluster formation. The parallel sets visualization shows nine clusterings results computed using the Means-Shift algorithm. Axis and represent the clustering results for color and Wikipedia features, respectively. Intermediate clusterings denotes combinations of these features.
Figure 12: Projections produced by the Force-scheme technique for the purple group () of instances in . The color of the border indicates the group the photo belongs to. The purple group instances are selected from , and and mapped in (a), (b) and (c), respectively. The projection of the purple group in is shown in (d). In (a), two groups are visible. In (b), these groups are less separate. Both in (c) and (d), there are only one group according to the clustering technique. However, in (d) there is a small group inside this group that distinguishes three photographers with similar styles.
Figure 13: Similarity Matrix for different weight combinations. represent color feature only, showing two groups (brown areas on the main diagonal). represent Wikipedia feature only and has several small groups in the diagonal and two major intersected groups. and have two major brown areas but with different sizes. These visualizations show how the different weight combinations influence the similarity calculation between instances, matching with the group formation presented by the parallel sets.

Parallel sets are useful tools to show the difference between clustering results. However, they do not show the similarity relationships between instances. In order to explore clusters and the relationships between instances, we also visualize the pairwise dissimilarity matrix produce from a given feature combination as a heatmap. In our representation, similar items are rendered in brown colors, whereas dissimilar ones are rendered in pale orange colors. The order (rows and columns) of our representation is obtained by the position of the leaves in a dendrogram generated by the average linkage hierarchical clustering 50, 51, 52.

Figure 13 shows dissimilarity matrices using the same weight combinations that generate , , and on parallel sets. In Figure 13(a) we can spot two groups (two brown areas on the main diagonal). The colored margins indicate the groups of the instances given by the clustering algorithm. Note there are some instances from the green group in the other group, denoting a potential problem with the clustering algorithm. Similar behavior can also be observed in Figure 12(a). In Figure 13(b), the two major groups remain, but sub-groups can now be noticed inside the larger ones. Figure 13(c) also shows two big brown areas on the diagonal. However, these groups have the same size. In the previous matrix, one group is bigger than the other because the color feature has more weight and the dataset has more black-and-white than colorful photos. In Figure 13(c), Wikipedia feature begins to have more contribution in the combination process forming groups that groups photos according to style and color. Finally, in Figure 13(d), there are several groups on the main diagonal and two major groups that intersect. A possible explanation is that some photographers tend to shoot similar object categories, but they are from different schools of thought 43. Looking at the purple part, we can see some sub-groups, each sub-group representing a photographer with a similar style. These sub-groups are also shown in Figure 12(d).


In this paper, we proposed a novel approach for feature fusion that successfully allows users to incorporate knowledge into the fusion process. It is a two-step strategy where, starting from a small sample of the input data, users can easily test different feature combinations and check in real-time the resulting similarity relationships. Once a combination that matches the user expectation is defined, it is propagated to the whole dataset through an affine transformation. Our experiments show that the complete dataset combination preserves the similarities from the sample configuration, providing our approach as a very flexible mechanism to assist the feature fusion process.

We have applied the proposed feature fusion approach to allow users to control and understand the results of clustering techniques. Clustering is one of the most attractive application scenarios for our approach given the subjectiveness involved in unsupervised tasks. Currently, visualization assisted clustering techniques only allow to add user knowledge by changing techniques parameters 53, 54, 55, 56. Enabling users to guide the input feature configuration renders a much flexible control since users can explicitly steer the semantics of the input data and the similarity relationships (e.g., images are similar due to the color vs. images are similar due to the presence of objects), consequently controlling the reason for the cluster formation while allows an easy interpretation of the composed groups.


  • 1 Bostrom H, Andler SF, Brohede M et al. On the definition of information fusion as a field of research 2007; .
  • 2 Xu R and Wunsch D II. Survey of clustering algorithms. Trans Neur Netw 2005; 16(3): 645–678. DOI:10.1109/TNN.2005.845141.
  • 3 Tan PN, Steinbach M and Kumar V. Introduction to Data Mining, (First Edition). Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2005. ISBN 0321321367.
  • 4 Nonato LG and Aupetit M. Multidimensional projection for visual analytics: Linking techniques with distortions, tasks, and layout enrichment. IEEE Transactions on Visualization and Computer Graphics 2018; : 1–1DOI:10.1109/TVCG.2018.2846735.
  • 5 Sacha D, Zhang L, Sedlmair M et al. Visual interaction with dimensionality reduction: A structured literature analysis. IEEE Transactions on Visualization and Computer Graphics 2017; 23(1): 241–250. DOI:10.1109/TVCG.2016.2598495.
  • 6 Mangai UG, Samanta S, Das S et al. A survey of decision fusion and feature fusion strategies for pattern classification. IETE Technical Review 2010; 27(4): 293–307. DOI:10.4103/0256-4602.64604. URL
  • 7 Sudha D and Ramakrishna M. Comparative study of features fusion techniques. 2017 International Conference on Recent Advances in Electronics and Communication Technology (ICRAECT) 2017; : 235–239.
  • 8 Anne KR, Kuchibhotla S and Vankayalapati HD. Acoustic Modeling for Emotion Recognition. Springer Publishing Company, Incorporated, 2015. ISBN 3319155296, 9783319155296.
  • 9 Kuang H, Chan LL, Liu C et al. Fruit classification based on weighted score-level feature fusion. J Electronic Imaging 2016; 25(1): 013009. DOI:10.1117/1.JEI.25.1.013009. URL
  • 10 Wang X, Han TX and Yan S. An hog-lbp human detector with partial occlusion handling. 2009 IEEE 12th International Conference on Computer Vision 2009; : 32–39.
  • 11 Ahonen T, Hadid A and Pietikäinen M. Face recognition with local binary patterns. In In Proc. of 9th Euro15 We. pp. 469–481.
  • 12 Dalal N and Triggs B. Histograms of oriented gradients for human detection. In In CVPR. pp. 886–893.
  • 13 Manshor N, Rahiman AR, Mandava R et al. Feature fusion in improving object class recognition 2012; 8: 1321–1328.
  • 14 Lowe DG. Object recognition from local scale-invariant features. In Proc. of the International Conference on Computer Vision ICCV, Corfu.
  • 15 C Gonzalez R, E Woods R and L Eddins S. Digital image processing using matlab 2004; 1.
  • 16 Chu J, Guo Z and Leng L. Object detection based on multi-layer convolution feature fusion and online hard example mining. IEEE Access 2018; : 1–1.
  • 17 Chun YD, Kim NC and Jang IH. Content-based image retrieval using multiresolution color and texture features. IEEE Transactions on Multimedia 2008; 10(6): 1073–1084. DOI:10.1109/TMM.2008.2001357.
  • 18 Loni B, Khoshnevis SH and Wiggers P. Latent semantic analysis for question classification with neural networks. In 2011 IEEE Workshop on Automatic Speech Recognition Understanding. pp. 437–442.
  • 19 Loni B, Van Tulder G, Wiggers P et al. Question classification by weighted combination of lexical, syntactic and semantic features. In Proceedings of the 14th International Conference on Text, Speech and Dialogue. TSD 11, Berlin, Heidelberg: Springer-Verlag. ISBN 978-3-642-23537-5, pp. 243–250. URL
  • 20 Ma G, Yang X, Zhang B et al. Multi-feature fusion deep networks. Neurocomput 2016; 218(C): 164–171. DOI:10.1016/j.neucom.2016.08.059. URL
  • 21 You T and Tang Y. Visual saliency detection based on adaptive fusion of color and texture features. In 2017 3rd IEEE International Conference on Computer and Communications (ICCC). pp. 2034–2039. DOI:10.1109/CompComm.2017.8322894.
  • 22 Yu W and Zhu Q. Quick retrieval method of massive face images based on global feature and local feature fusion. In 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). pp. 1–6. DOI:10.1109/CISP-BMEI.2017.8301987.
  • 23 Degani A, Dalai M, Leonardi R et al. A heuristic for distance fusion in cover song identification. In 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS). pp. 1–4.
  • 24 Huang ZC, Chan PPK, Ng WWY et al. Content-based image retrieval using color moment and gabor texture feature. In 2010 International Conference on Machine Learning and Cybernetics, volume 2. pp. 719–724.
  • 25 Vadivel A, Majumdar AK and Sural S. Characteristics of weighted feature vector in content-based image retrieval applications. In International Conference on Intelligent Sensing and Information Processing, 2004. Proceedings of. pp. 127–132. DOI:10.1109/ICISIP.2004.1287638.
  • 26 Liu P, Guo JM, Wu CY et al. Fusion of deep learning and compressed domain features for content-based image retrieval. IEEE Transactions on Image Processing 2017; 26(12): 5706–5717.
  • 27 Kim K, Lin H, Choi JY et al. A design framework for hierarchical ensemble of multiple feature extractors and multiple classifiers. Pattern Recogn 2016; 52(C): 1–16. DOI:10.1016/j.patcog.2015.11.006. URL
  • 28 Mendes-Moreira Ja, Soares C, Jorge AM et al. Ensemble approaches for regression: A survey. ACM Comput Surv 2012; 45(1): 10:1–10:40. DOI:10.1145/2379776.2379786. URL
  • 29 Dietterich TG. Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems. MCS ’00, London, UK, UK: Springer-Verlag. ISBN 3-540-67704-6, pp. 1–15. URL
  • 30 Schneider B, Jackle D, Stoffel F et al. Visual Integration of Data and Model Space in Ensemble Learning. In Symposium on Visualization in Data Science (VDS) at IEEE VIS 2017 (BEST PAPER award).
  • 31 Woniak M, Graña M and Corchado E. A survey of multiple classifier systems as hybrid systems. Inf Fusion 2014; 16: 3–17. DOI:10.1016/j.inffus.2013.04.006.
  • 32 Xia R, Zong C and Li S. Ensemble of feature sets and classification algorithms for sentiment classification. Inf Sci 2011; 181: 1138–1152.
  • 33 Pal NR and Bezdek JC. On cluster validity for the fuzzy c-means model. IEEE Transactions on Fuzzy Systems 1995; 3(3): 370–379. DOI:10.1109/91.413225.
  • 34 Wilson DR and Martinez TR. The need for small learning rates on large problems. In IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222), volume 1. pp. 115–119 vol.1. DOI:10.1109/IJCNN.2001.939002.
  • 35 Faloutsos C and Lin KI. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. SIGMOD Rec 1995; 24(2): 163–174. DOI:10.1145/568271.223812. URL
  • 36 Levina E and Bickel PJ. Maximum likelihood estimation of intrinsic dimension. In Proceedings of the 17th International Conference on Neural Information Processing Systems. NIPS’04, Cambridge, MA, USA: MIT Press, pp. 777–784. URL
  • 37 Joia P, Coimbra D, Cuminato JA et al. Local affine multidimensional projection. IEEE Transactions on Visualization and Computer Graphics 2011; 17(12): 2563–2571. DOI:10.1109/TVCG.2011.220. URL
  • 38 Pagliosa P, Paulovich FV, Minghim R et al. Projection inspector: Assessment and synthesis of multidimensional projections. Neurocomputing 2015; 150, Part B: 599–610. DOI:
  • 39 Coates A, Ng AY and Lee H. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011. pp. 215–223. URL
  • 40 Lampert C, Nickisch H and Harmeling S. Learning to detect unseen object classes by between-class attribute transfer. In CVPR 2009. Max-Planck-Gesellschaft, Piscataway, NJ, USA: IEEE Service Center, pp. 951–958.
  • 41 Yu A and Grauman K. Fine-grained visual comparisons with local learning. In Computer Vision and Pattern Recognition (CVPR).
  • 42 Krizhevsky A. Learning multiple layers of features from tiny images. Technical report, 2009.
  • 43 Thomas C and Kovashka A. Seeing behind the camera: Identifying the authorship of a photograph. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • 44 Chen L, Lu G and Zhang D. Effects of different gabor filter parameters on image retrieval by texture. In Proceedings of the 10th International Multimedia Modelling Conference. MMM ’04, Washington, DC, USA: IEEE Computer Society. ISBN 0-7695-2084-7, pp. 273–. URL
  • 45 Jia Y, Shelhamer E, Donahue J et al. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia. MM ’14, New York, NY, USA: ACM. ISBN 978-1-4503-3063-3, pp. 675–678. DOI:10.1145/2647868.2654889.
  • 46 Cui Q, Ward M, Rundensteiner E et al. Measuring data abstraction quality in multiresolution visualizations. IEEE Transactions on Visualization and Computer Graphics 2006; 12(5): 709–716.
  • 47 Comaniciu D and Meer P. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002; 24(5): 603–619. DOI:10.1109/34.1000236.
  • 48 Kosara R, Bendix F and Hauser H. Parallel sets: interactive exploration and visual analysis of categorical data. IEEE Transactions on Visualization and Computer Graphics 2006; 12: 558–568.
  • 49 Tejada E, Minghim R and Nonato LG. On improved projection techniques to support visual exploration of multidimensional data sets. Information Visualization 2003; 2(4): 218–231.
  • 50 Sokal RR and Michener CD. A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin 1958; 28: 1409–1438.
  • 51 Day W and Edelsbrunner H. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification 1984; 1(1): 7–24. DOI:10.1007/BF01890115.
  • 52 Sander J, Qin X, Lu Z et al. Automatic extraction of clusters from hierarchical clustering representations. In Proceedings of the 7th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. PAKDD ’03, Berlin, Heidelberg: Springer-Verlag. ISBN 3-540-04760-3, pp. 75–87.
  • 53 Kwon BC, Eysenbach B, Verma J et al. Clustervision: Visual supervision of unsupervised clustering. IEEE Transactions on Visualization and Computer Graphics 2018; 24(1): 142–151. DOI:10.1109/TVCG.2017.2745085.
  • 54 kern, Lex A, Gehlenborg N et al. Interactive visual exploration and refinement of cluster assignments. BMC Bioinformatics 2017; 18(1): 406. DOI:10.1186/s12859-017-1813-7.
  • 55 Bruneau P, Pinheiro P, Broeksema B et al. Cluster Sculptor, an interactive visual clustering system. Neurocomputing 2015; 150: 627 – 644. DOI:10.1016/j.neucom.2014.09.062.
  • 56 Jentner W, Sacha D, Stoffel F et al. Making Machine Intelligence Less Scary for Criminal Analysts: Reflections on Designing a Visual Comparative Case Analysis Tool. The Visual Computer Journal 2018; DOI:10.1007/s00371-018-1483-0.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description