# Algorithms for Item Categorization Based on Ordinal Ranking Data

## Abstract

We present a new method for identifying the latent categorization of items based on their rankings. Complimenting a recent work that uses a Dirichlet prior on preference vectors and variational inference, we show that this problem can be effectively dealt with using existing community detection algorithms, with the communities corresponding to item categories. In particular we convert the bipartite ranking data to a unipartite graph of item affinities, and apply community detection algorithms. In this context we modify an existing algorithm - namely the label propagation algorithm to a variant that uses the distance between the nodes for weighting the label propagation - to identify the categories. We propose and analyze a synthetic ordinal ranking model and show its relation to the recently much studied stochastic block model. We test our algorithms on synthetic data and compare performance with several popular community detection algorithms. We also test the method on real data sets of movie categorization from the Movie Lens database. In all of the cases our algorithm is able to identify the categories for a suitable choice of tuning parameter.

## 1Introduction

In this paper we consider the problem of item categorization based on choice, preference or ranking data by a number of voters. So far using the voter-rating matrix, the literature has focused on voter categorization instead of item categorization [1]. We are partly motivated by a recent work, [2] where item categorization based on choice statistics was considered. Using a Dirichlet prior on the preferences for each user coupled with a random utility model for making choices, the authors use a variational algorithm to infer the categories. In contrast to these approaches, in this paper a new ranking (choice) model among the categories is presented followed by the use of community detection algorithms [3] for category discovery by converting the bipartite graph of ratings to the unipartite similarity graph of item similarities.

In this context our contribution is two-fold - (a) We analyze the expected connectivity in the similarity graph and link our model to the recently much studied stochastic block model. This implies that one can understand information theoretic limits on the discovery of categories by directly using recent results in [5]^{1}

The rest of the paper is organized as follows. In the following section we introduce the generative ranking model used in the analysis of the algorithm and the relation to the standard block model is explored through the analysis of the chosen similarity function. In section 3, we present a modification of the label propagation algorithm that introduces a weighting of the labels that are to be selected at each round. The new algorithm aims to give larger weight to labels that are closer to their source vertices and less strength to labels that are farther. Finally, in section 4 we present the experimental results of the algorithm, both on the synthetic data produced by the generative model and on real data from the Movie Lens database [8].

## 2Generation of a Synthetic Rating Model

Algorithm ? outlines the model used to generate the ranking data among categories. The parameters for this data generation are : the number of categories; : the number of items in each category; : the expected number of items each category will swap with every other category (referred to as the mixing parameter) to allow for choice variability; and : the number of voters.

The main idea behind the synthetic ranking generation algorithm (Algorithm ?) is as follows. For each voter, the following process is repeated. Each of the categories are ordered randomly from to . Then, each of the items of the categories are given a unique random integer value in the range where is the location of the category in the ordering. This process produces an ordinal ranking system of the items with items in the same category initially given close proximity to one another.

Once all of the items are placed as such, *the mixing process begins*. In a new random order, generated separately from the ordering above, the categories each swap items with each of the other categories. These items are chosen randomly and uniformly from the items of the category. It is worth noting that if 3 categories swap in the order , , then the item that swapped into from could be the same ones that swaps with . This is to say that the items swapped from a category do not have to have originated in that category.

As mentioned, the above ranking generation is repeated for each of the voters to create a complete dataset of *ordinal* rankings. Once this is completed, the bipartite ranking data is converted into a unipartite graph using Algorithm ?. This algorithm is based around the idea of collapsing the edges between voters and items into direct edges between items. For each voter, call it , each pair of item is compared using a similarity function, . The values of this similarity function affect whether or not there is to be an edge between these two items.

As edges in the final graph are to represent a strong relationship between elements, the similarity function should have a higher value for items that are rated similarly and a lower rating for those with a larger difference. The reasoning behind this is that users typically prefer similar things and will thus give similar ratings to items of the same category while items of different categories should be rated differently due to preference. There are a variety of functions that can be chosen such as the Cosine Similarity Measure [1] or the Pearson Correlation Coefficient [9]. In this paper, a different function, introduced below, is used that is more amenable to analysis as shown in the next section.

### 2.1Relation to the Stochastic Block Model (SBM)

In this section, we aim to discover the relationship between this model and the stochastic block model (SBM)^{2}

Since these probabilities are implicit, in the following sections we will analyze their expected behaviors.

### 2.2Analysis of Similarity Function

As the calculation of and is an implicit calculation, it is useful to calculate the expected values of the similarity function for elements in the same category and those in different categories as a proxy for examining these measures. We first introduce the similarity function we are using in the algorithm and analysis:

. As discussed above, this function is chosen as it gives higher weights to elements that are more closely ranked, such as those in the same category, than elements that are ranked very differently. As the threshold, , should lead to the addition of only edges that represent strong relationships, the expected value of our similarity function is used. In this calculation it is hypothesized that there is an equal chance of choosing any pair of elements and .

Please see Appendix A

#### Analysis of Similarity Function for Elements of the Same Category

As mentioned above, as the calculation of is implicit, it is useful to consider what the expected value for the similarity function would be between elements in the same category. All that must be considered is the value of the distance, as the value of the similarity function is inversely proportional to the distance. The expected distance is calculated by averaging the distance between all possible combinations of elements.

Situations involving mixing must also be considered. The total distance of all possible pairs of elements can be calculated regardless of where a new element, , swaps in; can swap in with any of the other elements already in the category. In order to calculate the number of combinations, the possible locations swapped in the category are multiplied by the different element pairs to yield the total possible distance combinations.

To get the total distance, we recognize that the new element, , will be paired with each other element already in the category times, every round except that in which will have swapped with the element itself. Similarly, all of the standard intracategory pairings will still happen times: the two exceptions being when either of the two elements is the one swapped. Thus we multiply the standard sum of intracategory distances, by . These values are combined to get the expected distance expressed below.

This idea is extended to the general case of swaps. Let be the elements that are swapping into the category.

This lemma follows from the extension of the above ideas discussed in the formation of Lemma II.2

Clearly these two equations are equivalent if the value is substituted for . One thing worth noting is that when , .

#### Analysis of Similarity Function for Elements in Different Categories

The analysis of the distance function for elements in different categories is slightly more difficult. As categories can have different distances from each other, a new variable, , is introduced to signify this distance. To give an example of this measurement, if two elements are in adjacently ranked categories, , if they are two away, i.e. category 1 and category 3, then .

Please See Appendix B

We can also extend this formulation to the situation where there are elements swapping. As all the elements swapping in come from the other category, there is no need to introduce the variables seen in the intracategory comparison.

Please See Appendix B

For the majority of our comparisons, the case where is considered, which causes to always be zero as a greater distance is impossible. By fixing , the terms involving can be disregarded as they become the multiplicative identity.

#### Comparisons

In order to achieve a graph in which the community structure is discernible, we would like to see that

so that edges within a category are more likely to be added than edges joining elements between different categories. Recalling that as a proxy we can instead use the relationship between their expected distances (which is the inverse of the relationship between and ), we require . As a base case, we compare the expected distance values without any mixing:

Remembering that we are primarily considering the case when allows us to equate the right side to simply . Similarly, the middle term is equivalent to . Clearly, then, this inequality simplifies to the true inequality .

It is now important to consider the cases where is not zero. In Figure ? the results seen in Lemmas II.1 - II.5 are combined to see the values of the expected distance functions as a function of , the number of elements swapped.

As increases, the expected distance between elements of different categories decreases and the expected distance between elements of the same category increases. It is only once gets large enough () that the expected distance between elements of different categories dips beneath the threshold. Regardless of the value of , the expected intracategory distance is lower than the intercategory distance. In the next section, we introduce the Weighted Label Propagation algorithm which is used to examine the community structure on the created graph.

## 3Weighted Label Propagation Algorithm

In this section we introduce a modification of the traditional label propagation algorithm seen in [6]. As with the original, this algorithm starts with each node having its own unique label. At each iteration, a node’s label is updated to be the most common label of all its neighboring nodes, with ties broken uniformly and randomly. As iterations continue, most of the labels disappear as many nodes take on the same label. The convergence point in this algorithm is when the label of every node does not change from iteration to iteration. This is equivalent to every node having the most common label of all of its neighbors. At the conclusion, all nodes sharing the same label are grouped into communities.

In the original version, all neighboring labels have equal weight regardless of their location in the graph. The distance of a label from its source vertex is not considered at all. In the weighted version of the algorithm, we incorporate this distance function in order to allow labels that occur close to their source label to have more weight than labels that are very far from their source. This modification serves to better localize the clustering. A secondary benefit of this modification is that it prevents the entire graph from being classified as one community purely due to an increase in the frequency of a label.

As stated, this algorithm is very similar to the non-weighted version except that before assigning the label with the largest count, the label counts are re-weighted based on their distance from their source vertex. This weighting should be a function of the distance, , and can be done in a variety of ways. In this paper, two such distance functions chosen were the linear and the exponential .

The weighted algorithm can be performed in either a synchronous or asynchronous manner. The synchronous manner is the one described above and the asynchronous version differs only in the update step. Rather than always updating based on the label from the previous time step, the asynchronous version uses the new label of any nodes that have already been updated at the current time step.

The time complexity of this algorithm is also linear per iteration. The only difference is that it requires a preprocessing step to find the distances from all vertices to all other vertices in the graph. This is done using Dijkstra’s algorithm for each vertex . However, since the graphs we are considering are relatively sparse, we have , so this reduces to . We have omitted the analysis of the convergence of the weighted label propagation algorithm and intend to explore it in future work. Please see Appendix C for a comparison of the weighted label propagation algorithm with other algorithms.

## 4Experimental Results

### 4.1Synthetic Data

In order to judge the success of the algorithm with the synthetic data, we compared the resulting community labeling to the initial categorization from the beginning of the algorithm. The metric we used to judge this comparison was normalized mutual information (NMI) [10]. This metric ranges from to , where a signifies that we have a perfect match with the true community structure and a signifies no relationship with the true structure.

We can examine Figure 4 to see the effects of the voting multiplier and the mixing parameter on the accuracy of the model. Here the voting multiplier signifies the ratio of the number of voters to the total number of elements. As we would expect, the larger the voting multiplier, the higher the accuracy, regardless of the mixing parameter. An increase in the mixing parameter does show a general trend of decreasing the NMI with the true categorization.

### 4.2Real Data

In order to test the algorithm, we used the MovieLens data [8], which consists of user ratings on a variety of movies. Users rate the movies from 0-5 and may leave some of the movies unranked. We explored different thresholds for cutting off the similarity function and present the results below. In every situation, the entire dataset chosen consists of the union of all the categories.

Comparing Amityville Horror Movies and Kid’s Movies

Category 1:

Toy Story (1995), Lion King, The (1994), Aladdin (1992), Snow White and the Seven Dwarfs (1937), Alice in Wonderland (1951)

Category 2:

Aladdin and the King of Thieves (1996), Jungle Book, The (1994), Pocahontas (1995)

Category 3:

Amityville 1992: It’s About Time (1992), Amityville 3-D (1983), Amityville: A New Generation (1993), Amityville II: The Possession (1982), Amityville Horror, The (1979), Amityville Curse, The (1990)

Comparing Amityville Horror Movies and Star Trek Movies

Category 1:

Star Trek VI: The Undiscovered Country (1991), Star Trek: The Wrath of Khan (1982), Star Trek III: The Search for Spock (1984), Star Trek IV: The Voyage Home (1986), Star Trek: Generations (1994), Star Trek: The Motion Picture (1979)

Category 2:

Star Trek V: The Final Frontier (1989), Amityville 1992: It’s About Time (1992), Amityville 3-D (1983), Amityville: A New Generation (1993), Amityville II: The Possession (1982), Amityville Horror, The (1979), Amityville Curse, The (1990)

Comparing Star Wars and Star Trek Movies

Category 1:

Star Wars (1977), Empire Strikes Back, The (1980), Return of the Jedi (1983)

Category 2:

Star Trek VI: The Undiscovered Country (1991), Star Trek: The Wrath of Khan (1982), Star Trek III: The Search for Spock (1984), Star Trek IV: The Voyage Home (1986), Star Trek: Generations (1994), Star Trek: The Motion Picture (1979)

Category 3:

Star Trek V: The Final Frontier (1989)

Category 1

: Star Wars (1977), Empire Strikes Back, The (1980), Return of the Jedi (1983), Star Trek: The Wrath of Khan (1982)

Category 2

: Star Trek VI: The Undiscovered Country (1991), Star Trek III: The Search for Spock (1984), Star Trek IV: The Voyage Home (1986), Star Trek: Generations (1994), Star Trek: The Motion Picture (1979), Star Trek V: The Final Frontier (1989)

We are able to form a categorization of the movies chosen by examining the relationship between the ratings of all of the voters. In each of the above situations, a clear distinction can be seen between the groups of movies we are examining. Although at some points there is an additional group introduced, the groups reflect the separation that is implied based on the chosen films.

## 5Appendix

### 5.1Calculation of Expected Value of Similarity Function:

We are calculating the expected value of the similarity function

### 5.2Calculation of Expected Intercategory Distance

Below we present the formulation of the expected intercategory distance with swaps. The proof of the equation with swap can be seen by simply substituting . Similarly to the derivation of the intracategory distance, the intercategory distance was derived by examining the possible combinations of the elements that are going to be compared.

: This represents the total sum of distances when the two elements are in separate categories. is the cumulative distance for all possible pairs of elements in each arrangement of this type and is the total number of times we will have comparisons between elements of different categories. The first term represents arrangements in which neither of the elements swaps from its own category. We can think of this as fixing the element we are considering and picking the elements to swap from the other elements of the category for both of the categories. The second term represents arrangements when the elements actually swap with each other and neither is in its true category. We can calculate the number of ways of doing this as requiring the element to swap and picking the other elements from the remaining elements of the category for each category.

: This term represents the total sum of distances when the two elements are in the same category. We know that is the total distance for each occurrence of this comparison (as seen in Appendix A). Here represents the arrangement of categories in which this comparison arises. We know that we are then fixing two elements from the category (1 to swap and 1 to stay), so we then pick the other elements to swap from the remaining elements. Now we know we will have an occurrence of this type regardless of what happens in the other category, leading to the factor of (the number of possible outcomes from the second category). We then multiply by 2 twice: once because either element can be the one that stays and once because we must consider this summation of distances for both of the categories.

Finally, we see the denominator is formed from the product of the number of possible swaps at each arrangement. We know that there are possible arrangements resulting from each cluster picking elements to swap. Similarly, we know that each arrangement has swaps just by simple examination.

### 5.3Comparison of Weighted Label Propagation with Other Algorithms:

In order to judge the success of the Weighted Label Propagation algorithm, we compared its success in determining the community structure of a series of planted partition models:

CNM | Regular | Weighted | |
---|---|---|---|

Avg Num of Categories: | 8.81 | 9.35 | 9.99 |

Avg NMI with Truth: | 0.9568 | 0.9764 | 0.9940 |

Avg Modularity: | 0.6998 | 0.6992 | 0.7020 |

CNM | Regular | Weighted | |
---|---|---|---|

Avg Num of Categories: | 8.97 | 9.29 | 10.02 |

Avg NMI with Truth: | 0.9644 | 0.9754 | 0.9970 |

Avg Modularity: | 0.7003 | 0.6969 | 0.7024 |

CNM | Regular | Weighted | |
---|---|---|---|

Avg Num of Categories: | 8.99 | 9.61 | 10.01 |

Avg NMI with Truth: | 0.9646 | 0.9873 | 0.9976 |

Avg Modularity: | 0.7066 | 0.7069 | 0.7089 |

As is seen in all of the above examples, the weighted label propagation algorithm performs better than the CNM algorithm and the regular label propagation algorithm in terms of the correct number of categories found, the normalized mutual information (NMI) and the modularity of the resulting partition.

## 6Acknowledgements

This research was supported by NSF Research Experiences for Undergraduate (REU) program via the grant NSF:CCF:1319653.

### Footnotes

- We note that this conversion can also be applied to other latent category choice/ranking models and similarity measures, but the ensuing analysis, at present, seems quite complicated.
- Also equivalent to the planted partition model.

### References

- G. Beigi, M. Jalili, H. Alvari, and G. Sukthankar, “Leveraging community detection for accurate trust prediction,” in
*Academy of Science and Engineering (ASE), USA*, 2014. - S. Agarwal, “On ranking and choice models,” in
*Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)*, 2016, http://www.shivani-agarwal.net/Publications/2016/ijcai16-ranking-choice-models-invited.pdf. - S. Fortunato, “Community detection in graphs,”
*Physical Reports*, vol. 486, pp. 75–174, Feb. 2010. - =2plus 43minus 4 M. Girvan and M. E. J. Newman, “Community structure in social and biological networks,”
*Proceedings of the National Academy of Sciences*, vol. 99, no. 12, pp. 7821–7826, 2002. [Online]. Available: http://www.pnas.org/content/99/12/7821.abstract =0pt - E. Abbe, A. S. Bandeira, and G. Hall, “Exact recovery in the stochastic block model,”
*IEEE Transactions on Information Theory*, vol. 62, no. 1, pp. 471–487, Jan 2016. - U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community structures in large-scale networks,”
*Physical Review E*, vol. 76, no. 3, p. 036106, Sep. 2007. - A. Clauset, M. E. J. Newman, and C. Moore, “Finding community structure in very large networks,”
*Physical Review E*, vol. 70, no. 6, p. 066111, Dec. 2004. - =2plus 43minus 4 F. M. Harper and J. A. Konstan, “The movielens datasets: History and context,”
*ACM Trans. Interact. Intell. Syst.*, vol. 5, no. 4, pp. 19:1–19:19, Dec. 2015. [Online]. Available: http://doi.acm.org/10.1145/2827872 =0pt - M. MacMahon and D. Garlaschelli, “Community detection for correlation matrices,”
*ArXiv e-prints*, Nov. 2013. - =2plus 43minus 4 L. Danon, A. Díaz-Guilera, J. Duch, and A. Arenas, “Comparing community structure identification,”
*Journal of Statistical Mechanics: Theory and Experiment*, vol. 2005, no. 09, p. P09008, 2005. [Online]. Available: http://stacks.iop.org/1742-5468/2005/i=09/a=P09008 =0pt