Finding Desirable Objects under Group Categorical Preferences {}^{\star} {}^{\star}To appear in Knowledge and Information Systems Journal (KAIS), 2015

Finding Desirable Objects under Group Categorical Preferences thanks: To appear in Knowledge and Information Systems Journal (KAIS), 2015

Nikos Bikakis Nikos Bikakis National Technical University of Athens, Greece & ATHENA Research Center, Greece
22email: bikakis@dblab.ntua.grKarim Benouaret Inria Nancy, France
44email: karim.benouaret@inria.frDimitris SacharidisTechnische Universität Wien, Austria
66email: dimitris@ec.tuwien.ac.at
   Karim Benouaret Nikos Bikakis National Technical University of Athens, Greece & ATHENA Research Center, Greece
22email: bikakis@dblab.ntua.grKarim Benouaret Inria Nancy, France
44email: karim.benouaret@inria.frDimitris SacharidisTechnische Universität Wien, Austria
66email: dimitris@ec.tuwien.ac.at
   Dimitris Sacharidis Nikos Bikakis National Technical University of Athens, Greece & ATHENA Research Center, Greece
22email: bikakis@dblab.ntua.grKarim Benouaret Inria Nancy, France
44email: karim.benouaret@inria.frDimitris SacharidisTechnische Universität Wien, Austria
66email: dimitris@ec.tuwien.ac.at
Abstract

Considering a group of users, each specifying individual preferences over categorical attributes, the problem of determining a set of objects that are objectively preferable by all users is challenging on two levels. First, we need to determine the preferable objects based on the categorical preferences for each user, and second we need to reconcile possible conflicts among users’ preferences. A naïve solution would first assign degrees of match between each user and each object, by taking into account all categorical attributes, and then for each object combine these matching degrees across users to compute the total score of an object. Such an approach, however, performs two series of aggregation, among categorical attributes and then across users, which completely obscure and blur individual preferences. Our solution, instead of combining individual matching degrees, is to directly operate on categorical attributes, and define an objective Pareto-based aggregation for group preferences. Building on our interpretation, we tackle two distinct but relevant problems: finding the Pareto-optimal objects, and objectively ranking objects with respect to the group preferences. To increase the efficiency when dealing with categorical attributes, we introduce an elegant transformation of categorical attribute values into numerical values, which exhibits certain nice properties and allows us to use well-known index structures to accelerate the solutions to the two problems. In fact, experiments on real and synthetic data show that our index-based techniques are an order of magnitude faster than baseline approaches, scaling up to millions of objects and thousands of users.

Keywords:
Group recommendation Rank aggregation Preferable objects Skyline queries Collective dominance Ranking scheme Recommender systems

-

1 Introduction

Recommender systems have the general goal of proposing objects (e.g., movies, restaurants, hotels) to a user based on her preferences. Several instances of this generic problem have appeared over the past few years in the Information Retrieval and Database communities; e.g., AT05; IBS08; BOHG13; SKP11. More recently, there is an increased interest in group recommender systems, which propose objects that are well-aligned with the preferences of a set of users JS07; MAS11; CC12; BC11. Our work deals with a class of these systems, which we term Group Categorical Preferences (GCP), and has the following characteristics. (1) Objects are described by a set of categorical attributes. (2) User preferences are defined on a subset of the attributes. (3) There are multiple users with distinct, possibly conflicting, preferences. The GCP formulation may appear in several scenarios; for instance, colleagues arranging for a dinner at a restaurant, friends selecting a vacation plan for a holiday break.

Attributes
Restaurant Cuisine Attire Place Price Parking
Eastern Business casual Clinton Hill $$$ Street
French Formal Time Square Valet
Brazilian Smart Casual Madison Square $$ No
Mexican Street wear Chinatown $ No
Table 1: New York Restaurants
Preferences
User
Cuisine
Attire Place Price Parking
European Casual Brooklyn $$$ Street
French, Chinese Valet
Continental Time Square, Queens
Table 2: User preferences
(a) Attire attribute
(b) Cuisine attribute
(c) Parking attribute
(d) Place attribute (part of New York places)
Figure 1: Attribute hierarchies

To illustrate GCP, consider the following example. Assume that three friends in New York are looking for a restaurant to arrange a dinner. Suppose that, the three friends are going to use a Web site (e.g., Yelp111www.yelp.com) in order to search and filter restaurants based on their preferences. Note that in this setting, as well as in other Web-based recommendation systems, categorical description are prevalent compared to numerical attributes. Assume a list of available New York restaurants, shown in Table 1, where each is characterized by five categorical attributes: Cuisine, Attire, Place, Price and Parking. In addition, Figure 1 depicts the hierarchies for these attributes. Attire and Parking are three-level hierarchies, Cuisine and Place are four-level hierarchies, and Price (not shown in Figure 1) is a two-levels hierarchy with four leaf nodes (). Finally, Table 2 shows the three friends’ preferences. For instance, prefers European cuisine, likes to wear casual clothes, and prefers a moderately expensive () restaurant in the Brooklyn area offering also street parking. On the other hand, likes French and Chinese cuisine, and prefers restaurants offering valet parking, without expressing any preference on attire, price and place.

Observe that if we look at a particular user, it is straightforward to determine his ideal restaurant. For instance, clearly prefers , while clearly favors . These conclusions per user can be reached using the following reasoning. Each preference attribute value is matched with the corresponding object attribute value using a matching function, e.g., the Jaccard coefficient, and a matching degree per preference attribute is derived. Given these degrees, the next step is to “compose” them into an overall matching degree between a user and an object . Note that several techniques are proposed for “composing” matching degrees; e.g., MAS11; JM04; K02; SKP11; C03. The simplest option is to compute a linear combination, e.g., the sum, of the individual degrees. Finally, alternative aggregations models (e.g., Least-Misery, Most-pleasure, etc.) could also be considered.

Returning to our example, assume that the matching degrees of user are: for restaurant , for , for , and for (these degrees correspond to Jaccard coefficients computed as explained in Section 3). Note that for almost any “composition” method employed (except those that only, or strongly, consider the Attire attribute), is the most favorable restaurant for user . Using similar reasoning, restaurant , is ideal for both users , .

When all users are taken into consideration, as required by the GCP formulation, several questions arise. Which is the best restaurant that satisfies the entire group? And more importantly, what does it mean to be the best restaurant? A simple answer to the latter, would be the restaurant that has the highest “composite” degree of match to all users. Using a similar method as before, one can define a collective matching degree that “composes” the overall matching degrees for each user. This interpretation, however, enforces an additional level of “composition”, the first being across attributes, and the second across users. These compositions obscure and blur the individual preferences per attribute of each user.

To some extent, the problem at the first “composition” level can be mitigated by requiring each user to manually define an importance weight among his specified attribute preferences. On the other hand, it is not easy, if possible at all, to assign weights to users, so that the assignment is fair. There are two reasons for this. First, users may specify different sets of preference attributes, e.g., specifies all five attributes, while only Cuisine and Parking. Second, even when considering a particular preference attribute, e.g., Cuisine, users may specify values at different levels of the hierarchy, e.g., specifies a European cuisine, while French cuisine, which is two levels beneath. Similarly, objects can also have attribute values defined at different levels. Therefore, any “composition” is bound to be unfair, as it may favor users with specific preferences and objects with detailed descriptions, and disfavor users with broader preferences and objects with coarser descriptions. This is an inherent difficulty of the GCP problem.

In this work, we introduce the double Pareto-based aggregation, which provides an objective and fair interpretation to the GCP formulation without “compositing” across preference attributes and users. Under this concept, the matching between a user and an object forms a matching vector. Each coordinate of this vector corresponds to an attribute and takes the value of the corresponding matching degree. The first Pareto-based aggregation is defined over attributes and induces a partial order on these vectors. Intuitively, for a particular user, the first partial order objectively establishes that an object is better, i.e., more preferable, than another, if it is better on all attributes. Then, the second Pareto-based aggregation, defined across users, induces the second and final partial order on objects. According to this order, an object is better than another, if it is more preferable according to all users.

Based on the previous interpretation of the GCP formulation, we seek to solve two distinct problems. The first, which we term the Group-Maximal Categorical Objects (GMCO) problem, is finding the set of maximal, or Pareto-optimal, objects according to the final partial order. Note that since this order is only partial, i.e., two objects may not be comparable, there may exist multiple objects that are maximal; recall, that an object is maximal if there exists no other object that succeeds it in the order considered. In essence, it is the fact that this order is partial that guarantees objectiveness. The GMCO problem has been tackled in our previous work BBS14.

The second problem, which we term the Group-Ranking Categorical Objects (GRCO) problem, consists of determining an objective ranking of objects. Recall that the double Pareto-based aggregation, which is principal in guaranteeing objectiveness, induces only a partial order on the objects. On the other hand, ranking implies a total order among objects. Therefore, it is impossible to rank objects without introducing additional ordering relationships among objects, which however would sacrifice objectiveness. We address this contradiction, by introducing an objective weak order on objects. Such an order allows objects to share the same tier, i.e., ranked at the same position, but defines a total order among tiers, so that among two tiers, it is always clear which is better.

The GMCO problem has at its core the problem of finding maximal elements according to some partial order. Therefore, it is possible to adapt an existing algorithm to solve the core problem, as we discuss in Section 4.2. While there exists a plethora of main-memory algorithms, e.g., Kung1975; Bentley1990, and more recently of external memory algorithms (termed skyline query processing methods), e.g., BKS01; CGGL03; PTFS05, they all suffer from two performance limitations. First, they need to compute the matching degrees and form the matching vectors for all objects, before actually executing the algorithm. Second, it makes little sense to apply index-based methods, which are known to be the most efficient, e.g., the state-of-the-art method of PTFS05. The reason is that the entries of the index depend on the specific instance, and need to be rebuilt from scratch when the user preferences change, even though the description of objects persists.

To address these limitations, we introduce a novel index-based approach for solving GMCO, which also applies to GRCO. The key idea is to index the set of objects that, unlike the set of matching vectors, remains constant across instances, and defer expensive computation of matching degrees. To achieve this, we apply a simple transformation of the categorical attribute values to intervals, so that each object translates to a rectangle in the Euclidean space. Then, we can employ a space partitioning index, e.g., an R-Tree, to hierarchically group the objects. We emphasize that this transformation and index construction is a one-time process, whose cost is amortized across instances, since the index requires no maintenance, as long as the collection of objects persists. Based on the transformation and the hierarchical grouping, it is possible to efficiently compute upper bounds for the matching degrees for groups of objects. Therefore, for GMCO, we introduce an algorithm that uses these bounds to guide the search towards objects that are more likely to belong to the answer set, avoid computing unnecessary matching degrees.

For the GRCO problem, i.e., finding a (weak) order among objects, there has been a plethora of works on the related topic of combining/fusing multiple ranked lists, e.g., FS93; DKNS01; AM01; MA02; FV07; MAS11. However, such methods are not suitable for our GCP formulation. Instead, we take a different approach. We first relax the unanimity in the second Pareto-based aggregation, and require only a percentage of users to agree, resulting in the p-GMCO problem. This introduces a pre-order instead of a partial order, i.e., the induced relation lacks antisymmetry (an object may at the same time be before and after another). Then, building on this notion, we define tiers based on values, and rank objects according to the tier they belong, which results in an objective weak order. To support the effectiveness of our ranking scheme, we analyze its behaviour in the context of rank aggregation and show that it posseses several desirable theoretical properties.

Contributions. The main contributions of this paper are summarized as follows.

  • We introduce and propose an objective and fair interpretation of group categorical preference (GCP) recommender systems, based on double Pareto-based aggregation.

  • We introduce three problems in GCP systems, finding the group-maximal objects (GMCO), finding relaxed group-maximal objects (-GMCO), and objectively ranking objects (GRCO).

  • We present a method for transforming the hierarchical domain of a categorical attribute into a numerical domain.

  • We propose index-based algorithms for all problems, which employ a space partitioning index to hierarchically group objects.

  • We theoretically study the behaviour of our ranking scheme and present a number of theoretical properties satisfied by our approach.

  • We present several extensions involving the following issues: multi-values attributes, non-tree hierarchies, subspace indexing, and objective attributes.

  • We conduct a thorough experimental evaluation using both real and synthetic data.

Outline. The remaining of this paper is organized as follows. Section 3 contains the necessary definitions for the GCP formulation. Then, Section 4 discusses the GMCO problem, Section 5 the -GMCO problem, and Section 6 the GRCO problem. Section 7 discusses various extensions. Section 8 contains a detailed experimental study. Section 2 reviews related work, while Section 9 concludes this paper.

2 Related Work

This section reviews work on recommender systems and algorithms for Pareto aggregation.

2.1 Recommender Systems

There exist several techniques to specify preferences on objects SKP11; LL87. The quantitative preferences, e.g., AW00; HKP01; KI04, assign a numeric score to attribute values, signifying importance. For example, values , , are assigned scores 0.9, 0.7, 0.1, respectively, which implies that is more preferable than , which in turn is more preferable than . There also exist qualitative preferences, e.g., K02; C03, which are relatively specified using binary relationships. For example, value is preferred over and , but , are indifferent. This work assumes the case of boolean quantitative preferences, where a single attribute value is preferred, while others are indifferent.

The general goal of recommendation systems AT05; BOHG13; YuHSD14; KannanIP14 is to identify those objects that are most aligned to a user’s preferences. Typically, these systems provide a ranking of the objects by aggregating user preferences. Particularly, the work in AW00 defines generic functions that merge quantitative preferences. The works in CBC+00; HKP01 deal with linear combinations of preference scores and propose index and view based techniques for ranking tuples. For preferences in general, K02; C03 introduce a framework for composing or accumulating interests. Among the discussed methods is the Pareto composition, which is related to the skyline computation, discussed below.

Recently, several methods for group recommendations are proposed JS07; MAS11; CC12; BC11. These methods, recommend items to a group of users, trying to satisfy all the group members. The existing methods are classified into two approaches. In the first, the preferences of each group member are combined to create a virtual user; the recommendations to the group are proposed w.r.t. to the virtual user. In the second, individual recommendations for each member is computed; the recommendations of all members are merged into a single recommendation. A large number of group recommendation methods have been developed in several domains such as: music SWT08; CBH02; PiliponyteRK13; McCarthyA98; ChaoBF05; ZXD05, movies CCKR01, TV programs JM04; YZHG06; VKHA09; gb02, restaurants PPC08; McCarthy2002, sightseeing tours GSO11; ArdissonoGPST03; KW05, vacation packages McCarthySCMSN06; Jameson04, food ElahiGRMB14, news PCCA05, and online communities GXLBHMS10; KKYY10; BPD08. Finally, several works study the problem of rank aggregation in the context of group recommendations RACDY10; BMR10; BF10; CMS07; NSNK12b.

Several methods to combine different ranked lists are presented in the IR literature. There the data fusion problem is defined. Given a set of ranked lists of documents returned by different search engines, construct a single ranked list combining the individual rankings DKNS01. Data fusion techniques can be classified based on whether they require knowledge of the relevance scores AM01. The simplest method based solely on the documents’ ranks is the Borda-fuse model. It assigns as score to each document the summation of its rank in each list. The Condorcet-fuse method MA02 is based on a majoritarian voting algorithm, which specifies that a document is ranked higher in the fused list than another document if is ranked higher than more times than is ranked higher than . The approach in FV07, assumes that a document is ranked better than another if the majority of input rankings is in concordance with this fact and at the same time only a few input rankings refute it. When the relevance scores are available, other fusion techniques, including CombSUM, CombANZ and CombMNZ, can be applied FS93. In CombSUM, the fused relevance score of a document is the summation of the scores assigned by each source. In CombANZ (resp. CombMNZ), the final score of a document is calculated as that of CombSUM divided (resp. multiplied) by the number of lists in which the document appears.

2.2 Algorithms for Pareto Aggregation

The work of BKS01 rekindled interest in the problem of finding the maximal objects Kung1975 and re-introduces it as the skyline operator. An object is dominated if there exists another object before it according to the partial order enforced by the Pareto-based aggregation. The maximal objects are referred to as the skyline. The authors propose several external memory algorithms. The most well-known method is Block Nested Loops (BNL) BKS01, which checks each point for dominance against the entire dataset.

The work in CGGL03 observes that examining points according to a monotone (in all attributes) preference function reduces the average number of dominance checks. Based on this fact, the Sort-first Skyline algorithm (SFS) is introduces; including some variations (i.e., LESS GSG07, and SaLSa BCP08) belonging in the class of sort-based skyline algorithms, that improve performance (see BikakisSS14 for more details).

In SLNX09 the multi-pass randomize algorithm RAND is proposed. Initially, RAND selects a random sample; then, multiple passes over the dataset is performed in order to prune points and find the skyline.

In other approaches, multidimensional indexes are used to guide the search for skyline points and prune large parts of the space. The most well-known algorithms is the Branch and Bound Skyline (BBS) method PTFS05, which uses an R-tree, and is shown to be I/O optimal with respect to this index. Similarly, the Nearest Neighbor algorithm (NN) KRR02 also uses an R-tree performing multiple nearest neighbor searches to identify skyline objects. A bitmap structure is used by Bitmap TEO01 algorithm to encode the input data. In the Index TEO01 algorithm, several B-trees are used to index the data, one per dimension. Other methods, e.g., LZLL07; LC10, employ a space-filling curve, such as the Z-order curve, and use a single-dimensional index. The Lattice Skyline (LS) algorithm MPJ07 builds a specialized data structure for low-cardinality domains.

In partitioning-based approaches, the algorithms divide the initial space into several partitions. The first algorithm in this category, D&C BKS01 computes the skyline objects adopting the divide-and-conquer paradigm. A similar approach with stronger theoretical guarantees is presented in ST12. Recently, partitioning-based skyline algorithms are proposed in ZMC09; JS10. OSP ZMC09 attempts to reduce the number of checks between incomparable points by recursively partition the skyline points. BSkyTree JS10 enhances ZMC09 by considering both the notions of dominance and incomparability while partitioning the space.

Finally, specific algorithms are proposed to efficiently compute the skyline over partially ordered domains CET05; WFP+08; SPP09; ZMKC10, metric spaces CL09, non-metric spaces PDMK09, or anticorrelated distributions SK13.

Several lines of research attempt to address the issue that the size of skyline cannot be controlled, by introducing new concepts and/or ranking the skyline (see LB13 for a survey). YM07 ranks tuples based on the number of records they dominate. CJT+06 deals with high-dimensional skylines, and relaxes the notion of dominance to -dominance, according to which a record is -dominated if it is dominated in a subspace of dimensions. LJZ11 uses a skyline-based partitioning to rank tuples. The most representative skyline operator is proposed in LYZZ07, which selects a set of skyline points, so that the number of points dominated by at least one of them is maximized. In a similar spirit, TDLP09 tries to select the skyline points that best capture the trade-offs among the parameters. Finally, LYHSB12 attempts to find a small and focused skyline set. The size of the skyline is reduced by asking from users to state additional preferences.

3 Group Categorical Preferences

Table 3 shows the most important symbols and their definition. Consider a set of categorical attributes . The domain of each attribute is a hierarchy . A hierarchy defines a tree, where a leaf corresponds to a lowest-level value, and an internal node corresponds to a category, i.e., a set, comprising all values within the subtree rooted at this node. The root of a hierarchy represents the category covering all lowest-level values. We use the symbol (resp. ) to denote the number of leaf (resp. all hierarchy) nodes. With reference to Figure 1, consider the “Cuisine” attribute. The node “Eastern” is a category and is essentially a shorthand for the set “Greek”, “Austrian”, since it contains the two leaves, “Greek” and “Austrian”.

Assume a set of objects . An object is defined over all attributes, and the value of attribute is one of the nodes of the hierarchy . For instance, in Table 1, the value of the “Cuisine” attribute of object , is the node “Eastern” in the hierarchy of Figure 1.

Symbol Description
, Set of attributes, number of attributes ()
, Attribute, number of distinct values in
, Hierarchy of , number of hierarchy nodes
, Set of objects, an object
, Set of users, a user
, Value of attribute in object , user
, Interval representation of the value of in ,
Matching vector of object to user
Matching degree of to user on attribute
Object is collectively preferred over
The R-Tree that indexes the set of objects
, R-Tree node, the entry for in its parent node
, The pointer to node , the MBR of
Maximum matching vector of entry to user
Maximum matching degree of to user on
Table 3: Notation

Further, assume a set of users . A user is defined over a subset of the attributes, and for each specified attribute , its value in one of the hierarchy nodes. For all unspecified attributes, we say that user is indifferent to them. Note that, an object (resp. a user) may has (resp. specify) multiple values for each attribute (see Section 7.1).

Given an object , a user , and a specified attribute , the matching degree of to with respect to , denoted as , is specified by a matching function . The matching function defines the relation between the user’s preferences and the objects attribute values. For an indifferent attribute of a user , we define .

Note that, different matching functions can be defined per attribute and user; for ease of presentation, we assume a single matching function. Moreover, note that this function can be any user defined function operating on the cardinalities of intersections and unions of hierarchy attributes. For example, it can be the Jaccard coefficient, i.e., . The numerator counts the number of leaves in the intersection, while the denominator counts the number of leaves in the union, of the categories and . Other popular choices are the Overlap coefficient: , and the Dice coefficient: .

In our running example, we assume the Jaccard coefficient. Hence, the matching degree of restaurant to user w.r.t. “Attire” is , where we substituted “Casual” with the set “Business casual”, “Smart casual”.

User
Restaurant
Table 4: Matching vectors

Given an object and a user , the matching vector of to , denoted as , is a -dimensional point in , where its -th coordinate is the matching degree with respect to attribute . Furthermore, we define the norm of the matching vector to be . In our example, the matching vector of restaurant to user is . All matching vectors of this example are shown in Table 4.

4 The Group-Maximal Categorical Objects (GMCO) Problem

Section 4.1 introduces the GMCO problem, and Section 4.2 describes a straightforward baseline approach. Then, Section 4.3 explains a method to convert categorical values into intervals, and Section 4.4 introduces our proposed index-based solution.

4.1 Problem Definition

We first consider a particular user and examine the matching vectors. The first Pareto-based aggregation across the attributes of the matching vectors, induces the following partial and strict partial “preferred” orders on objects. An object is preferred over , for user , denoted as iff for every specified attribute of the user it holds that . Moreover, object is strictly preferred over , for user , denoted as iff is preferred over and additionally there exists a specified attribute such that . Returning to our example, consider user and its matching vector for , and for . Observe that is strictly preferred over .

We now consider all users in . The second Pareto-based aggregation across users, induces the following strict partial “collectively preferred” order on objects. An object is collectively preferred over , if is preferred over for all users, and there exists a user for which is strictly preferred over . From Table 4, it is easy to see that restaurant is collectively preferred over , because is preferred by all three users, and strictly preferred by user .

Given the two Pareto-based aggregations, we define the collectively maximal objects in with respect to users , as the set of objects for which there exists no other object that is collectively preferred over them. In our running example, and objects are both collectively preferred over and . There exists no object which is collectively preferred over and , and thus are the collectively maximal objects. We next formally define the GMCO problem.

Problem 1. [GMCO] Given a set of objects and a set of users defined over a set of categorical attributes , the Group-Maximal Categorical Objects (GMCO) problem is to find the collectively maximal objects of with respect to .

4.2 A Baseline Algorithm (BSL)

The GMCO problem can be transformed to a maximal elements problem, or a skyline query, where the input elements are the matching vectors. Note, however, that the GMCO problem is different than computing the conventional skyline, i.e., over the object’s attribute values.

The Baseline (BSL) method, whose pseudocode is depicted in Algorithm 1, takes advantage of this observation. The basic idea of BSL is for each object (loop in line 1) and for all users (loop in line 2), to compute the matching vectors (line 3). Subsequently, BSL constructs a -dimensional tuple (line 4), so that its -th entry is a composite value equal to the matching vector of object to user . When all users are examined, tuple is inserted in the set (line 5).

The next step is to find the maximal elements, i.e., compute the skyline over the records in . It is easy to prove that tuple is in the skyline of iff object is a collectively maximally preferred object of w.r.t. . Notice, however, that due to the two Pareto-based aggregations, each attribute of a record is also a record that corresponds to a matching vector, and thus is partially ordered according to the preferred orders defined in Section 3. Therefore, in order to compute the skyline of , we need to apply a skyline algorithm (line 6), such as BKS01; PTFS05; GSG07.

Computational Complexity. The computational cost of BSL is the sum of two parts. The first is computing the matching degrees, which takes time. The second is computing the skyline, which requires comparisons, assuming a quadratic time skyline algorithms is used. Therefore, BSL takes time.

Input: objects , users
Output: the collectively maximal
Variables: set of intermediate records
1 foreach  do
2       foreach  do
3             compute
4      insert into
SkylineAlgo ()
Algorithm 1 BSL

4.3 Hierarchy Transformation

This section presents a simple method to transform the hierarchical domain of a categorical attribute into a numerical domain. The rationale is that numerical domains can be ordered, and thus tuples can be stored in multidimensional index structures. The index-based algorithm of Section 4.4 takes advantage of this transformation.

Consider an attribute and its hierarchy , which forms a tree. We assume that any internal node has at least two children; if a node has only one child, then this node and its child are treated as a single node. Furthermore, we assume that there exists an ordering, e.g., the lexicographic, among the children of any node that totally orders all leaf nodes.

The hierarchy transformation assigns an interval to each node, similar to labeling schemes such as ABJ89. The -th leaf of the hierarchy (according to the ordering) is assigned the interval . Then, each internal node is assigned the smallest interval that covers the intervals of its children. Figure 1 depicts the assigned intervals for all nodes in the two car hierarchies.

Following this transformation, the value on the attribute of an object becomes an interval . The same holds for a user . Therefore, the transformation translates the hierarchy into the numerical domain .

An important property of the transformation is that it becomes easy to compute matching degrees for metrics that are functions on the cardinalities of intersections or unions of hierarchy attributes. This is due to the following properties, which use the following notation: for a closed-open interval , define .

Proposition 1. For objects/users , , and an attribute , let , denote the intervals associated with the value of , on . Then the following hold:

Proof. For a leaf value , it holds that . By construction of the transformation, . For a non-leaf value , is equal to the number of leaves under . Again, by construction of the transformation, is equal to the smallest interval that covers the intervals of the leaves under , and hence equal to . Therefore for any hierarchy value, it holds that .

Then, the last two properties trivially follow. The third holds since . ◻

4.4 An Index-based Algorithm (IND)

This section introduces the Index-based GMCO (IND) algorithm. The key ideas of IND are: (1) apply the hierarchy transformation, previously described, and index the resulting intervals, and (2) define upper bounds for the matching degrees of a group of objects, so as to guide the search and quickly prune unpromising objects.

We assume that the set of objects and the set of users are transformed so that each attribute value is an interval . Therefore, each object (and user) defines a (hyper-)rectangle on the -dimensional cartesian product of the numerical domains, i.e., .

Figure 2: Transformed objects and users

Figure 2 depicts the transformation of the objects and users shown in Tables 1 & 2, considering only the attributes Cuisine and Attire. For instance, object is represented as the rectangle in the “Cuisine”“Attire” plane. Similarly, user is represented as two intervals, , , on the transformed “Cuisine”, “Attire” axes, respectively.

The IND algorithm indexes the set of objects in this -dimensional space. In particular, IND employs an R-Tree BKSS90, which is well suited to index rectangles. Each node corresponds to a disk page, and contains a number of entries. Each entry comprises (1) a pointer , and (2) a Minimum Bounding Rectangle (MBR) . A leaf entry corresponds to an object , its pointer is null, and is the rectangle defined by the intervals of . A non-leaf entry corresponds to a child node , its pointer contains the address of , and is the MBR of (i.e., the tightest rectangle that encloses) the MBRs of the entries in .

Due to its enclosing property, the MBR of an entry encloses all objects that are stored at the leaf nodes within the subtree rooted at node . It is often helpful to associate an entry with all the objects it encloses, and thus treat as a group of objects.

Consider a entry and a user . Given only the information within entry , i.e., its MBR, and not the contents, i.e., its enclosing objects, at the subtree rooted at , it is impossible to compute the matching vectors for the objects within this subtree. However, it is possible to derive an upper bound for the matching degrees of any of these objects.

We define the maximum matching degree of entry on user w.r.t. specified attribute as the highest attainable matching degree of any object that may reside within . To do this we first need a way to compute lower and upper bounds on unions and intersections of a user interval with an MBR.

Proposition 2. Fix an attribute . Consider an object/user , and let , denote the interval associated with its value on . Also, consider another object/user whose interval on is contained within a range . Given an interval , returns if is empty, and otherwise. Then the following hold:

Proof. Note that for the object/user with interval on , it holds that .

(1) For the left inequality of the first property, observe that value is a node that contains at least one leaf, hence . Furthermore, for the right inequality, .

(2) For the left inequality of the second property, observe that the value contains either at least one leaf when the intersection is not empty, and no leaf otherwise. The right inequality follows from the fact that .

(3) For the left inequality of the third property, assume first that ; hence . In this case, it holds that . By the first property, we obtain . Combining the three relations, we obtain the left inequality. Now, assume that ; hence . In this case, it also holds that , and the left inequality follows.

For the right inequality of the third property, observe that . By the first property, we obtain , and , by the second. The right inequality follows from combining these three relations. ◻

Then, defining the maximum matching degree reduces to appropriately selecting the lower/upper bounds for the specific matching function used. For example, consider the case of the Jaccard coefficient, . Assume is a non-leaf entry, and let denote the range of the MBR on the attribute. We also assume that and overlap. Then, we define , where we have used the upper bound for the intersection in the enumerator and the lower bound for the union in the denominator, according to Section 4.4. For an indifferent to the user attribute , we define . Now, assume that is a leaf entry, that corresponds to object . Then the maximum matching degree is equal to the matching degree of to w.r.t. .

Computing maximum matching degrees for other metrics is straightforward. In any case, the next lemma shows that an appropriately defined maximum matching degree is an upper bound to the matching degrees of all objects enclosed in entry .

Proposition 3. The maximum matching degree of entry on user w.r.t. specified attribute is an upper bound to the highest matching degree in the group that defines.

Proof. The maximum matching degree is an upper bound from Section 4.4. ◻

In analogy to the matching vector, the maximum matching vector of entry on user is defined as a -dimensional vector whose -th coordinate is the maximum matching degree . Moreover, the norm of the maximum matching vector is .

Next, consider a entry and the entire set of users . We define the score of an entry as . This score quantifies how well the enclosed objects of match against all users’ preferences. Clearly, the higher the score, the more likely that contains objects that are good matches to users.

Input: R-Tree , users
Output: the collectively maximal
Variables: a heap with entries sorted by
1 read root node insert in the root entries while  is not empty do
2       pop if  is non-leaf then
3             read node foreach  do
4                   foreach  do
5                         compute
6                  foreach  do
7                         if  then
8                               break
9                        
10                  if not  then
11                         insert in
12                  
13            
14      else
15             foreach  do
16                   if  then
17                         break
18                  
19            if  then
20                   insert in
21            
22      
Algorithm 2 IND

Algorithm Description. Algorithm 2 presents the pseudocode for IND. The algorithm maintains two data structures: a heap which stores entries sorted by their score, and a list of collectively maximal objects discovered so far. Initially the list is empty (line 1), and the root node of the R-Tree is read (line 2). The score of each root entry is computed and all entries are inserted in (line 3). Then, the following process (loop in line 4) is repeated as long as has entries.

The entry with the highest score, say , is popped (line 5). If is a non-leaf entry (line 6), it is expanded, which means that the node identified by is read (line 7). For each child entry of (line 8), its maximum matching degree with respect to every user is computed (lines 10–11). Then, the list is scanned (loop in line 12). If there exists an object in such that (1) for each user , the matching vector of is better than , and (2) there exists a user so that the matching vector of is strictly better than , then entry is discarded (lines 13–15). It is straightforward to see (from Section 4.4) that if this condition holds, cannot contain any object that is in the collectively maximal objects, which guarantees IND’ correctness. When the condition described does not hold (line 16), the score of is computed and is inserted in (line 17).

Now, consider the case that is a leaf entry (line 18), corresponding to object (line 19). The list is scanned (loop in line 21). If there exists an object that is collectively preferred over (line 22), it is discarded. Otherwise (line 25–26), is inserted in .

The algorithm terminates when is empty (loop in line 4), at which time the list contains the collectively maximal objects.

Computational Analysis. IND performs object to object comparisons as well as object to non-leaf entries. Since there are at most non-leaf entries, IND performs comparisons in the worst case. Further it computes matching degrees on the fly at a cost of . Overall, IND takes time, the same as BSL. However, in practice IND is more than an order of magnitude faster than BSL (see Section 8).

Example. We demonstrate IND, using our running example, as depicted in Figure 2. The four objects are indexed by an R-Tree, whose nodes are drawn as dashed rectangles. Objects , are grouped in entry , while , in entry . Entries and are the entries of the root . Initially, the heap contains the two root entries, . Entry has the highest score (the norm of its maximum matching vector is the largest), and is thus popped. The two child entries and are obtained. Since the list is empty, no child entry is pruned and both are inserted in the heap, which becomes . In the next iteration, has the highest score and is popped. Since this is a leaf entry, i.e., an object, and is empty, is inserted in the result list, . Subsequently, is popped and since is not collectively preferred over it, is also placed in the result list, . In the final iteration, entry is popped, but the objects in are collectively preferred over both child. Algorithm IND concludes, finding the collectively maximal .

5 The p-Group-Maximal Categorical Objects (p-GMCO) Problem

Section 5.1 introduces the -GMCO problem, and Section 5.2 presents an adaptation of the BSL method, while Section 5.3 introduces an index-based approach.

5.1 Problem Definition

As the number of users increases, it becomes more likely that the users express very different and conflicting preferences. Hence, it becomes difficult to find a pair of objects such that the users unanimously agree that one is worst than the other. Ultimately, the number of maximally preferred objects increases. This means that the answer to an GMCO problem with a large set of users becomes less meaningful.

The root cause of this problem is that we require unanimity in deciding whether an object is collectively preferred by the set of users. The following definition relaxes this requirement. An object is -collectively preferred over , denoted as , iff there exist a subset of at least users such that for each user is preferred over , and there exists a user for which is strictly preferred over . In other words, we require only of the users votes to decide whether an object is universally preferred. Similarly, the -collectively maximal objects of with respect to users , is defined as the set of objects in for which there exists no other object that is -collectively preferred over them. The above definitions give rise to the -GMCO problem.

Problem 2. [-GMCO] Given a set of objects and a set of users defined over a set of categorical attributes , the -Group-Maximal Categorical Objects (-GMCO) problem is to find the -collectively maximal objects of with respect to .

Following the definitions, we can make a number of important observations, similar to those in the -dominance notion CJT+06. First, if an object is collectively preferred over some other object, it is also -collectively preferred over that same object for any . As a result, an object that is -collectively maximal is also collectively maximal for any . In other words, the answer to the -GMCO problem is a subset of the answer to the corresponding GMCO.

Second, consider an object that is not -collectively maximal. Note that it is possible that no -collectively maximal object is -collectively preferred over . As a result checking if is a result by considering only the -collectively maximal objects may lead to false positives. Fortunately, it holds that there must exist a collectively maximal object that is -collectively preferred over . So it suffices to check against the collectively maximal objects only (and not just the subset that is -collectively maximal).

Example. Consider the example in Tables 1 & 2. If we consider , we require all users to agree if an object is collectively preferred. So, the -collectively maximal objects are the same as the collectively maximal objects (i.e., , ). Let’s assume that ; i.e., users. In this case, only the restaurant is -collectively maximal, since, is -collectively preferred over , if we consider the set of users and . Finally, if , we consider only one user in order to decide if an object is collectively preferred. In this case, the -collectively maximal is an empty set, since is -collectively preferred over , if we consider either user or , and also is -collectively preferred over , if we consider user .

Input: objects , users
Output: - the -collectively maximal
Variables: the collectively maximal
7 foreach  do
      98 foreach   do
            10 if  then
                  1211 break;
            
      13if  then
            14 insert to
      
Algorithm 3 -BSL

5.2 A Baseline Algorithm (p-Bsl)

Based on the above observations, we describe a baseline algorithm for the -GMCO problem, based on BSL. Algorithm 3 shows the changes with respect to the BSL algorithm; all omitted lines are identical to those in Algorithm 1. The -BSL algorithm first computes the collectively maximal objects applying BSL (lines 1–6). Then, each collectively maximal object, is compared with all other collectively maximal objects (lines 7–14). Particularly, each object is checked whether there exists another object in that is -collectively preferred over (lines 10–12). If there is no such object, object is inserted in - (line 14). When the algorithm terminates, the set - contains the -collectively maximal objects.

Computational Analysis. Initially, the algorithm is computing the collectively maximal set using the BSL algorithm (lines 1–6), which requires . Then, finds the -collectively maximal objects (lines 7–14), performing in the worst case comparisons. Since, in worst case we have that . Therefore, the computational cost of Algorithm 3 is .

5.3 An Index-based Algorithm (p-Ind)

We also propose an extension of IND for the -GMCO problem, termed -IND. Algorithm 4 shows the changes with respect to the IND algorithm; all omitted lines are identical to those in Algorithm 2.

Input: R-Tree , users
Output: - the -collectively maximal
Variables: a heap with entries sorted by , the collectively maximal object
41 ; - while  is not empty do
      18 else
            212019 ; foreach  do
                  22 if  then
                        2423 break
                  25 if  then
                        26 if  then
                              27
                        
                  28 if  - then
                        29 if  then
                              30 remove from
                        
                  
            31 if  then
                  3332 insert to if  then
                        34 insert to
                  
            
      
Algorithm 4 -IND

In addition to the set , -IND maintains the set - of -collectively maximal objects discovered so far (line 1). It holds that -; therefore, an object may appear in both sets. When a leaf entry is popped (line 19), it is compared against each object in (lines 21–30) in three checks. First, the algorithm checks if is collectively preferred over (lines 22–24). In that case, object is not in the and thus not in the -. Second, it checks if is -collectively preferred over (lines 25–27). In that case, object is not in the -, but is in the . Third, the algorithm checks if the object is -collectively preferred over (lines 28–30). In that case, object is removed from the -collectively maximal objects (line 30), but remains in .

After the three checks, if is collectively maximal (line 31) it is inserted in (line 32). Further, if is p-collectively maximal (line 33) it is also inserted in - (line 34). When the -IND algorithm terminates, the set - contains the answer to the -GMCO problem.

Computational Analysis. -IND performs at most 3 times more object to object comparisons than IND. Hence its running time complexity remains .

6 The Group-Ranking Categorical Objects (GRCO) Problem

Section 6.1 introduces the GRCO problem, and Section 6.2 describes an algorithm for GRCO. Then, Section 6.3 discusses some theoretical properties of our proposed ranking scheme.

6.1 Problem Definition

As discussed in Section 1, it is possible to define a ranking among objects by “composing” the degrees of match for all users. However, any “compositing” ranking function is unfair, as there is no objective way to aggregate individual degrees of match. In contrast, we propose an objective ranking method based on the concept of -collectively preference. The obtained ranking is a weak order, meaning that it is possible for objects to share the same rank (ranking with ties). We define the rank of an object to be the smallest integer , where , such that is -collectively maximal for any . The non-collectively maximal objects are assigned the lowest possible rank . Intuitively, rank for an object means that any group of at least users (i.e., ) would consider to be preferable, i.e., would be collectively maximal for these users. At the highest rank , an object is preferred by each user individually, meaning that appears in all possible -collectively maximal object sets.

Problem 3. [GRCO] Given a set of objects and a set of users defined over a set of categorical attributes , the Group-Ranking Categorical Objects (GRCO) problem is to find the rank of all collectively maximal objects of with respect to .

Example. Consider the restaurants and the users presented in Tables 1 & 2. In our example, the collectively maximals are the restaurants and . As described in the previous example (Section 5), the restaurant is collectively maximal for any group of two users. Hence, the rank for the restaurant is equal to two. In addition, requires all the three users in order to be considered as collectively maximal; so its rank is equal to three. Therefore, the restaurant is ranked higher than .

6.2 A Ranking Algorithm (RANK-CM)

The RANK-CM algorithm (Algorithm 5), computes the rank for all collectively maximal objects. The algorithm takes as input, the collectively maximal objects , as well as the number of users . Initially, in each object is assigned the highest rank; i.e., (line 2). Then, each object is compared against all other objects in (loop in line 3). Throughout the objects comparisons, we increase (lines 5–11) from the current rank (i.e., ) (line 4) up to . If is not -collectively maximal (line 7), for (line 6), then cannot be in the - and can only have rank at most (line 8). Finally, each object is inserted in the based on its rank (line 12).

Computational Analysis. The algorithm compares each collective maximal object with all other collective maximal objects. Between two objects the algorithm performs at most comparisons. Since, in worst case we have that , the computational cost of Algorithm 5 is .

Input: the collectively maximal objects, the number of users
Output: the ranked collectively maximal objects
1 foreach  do
2       foreach  do
3             while  do
4                   if   then
5                        
6                  else
7                         break;
8                  
9            
10      insert in at
Algorithm 5 RANK-CM

6.3 Ranking Properties

In this section, we discuss some theoretical properties in the context of the rank aggregation problem. These properties have been widely used in voting theory as evaluation criteria for the fairness of a voting system Taylor05; A63; R88. We show that the proposed ranking scheme satisfies several of these properties.

Property 1. [Majority] If an object is strictly preferable over all other objects by the majority of the users, then this object is ranked above all other objects.

Proof. Assume that users strictly prefer over all other objects, where . We will prove that the rank of the object is lower than the rank of any other object.

Since, users strictly prefer over all other objects, any group of at least users, will consider as collectively maximal. This holds since, any group of at least users, contains at least one user which strictly prefers over all other objects. Note that, may not be the smallest group size. That is, it may hold that, for any group of less than users, is collectively maximal.

Recall the definition of the ranking scheme, if the rank of an object is , then is the smallest integer that, for any group of at least users, will be collectively maximal (for this group). Therefore, in any case we have that, the rank of is at most , i.e., (1).

On the other hand, let an object . Then, is not collectively maximal, for any group with users. This holds since, we have that . So, there is a group of users, for which, each user strictly preferred over . As a result, in order for to be considered as collectively maximal for any group of a specific size, we have to consider groups with more than users. From the above, it is apparent that, in any case, the rank for an object is greater than , i.e., (2).

Therefore, from (1) and (2), in any case the rank of the object will be lower than the rank of any other object. This concludes the proof of the property. ◻

Property 2. [Independence of Irrelevant Alternatives] The rank of each object is not affected if non-collectively maximal objects are inserted or removed.

Proof. According to the definition of the ranking scheme, if the rank of an object is , then is the smallest integer that, for any group of at least users, will be collectively maximal (for this group).

As a result, the rank of an object is specified from the minimum group size, for which, for any group of that size, the object is collectively maximal. Therefore, it is apparent that, the rank of each object is not affected by the non-collectively maximal objects. To note that, the non-collectively maximal objects are ranked with the lowest possible rank, i.e., . ◻

Property 3. [Independence of Clones Alternatives] The rank of each object is not affected if non-collectively maximal objects similar to an existing object are inserted.

Proof. Similarly to the Section 6.3. Based on the ranking scheme definition, the non-collectively maximal objects do not affect the ranking. ◻

Property 4. [Users Equality] The result will remain the same if two users switch their preferences. This property is also know as Anonymity.

Proof. According to the definition of the ranking scheme, if the rank of an object is , then is the smallest integer that, for any group of at least users, will be collectively maximal (for this group).

As a result, the rank of an object is specified from the minimum group size, for which, for any group of that size, the object is collectively maximal. Hence, if two users switch preferences, it is apparent that, the minimum group of any users, for which an object is collectively maximal, remains the same, for all objects. Therefore, the rank of all objects remains the same. ◻

Let an object and a user . Also, let be the matching vector between and . We say that the user increases his interest over , if , where is the matching degree resulted by the interest change.

Property 5. [Monotonicity] If an object is ranked above an object , and a user increases his interest over , then maintains its position above .

Proof. Let and be the rank of objects and , respectively. Since, is ranked above the object , we have that .

According to the definition of the ranking scheme, if the rank of an object is , then is the smallest integer that, for any group of at least users, will be collectively maximal (for this group). So, we have that for any group of at least and members, and will be collectively maximal.

Assume a user increases his interest over the object . Further, assume that and are the new ranks of the objects and , resulting from the interest change. We show that in any case and .

First let us study what holds for the new rank of the object . After the interest change, is the smallest group size that is collectively maximal for any group of that size. We suppose for the sake of contradiction that . Hence, after the interest change, we should consider larger group sizes in order to ensure that will be collectively maximal for any group of that size. This means that, after the interest change, there is a group of users for which is not collectively maximal. Hence, since is not collectively maximal, there must exist an object that is collectively preferred over . To sum up, considering users, we have that: before the interest change, there is no object that is collectively preferred over ; and, after the interest change, there is an object that is collectively preferred over . This cannot hold, since the matching degrees between all other users and objects remain the same, while some matching degrees between and have increased (due to interest change). So, for any group of users, there cannot exist an object which is collectively preferred over . Hence, we proved by contradiction that in any case .

Now, let us study what holds for the new rank of the object . After the interest change, is the smallest group size, that, for any group of that size, is collectively maximal. For the sake of contradiction, we assume that . Hence, after the interest change, we should consider smaller group sizes, in order to ensure that