Bias Disparity in Recommendation Systems
Abstract.
Recommender systems have been applied successfully in a number of different domains, such as, entertainment, commerce, and employment. Their success lies in their ability to exploit the collective behavior of users in order to deliver highly targeted, personalized recommendations. Given that recommenders learn from user preferences, they incorporate different biases (Pitoura et al., 2017) that users exhibit in the input data. More importantly, there are cases where recommenders may amplify such biases, leading to the phenomenon of bias disparity. In this short paper, we present a preliminary experimental study on synthetic data, where we investigate different conditions under which a recommender exhibits bias disparity, and the longterm effect of recommendations on data bias. We also consider a simple reranking algorithm for reducing bias disparity, and present some observations for data disparity on real data.
1. Introduction
Recommender systems have found applications in a wide range of domains, including ecommerce, entertainment, social media, news portals, and employment sites (Su and Khoshgoftaar, 2009). One of the most popular classes of recommendation systems is collaborative filtering. Collaborative Filtering (CF) uses the collective behavior of all users over all items to infer the preferences of individual users for specific items (Su and Khoshgoftaar, 2009). However, given the reliance of CF algorithms on the input preferences, they are susceptible to biases that may appear in the input data. In this work, we consider biases with respect to the preferences of specific groups of users (e.g., men and women) towards specific categories of items (e.g., different movie genres).
Bias in recommendations is not necessarily always problematic. For example, it is natural to expect gender bias when recommending clothes. However, gender bias is undesirable when recommending job postings, or information content. Furthermore, we want to avoid the case where the recommender system introduces bias in the data, by amplifying existing biases and reinforcing stereotypes. We refer to this phenomenon, where input and recommendation bias differ, as bias disparity.
The problem of algorithmic bias, and its flip side, fairness in algorithms, has attracted considerable attention in the recent years (Hajian et al., 2016; Dwork et al., 2012). Most existing work focuses on classification systems, while there is limited work on recommendation systems. One type of recommendation bias that has been considered in the literature is popularity bias (Celma and Cano, 2008). It has been observed that under some conditions popular items are more likely to be recommended leading to a rich get richer effect, and there are some attempts to correct this bias (Kamishima et al., 2014). Related to this is also the quest for diversity (Kunaver and Porl, 2017), where the goal is to include different types of items in the recommendations.
These notions of fairness do not take into account the presence of different (protected) groups of users and different item categories that we consider in this work. In (Burke et al., 2018) they assume different groups of users and items, they define two types of bias and they propose a modification of the recommendation algorithm in (Ning and Karypis, 2011) to ensure a fair output. Their work focuses on fairness, rather than bias disparity, and works with a specific algorithm. The notion of bias disparity is examined in (Zhao et al., 2017) but in a classification setting. Fairness in terms of correcting rating errors for specific groups of users was studied in (Yao and Huang, 2017) for a matrix factorization CF recommender.
In this paper, we consider the problem of bias disparity in recommendation systems. More specifically:

We define notions of bias and bias disparity for recommender systems.

Using synthetic data we study different conditions under which bias disparity may appear. We consider the effect of the iterative application of recommendation algorithms on the bias of the data.

We present some observations on bias disparity on real data, using the MovieLens^{1}^{1}1MovieLens 1M: https://grouplens.org/datasets/movielens/1m/ dataset.

We consider a simple reranking algorithm for correcting bias disparity and study it experimentally.
2. Model
2.1. Definitions
We consider a set of users and a set of items . We are given implicit feedback in a matrix , where if user has selected item , and zero otherwise. Selection may mean that user liked post , or that purchased product , or that watched video .
We assume that users are associated with an attribute , e.g., the gender of the user. The attribute partitions the users into groups, that is, subsets of users with the same attribute value, e.g., men and women. We will typically assume that we have two groups and one of the groups is the protected group. Similarly, we assume that items are associated with an attribute , e.g., the genre of a movie, which partitions the items into categories, that is, subsets of items with the same attribute value, e.g., action and romance movies.
Given the association matrix , we define the input preference ratio of group for category as the fraction of selections from group that are in category . Formally:
(1) 
This is essentially the conditional probability that a selection is in category given that it comes from a user in group .
To assess the importance of this probability we compare it against the probability of selecting from category when selecting uniformly at random. We define the bias of group for category as:
(2) 
Bias values less than 1 denote negative bias, that is, the group on average tends to select less often from category , while bias values greater than 1 denote positive bias, that is, that group favors category disproportionately to its size.
We assume that the recommendation algorithm outputs for each user a ranked list of items . The collection of all recommendations can be represented as a binary matrix , where if item is recommended for user and zero otherwise. Given matrix , we can compute the output preference ratio of the recommendation algorithm, , of group for category using Eq. (1), and the output bias of group for category .
To compare the bias of a group for a category in the input data and the recommendations , we define the bias disparity, that is, the relative change of the bias value.
(3) 
2.2. The Recommendation Algorithm
For the recommendations, in our experiments, we use a userbased NearestNeighbors (UserKNN) algorithm. The UserKNN algorithm first computes for each user, , the set of the most similar users to . For similarity, it uses the Jaccard similarity, , computed using the matrix . For user and item not selected by , the algorithm computes a utility value
(4) 
The utility value is the fraction of the similarity scores of the top most similar users to that have selected item . To recommend items to a user, the items with the highest utility values are selected.
3. Bias Disparity on Synthetic Data
In this section, we present experiments with synthetic data. Our goal is to study the conditions under which the UserKNN exhibits bias disparity.
3.1. Synthetic data generation
Users are split into two groups and of size and respectively, and items are partitioned into two categories and of size and respectively. We assume that users in tend to favor items in category , while users in group tend to favor items in category . To quantify this preference, we give as input to the data generator two parameters , where parameter determines the preference ratio of group for category . For example, means that 70% of the ratings of group are in category .
The datasets we create consist of 1,000 users and 1,000 items. We assume that each user selects 5% of the items in expectation and we recommend items per user. The presented results are average values of 10 experiments.
We perform two different sets of experiments. In the first set, we examine the role of the preference ratios and in the second set the role of group and category sizes.
3.2. Varying the preference ratios
In these experiments, we create datasets with equalsize groups and , and equalsize item categories and , and we vary the preference ratios of the groups.
3.2.1. Symmetric Preferences:
In the first experiment, we assume that the two groups and have the same preference ratios by setting , where takes values from 0.5 to 1, in increments of 0.05. In Figure 1(a), we plot the output preference ratio (eq. ) as a function of . Note that in this experiment, bias is the preference ratio scaled by a factor of two. We report preference ratios to be more interpretable. The dashed line shows when the output ratio is equal to the input ratio and thus there is no bias disparity. We consider different values for , the number of neighbors. A first observation is that when the input bias is small (), the output bias decreases or stays the same. In this case, users have neighbors from both groups. For higher input bias (), we have a sharp increase of the output bias, which reaches its peak for . In these cases, the recommender polarizes the two groups, recommending items only from their favored category.
In Figure 1(b), we report the preference ratio for all candidate items for recommendation for each user (i.e., all items having non zero utility). Surprisingly, the candidate items are less biased even for high values of the input bias. This shows that (a) utility proportional to usersimilarity increases bias, (b) reranking may help in decreasing bias.
Increasing the value of K increases the output bias. Adding neighbors increases the strength of the signal, and the algorithm discriminates better between the items in the different categories. Understanding the role of is a subject for future study.
3.2.2. Asymmetric Preferences:
In this experiment, group has preference ratio ranging from 0.5 to 1 while has fixed preference ratio , that is, is unbiased. In Figure 1, we show the recommendation preference ratio for groups (Figure 1(c)) and (Figure 1(d)) as a function of .
We observe that the output bias of group is amplified at a rate much higher than in Figure 1(a), while group becomes biased towards category . Surprisingly, the presence of the unbiased group , rather than moderating the overall bias, it has an amplifying effect on the bias of , more so than an oppositebiased group. Furthermore, the unbiased group (Figure 1(d)) adopts the biases of the bias group. This is due to the fact that the users in the unbiased group provide a stronger signal in favor of category compared to the symmetric case where group is biased over . This reinforces the overall bias in favor of category .
3.3. Varying group and category sizes
In this experiment we examine bias disparity with unbalanced groups and categories.
3.3.1. Varying Group Sizes:
We first consider groups of uneven size. We set the size of to be a fraction of the number of all users , ranging from 5% to 95%. Both groups have fixed preference ratio . Figure 2(a) shows the output recommendation preference ratio as a function of . The plot of is the mirror image of this one, so we do not report it.
We observe that for group has negative bias disparity (). That is, the small group is drawn by the larger group. For medium values of in the bias of both groups is amplified, despite the fact that is smaller than . The increase is larger for the larger group, but there is increase for the smaller group as well.
We also experimented with the case where is unbiased. In this case becomes biased towards even for , while the point at which the bias disparity for becomes positive is much earlier (). This indicates that a small biased group can have a stronger impact than a large unbiased one.
3.3.2. Varying Category Sizes:
We now consider categories of uneven size. We set the size of to be a fraction of the number items , ranging from 10% to 90%. We assume that both groups have fixed preference ratio . Figure 2(b) shows the recommendation preference ratio as a function of . The plot of is again the mirror image of this one.
Note that as long as , group has positive bias (greater than 1) for category since bias is equal to . However, it decreases as the size of the category increases. When the category size is not very large (), the output bias is amplified regardless of the category size. For , is actually biased in favor of , and this is reflected in the output. There is an interesting range where is positively biased towards but its bias is weak, and thus the recommendation output is drawn to category by the more biased group.
3.4. Iterative Application of Recommendations
We observed bias disparity in the output of the recommendation algorithm. However, how does this affect the bias in the data? To study this we consider a scenario where the users accept (some of) the recommendations of the algorithm, and we study the longterm effect of the iterative application of the algorithm on the bias of the data. More precisely, at each iteration, we consider the top recommendations of the algorithm () to a user , and we normalize their utility values, by the utility value of the top recommendation. We then assume that the user accepts a recommendation with probability equal to the normalized score. The accepted recommendations are added to the data, and they are fed as input to the next iteration of the recommendation algorithm.
We apply this iterative algorithm on a dataset with two equally but oppositely biased groups, as described in Section 3.2.1. The results of this iterative experiment are shown in Figure 3(a), where we plot the average preference ratio for each iteration. Iteration 0 corresponds to the input data. In our experiment a user accepts on average 7 recommendations. For this experiment we set the number to 50.
We observe that even with the probabilistic acceptance of recommendations, there is a clear longterm effect of the recommendation bias. For small values of input bias, we observe a decrease, in line with the observations in Figure 1(a). For these values of bias, the recommender will result in reducing bias and smoothing out differences. The value of preference ratio 0.6 remains more or less constant, while for larger values the bias in the data increases. Therefore, for large values of bias the recommender has a reinforcing effect, which in the long term will lead to polarized groups of users.
4. Bias disparity on Real Data
In this experiment, we use the Movielens 1M dataset^{2}^{2}2MovieLens 1M: https://grouplens.org/datasets/movielens/1m/. We consider as categories the genres Action and Romance, with 468 and 463 movies. We extract a subset of users that have at least 90 ratings in these categories, resulting in 1,259 users. Users in consist of 981 males and 278 females.
In Table 1, we show the input/output bias and in parentheses the bias disparity for each groupcategory combination. The right part of the table reports these numbers when the user groups are balanced, by selecting a random sample of 278 males. We observe that males are biased in favor of Action movies while females prefer Romance movies. The application of UserKNN increases the output bias for males for which group the input bias is strong. Females are moderately biased in favor of Romance movies. Hence, their output bias is drawn to Action items. We observe a very similar picture for balanced data, indicating that the changes in bias are not due to the group imbalance.
Unbalanced Groups  Balanced Groups  

Action  Romance  Action  Romance  
M  1.39/1.67 (0.2)  0.58/0.28 (0.51)  1.40/1.66 (0.18)  0.57/0.29 (0.49) 
F  0.97/1.14 (0.17)  1.03/0.85 (0.17)  0.97/1.08 (0.11)  1.03/0.92 (0.10) 
5. Correcting Bias Disparity
To address the problem of bias disparity, we consider an algorithm that performs postprocessing of the recommendations. Our goal is to adjust the set of items recommended to users so as to ensure that there is no bias disparity. In addition, we would like the new recommendation set to have the maximum possible utility.
Abusing the notation, let denote the set of useritem pairs produced by our recommendation algorithm, where denotes that was recommended item . We will refer to the pair as a recommendation. The set contains recommendations for each user, thus, recommendations in total. Let denote the total utility of the recommendations in set . Since contains for each user the top items with the highest utility, has the minimum utility loss.
We want to adjust the set so as to ensure that the bias of each group in is the same as the one in the input data. Since we have two categories, it suffices to have . Without loss of generality assume that . Let denote the category other than .
We decrease the output bias by swapping recommendations of category with recommendations of category . We use a simple greedy algorithm that at each step swaps the pair of recommendations that incur the minimum utility loss. The utility loss incurred by swapping with is . The candidate swaps can be computed by pairing for each user the lowestranked recommendation in from category , with the highest ranked recommendation not in from category . We perform swaps like that until the desired number of swaps has been performed. This algorithm is efficient, and it is easy to show that it is optimal, in the sense that it will produce the set of recommendations with the highest utility among all sets with no bias disparity. We refer to this algorithm as the GULM (Group Utility Loss Minimization) algorithm.
By design, when we apply the GULM algorithm on the output of the recommendation algorithm, we eliminate bias disparity (modulo rounding errors) in the recommendations. We consider the iterative application of the recommendation algorithm, in the setting described in Section 3.4, again assuming that the probability of a recommendation being accepted depends on its utility. The results are shown in Figure 3(b). For values of preference ratio up to 0.65, we observe that bias remains more or less constant after reranking. For larger values, there is some noticeable increase in the bias, albeit significantly smaller than before reranking. The increase is due to the fact that the recommendations introduced by GULM have low probability to be accepted.
6. Conclusions
In this short paper, we performed a preliminary study of bias disparity in recommender systems, and the conditions under which it may appear. We view this analysis as a first step towards a systematic analysis of the factors that cause bias disparity. We intend to investigate more recommendation algorithms, and the case of numerical, rather than unary, ratings. We also want to better understand how the conditions we studied appear in real data.
References
 (1)
 Burke et al. (2018) Robin Burke, Nasim Sonboli, and Aldo OrdonezGauger. 2018. Balanced Neighborhoods for Multisided Fairness in Recommendation. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency (Proceedings of Machine Learning Research), Sorelle A. Friedler and Christo Wilson (Eds.), Vol. 81. PMLR.
 Celma and Cano (2008) Òscar Celma and Pedro Cano. 2008. From Hits to Niches?: Or How Popular Artists Can Bias Music Recommendation and Discovery. In Proceedings of the 2Nd KDD Workshop on LargeScale Recommender Systems and the Netflix Prize Competition (NETFLIX ’08). ACM, 5:1–5:8.
 Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness Through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS ’12). ACM, 214–226.
 Hajian et al. (2016) Sara Hajian, Francesco Bonchi, and Carlos Castillo. 2016. Algorithmic Bias: From Discrimination Discovery to Fairnessaware Data Mining. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). ACM, 2125–2126.
 Kamishima et al. (2014) Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. 2014. Correcting Popularity Bias by Enhancing Recommendation Neutrality. In Poster Proceedings of the 8th ACM Conference on Recommender Systems, RecSys 2014, Foster City, Silicon Valley, CA, USA, October 610, 2014.
 Kunaver and Porl (2017) Matev Kunaver and Toma Porl. 2017. Diversity in Recommender Systems A Survey. Know.Based Syst. 123, C (May 2017), 154–162.
 Ning and Karypis (2011) Xia Ning and George Karypis. 2011. SLIM: Sparse Linear Methods for TopN Recommender Systems. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining (ICDM ’11). IEEE Computer Society, 497–506.
 Pitoura et al. (2017) Evaggelia Pitoura, Panayiotis Tsaparas, Giorgos Flouris, Irini Fundulaki, Panagiotis Papadakos, Serge Abiteboul, and Gerhard Weikum. 2017. On Measuring Bias in Online Information. CoRR abs/1704.05730 (2017).
 Su and Khoshgoftaar (2009) Xiaoyuan Su and Taghi M. Khoshgoftaar. 2009. A Survey of Collaborative Filtering Techniques. Adv. in Artif. Intell. 2009 (Jan. 2009).
 Yao and Huang (2017) Sirui Yao and Bert Huang. 2017. Beyond Parity: Fairness Objectives for Collaborative Filtering. CoRR abs/1705.08804 (2017).
 Zhao et al. (2017) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and KaiWei Chang. 2017. Men Also Like Shopping: Reducing Gender Bias Amplification using Corpuslevel Constraints. CoRR abs/1707.09457 (2017).