Unifying Ensemble Methods for Qlearning
via Social Choice Theory
Abstract
Ensemble methods have been widely applied in Reinforcement Learning (RL) in order to enhance stability, increase convergence speed, and improve exploration. These methods typically work by employing an aggregation mechanism over actions of different RL algorithms. We show that a variety of these methods can be unified by drawing parallels from committee voting rules in Social Choice Theory. We map the problem of designing an action aggregation mechanism in an ensemble method to a voting problem which, under different voting rules, yield popular ensemblebased RL algorithms like Majority Voting Qlearning or Bootstrapped Qlearning. Our unification framework, in turn, allows us to design new ensembleRL algorithms with better performance. For instance, we map two diversitycentered committee voting rules, namely Single NonTransferable Voting Rule and ChamberlinCourant Rule, into new RL algorithms that demonstrate excellent exploratory behavior in our experiments.
1 Introduction
Ensemble methods such as bagging or boosting are used to improve several Machine Learning predictors Zhang and Ma (2012) in a manner that decreases bias and variance or improves accuracy. In Reinforcement Learning (RL), ensemble methods have been used in a variety of ways for achieving stability Anschel et al. (2016), robustness to perturbation Rajeswaran et al. (2016) or for improving exploration Osband et al. (2016). One popular method is to maintain a set of quality (Q) function estimates that serves as an approximation to the Qfunctions’ posterior distribution, conditioned on interaction history Osband et al. (2013). Quite similar to bagging Opitz and Maclin (1999), these Qfunction estimates are trained on different state sequences in order to represent a close sample from the actual Qfunction distribution Osband et al. (2016) which, upon aggregation, reduces variance. The mechanism used to aggregate the estimates can affect the algorithm’s characteristics considerably Wiering and van Hasselt (2008).
Alternate aggregation mechanisms are employed by different ensemble RL algorithms: Majority Voting and Rank Voting Qlearning Faußer and Schwenker (2015); Harutyunyan et al. (2014) use ordinal aggregation mechanisms that depend on the order, rather than value, of Qfunction estimates; on the other hand, Boltzmann Addition, Boltzmann Multiplication, and Averaged Qlearning Marivate and Littman (2013); Wiering and van Hasselt (2008) use cardinal aggregation mechanisms. While a few empirical comparisons among these varied techniques have been done Wiering and van Hasselt (2008), any broad unification of these variants in order to identify key differences that cause significant deviation in learning characteristics hasn’t been attempted to the best of our knowledge. Based on recent work on representing committee voting rules via committee scoring functions Elkind (2017), we attempt this unification in the paper and, in process, port properties of election rules to RL algorithms.
By viewing the task of aggregating Qfunction estimates for action decision as a multiwinner voting problem from Social Choice Theory (SCT), we present an abstraction to several ensemble RL algorithms based on an underlying voting rule. We consider individual units (we call heads) in ensemble as the voters, action choices as the candidates, and the heads’ Qvalues over the actions as the ballot preference of respective voters. With this perspective on the setting, the paper has the following contributions.

We develop a generic mechanism based on multiwinner voting Elkind (2017) in SCT that exactly emulates the aggregation method for several of the aforementioned algorithms, when plugged with an appropriate committee scoring function.
We believe that our mapping method helps correlate properties of voting rules to the characteristics of respective RL algorithms. For instance, our experiments suggest a strong correlation between the Proportional Representation (PR) Bogdanor (1984) property of a voting rule and the consequent RL algorithm’s exploration ability.
2 Background
Here, we provide a brief introduction to RL problem, describe some ensemblebased RL algorithms, and discuss committee voting strategies. For brevity, we denote by and uniform distribution over a set as . Some notations are overloaded between Section 2.2 and 2.3 to aid understanding.
2.1 Preliminaries
We consider a typical RL environment as a Markov Decision Process (MDP) , where is the state space, is the action space, and is the discount factor. gives the finite reward of taking action in state . gives the environment’s transition probability to state on taking action in . We denote the space of all bounded real functions on as . A policy can be a deterministic mapping over actions, i.e. , or a stochastic distribution over the same. Every policy has a corresponding function computed from the following recursive equation:
(1) 
An optimal policy is defined as a policy mapping that receives maximum expected discounted returns on , i.e.
(2) 
Vanilla Qlearning algorithm starts with a random Qfunction estimate and follows an offpolicy exploratory strategy to generate observation tuples . The tuples are used to improve the estimate following the update
(3)  
where is the learning rate. A popular exploratory policy is the soft policy over step greedy policy .
2.2 Ensemble Qlearning Algorithms
Ensemble Qlearning algorithms employ a set of Qfunction estimates for making action decisions. For an ensemble with heads, we denote the Qfunction estimates at time as . While the update rule for each estimate remains same as (3), the action aggregation strategy is what differs for different ensemble algorithms.
Majority Voting Qlearning strategy selects the action with highest votes, where every ensemble head votes for it’s greedy action (ties broken arbitrarily).
(4) 
In Rank Voting Qlearning, every head supplies it’s integer preference, , on actions based on the Qfunction estimates in an order preserving manner. The action with the highest cumulative preference is selected.
(5) 
Average Qlearning Faußer and Schwenker (2015) uses the average of the ensemble Qfunction estimates to decide the greedy action.
(6) 
Similar to Bootstrapped DQN Osband et al. (2016), the Qlearning variant samples a head from the ensemble per episode and uses it’s greedy action to make decisions. Let give the bootstrap head sampled for the episode that timestep falls in.
(7) 
Boltzmann Addition Qlearning Wiering and van Hasselt (2008) averages the Boltzmann probabilities of actions over the ensemble and uses it as the distribution for action choices.
(8) 
2.3 Committee Voting
Committee voting in Social Choice Theory Aziz et al. (2017a); Elkind et al. (2017) deals with scenarios where societies, represented by set , need to elect a representative committee among candidates . Every voter has a preference order over the candidates, denoted by a utility function . For a voter , let be a function that maps candidates to their ranks based on utility function , from higher utility to lower. Elkind et al. (2017) classified several ordinal voting rules as committee scoring rules which can be succinctly represented by a committee scoring function that maps utilitycommittee pairs to a real number. Even for some cardinal voting rules, scoring function can be used to describe them. Following is a list of such rules with respective scoring functions.
Plurality or Single NonTransferable Voting Grofman et al. (1999) rule uses a scoring function that returns if the committee has voter ’s most preferred candidate, otherwise . Let .
(9) 
Bloc system’s Ball (1951) scoring function returns the number of candidates in the committee that are among voter ’s top preferred candidates.
(10) 
ChamberlinCourant rule Chamberlin and Courant (1983) depends on a satisfaction function , which is an orderpreserving map over the utility values . The rule’s scoring function outputs the maximum satisfaction score from a candidate in the committee.
(11) 
Borda rule’s de Borda (1953) scoring function uses Borda score which, for a voter , assigns same value as satisfaction . The committee’s score with respect to a voter is just the sum of individual scores.
(12) 
Majority Judgment rule’s Felsenthal and Machover (2008) scoring function is a cardinal voting rule that outputs the sum of utility values for the candidates in the committee.
(13) 
Lottery rule or Random Ballot Amar (1984) is a stochastic voting rule where a voter is randomly selected and it’s most preferred candidate is elected. Let be a random voter, i.e. . Then for a masking function , defined as , the scoring function with respect to a voter is the masked greedy utility.
(14) 
Given an election pair , committee size , and a committee scoring rule , the election result Elkind et al. (2017) is given by
(15) 
and winning score is the maximum score. Voting rules may or may not follow certain desirable properties such as committee monotonousity, proportional representation (PR), consistency, etc. In (15), we break ties to preserve committee monotonousity whenever possible, i.e. we try to ensure for all .
While the election result for Plurality, Borda, and Bloc rules can be computed by polynomial time greedy algorithm, the result for ChamberlinCounrant rule has been shown to be NPcomplete to compute Procaccia et al. (2008). We can however get approximate results within a factor of for fixed through greedy algorithms Lu and Boutilier (2011).
3 Unification Framework
The intuition behind the unification is the similarity between the voting rules and action aggregation mechanism method in RL ensembles. Let’s consider ensemble heads as voters that, on perceiving a state at time , cast preferences over the action set in form of their Qfunction estimates or a softmax over it, i.e.
(16)  
(17) 
For the sake of simplicity, we have omitted inputs in the utility function notation as they are identical across the ensemble for each decision reconciliation instance. Applying committee voting rules described in Section 2.3 on these utility functions is identical to several ensemble algorithms’ action selection method described in Section 2.2. Figure 1 describes this aggregation mechanism.
Following propositions establish equivalence between voting rules on election pair with voter utilities given by Qfunction estimates (16) and action reconciliation in several ensemble Qlearning algorithms.
Proposition 1.
Plurality and Bloc voting rules map to Majority Voting Qlearning; ChamberlinCourant and Borda voting rules map to Rank Voting Qlearning; Majority Judgment voting rule maps to Average Qlearning. In brief,
(18)  
Even in case of Boltzmann Addition Qlearning, that uses a stochastic policy, the probabilities of the actions can be represented using the committee voting aggregation mechanism as stated by the following proposition.
Proposition 2.
Majority Judgment voting rule on softmax utility (17), gives Boltzmann Addition Qlearning’s policy, i.e., if ties are broken in a manner that preserves committee monotonicity and = , then for action defined as
(19) 
for an integer , the Boltzmann probabilities for is
(20) 
While action selection strategy in Bootstrapped Qlearning is quite similar to Lottery rule as the decisions in both are made based on a sampled head / ballot’s utility, they differ in the sampling frequency. Lottery voting samples a new ballot every election, but Bootstrapped Qlearning uses a sampled agent for multiple steps (episode). This can be accounted for in the mechanism by modifying the masking function used in (14) to use bootstrap head (same throughout an episode) for ongoing episode at timestep .
Proposition 3.
Lottery voting rule maps to Bootstrapped Qlearning when ballots are drawn randomly every episode, i.e. if , where gives the sampled head for episode at , then
(21) 
4 Multiwinner RL Algorithms
With the exception of Boltzmann Addition Qlearning, all the RL algorithms discussed so far mapped to committee voting rules with committeesizes restricted to . A single candidate doesn’t satisfactorily represent a community of voters. These RL algorithms, therefore, can be improved by extending them to allow multiwinner action committees. Unfortunately, the modelling of typical RL paradigm forbids selection of multiple actions at a state. Some environments do support backtracking, i.e. allow the agent to retrace to a state multiple times in order to try multiple actions De Asis et al. (2018). However, we instead follow a straightforward approach to randomly sample an action from the winning committee, i.e.
(22) 
This uniform sampling promotes diversity in action selection as all the winning candidate actions receive equal chance of selection regardless of voter’s backing.
4.1 Dynamic Committee Resizing
Deciding a suitable committeesize for the multiwinner extension to ensemble RL algorithms is very crucial. Since every policy decision involves conducting an election over action choices, a static committee size cannot hoped to be optimal for all the observations throughout learning. We therefore propose dynamic committee resizing based on a satisfaction threshold hyperparameter , that elects committees varying in sizes across elections. Described as ELECTION procedure in Algorithm 1, this subroutine is similar to the classic heuristic analyzed by Nemhauser et al. (1978) for optimizing submodular set functions. We start with an empty committee and populate it iteratively with greedy candidate actions that best improves the satisfaction score until the threshold is reached or all candidates are included.
Using this subroutine, we extend the classic Qlearning to a generic ENSEMBLE QLEARNING algorithm described in Algorithm 1. For , this procedure mimics existing ensemble algorithms when executed with respective scoring functions mentioned in Section 3 and extends to novel ensemble RL algorithms with interesting properties for nonzero .
4.2 Properties of Voting Rules in RL
A wide variety of axiomatic studies for multiwinner voting Elkind et al. (2017); Faliszewski et al. (2017) rules have been conducted which provide analysis of several properties, such as consistency, solid coalitions, unanimity, etc. When voting rules are mapped to an ensemble RL algorithm via the aforementioned procedure, these properties affect the characteristics of the algorithm. Therefore, analyzing the effect of voting properties on their RL counterpart helps in understanding existing algorithms as well as proposing improved ones. Effective exploration in RL algorithms is a highly desirable characteristic that has been well studied Thrun (1992) for improving sample complexity of any given task. Using our generic unified ensemble RL algorithm, we performed a study on the correlation between efficient exploration and proportional representation voting property for various single and multiwinner voting rules. While there are several PR centric voting rules such as proportional approval voting, reweighted approval voting Aziz et al. (2017a), and Moore rule Monroe (1995), we limited ourselves to committee scoring rules due to the design of our generic algorithm which expects a scoring function. ChamberlinCourant rule Chamberlin and Courant (1983) and Random Ballot Amar (1984) are two such rules that have been shown to promote PR. We also consider SNTV (although it doesn’t promote PR) because it has been shown to favour minorities Cox (1994), which increases diversity. Our experiments suggest that rules that demonstrate PR and diversity manifest as excellent exploratory RL algorithms.
5 Experiments
5.1 Environments
The first evaluation was done on a combination of corridor MDP from Dueling Architectures Wang et al. (2015) and a MDP from Bootstrapped DQN Osband et al. (2016). As shown in Figure 2, this combined corridor MDP consists of a linear chain of states, out of which only two states, , confer rewards. The start state is randomly selected based on a Binomial distribution biased towards the low rewarding state , i.e. start state is . Every non terminal state has actions, out of which randomly selected actions are nonoperational while the other two leads to a transition to adjacent left and right states. Every episode has a length of steps after which it is terminated.
On five instances of this environment with increasing actionset sizes , we evaluated two sets of similar ensemble algorithms: (i) Bloc, Majority Voting, and SNTV Qlearning, and (ii) Borda, CCR, and Rank Voting Qlearning. The results are shown in Figure 4. SNTV and Bloc Qlearning are multiwinner extensions to Majority Voting Qlearning using the dynamic resizing method discussed in Section 4.1. Similarly, CCR and Borda Qlearning are multiwinner extensions to Rank Voting Qlearning.
Next, we evaluated the ensemble algorithms on a test bed of gridworld puzzles ChevalierBoisvert and Willems (2018), shown in Figure 3. The environments are partially observable, with varying objectives such as navigating to the goal, finding correct key, and unlocking the correct door. The actionset has discrete actions and the rewards in all of these environments are extremely sparse: the agent receives on successful completion of task and zero otherwise. Several ensemble RL algorithms were trained on these environments for 2 million steps each across 48 different seeds (i.e. 48 random gridworld maps). For all the runs, the ensemble size was , discount factor was 0.9, learning rate was 0.2, and the exploration schedule was linear (annealed from 1 to 0.001 in 1 million steps). The scoring function thresholds for multiwinner algorithms were manually tuned and set to for SNTV and Bloc and to for CCR and Borda.
5.2 Results
singlewinner voting  multiwinner voting  

Average  Rank  Majority  Lottery  Bloc  Borda  SNTV  CCR  
MiniGridDoorKey16x16v0  0.08  0.06  0.37  0.38  0.41  0.11  0.49  0.58 
MiniGridMultiRoomN6v0  0.00  0.01  0.13  0.47  0.40  0.00  0.47  0.47 
MiniGridKeyCorridorS4R3v0  0.10  0.09  0.53  0.90  0.22  0.13  0.92  0.92 
MiniGridObstructedMaze2Dlv0  0.07  0.08  0.20  0.52  0.16  0.06  0.63  0.64 
Figure 4 shows the results of the evaluation on corridor MDP. Table 1 lists the performance comparison of the evaluations on the gridworld environments. The metric used is the maximum value of exponential moving averages, meaned across 48 runs. In order to do this, we form exponential moving average estimates for each run and use it to sample 100 reestimates at equidistant points (step difference of 20000). The samples are then meaned across runs and the maximum value is selected.
The results suggest that in general the multiwinner voting RL algorithms fare much better compared to singlewinner variants, highlighting the efficacy of threshold based dynamic resizing method. Moreover, we see a consistent pattern of PR and diversity based algorithms—CCR, SNTV and Lottery Qlearning—beating other ensemble Qlearning techniques and being more resilient to increasing exploration difficulty. We believe this provides an alternate reasoning as to why Bootstrapped DQN Osband et al. (2016) does deep exploration: the random sampling of heads is akin to Lottery voting rule which exhibits PR.
6 Concluding Discussion
In this paper, we presented a committee voting based unified aggregation mechanism that generalizes several ensemble based Qlearning algorithms. By proposing a dynamic resizing committee election mechanism, we extended classical Qlearning to a generic ensemble RL algorithm. On plugging a multiwinner committee voting rule in this generic procedure, we see that the resulting RL algorithm manifests the underlying voting rule’s properties. For instance, proportional representation centric voting rules such as ChamberlinCourant and Random Ballot exhibit an improvement in exploration on use with the generic algorithm, as seen in our experiments on fabricated MDPs as well as complex grid world environments.
While our analysis focused only on exploratory behaviour of ensemble RL algorithms, one may investigate other properties, such as stability to environmental perturbation via application of Gehrlein Stable voting rules Gehrlein (1985); Aziz et al. (2017b) such as Minimal Size of External Opposition rule (SEO) and Minimal Number of External Defeats (NED) Coelho (2005). Several other multiwinner voting rules could potentially be of interest in modifying RL algorithm’s traits and our work provides a method to study them.
References
 Choosing representatives by lottery voting. The Yale Law Journal 93 (7), pp. 1283–1308. Cited by: §2.3, §4.2.
 AveragedDQN: Variance Reduction and Stabilization for Deep Reinforcement Learning. arXiv:1611.01929. Cited by: §1.
 Justified representation in approvalbased committee voting. Social Choice and Welfare 48 (2), pp. 461–485. Cited by: §2.3, §4.2.
 The condorcet principle for multiwinner elections: from shortlisting to proportionality. arXiv:1701.08023. Cited by: §6.
 Bloc voting in the general assembly. International Organization 5 (1), pp. 3–31. Cited by: §2.3.
 What is proportional representation. Martin Robertson & Company Ltd.: Oxford. Cited by: §1.
 Representative deliberations and representative decisions: proportional representation and the borda rule. American Political Science Review 77 (3), pp. 718–733. Cited by: 2nd item, §2.3, §4.2.
 Minimalistic gridworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gymminigrid Cited by: §5.1.
 Understanding, evaluating and selecting voting rules through games and axioms. Universitat Autònoma de Barcelona,. Cited by: §6.
 Strategic voting equilibria under the single nontransferable vote. American Political Science Review 88 (3), pp. 608–621. Cited by: §4.2.
 Multistep reinforcement learning: a unifying algorithm. In AAAI, Cited by: §4.
 Memoire sur les elections au scrutin, 1781. Histoire de l’Academie Royale des Sciences, Paris. Cited by: §2.3.
 Properties of multiwinner voting rules. Social Choice and Welfare 48 (3), pp. 599–632. Cited by: §A.2, §2.3, §2.3, §4.2.
 Justified representation in multiwinner voting: axioms and algorithms. In FSTTCS, Cited by: 1st item, §1.
 Multiwinner rules on paths from kborda to chamberlin–courant. In IJCAI, pp. 192–198. Cited by: §4.2.
 Neural network ensembles in reinforcement learning. Neural Processing Letters 41 (1), pp. 55–69. External Links: ISSN 1573773X, Document, Link Cited by: §1, §2.2.
 The majority judgement voting procedure: a critical evaluation. Homo oeconomicus 25 (3/4), pp. 319–334. Cited by: §2.3.
 The condorcet criterion and committee selection. Mathematical Social Sciences 10 (3), pp. 199–209. Cited by: §6.
 Elections in japan, korea, and taiwan under the single nontransferable vote: the comparative study of an embedded institution. University of Michigan Press. Cited by: §2.3.
 Offpolicy shaping ensembles in reinforcement learning. arxiv:1405.5358. Cited by: §1.
 Budgeted social choice: from consensus to personalized decision making.. In IJCAI International Joint Conference on Artificial Intelligence, pp. 280–286. External Links: Document Cited by: §2.3.
 An ensemble of linearly combined reinforcementlearning agents. In AAAI Workshop, pp. 77–79. Cited by: §1.
 Fully proportional representation. American Political Science Review 89 (4), pp. 925–940. Cited by: §4.2.
 An analysis of approximations for maximizing submodular set functions—i. Mathematical programming 14 (1), pp. 265–294. Cited by: §4.1.
 Electoral engineering: voting rules and political behavior. Cambridge university press. Cited by: 2nd item.
 Popular ensemble methods: an empirical study. Journal of artificial intelligence research 11, pp. 169–198. Cited by: §1.
 (More) efficient reinforcement learning via posterior sampling. In NIPS, Cited by: §1.
 Deep exploration via bootstrapped dqn. In NIPS, pp. 4026–4034. Cited by: §1, §2.2, §5.1, §5.2.
 On the complexity of achieving proportional representation. Social Choice and Welfare 30 (3), pp. 353–362. External Links: ISSN 1432217X, Document, Link Cited by: §2.3.
 EPOpt: Learning Robust Neural Network Policies Using Model Ensembles. arxiv:1610.01283. Cited by: §1.
 Efficient exploration in reinforcement learning. Cited by: §4.2.
 Dueling Network Architectures for Deep Reinforcement Learning. arxiv:1511.06581. Cited by: §5.1.
 Ensemble algorithms in reinforcement learning. Trans. Sys. Man Cyber. Part B 38 (4), pp. 930–936. External Links: ISSN 10834419, Link, Document Cited by: §1, §1, §2.2.
 Ensemble machine learning: methods and applications. Springer. Cited by: §1.
Appendix A Appendix
a.1 Proof of Proposition 1
From the definitions of voting rules, we can see that in case when committee size , (9) is identical to (10) and (11) is identical to (12). Therefore showing equivalence for either of the rule in a pair is sufficient.
For plurality, this equivalence is established as follows.
(23)  
The proof for remaining rules follows a similar flow.
a.2 Proof of Proposition 2
Let . Let the sized winning committee be denoted as and the winning score as . We can express as
(24) 
A voting rule is a bestk rule if there exists a preference function such that for each election , the result for any committeesize is same as top ranking candidates in . One may easily verify using (24) that for majority judge voting rule, the preference function is , and therefore is a bestk rule. Elkind et al. (2017) showed that all bestk voting rules follow committee monotonousity and vice versa. Coupled with tiebreaking constraint, we therefore ensure that is true.
Let the extra element be . The difference in scores is
(25)  
a.3 Proof of Proposition 3
Except the inclusion of bootstrapping mask, the proof follows along the lines of proposition 1.