A Roadmap for End-to-End Robust Alignment

A Roadmap for End-to-End Robust Alignment


As algorithms are becoming more and more data-driven, the greatest lever we have left to make them robustly beneficial to mankind lies in the design of their objective functions. Robust alignment aims to address this design problem. Arguably, the growing importance of social medias’ recommender systems makes it an urgent problem, for instance to adequately automate hate speech moderation. In this paper, we propose a preliminary research program for robust alignment. This roadmap aims at decomposing the end-to-end alignment problem into numerous more tractable subproblems. We hope that each subproblem is sufficiently orthogonal to others to be tackled independently, and that combining the solutions to all such subproblems may yield a solution to alignment.

1 Introduction

As they are becoming more capable and ubiquitous, algorithms are raising numerous concerns, including fairness, privacy, filter bubbles, addiction, job displacement or even existential risks [74, 85]. It has been argued that aligning the goals of algorithmic systems with humans’ preferences would be an efficient way to make them reliably beneficial and to avoid potential catastrophic risks [75]. In fact, given the global influence of today’s large-scale recommender systems [50], alignment has already become urgent. Indeed, such systems currently do not seem to be robustly beneficial, since they seem to recommend unscientific claims to hundreds of millions of users on important topic like climate change [1], and to be favoring the radicalization of many users [72].

Crucially, in practice, the alignment of algorithms needs to be made robust. This means that it ought to be resilient to algorithmic bugs, flawed world models [37], reward hacking [2], biased data [60], unforeseen side effects [7], distributional shift [78], adversarial attacks [35] and moral uncertainty [57]. Robust alignment seems crucial in complex environments with diverging human preferences, like hate speech algorithmic moderation on social medias.

Unfortunately, robust alignment has been argued to be an extremely difficult problem [12, 29]. To address it, there have been a few attempts to list key research directions towards robust alignment [79, 82]. This paper aims to contribute to this line of work by outlining in a structured and compelling manner the main challenges posed by robust alignment. Given that the robustness of a system is often limited by its weakest component, robust alignment demands that we consider our algorithmic systems in their entirety. This motivates the proposal of a roadmap for robust end-to-end alignment, from data collection to algorithmic output.

While much of our proposal is speculative, we believe that several of the ideas presented here will be critical for the safety and alignment of algorithms. More importantly, we hope that this will be a useful roadmap to better estimate how they can best contribute to the effort. Given the complexity of the problem, our roadmap here will likely be full of gaps and false good ideas. Our proposal is nowhere near definite or perfect. Rather, we aim at presenting a sufficiently good starting point for others to build upon.

2 The Roadmap

Our roadmap consists of identifying key steps to robust alignment. For the sake of exposition, these steps will be personified by 5 characters, called Alice, Bob, Charlie, Dave and Erin. Roughly speaking, Erin will be collecting data from the world, Dave will use these data to infer the likely states of the world, Charlie will compute the desirability of the likely states of the world, Bob will derive incentive-compatible rewards to motivate Alice to take the right decision, and Alice will optimize decision-making. This decomposition is graphically represented in Figure 1.

Figure 1: We decompose the robust alignment problem into 5 key steps: data collection, world model inference, desirability learning, incentive design and reinforcement learning.

Evidently, Alice, Bob, Charlie, Dave and Erin need not be 5 different algorithms. Typically, it may be much more computationally efficient to merge Charlie and Dave. Nevertheless, at least for pedagogical reasons, it seems useful to first dissociate the different roles that these components have.

In the sequel, we shall further detail the challenges posed by each of the 5 algorithms. We shall also argue that, for robustness and scalability reasons, these algorithms will need to be further divided into many more algorithms. We will see that this raises additional challenges. We shall also make a few non-technical remarks, before concluding.

3 Alice’s Reinforcement learning

Today’s most promising framework for large-scale algorithms seems to be reinforcement learning [84, 44]. In reinforcement learning, at each time , the algorithm observes some state of the world . Depending on its inner parameters , it then takes (possibly randomly) some action .

The decision then influences the next state and turns it into . The transition from to given action is usually assumed to be nondeterministic. In any case, the algorithm then receives a reward . The internal parameters of the algorithm may then be updated into , depending on previous parameters , action , state and reward .

Note that this is a very general framework. In fact, we humans are arguably (at least partially) subject to this framework. At any point in time, we observe new data that informs us about the world. Using an inner model of the world , we then infer what the world probably is like, which motivates us to take some action . This may affect what likely next data will be observed, and may be accompanied with a rewarding (or painful) feeling , which will motivate us to update our inner model of the world into .

Let us call Alice the algorithm in charge of performing this reinforcement learning reasoning. Alice can thus be viewed as an algorithm, which inputs observed states and rewards , and undertakes actions so as to typically maximize some discounted sum of expected future rewards.

In the case of large-scale algorithms, such actions are mostly of the form of messages sent through the Internet. This may sound benign. But it is not. The YouTube recommender system might suggest billions of antivax videos, causing a major decrease of vaccination and an uprise of deadly diseases. On a brighter note, if an algorithm now promotes convincing eco-friendly messages every day to billions of people, the public opinion on climate change may greatly change. Similarly, thoughtful compassion and cognitive empathy could be formidably enhanced by a robustly aligned recommender algorithm.

Note that, as opposed to all other components, in some sense, Alice is the real danger. Indeed, in our framework, she is the only one that really undertakes actions. More precisely, only her actions will be unconstrained (although others highly influence her decision-making and are thus critical as well). It is thus of the utmost importance that Alice be well-designed.

Some past work [69, 23] proposed to restrict the learning capabilities of Alice to provide provable desirable properties. Typically, they proposed to allow only a subclass of learning algorithms, i.e. of update rules of as a function of . However, such restrictions might be too costly, and thus may not be adopted. Yet a safety measure will be useful only if the most powerful algorithms are all subject to the safety measure. As a result, an effective safety measure cannot hinder too much the performance of the algorithms. In other words, there are constraints on the safety constraints that can be imposed. This is what makes the research to make algorithms safe so challenging.

To design safe reinforcement learning algorithms, it seems more relevant to guarantee that they will do sufficient planning to avoid dangerous actions. One interesting idea by [2] is model lookahead. This essentially corresponds to Alice simulating many likely scenarii before undertaking any action. More generally, Alice faces a safe exploration problem.

Another key feature of safe reinforcement learning algorithms is their ability to deal adequately with uncertainty. While recent results advanced the capability of algorithms to verify the robustness of neural networks [20, 88], we should not expect algorithms to be correct all the time. Just like humans, algorithms will likely be sometime wrong. But then, even if an algorithm is right 99.9999% of the time, it will still be wrong one time out of a million. Yet, algorithms like recommender systems or autonomous cars take billions of decisions every day. In such cases, thousands of algorithmic decisions may be unboundedly wrong every day.

This problem can become even more worrysome if we take into account the fact that hackers may attempt to take advantage of the algorithms’ deficiencies. Such hackers may typically submit only data that corresponds to cases where the algorithms are wrong. This is known as evasion attacks [55, 83, 32]. To avoid evasion attacks, it is crucial for an algorithm to never be unboundedly wrong, e.g. by reliably measuring its own confidence in its decisions and to ask for help in cases of great uncertainty [38].

Now, even if Alice is well-designed, she will only be an effective optimization algorithm. Unfortunately, this is no guarantee of safety or alignment. Typically, because of humans’ well-known confirmation bias [39], a watch-time maximization YouTube recommender algorithm may amplify filter bubbles, which may lead to worldwide geopolitical tensions. Both misaligned and unaligned algorithms will likely cause very undesirable side effects [12, 27].

To make sure that Alice will want to behave as we want her to, it seems critical to at least design appropriately the reward . This is similar to the way children are taught to behave. We do so by punishing them when the sequence is undesirable, and by rewarding them when the sequence is desirable. In particular, rewards are Alice’s incentives. They will determine her decision-making. Unfortunately, determining the adequate rewards to be given to Alice is an extremely difficult problem. It is, in fact, the key to robust alignment. Our roadmap to solve it identifies 4 key steps incarnated by Erin, Dave, Charlie and Bob.

4 Erin’s data collection problem

Evidently, data are critical to do good. Indeed, even the most brilliant mind will be unable to know anything about the world if it does not have any data from that world. Now, much data is already available on the Internet, especially on a platform like YouTube. However, it is important to take into account the fact that the data on the Internet is not always fully reliable. It may be full of fake news, fraudulent entries, misleading videos, hacked posts and corrupted files.

It may then be relevant to invest in more reliable and relevant data collection. This would be Erin’s job. Typically, Erin may want to collect economic metrics to better assess needs. Recently, it has been shown that satellite images combined with deep learning allow to compute all sorts of useful economic indicators [46], including poverty risks and agricultural productivity. The increased use of sensors may further enable us to improve life standards, especially in developing countries.

To guarantee the reliability of such data, cryptographic and distributed computing solutions are likely to be useful as well, as they already are on the web. Electronic signatures may eventually be the key to distinguish reliable data from deep fakes [47, 66], while distributed computing, combined with recent Byzantine-resilient consensus algorithms like Blockchain [63], Hashgraph [8] or AT-2 [36], could guarantee the reliable storage and traceability of critical information.

Note though that such data collection mechanisms could pose major privacy issues. It is a major current challenge to balance the usefulness of collected data and the privacy violation they inevitably cause. Some possible solutions include differential privacy [21], or weaker versions like generative-adversarial privacy [43]. It could also be possible to combine these with more cryptographic solutions, like homomorphic encryption or multi-party computation. It is interesting that such cryptographic solutions may be (essentially) provably robust to any attacker, including a superintelligence1.

5 Dave’s world model problem

Unfortunately, raw data are usually extremely messy, redundant, incomplete, unreliable, poisoning and even hacked. To tackle these issues, it is necessary to infer the likely actual states of the world, given Erin’s collected data. This will be Dave’s job.

Dave will probably heavily rely on deep representation learning. This typically corresponds to determining low-dimensional representations of high-dimensional data through deep neural networks. This basic idea has given rise to today’s most promising unsupervised machine learning algorithms, e.g. word vectors [61], autoencoders [53], generative adversarial networks (GANs) [34] and transformers [87, 71].

Given how crucial it is for Dave to have an unbiased representation of the world, much care will be needed to make sure that Dave’s inference will foresee selection biases. For instance, when asked to provide images of CEOs, Google Image may return a greater ratio of male CEOs than the actual ratio. More generally, such biases can be regarded as instances of Simpson’s paradox [77], and boil down to the saying “correlation is not causation”. It seems crucial that Dave does not fall into this trap.

In fact, data can be worse than unintentionally misleading. Given how influential Alice may be, there will likely be great incentives for many actors to bias Erin’s data gathering to fool Dave. This is known as poisoning attacks. It seems extremely important that Dave anticipate the fact that the data he was given may be purposely biased, if not hacked. Recent breakthroughs have enabled efficient performant robust high-dimensional statistics [19, 51] with different attack models, as well as applications to robust distributed learning [10, 24, 18], though future work will probably need to combine such results with data certification guarantees provided by cryptographical signatures. Eventually, like any good journalist, Dave will likely need to cross information from different sources to infer the most likely states of the world.

This inference approach is well captured by the Bayesian paradigm [42]. In particular, Bayes rule is designed to infer the likely causes of the observed data . These causes can also be regarded as theories (and such theories may assume that some of the data were hacked). Bayes rule tells us that the reliability of theory given data can be derived formally by .

One typical instance of Dave’s job is the problem of inferring global health from a wide variety of collected data. This is what has been done by [45], using a sophisticated Bayesian model that reconstructed the likely causes of deaths in countries where data were lacking.

Importantly, Bayes rule also tells us that we should not fully believe any single theory. This simply corresponds to saying that data can often be interpreted in many different mutually incompatible manners. It seems important to reason with all possible interpretations rather than isolating a single interpretation that may be flawed.

When the space of possible states of the world is large, which will surely be the case of Dave, it is often computationally intractable to reason with the full posterior distribution . Bayesian methods often rather propose to approximate it through variational methods, or to sample from the posterior distribution to identify a reasonable number of good interpretations of the data, typically through Markov-Chain Monte-Carlo (MCMC).

In some sense, Dave’s job can be regarded as writing a compact report of all likely states of the world, given Erin’s collected data. It is unclear what language Dave’s report will be in. It might be useful to make it understandable by humans. But this might be too costly as well. Indeed, Dave’s report might be billions of pages long. It could be unreasonable or undesirable to make it humanly readable.

Note also that Erin and Dave are likely to gain cognitive capabilities over time. It is surely worthwhile to anticipate the complexification of Erin’s data and of Dave’s world models. It seems unclear so far how to do so. Some high-level (purely descriptive) language to describe world models may be needed. This high-level language probably should be flexible enough to be reshaped and redesigned over time. This may be dubbed the world description problem. It is arguably still a very open and uncharted area of research.

6 Charlie’s desirability learning problem

Given any of Dave’s world models, Charlie’s job will then be to compute how desirable this world model is. This is the desirability learning problem [80], also known as value learning. This is the problem of assigning desirability scores to different world models. These desirability scores can then serve as the basis for any agent to determine beneficial actions.

Unfortunately, determining what, say, the median human considers desirable is an extremely difficult problem. But it should be stressed that we should not aim at deriving an ideal inference of what people desire. This is likely to be a hopeless endeavor. Rather, we should try our best to make sure that Charlie’s desirability scores will be good enough to avoid catastrophic outcomes, e.g. increased polarization, severe misinformation or major discrimination.

One proposed solution is to build upon comparison-based preference inputs of users. For instance, in the Thurstone-Mosteller [86, 62], the Bradley-Terry [13] or the Plackett-Luce models [56, 70], a user is assumed to assign some intrinsic preference to each alternative . Given a dilemma between options and , they will have a probability to choose over , where the function is an increasing function from 0 to 1. For instance, we may have . Given a dataset of expressed preferences of a user, we may then infer the likely values of ’s.

It is even possible to address the case where the set of alternatives is infinite, as would be the case if alternatives are vectors, or vector representations of alternatives drawn for an unknown large set. One way to do so is to assume a priori that the ’s are a Gaussian process, and that we can define some kernel for the set of alternatives. Bayesian inference (or approximations of Bayesian inference) could then allow to infer the likely preferences of an individual over from a finite sample of binary preferences [15, 16, 17].

A perhaps more computationally effective approach could consist in learning a preference vector in feature space for each user , so that would represent the probability that user will say to prefer to on a given query, where and are vector representations of alternatives and , as is done in [67]. By maximizing over the likelihood of observed data, with some addditional appropriate regularization term on , this approach can be argued to compute some maximum a posteriori of user ’s preferences.

Another proposed solution to infer human preferences is so-called inverse reinforcement learning [65, 26]. Assuming that humans perform reinforcement learning to choose their actions, and given examples of actions taken by humans in different contexts, inverse reinforcement learning infers what were the humans’ likely implicit rewards that motivated their decision-making. Assuming we can somehow separate humans’ selfish rewards from altruistic ones, inverse reinforcement learning seems to be a promising first step towards inferring humans’ preferences from data.

There are, however, many important considerations to be taken into account, which we discuss below. First, despite Dave’s effort and because of Erin’s limited and possibly biased data collection, Dave’s world model is fundamentally uncertain. In fact, as discussed previously, Dave would probably rather present a distribution of likely world models. Charlie’s job should be regarded as a scoring of all such likely world models. In particular, she should not assign a single number to the current state of the world, but, rather, a distribution of likely scores of the current state of the world. This distribution should convey the uncertainty about the actual state of the world.

Another challenging aspect of Charlie’s job will be to provide a useful representation of potential human disagreements about the desirability of different states of the world. Humans’ preferences are diverse and may never converge. This should not be swept under the rug. Instead, we need to agree on some way to mitigate disagreement.

This is known as a social choice problem. In its general form, it is the problem of aggregating the preferences of a group of disagreeing people into a single preference for the whole group that, in some sense, fairly well represents the individuals’ preferences. Unfortunately, social choice theory is plagued with impossibility results, e.g. Arrow’s theorem [5] or the Gibbard-Satterthwaite theorem [31, 76]. Again, we should not be too demanding regarding the properties of our preference aggregation. Besides, this is the path taken by social choice theory, e.g. by proposing randomized solutions to preserve some desirable properties [40].

One particular proposal, known as majority judgment [9], may be of particular interest to us here. Its basic idea is to choose some deciding quantile (often taken to be ). Then, for any possible state of the world, consider all individuals’ desirability scores for that state. This yields a distribution of humans’ preferences for the state of the world. Majority judgment then concludes that the group’s score is the quantile of this distribution. If , this corresponds to the score chosen by the median individual of the group.

While majority judgment seems promising, it does raise the question of how to compare two different individuals’ scores. It is not clear that given by John has a meaning comparable to Jane’s . In fact, according to a theorem by von Neumann and Morgenstern [64], within their framework, utility functions are only defined up to a positive affine transformation. More work is probably needed to determine how to scale different individuals’ utility functions appropriately, despite previous attempts in special cases [41].

In practice, another challenge to apply social choice to real-world problem, such as hate speech moderation, is the fact the number of parameters of a hate speech moderation algorithm may be too large for humans to express their preferences. In fact, an algorithm that must decide based on the visual content of a video will likely have to rely on some sophisticated neural network. However, if the hate speech moderation algorithm has parameters, then the number of options for voters to vote on would be of the order . Deciding what should be moderated through classical social choice would then seem impractical.

This problem has been underlined by [58], where they studied social choice algorithms with a logarithmic-sized elicitation of voters. In other words, if there were options, then the social choice algorithms could only query bits of information from each voter. Clearly, we would then obtain suboptimal decision-making. Interestingly, [58] gave upper and lower bounds on the worst-case ratio between such a social choice algorithm’s decision and the optimal decision. Note though that this worst-case ratio may be too conservative, as it excludes reasonable priors we may have on individuals’ preferences, or on the similarity of individuals’ preferences.

Another fascinating line of research proposed to infer from elicitations of pairwise preferences a model for each user’s preferences, and to apply social choice to users’ preference models. This is known as virtual democracy [49]. [67] addressed the case of autonomous car dilemmas, [30] faced kidney donation, while [52] tackled food donation. Interestingly, in [52], voters were given the chance to interact in their model and to receive feedbacks about the social choice algorithm, so as to gain trust in the system that was implemented. This voting-based approach seems to be a promising avenue to alignment.

Now, arguably, humans’ current preferences are almost surely undesirable. Indeed, over the last decades, psychology has been showing again and again that human thinking is full of inconsistencies, fallacies and cognitive biases [48]. We tend to first have instinctive reactions to stories or facts [11], which quickly becomes the position we will want to defend at all costs [39]. Worse, we are unfortunately largely unaware of why we believe or want what we believe or want. This means that our current preferences are unlikely to be what we would prefer, if we were more informed, thought more deeply, and tried to make sure our preferences were as well-founded as possible.

In fact, arguably, we should prefer what we would prefer to prefer, rather than what we instinctively prefer. Typically, one might prefer to watch a cat video, even though one might prefer to prefer mathematics videos over cat videos. This is sometimes known as humans’ volitions [89], as opposed to humans’ preferences.

It might even be relevant to consider the volition of much more thoughtful and insightful versions of ourselves, as the currents selves may be blind to some phenomenons that will be of great concerns to our better selves. This can be illustrated by the fact that past standards are often no longer regarded as desirable. Our intuitions about the desirability of slavery, homosexuality and gender discrimination have been completely upset over the last century, if not over the last few decades. It seems unlikely that all of our other intuitions will never change. In particular, it seems unlikely that better versions of ourselves will fully agree with current selves. And it seems reasonable to argue that our better selves would be “more right” than current selves.

It is noteworthy that we clearly have epistemic uncertainty about our better selves. Algorithms probably should also take into account their epistemic uncertainty about our better selves’ volitions. Computing a given person’s volition may be called the volition learning problem. Interestingly, this is (mostly) a prediction problem, as it is the prediction of the preferences of alternative versions of ourselves. Bayes rule seems able to address this counterfactual reality.

Unfortunately, distinguishing preferences from volitions seems to be a neglected research direction so far. One approach to do so could be to assume that preferences obtained through elicitations better reflect volitions that preferences inferred from inverse reinforcement learning. This could serve as a basis to distinguish humans’ behaviors motivated by preferences from those motivated by volitions. Further research in this direction seems critical.

Such scores could also be approximated using a large number of proxies, as is done by boosting methods [3]. The use of several proxies could avoid the overfitting of any proxy. Typically, rather than relying solely on DALYs [68], we probably should invoke machine learning methods to combine a large number of similar metrics, especially those that aim at describing other desirable economic metrics, like human development index (HDI) or gross national happiness (GNH). Still another approach may consist of analyzing “typical” human preferences, e.g. by using collaborative filtering techniques [73].

Computing the desirability of a given world state according to a virtual democracy of our better selves’ volitions is Charlie’s job. In some sense, Charlie’s job would thus be to remove cognitive biases from our intuitive preferences, so that they still basically reflect what we really regard as preferable, but in a more coherent and informed manner. She would then need to aggregate such preferences, which may be computationally challenging if the computations must be done in milliseconds as in the case of autonomous cars [67] or of recommender systems.

But she might have to do more. Indeed, it seems critical for us to be able to trust Charlie’s computations. But this will arguably be very hard. Indeed, we should expect that our better selves’ volitions may find desirable things that our current selves might find repelling. Unfortunately though, we humans tend to react poorly to disagreeing jugments. And this is likely to hold even when the oppositions are our better selves. This poses a great scientific and engineering challenge. How can one be best convinced of the judgments that he or she will eventually embrace but does not yet? In other words, how can we quickly agree with better versions of ourselves? What could be said to an individual to get him closer to his better self? This may be dubbed the individual improvement problem, which is arguably much more a psychology problem than a computer science one. But educative feedbacks from algorithms may be essential, as seems to have been the case in [52].

7 Bob’s incentive design

The last piece of the jigsaw is Bob’s job. Bob is in charge of computing the rewards that Alice will receive, based on the work of Erin, Dave and Charlie. Evidently he could simply compute the expectation of Charlie’s scores for the likely states of the world. But this is probably a bad idea, as it opens the door to reward hacking [2].

Recall that Alice’s goal is to maximize her discounted expected future rewards. But given that Alice knows (or is likely to eventually guess) how her rewards are computed, instead of undertaking the actions that we would want her to, Alice could hack Erin, Dave or Charlie’s computations, so that such hacked computations yield large rewards. This is sometimes called the wireheading problem [28].

To avoid wireheading, Bob’s role will be to make sure that, while Alice’s rewards do correlate with Charlie’s scores, they also give Alice the incentives to guarantee that Erin, Dave and Charlie perform as reliably as possible the job they were given.

In fact, it even seems desirable that Alice be incentivized to constantly upgrade Erin, Dave and Charlie for the better. After all, it is high unlikely that early versions of Erin, Dave and Charlie will have no flaws. In fact, because of Goodhart’s law [59], even the slight imperfection in their design might result in highly counter-productive optimization by Alice. Indeed, [25] showed that when the discrepancy between a measure and the true goal has a heavy-tailed distribution, then as the measure gets maximized, the correlation between measure and goal becomes negative.

This suggests that robust alignment must rely on the continuous upgrade of the reward system, so that the computation of Alice’s rewards keep approaching the objective function that we really want to maximize. This is sometimes known as corrigibility [81, 54], though for efficient corrections, it seems essential that this corrigibility be pre-programmed. Bob’s job will be to solve the programmed corrigibility problem, by incentivizing Alice to upgrade adequately the components of her reward system.

Unfortunately, it seems unclear how Bob can best make sure that Alice has such incentives. Perhaps a good idea is to penalize Dave’s reported uncertainty about the likely states of the world. Typically, Bob should make sure Alice’s rewards are affected by the reliability of Erin’s data. The more reliable Erin’s data, the larger Alice’s rewards. Similarly, when Dave or Charlie feel that their computations are unreliable, Bob should take note of this and adjust Alice’s rewards accordingly to motivate Alice to provide larger resources for Charlie’s computations. All of this seems to demand a way for Bob to estimate the performance of the different components of the reward system. Doing so seems to be an open problem so far.

Now, Bob should also mitigate the desire for more reliable data and for more trustworthy computations with the fact that such efforts will necessarily require the exploitation of more resources, probably at the expense of Charlie’s scores. It is this non-trivial trade-off that Bob will need to take care of.

Bob’s work might be simplified by some (partial) control of Alice’s action or world model. Although it seems unclear so far how, techniques like interactive proofs (IP) [6, 33] or probabilistically checkable proofs (PCP) [4] might be useful to force Alice to prove its correct behavior. By requesting such proofs to yield large rewards, Bob might be able to incentivize Alice’s transparency. All such considerations make up Bob’s incentive problem.

8 Decentralization

We have decomposed robust alignment into 5 components for the sake of exposition. However, any component will likely have to be decentralized to gain reliability and scalability. In other words, instead of having a single Alice, a single Bob, a single Charlie, a single Dave and a single Erin, it seems crucial to construct multiple Alices, Bobs, Charlies, Daves and Erins.

This is key to crash-tolerance. Indeed, a single computer doing Bob’s job could crash and leave Alice without reward nor penalty. But if Alice’s rewards are an aggregate of rewards given by a large number of Bobs, then even if some Bobs crash, Alice’s rewards will remain mostly the same. But crash-tolerance is likely to be insufficient. Instead, we should design Byzantine-resilient mechanisms, that is, mechanisms that still perform correctly despite hacked or malicious Bobs. Robust estimators [19, 51] for both distributed workers [10] and parameter servers [22] may be useful for this purpose.

Figure 2: We decompose robust alignment into 5 steps. Each step is associated with further substeps or techniques. Also, there are critical subproblems that will likely be useful for several of the 5 steps.

Evidently, in this Byzantine environment, cryptography, especially (postquantum?) cryptographical signatures and hashes, are likely to play a critical role. Typically, Bobs’ rewards will likely need to be signed. More generally, the careful design of secure communication channels between the components of the algorithms seems key. This may be called the secure messaging problem.

Another difficulty is the addition of more powerful and precise Bobs, Charlies, Daves and Erins to the pipeline. It is not yet clear how to best integrate reliable new comers, especially given that such new comers may be malicious. In fact, they may want to first act benevolent to gain admission. But once they are numerous enough, they could take over the pipeline and, say, feed Alice with infinite rewards. This is the upgrade problem, which was recently discussed by [14] who proposed using numerous weaker algorithms to supervise stronger algorithms. More research in this direction is probably needed.

Now, in addition to reliability, decentralization also enables different Alices, Bobs, Charlies, Daves and Erins to focus on specific tasks. This would allow for more optimized solutions at lower costs. To this end, it may be relevant to adapt different Alices’ rewards to their specific tasks. Note though that this could also be a problem, as Alices may enter in competition with one another like in the prisoner’s dilemma. We may call it the specialization problem.

Another open question is the extent to which algorithms should be exposed to Bobs’ rewards. Typically, if a small company creates its own algorithm, to what extent should this algorithm be aligned? It should be noted that this may be computationally very costly, as it may be hard to separate the signal of interest to the algorithm from the noise of Bobs’ rewards. Intuitively, the more influential an algorithm is, the more it should be influenced by Bobs’ rewards. But even if this algorithm is small, it may be important to demand that it be influenced by Bobs to avoid any diffusion of responsibility, i.e. many small algorithms that disregard safety concerns on the ground that they each hardly have any global impact on the world.

What makes this nontrivial is that any algorithm may gain capability and influence over time. An unaligned weak algorithm could eventually become an unaligned human-level algorithm. To avoid this, even basic, but potentially unboundedly self-improving2 algorithms should perhaps be given at least a seed of alignment, which may grow as algorithms become more powerful. More generally, algorithms should strike a balance between some original (possibly unaligned) objective and the importance they give to alignment. This may be called the alignment burden assignment problem.

Figure 2 recapitulates our complete roadmap.

9 Conclusion

This paper discussed the robust alignment problem, that is, the problem of aligning the goals of algorithms with human preferences. It presented a general roadmap to tackle this issue. Interestingly, this roadmap identifies 5 critical steps, as well as many relevant aspects of these 5 steps. In other words, we have presented a large number of hopefully more tractable subproblems that readers are highly encouraged to tackle. Hopefully, this combination allows to better highlight the most pressing problems, how every expertise can be best used to, and how combining the solutions to subproblems might add up to solve robust alignment.


  1. The possible use of quantum computers may require postquantum cryptography.
  2. In particular, nonparametric algorithms should perhaps be treated differently from parametric ones.


  1. Joachim Allgaier. Science and environmental communication via online video: Strategically distorted communications on climate change and climate engineering on youtube. Frontiers in Communication, 4:36, 2019.
  2. Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
  3. Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
  4. Sanjeev Arora, Carsten Lund, Rajeev Motwani, Madhu Sudan, and Mario Szegedy. Proof verification and the hardness of approximation problems. Journal of the ACM, 45(3):501–555, 1998.
  5. Kenneth J Arrow. A difficulty in the concept of social welfare. Journal of political economy, 58(4):328–346, 1950.
  6. László Babai. Trading group theory for randomness. In Proceedings of the 17th Annual ACM Symposium on Theory of Computing, May 6-8, 1985, Providence, Rhode Island, USA, pages 421–429, 1985.
  7. Christopher A Bail, Lisa P Argyle, Taylor W Brown, John P Bumpus, Haohan Chen, MB Fallin Hunzaker, Jaemin Lee, Marcus Mann, Friedolin Merhout, and Alexander Volfovsky. Exposure to opposing views on social media can increase political polarization. Proceedings of the National Academy of Sciences, 115(37):9216–9221, 2018.
  8. Leemon Baird. Hashgraph consensus: fair, fast, byzantine fault tolerance. Technical report, Swirlds Tech Report, 2016.
  9. Michel Balinski and Rida Laraki. Majority judgment: measuring, ranking, and electing. MIT press, 2011.
  10. Peva Blanchard, El Mahdi El-Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 119–129, 2017.
  11. Paul Bloom. Against Empathy: The Case for Rational Compassion. Ecco, 2016.
  12. Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. OUP Oxford, 2014.
  13. Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  14. Paul Francis Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts. CoRR, abs/1810.08575, 2018.
  15. Wei Chu and Zoubin Ghahramani. Extensions of gaussian processes for ranking: semisupervised and active learning. Learning to Rank, page 29, 2005.
  16. Wei Chu and Zoubin Ghahramani. Gaussian processes for ordinal regression. Journal of Machine Learning Research, 6:1019–1041, 2005.
  17. Wei Chu and Zoubin Ghahramani. Preference learning with gaussian processes. In Machine Learning, Proceedings of the Twenty-Second International Conference (ICML 2005), Bonn, Germany, August 7-11, 2005, pages 137–144, 2005.
  18. Georgios Damaskinos, El Mahdi El-Mhamdi, Rachid Guerraoui, Rhicheek Patra, and Mahsa Taziki. Asynchronous byzantine machine learning (the case of SGD). In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 1153–1162, 2018.
  19. Ilias Diakonikolas and Daniel M. Kane. Recent advances in algorithmic high-dimensional robust statistics. CoRR, abs/1911.05911, 2019.
  20. Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Timothy A. Mann, and Pushmeet Kohli. A dual approach to scalable verification of deep networks. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pages 550–559, 2018.
  21. Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
  22. El Mahdi El-Mhamdi, Rachid Guerraoui, Arsany Guirguis, and Sebastien Rouault. SGD: decentralized byzantine resilience. CoRR, abs/1905.03853, 2019.
  23. El Mahdi El-Mhamdi, Rachid Guerraoui, Hadrien Hendrikx, and Alexandre Maurer. Dynamic safe interruptibility for decentralized multi-agent reinforcement learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 130–140, 2017.
  24. El Mahdi El-Mhamdi, Rachid Guerraoui, and Sébastien Rouault. The hidden vulnerability of distributed learning in byzantium. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 3518–3527, 2018.
  25. El Mahdi El-Mhamdi, Lê Nguyên Hoang, and Sébastien Rouault. On goodhart’s law, with an application to alignment. CoRR, 2020.
  26. Owain Evans, Andreas Stuhlmüller, and Noah D. Goodman. Learning the preferences of ignorant, inconsistent agents. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, pages 323–329, 2016.
  27. Tom Everitt. Towards Safe Artificial General Intelligence. PhD thesis, PhD thesis, Australian National University, 2018.
  28. Tom Everitt and Marcus Hutter. Reward tampering problems and solutions in reinforcement learning: A causal influence diagram perspective. CoRR, abs/1908.04734, 2019.
  29. Jaime F Fisac, Monica A Gates, Jessica B Hamrick, Chang Liu, Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S Shankar Sastry, Thomas L Griffiths, and Anca D Dragan. Pragmatic-pedagogic value alignment. In Robotics Research, pages 49–57. Springer, 2020.
  30. Rachel Freedman, Jana Schaich Borg, Walter Sinnott-Armstrong, John P. Dickerson, and Vincent Conitzer. Adapting a kidney exchange algorithm to align with human values. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 1636–1643, 2018.
  31. Allan Gibbard. Manipulation of voting schemes: a general result. Econometrica: journal of the Econometric Society, pages 587–601, 1973.
  32. Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian J. Goodfellow. Adversarial spheres. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, 2018.
  33. Shafi Goldwasser, Silvio Micali, and Charles Rackoff. The knowledge complexity of interactive proof systems. SIAM Journal of Computing, 18(1):186–208, 1989.
  34. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672–2680, 2014.
  35. Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  36. Rachid Guerraoui, Petr Kuznetsov, Matteo Monti, Matej Pavlovic, and Dragos-Adrian Seredinschi. Scalable byzantine reliable broadcast. In 33rd International Symposium on Distributed Computing, DISC 2019, October 14-18, 2019, Budapest, Hungary, pages 22:1–22:16, 2019.
  37. David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 2455–2467, 2018.
  38. Dylan Hadfield-Menell, Anca D. Dragan, Pieter Abbeel, and Stuart J. Russell. The off-switch game. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 220–227, 2017.
  39. Jonathan Haidt. The righteous mind: Why good people are divided by politics and religion. Vintage, 2012.
  40. Lê Nguyên Hoang. Strategy-proofness of the randomized condorcet voting system. Social Choice and Welfare, 48:679–701, 2017.
  41. Lê Nguyên Hoang, François Soumis, and Georges Zaccour. Measuring unfairness feeling in allocation problems. Omega, 65:138–147, 2016.
  42. Lê Nguyên Hoang. The Equation of Knowledge: From Bayes’ Rule to a Unified Philosophy of Science. Chapman and Hall/CRC, 2020.
  43. Chong Huang, Peter Kairouz, Xiao Chen, Lalitha Sankar, and Ram Rajagopal. Context-aware generative adversarial privacy. Entropy, 19(12):656, 2017.
  44. Marcus Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media, 2004.
  45. Institute for Health Metrics and Evaluation (IHME), University of Washington. Gbd compare data visualization, 2016.
  46. Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016.
  47. Youngjoo Jo and Jongyoul Park. Sc-fegan: Face editing generative adversarial network with user’s sketch and color. In Proceedings of the IEEE International Conference on Computer Vision, pages 1745–1753, 2019.
  48. Daniel Kahneman. Thinking, fast and slow. Farrar, Straus and Giroux New York, 2011.
  49. Anson Kahng, Min Kyung Lee, Ritesh Noothigattu, Ariel D. Procaccia, and Christos-Alexandros Psomas. Statistical foundations of virtual democracy. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 3173–3182, 2019.
  50. Adam DI Kramer, Jamie E Guillory, and Jeffrey T Hancock. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, page 201320040, 2014.
  51. Guillaume Lecué and Jules Depersin. Robust subgaussian estimation of a mean vector in nearly linear time. arXiv preprint arXiv:1906.03058, 2019.
  52. Min Kyung Lee, Daniel Kusbit, Anson Kahng, Ji Tae Kim, Xinran Yuan, Allissa Chan, Daniel See, Ritesh Noothigattu, Siheon Lee, Alexandros Psomas, and Ariel D. Procaccia. Webuildai: Participatory framework for algorithmic governance. Proceedings of the ACM on human-computer interaction, 3(CSCW):181:1–181:35, 2019.
  53. Cheng-Yuan Liou, Jau-Chi Huang, and Wen-Chie Yang. Modeling word perception using the elman network. Neurocomputing, 71(16-18):3150–3157, 2008.
  54. Yat Long Lo, Chung Yu Woo, and Ka Lok Ng. The necessary roadblock to artificial general intelligence: corrigibility. AI Matters, 5(3):77–84, 2019.
  55. Daniel Lowd and Christopher Meek. Adversarial learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, August 21-24, 2005, pages 641–647, 2005.
  56. Robert Luce. Individual Choice Behavior: A Theoretical Analysis. John Wiley and sons, 1959.
  57. William MacAskill. Normative uncertainty. PhD thesis, University of Oxford, 2014.
  58. Debmalya Mandal, Ariel D. Procaccia, Nisarg Shah, and David P. Woodruff. Efficient and thrifty voting by any means necessary. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 7178–7189, 2019.
  59. David Manheim and Scott Garrabrant. Categorizing variants of goodhart’s law. CoRR, abs/1803.04585, 2018.
  60. Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635, 2019.
  61. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013.
  62. Frederick Mosteller. Remarks on the method of paired comparisons: Ii. the effect of an aberrant standard deviation when equal standard deviations and equal correlations are assumed. Psychometrika, 16(2):203–206, 1951.
  63. Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic cash system, 2008.
  64. J von Neumann and Oskar Morgenstern. Theory of games and economic behavior. Princeton: Princeton, 1944.
  65. Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 663–670, 2000.
  66. Yuval Nirkin, Yosi Keller, and Tal Hassner. Fsgan: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE International Conference on Computer Vision, pages 7184–7193, 2019.
  67. Ritesh Noothigattu, Snehalkumar (Neil) S. Gaikwad, Edmond Awad, Sohan Dsouza, Iyad Rahwan, Pradeep Ravikumar, and Ariel D. Procaccia. A voting-based system for ethical decision making. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 1587–1594, 2018.
  68. World Health Organization et al. Death and daly estimates for 2004 by cause for who member states, 2009.
  69. Laurent Orseau and MS Armstrong. Safely interruptible agents. In Uncertainty in Artificial Intelligence: 32nd Conference (UAI 2016), edited by Alexander Ihler and Dominik Janzing, page 557–566, 2016.
  70. Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society: Series C (Applied Statistics), 24(2):193–202, 1975.
  71. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
  72. Manoel Horta Ribeiro, Raphael Ottoni, Robert West, Virgílio A. F. Almeida, and Wagner Meira Jr. Auditing radicalization pathways on youtube. In FAT* ’20: Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, January 27-30, 2020, pages 131–141, 2020.
  73. Francesco Ricci, Lior Rokach, and Bracha Shapira. Recommender systems: Introduction and challenges. In Recommender Systems Handbook, pages 1–34. 2015.
  74. Stuart J. Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. AI Magazine, 36(4), 2015.
  75. Stuart Jonathan Russell. Human compatible: Artificial intelligence and the problem of control. Penguin, 2019.
  76. Mark Allen Satterthwaite. Strategy-proofness and arrow’s conditions: Existence and correspondence theorems for voting procedures and social welfare functions. Journal of economic theory, 10(2):187–217, 1975.
  77. Edward H Simpson. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), pages 238–241, 1951.
  78. Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D. Sculley, Joshua V. Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pages 13969–13980, 2019.
  79. Nate Soares. Aligning superintelligence with human interests: An annotated bibliography. Intelligence, 17(4):391–444, 2015.
  80. Nate Soares. The value learning problem. In Ethics for Artificial IntelligenceWorkshop at 25th International Joint Conference on Artificial Intelligence, 2016.
  81. Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky. Corrigibility. In Artificial Intelligence and Ethics, Papers from the 2015 AAAI Workshop, Austin, Texas, USA, January 25, 2015, 2015.
  82. Nate Soares and Benya Fallenstein. Agent foundations for aligning machine intelligence with human interests: a technical research agenda. In The Technological Singularity, pages 103–125. Springer, 2017.
  83. Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation, 23(5):828–841, 2019.
  84. Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
  85. Max Tegmark. Life 3.0. Being Human in the Age of Artificial Intelligence. NY: Allen Lane, 2017.
  86. Louis L Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927.
  87. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
  88. Min Wu, Matthew Wicker, Wenjie Ruan, Xiaowei Huang, and Marta Kwiatkowska. A game-based approximate verification of deep neural networks with provable guarantees. Theoretical Computer Science, 807:298–329, 2020.
  89. Eliezer Yudkowsky. Coherent extrapolated volition. Singularity Institute for Artificial Intelligence, 2004.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description