A Brandom-ian view of Reinforcement Learning towards strong-AI

A Brandom-ian view of Reinforcement Learning towards strong-AI

Atrisha Sarkar
David R Cheriton School of Computer Science
University of Waterloo
July 12, 2019
Abstract

The analytic philosophy of Robert Brandom, based on the ideas of pragmatism, paints a picture of sapience, through inferentialism. In this paper, we present a theory, that utilizes essential elements of Brandom’s philosophy, towards the objective of achieving strong-AI. We do this by connecting the constitutive elements of reinforcement learning and the Game Of Giving and Asking For Reasons. Further, following Brandom’s prescriptive thoughts, we restructure the popular reinforcement learning algorithm A3C, and show that RL algorithms can be tuned towards the objective of strong-AI.

A Brandom-ian view of Reinforcement Learning towards strong-AI


Atrisha Sarkar

David R Cheriton School of Computer Science

University of Waterloo

July 12, 2019

1 Introduction

The analytic philosophy of Robert Brandom situates itself with the demarcation question: what does it mean to be us? In the process, Brandom presents a layer-cake picture of sapience, thus categorizing agents into one of three categories: simple performers, rational being, and logical beings, in increasing order of their sapience. Brandom’s approach to evaluate a being’s sapience is through the Game Of Giving and Asking For Reasons (GOGAR). Brandom claims that this game is a prototypical representation of a social practice, and it is possible to evaluate a being’s sapience just by their nature of participation within that game. It is also possible to draw parallels between Brandom’s treatment of sapience, and John Searle’s distinction of weak and strong AI; where the highest level of sapience in Brandom’s philosophy (logical beings) is strong-AI [searle1980minds]. In relation to artificial intelligence, this presents a practical approach to re-imagine models with the purpose of advancing them towards strong-AI.

In this paper, we draw the theoretical links between one such class of models i.e. reinforcement learning, and Brandom’s philosophy of sapience. We do this by connecting the constitutive elements of reinforcement learning and Markov Decision Processes (MDPs), to Brandom’s ideas of inferentialism. Further, with the objective of exploring a trajectory towards strong-AI, we restructure the popular reinforcement learning Asynchronous Advantage Actor-Critic (A3C), in accordance with Brandom’s philosophy. Finally, looking through the lens of Brandom’s philosophy, we show that it is in principle possible to re-imagine reinforcement learning algorithms towards that objective.

2 Reinforcement Learning

Reinforcement learning (RL) is loosely inspired by the psychology of rewards and punishment. This is based on the neurological evidence of dopamine release in a mammal brain, which functions to shape reward driven behavior. RL is one of the popular machine learning techniques along with supervised and unsupervised learning, where the focus is on goal directed learning. Mathematically, reinforcement learning is modeled using a Markov Decision Process (MDP) [sutton1998reinforcement]. MDP is a discrete 111most principles of MDP translate directly to a continuous version of MDP state transition model which consists of:

  • States (): The set of states the agent can be in. This helps the agent to form a representation of the environment based on its perception. For e.g. sensor readings of a robot

  • Actions (): The set of actions the agent can take.

  • Transition function (): This is the environment model, and is expressed as a probability distribution over states that the agent can land into (), if it takes the action in state ,

  • Reward function (): The real valued reward function that the agent receives as a result of taking an action in state . Popular notation represents reward function as being dependent on the next state () as well as on the current state and action ().

  • Discount factor (): This controls the preference the agent gives to the immediate rewards as compared to rewards in the future.

Figure 1: A Markov Decision Process with state transition probabilities, and reward values shown within braces.

Figure 1 shows an MDP with the state transition probabilities, and the corresponding reward values that the agent receive for each state-action pair. RL being a goal-oriented learning mechanism, the agent progressively learns a sequence of actions (policy) to take in order achieve a pre-defined objective. Often, the learning process is framed in a manner such that the sequence of actions that maximizes the rewards gathered along the way (optimal policy), results in achieving that objective.

To facilitate the learning process, the agent maintain one or more of the following functions:

  • a policy function that maps states to actions the agent can take in that state.

  • a state-value function that represents the goodness of the state 222A note on the notation: Capitalized and is used for tabular representation, whereas and is used for functional representation.. It is the expected discounted sum of future rewards the agent can receive from that state onward.

    where is the expected value of a random variable, given that the agent follows the policy , is the time step, and is the expected sum of return from state .

  • an action-value function that is similar to the state-value function, but represents the goodness of taking an action, given a certain state.

2.1 Algorithms in Reinforcement Learning

Figure 2: Categorization of various algorithms for reinforcement learning.

Figure 2 shows the main categorizes that popular reinforcement learning algorithms 333http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html. This section gives a short description of each category, and subsequently elaborates the algorithms that are relevant for the development of the philosophical connections in the later sections.

Value Function based algorithms like the Q-learning [watkins1992q] and SARSA [sutton1996generalization] maintains an explicit representation of the or the functions, and use them to find an optimal policy of behavior (). On the other hand direct policy based algorithms like REINFORCE [williams1988use], finds an optimal policy of behavior without maintaining an explicit representation of the or functions. Actor-critic [sutton1998reinforcement] are a class of algorithms that combine the value function based and direct policy based methods into a single algorithm.

Model-based and model-free are another categorization of RL algorithms. Model-based methods maintains and learns an explicit representation of the transition function (), and uses that to calculate the value functions, subsequently deriving the optimal policy. Whereas, model-free approaches do not maintain an explicit representation of the transition function.

Q-learning

Q-learning is model-free algorithm that iteratively learns, as the name suggests, the Q-function. The optimal policy can then be derived by a one-step search.

1:procedure Q-learning()
2:     repeat
3:     until terminal state
4:     for each episode do
5:         Choose action from policy derived from Q ( -greedy)
6:         Take action , observer reward , and next state
7:         
8:         
9:     end for
10:end procedure
Algorithm 1 Q-learning

Algorithm 1 shows the pseudocode for Q-learning. The algorithm is initialized with an initial estimate of the Q function. At each step, the agent takes an action that is derived from the current value (Step 5), and observes the reward (Step 6). The derivation of an action from a given function is based on the equation: . The essence of -learning is step 7, which incrementally updates the older function estimate based on the reward received in step 6. Difference between the new and old estimates of or function, in this form is commonly referred to as the TD-error.

Direct policy search - REINFORCE

REINFORCE is direct policy search algorithm which maintains an approximation of the policy function . The actual policy function is approximated with a parameterized representation of the form , where is the parameters of the function. Algorithm 2 presents the psudocode of the algorithm.

In every iteration, similar to -learning, the algorithm updates the policy through stochastic gradient ascent on the parameter , thus improving the estimate of the policy representation .

1:procedure REINFORCE()
2:     repeat:
3:     Generate an episode , following
4:     for each step of the episode do
5:          return from step
6:         
7:     end for
8:end procedure
Algorithm 2 REINFORCE

2.1.1 Actor-critic algorithm

Algrithm 3 presents the psudocode of hte algorithm. Actor-critic combines the above algorithms into a single algorithm, where a temporal-difference is calculated to update the value function. This is in principle similar to TD-learning, the only difference being that the value function is represented in a parametric form . Thus, the value function is updated as stochastic gradient update of the weights (step 7), instead of a step update like in TD-learning. This temporal-difference is also used to update the parameters of the policy function (step 8), similar to REINFORCE. Detailed derivation and description of the notation is provided in the footnote link 444http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf#page=288.

1:procedure Actor-critic()
2:     Repeat forever:
3:     while terminal state not reached do
4:         Choose action according to the actors current policy
5:         Take action , observer reward , and next state
6:          temporal difference error
7:          gradient update of critic’s value function parameters
8:          gradient update of actors’s policy parameters
9:         
10:         
11:     end while
12:end procedure
Algorithm 3 Actor Critic algorithm
Figure 3: Schematic representation of actor-critic algorithm.

A3C: Asynchronous Advantage Actor-Critic algorithm

Asynchronous Advantage Actor Critic (AC3) algorithm is a parallel asynchronous multi-threaded implementation of actor-critic algorithm (Figure 4) [mnih2016asynchronous]. The psudocode of A3C is shown in algorithm 4. Each thread is identical to actor-critic algorithm, however, instead of updating gradients at each step, like in actor-critic, the A3C algorithm accumulates the gradients (steps 6,7), and performs an asynchronous update to the global actor-critic unit after steps (steps 9,10).

1:procedure A3C()
2:     repeat:
3:     Generate an episode , following
4:     for each step of the episode do
5:          temporal difference error
6:          gradient accumulation of critic’s value function parameters
7:          gradient accumulator of actors’s policy parameters
8:     end for
9:     Perform asynchronous update of global using for global critic
10:     Perform asynchronous update of global using for global actor
11:end procedure
Algorithm 4 A3C pseudocode for each thread
Figure 4: Schematic representation of Asynchronous Advantage Actor Critic algorithm (A3C) algorithm.

3 Philosophy of Robert Brandom

Background

Robert Brandom (b.1950) is a Distinguished Professor of Philosophy at the University of Pittsburgh, where he has been a member of the faculty since 1976. He completed his PhD at Princeton University under the guidance of Richard Rorty. Most notable of Brandom’s work include Making It Explicit, and a set of lecture publications delivered at the University of Oxford titled Between Saying and Doing. Brandom’s philosophy is grounded on the idea of pragmatism 555https://plato.stanford.edu/entries/pragmatism/, and his work falls under the philosophy of language, philosophy of mind, and philosophy of logic. His main contribution is inferentialism, where he shows how the process of making inferences imparts semantic meaning to things [brandom1994making][brandom2008between].

3.1 Sapience

The starting point of Brandom’s philosophy is the question:

What is the difference between a parrot who is disposed reliably to respond differentially to the presence of red things by saying "Raawk, that’s red". and a human reporter who makes the same noise under the same circumstances [brandom1995knowledge].

Simply put, this question asks what does it mean to be us? Asking this question has significant value in the philosophy of AI, where recent advances and widespread acceptance of artificial intelligence technologies have drawn society to introspect on the question what it means to be human. Brandom’s response to this question is by arguing that a parrot’s behavior is not a part of a special kind of norm governed social practice. Building up on the same argument, he presents a layer-cake picture of sapience, leading him to list three types of beings666The terminology of logical and rational is of Brandom’s own choosing, and might not translate directly to a colloquial understanding of these words., ordered based on their sapience [wanderer2014robert].

Simple performers: Simple performers, or the type Brandom calls Reliable Differential Responsive Dispostion, are reactive systems. They can be imagined to have a read and write head, where based on an input signal, it has the ability to differentially produce certain types of response outputs.

Rational beings: One step higher in the chain of sapience, are rational beings. Brandom says that what differentiates rational beings from simple performers is their ability to make moves in a Game Of Giving and Asking for Reasons (GOGAR). Brandom presents a detailed elaboration of what this game constitutes, and we would look into this in subsequent sections.

Brandom further claims that it is possible for a simple performer to achieve the ability of rational beings, through a more complex deployment of its abilities - the ability to draw inferences. For example, if someone is presented with an offer of employment, they have the option of signing it and taking the job. To take this all important decision, a rational being might consider all the consequences of signing that offer, for example, having to wake up 6 a.m. every weekday, being able to earn money. Thus, this interaction of signing the offer might be imagined to be an input-output relation where the input is the presentation of the offer, and the output is signing it. Although both simple performer and rational beings can take part in this interaction with the help of their abilities, what sets rational beings apart is the ability to draw inferences as a consequence of the output action.

Logical Beings: Further in the chain of sapience are logical beings. They distinguish themselves from rational beings by being able to deploy a special kind of vocabulary, which Brandom terms elaboration and explication.

Elaboration is a relation between two practical abilities. For instance the relation between the ability to do multiplication and subtraction (P1), and the ability to do long-form division (P2) [wanderer2014robert]. The relation between P1 and P2 is such that, it is possible to achieve the ability P2 entirely from the ability P1 through a step wise process. Elaboration is this process of step-by-step algorithmic derivation to achieve one ability from the other. Other than algorithmic derivation, elaboration also includes another form of derivation i.e. elaboration-by-training. This can be understood as the derivation of one ability from another, where the process of derivation is more complex than just a step-by-step algorithmic process. For example, ability to draw a passable picture of a human from the ability to draw a passable picture of a stick figure.

Explication is the act of making explicit the principles that codifies a practical know-how. For example, expressing in some linguistic form the ability to ride a bicycle, or swinging a baseball bat is an act of explication.

Elaboration and explication in conjunction is referred to as LX-vocabulary. This vocabulary can contain normative expressions like ’…is committed to…’, ascriptional phrases like ’…says’, or conditional phrases of the form ’if….else…’. Brandom states that the list of LX-vocabulary is unbounded, and any vocabulary that assists in the process of elaboration and explication can be considered a logical vocabulary.

Figure 5: Layer-cake picture of sapience in an increasing order from bottom (simple performers) to top (logical beings)

Figure 5 shows the three types of beings in increasing order of sapience (bottom to top). Critics of Brandom argue if it is at all possible to conceptualize rational beings as separate from logical beings. My understanding is that rational beings is a stepping stone towards logical beings from simple performers. In relation to Searle’s Chinese room thought experiment [hauser2001chinese] that passes a Turing test (or what he calls weak AI), Brandom’s rational beings would be an instantiation of weak-AI that passes that Turing test. Thus, whereas a rational being perform a series of statements and inferences, thereby passing a Turing test, it is only logical beings who have the ability to understand that performance (strong-AI).

3.2 Game of Giving and Asking For Reasons (GOGAR)

An essential aspect of Brandom’s philosophy is his account of sapience. Whereas the previous section presented his work on the three categories and their hierarchy in the ladder of sapience, this section presents his account of the necessary and sufficient conditions that can qualify a sapient being (rational or logical). This can be understood through the interaction of a being in a given social practice777A social practice can be imagined to be any interaction among beings (humans or not). E.g. in relation to our previous example, engaging in the process of employment can be an example of a social practice., or what form those interactions should be, to qualify as sapient (rational or logical). Brandom does this by mapping a being’s interaction within a social practice to a specific game - the Game of Giving and Asking For Reasons (GOGAR) (Figure 6).

The main elements of the GOGAR game are 888For the sake of purity, I have retained the original terminologies of the game without modification. These terminologies sometimes seem unintuitive. Thus, whenever possible, I have tried to add subjective descriptions for clarity.

  • Players and Scorekeepers: Each individual in the game acts as players (while making a move), as well as scorekeepers to other player’s moves.

  • counters: There are infinite distinct counters in the game. The counters () can be thought to be tokens. Each counter is related to other counters through a relation of committive consequences (cc):

  • commitment move: This is an act of the player taking a counter and placing it in their commitment box. As a part of the move, the player also has to place all other counters that participate in the committive consequence relation to the chosen counter. This can be interpreted as an act of making a claim (either through statements or through actions). Forcing the player to place the counters in relation can be interpreted as being held to the consequences of making that claim. For example making a claim of ’signing an employment offer letter’ might draw committive consequence of ’coming to work at 9am’ and ’wearing a bowtie’.

  • entitlement move: Similar to commitment move, players can place copies of the counters already in their commitment boxes into entitlement boxes. The difference between entitlement and commitment is that a player is obligated to defend an entitlement, if it is challenged by a scorekeeper. A challenge by the scorekeeper can be understood as the scorekeeper not accepting a claim. For example, if the employee claims ’receiving travel allowance’ as a of ’signing an employment offer letter’, then the HR manager (acting as a scorekeeper) can choose not to accept that claim. The player would have to then remove that claim from their entitlement box.

  • A-ing (asserting): A scorekeeper can consider a claim by the player as an a-ing move iff the claim is an entitled one and the player is committed to defend that entitlement if challenged by any other scorekeeper.

Figure 6: A representation of the Game Of Giving and Asking for Reasons

4 Reinforcement learning semantics and Brandom-ian philosophy

As we discussed in the previous section, both commitments and entitlements use the relation of commitive consequnce as a mode to express claims. Even the scorekeeper’s challenge to players are expressed through the modality placing counters (and counters that form a relation to that counter), that are in opposition to the counters placed by the player. Thus, the counters and the relation of is the basic structure on which the GOGAR game is built. It is also relevant here to refresh the intuitive explanation of the counter and the act of placing the counter. Whereas a counters represents the set of all possible claims, the act of placing a counter is the pragmatic action of execution of that claim (through saying or doing something) by the player.

Thus, one way we relate reinforcement learning semantics to Brandomian philosophy is by connecting the structural elements of RL, to the structural elements of GOGAR.

An MDP distinguishes between a state () and an action , whereas GOGAR does not have such explicit distinction. It uses counters and the committive consequence relation among them. Reproducing our earlier notation, this structure in GOGAR can be represented as:

(1)

In reinforcement learning, a policy function maps a state to an action 999A stochastic policy function is a probability distribution on the states and the action . However, a deterministic policy is a special case where, given a state, is 1 for at most one action, and 0 for all other. Although we consider only deterministic policies here; extension to stochastic policies are left as a future work, which can be represented in the form (refer Figure 7):

(2)

If we combine states and actions into a tuple such that

(3)
(4)

where is the transition function, then the same policy function can be represented as

(5)
(a) Original policy representation
(b) Restructured policy representation
Figure 7: Policy function representation in an MDP

Thus, with the above reformulation, we can see that formulation (1) of GOGAR and formulation (5) of reinforcement learning are structurally equivalent. This brings us to our link 1:

Link 1: The commitive consequence relation among tokens in GOGAR is structurally equivalent to a policy in reinfocement learning.

Our treatment of the above link between RL and Brandom’s philosphy also gives us a fresh perspective to reinforcement learning and MDP. We often view an MDP as a barebone mathematical model devoid of any innate meaning. It gets it’s meaning only through how an MDP and reinforcement learning is applied to a specific application. However, our conjecture is this need not be the case. Since the state-action tuple () is equivalent to Brandom’s use of counters () in GOGAR, we can use state-action tuples (), as basic building blocks towards a strong-AI, just like how the counters build up towards the sapience of logical beings 101010This is directly derived by drawing parallels between Searle’s use of strong-AI, and Brandom’s logical beings being an instance of that.

Next, we delve into the semantics of the counters and committive consequence relation in GOGAR. Brandom’s position on semantics of the committive consequence is related to a constitutive view of a social practice. In other words, the specific social practice in which the counters and relations appear, is the source of the semantics. Intuitively, it helps to understand this in reference to our previous example. The commitive consequence of ’signing an employment offer’ to ’waking up at 6am’ has meaning only because of the social practice of ’engaging in the process of employment’. Thus, the social practice have a constitutive function in relation to committive relationships [evans2016computer].

More formally, this means that any committive consequence in the formulation (1) needs to be indexed in reference to the social practice , where is the superset of all social practices.

(6)

Taking a similar line of argument of reinforcement learning, we can claim that RL policy formulation of the form (5), has semantic meaning only through the transition relation . We have already established this structurally in equation (4). However, semantically we can say that

(7)

where is the superset of all possible transition functions. This equivalence between equation (6) and (7) brings us to the second link:

Link 2: The transition function, or the underlying model of the environment in reinforcement learning can be understood as the social practice in a GOGAR.

Re-imagined actor-critic and A3C

Having drawn structural and semantic links between reinforcement learning and Brandom’s philosophy, we now connect the two on the basis of the reinforcement learning algorithms discussed in section 2.1. We do this by asking the following questions:

  • 1. If being able to play the GOGAR is a necessary condition of sapience, can a reinforcement learning algorithm play the game?

  • 2. If the path towards a logical being, and strong-AI, is through the deployment of a set of logical vocabulary, can that algorithm deploy that vocabulary?

To address the first question, we restructure the A3C algorithm, already introduced in section 2.1, in line with the GOGAR game, thereby showing that threads participating in AC3 can play the game. 111111We consider the case where all the participants in the game (players and scorekeepers) are reinforcement learning based agents. Extension to cases where a portion of the population is non-AI based, is left as a future work

Figure 8: Restructured actor-critic unit in A3C

First, we restructure the basic actor-critic units of the A3C. Figure 8 shows the new restructured actor-critic unit in an A3C algorithm. Instead of a static assignment of actor and critic, as in the original algorithm, we introduce the idea of a participant unit (pu) (shown within the red box in the figure), and actor and critic just being dynamic roles played by a . Each participant unit holds a value function representation , which helps them play the role of a critic, as well as a policy representation , which helps the same participating unit play the role of an actor. This is in accordance with the GOGAR where a participant plays the dual roles of player and scorekeeper.

1:procedure GOGAR-A3C()
2:     sample from [0,]
3:     initialize threads
4:     for each thread do
5:          sample from
6:          sample from
7:         while  do
8:              Choose action according to the actors current policy
9:              Take action , observer reward , and next state
10:               temporal difference error
11:               gradient update of critic’s value function parameters
12:               gradient update of actors’s policy parameters
13:              
14:              
15:              
16:         end while
17:     end for
18:end procedure
Algorithm 5 GOGAR A3C

Next, we restructure the complete algorithm as shown in Figure 9. Algorithm 5 shows the psudocode of the algorithm. The algorithm is initialized with a population () of participating units. Next, we initialize threads, each simulating an interaction between two participating units, each sampled without replacement from the population (steps 5,6). Until a predefined length of an interaction (), each thread then executes an actor-critic algorithm (steps 8-14), with the assigned actor and the critic. Further elements of the game like commitments and entitlements can then be designed as primitives, depending on the specific application.

Figure 9: Restructured AC3 with 4 participating units and 3 interactions

Having addressed the first question by restructuring the A3C algorithm so that the agents can participate in the game of GOGAR, we now focus on the second question.

As mentioned in section 3.1, participants in the game of GOGAR can advance themselves from rational beings to logical beings by deploying logical vocabulary. One category of logical vocabulary is the conditional locution of if…then. These condition locutions can also be interpreted as prediction questions of the form if I do this….then does that happen? In our restructured implementation of GOGAR-AC3, each participating units hold a value function representation. These value functions, in principle, can also be substituted by general value functions (GVF). GVFs are a generalization of standard value function in reinforcement learning, such that agents are able to answer prediction questions of the aforementioned form [sutton2011horde]. Although, a step-by-step re-implementation of GOGAR-AC3 with GVFs instead of value-functions are left as a future work, with reference to our second question, we feel confident to answer that it is in principle possible to deploy logical vocabulary in GOGAR through reinforcement learning.

Conclusion

In this paper, we presented a Brandom-ian view of reinforcement learning with the objective of advancing it towards strong-AI. We introduced the main elements of reinforcement learning, MDPs, and presented some popular RL algorithms. Next, we presented Brandom’s philosophy, and described in detail, the mechanism of the Game Of Giving and Asking For Reasons (GOGAR). As a main contribution, we drew links between the constituting elements of reinforcement learning and GOGAR. Further we theorize on two important questions, that we believe lie in the path towards strong-AI. Finally, we show that it is possible to re-imagine reinforcement learning, with the help of Brandom’s philosophy, for the objective of achieving strong-AI.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
119911
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description