A Programming Language With a POMDP Inside

A Programming Language With a POMDP Inside


We present Poaps, a novel planning system for defining Partially-Observable Markov Decision Processes (POMDPs) that abstracts away from POMDP details for the benefit of non-expert practitioners. Poaps includes an expressive adaptive programming language based on Lisp that has constructs for choice points that can be dynamically optimized. Non-experts can use our language to write adaptive programs that have partially observable components without needing to specify belief/hidden states or reason about probabilities. Poaps is also a compiler that defines and performs the transformation of any program written in our language into a POMDP with control knowledge. We demonstrate the generality and power of Poaps in the rapidly growing domain of human computation by describing its expressiveness and simplicity by writing several Poaps programs for common crowdsourcing tasks.

Key Words.:
POMDPs, Planning, Decision Theory, Adaptive Programming, Crowdsourcing

3 \toappear


I.2Artificial IntelligenceProgramming Languages and Software


Algorithms, Languages

1 Introduction

Although optimal decision-making is widely applicable across many aspects of human life, ultimately the ability to construct and use intelligent agents to make optimal decisions has been restricted to those who understand to some degree the theory of decision processes. Those who would greatly like access to such tools, like many in the crowdsourcing community weld-hcomp11 () for example, must either resort to sub-optimal techniques that use approximate heuristics or hire a planning expert to formally define and solve their domain-specific problems.

This paper presents Poaps (Partially Observable Adaptive Programming System), which is a step toward bringing the power of decision theory to the masses. Poaps makes available the power of Partially-Observable Markov Decision Processes (POMDPs) to non-expert users through the interface of an adaptive programming language that provides an abstraction over POMDPs so that non-experts can write POMDPs without knowing anything about them.

Like many previous approaches to proceduralizing decision processesandre-aaai02 (); boutilier-aaai00 (); pinto-icmla10 (), we create a language that includes choice points, which allow the system to make optimal decisions adaptively. However, unlike previous approaches, we do not expect the programmer to explicitly reason about the state space or world dynamics. Such an expectation for our target users is impractical. In an informal experiment, we recruited scientists unfamiliar with artificial intelligence and introduced them to (PO)MDPs. We then asked them to define a state space for some simple crowdsourcing problems. However, all of them were unable to define a satisfactory state space, let alone an entire POMDP. In particular, they had trouble grasping the meaning and mathematical formalism of a POMDP hidden “state.” Therefore, the challenge is to create a language that can express all the details of POMDPs, yet hides these details from the programmer, but is still flexible enough to represent programs for a variety of scenarios.

One of our key contributions is a division of work between experts and non-experts that achieves our desiderata. Poaps asks experts to define primitives. A primitive consists of a function and a mathematical model of that function. The mathematical model describes some hidden state underlying the function and its dynamics. Non-expert programmers understand the primitives as procedures in terms of easily understood inputs and outputs and may not appreciate the hidden mathematical models. They can, however, compose the primitives into novel programs for their needs. Poaps then compiles their programs into completely new POMDPs.

For instance, suppose a user would like to write a program that uses crowdsourcing to label training data. An adaptive program that achieves this goal might be the following. For each datum, poll crowd workers for labels until the system is confident it can stop and return a label. For an adaptive program to make optimal decisions, it needs to both maintain some state that represents a current belief about what the correct label is, and know how to update this belief after every label observation. Instead of hiring a planning expert to handcraft a custom POMDP for this simple voting problem dai-aaai10 (); kamar-aamas12 (), users, and in particular, non-experts, should be able to write a very simple program that abstracts away from state variables and probabilities: either ask another worker for another label and recurse, or return the label with the most number of votes. Poaps achieves this goal. Figure 1 shows a Poaps program for labeling (voting) that implements the algorithm we just described. It assumes there are two possible labels and reposes the problem as one of discovering if the first label is better than the second. Notice that the program makes no reference to any POMDP components in its definition. The user does not need to specify some hidden state that represents the correct answer. Instead, an expert has previously defined the primitive crowd-vote, which contains a mathematical model describing the dynamics of its inputs and ouputs, and we see that all the user needs to do in excess of providing the program logic is to use that primitive (by finding it in a library) and provide a choice point in the program. Then, Poaps will automatically determine the optimal branch to take at runtime.

(d efine (vote-better? q a0 a1 c0 c1)
(c hoose
(if (crowd-vote q a0 a1)
(vote-better? q a0 a1 (+ c0 1) c1)
(vote-better? q a0 a1 c0 (+ c1 1)))
(if (> c0 c1) #t #f)))
Figure 1: A Poaps program for labeling that manages uncertainty without exposing it to the user. q is an input question, a0,a1 are two possible answers, and c0,c1 count the number of votes for each choice.

We show the value of Poaps by writing many useful crowdsourcing POMDPs as Poaps programs. The simplicity of the language and programs leads us to believe that our system will be easily usable by non-experts.

In summary, our contributions are (1) a system that exploits a separation of experts and non-experts that allows non-experts to write POMDPs while being isolated from the mathematical description of the decision process, and (2) an implementation that will likely help non-expert POMDP practitioners in their ability to write decision processes in a variety of domains.

2 Related Work

Various languages have been proposed in the literature for representing POMDPs. Several of those are declarative representations, which ask the user to explicitly declare each component (state, actions, etc.) of a POMDP. Examples include Cassandra-style format 1, Probabilistic PDDL younes-ipc04 (), and RDDL sanner11 ().

Several procedural languages have also been developed including ABL simpkins-oopsla08 (), ALisp andre-aaai02 () and concurrent ALisp marthi-ijcai05 (), Hierarchical Abstract Machines (HAMs) parr-nips98 (), and Programmable Hierarchical Abstract Machines (PHAMs) andre-nips01 (). All these representations allow a user to provide control knowledge in the form of a procedural program for an existing and explicitly specified MDP. In other words, when one writes an ALisp program, one must additionally explicitly specify an MDP that the program is tied to and constraining.

DTGolog boutilier-aaai00 () is a situation calculus-based procedural language that can both define decision problems (by defining a set of axioms) and specify control. While this unification is useful for experts, users must still explicitly specify MDPs before they can write control programs. While non-expert users can write control policies for expert-written MDPs, they cannot write their own MDPs. Additionally, they must explicitly use the components of the MDP in their control policies.

Stochastic Programs (SP) mcallester-99 () provide a language for experts to write world models and primitive actions and then compose those primitive actions to create control policies for the corresponding world model. While they do no consider non-expert use of the language, one can imagine non-experts using the primitive actions to create control policies. However, these non-expert users must explicitly use POMDP components. In particular, the primitives do not abstract away from the state space; they take state variables as arguments, and return state variables and observations. SP also does not allow for learning policies. The language only provides for specifying a complete control policy and evaluating its utility.

The work on adaptive programs pinto-icmla10 () allows non-expert users to quickly construct observable decision processes by writing programs that can contain optimizable choice points. However, construction of POMDPs still requires the user to explicitly model POMDP details like belief states and belief updates.

The key difference between our work and all the related work is that we do not ask the programmer to explicitly define or use POMDP components. Instead we leverage a division of work between experts and non-experts whereby non-experts can glue together expert-provided POMDP components to create entirely new POMDPS for their own problems.

3 Crowdsourcing Background

We are motivated by and describe our system using many examples taken from the crowdsourcing literature. Crowdsourcing requesters, those who hire crowdsourced workers, often design workflows to complete their tasks. An example of a simple workflow is the labeling workflow we described in the Introduction (Figure 1). Another example of a workflow is the iterative improvement workflow little09 (). Suppose a requester wants to generate some artifact, like a text description of an image. In the iterative improvement workflow, he first hires one worker to improve an existing caption. Then, he asks several workers to vote on whether the new caption is better than the old caption. Finally, he repeats this process until he is satisfied with the caption. Previous work has hand-crafted a POMDP for this particular workflow in order to make dynamic decisions like when to vote and when to stop, showing significant savings over static policies dai-aaai10 (). Our system would allow these requesters, who are likely not planning experts, to easily write a program for this workflow (and others) that implicitly defines a POMDP, which our system can then optimize and control.

4 Primitives

In order to interpret a program like the one for iterative improvement as a POMDP, Poaps needs mathematical models for function calls, like the one that hires a worker to improve an artifact. Poaps asks experts to define primitives to bootstrap this process. A primitive is a ten-tuple , where:

  • is a set of domain states.

  • is a set of range states.

  • is a set of observations.

  • is a transition function.

  • is an observation function.

  • is an -dimensional indicator vector indicating which of the are observable.

  • is a cost function.

  • is a set of user domain states.

  • is a set of user range states.

  • is a user function.

An expert defines all 10 of these components. Intuitively, a primitive is a function (the user function ), and a model of that function (the rest of the components). We note that in the special case when , is a one-action POMDP. The best way to understand the purpose of primitives is through an example. In particular, we discuss how we would define the primitive c-imp, which would be a function used in iterative improvement to improve an artifact.

First, we define the function part. c-imp should take an artifact as input, call some crowdsourcing API, and return an improved artifact . Therefore, we define the user function to be = calltoAPI(), where are the set of all possible artifacts (e.g., the set of all strings). Now we want to define the model part of the primitive, which will track the quality of artifacts being input and output by . We define to represent the hidden quality of the artifact. The transition function needs to encode the probability of getting an artifact of quality if a worker improves an artifact of quality . Therefore, we define using some conditional distribution like a Beta distribution. We set to be the amount of money paid to a worker, which can be some constant like 5 cents. c-imp produces no observations so is empty, and hence there is no observation function . Finally, we set indicating that is not observable.

This primitive combines a model for the improvement of an artifact with a function that outputs an improvement of the artifact. We can view each as a model for and is a model for . So, for a non-expert user who does not care about or cannot understand the model, a primitive is simply the function . These non-experts can call primitives in their programs, and when they do so, they expect they are calling the function .

5 The Language

We now describe the Poaps language, which users use to express adaptive programs using primitives. We define a Poaps program to be a function definition written in the Poaps language. The Poaps language is an extension of Lisp, because Lisp is both easy to write and easy to interpret. Following previous work andre-aaai02 (); boutilier-aaai00 (); pinto-icmla10 (), we add the special form (choose <> <> ... ).

The choose special form is a construct for dynamic execution. It takes a variable number of arguments, each of which is a Lisp S-expression. When used in a program, it describes a choice point in the program, meaning that at runtime, Poaps will dynamically decide the optimal argument expression to execute.

A key contribution of Poaps is how function calls are interpreted. However, we first emphasize that for a non-expert user, Poaps behaves just as an ordinary programming language. When a non-expert user calls a primitive: , the expression evaluates to where . A function call is just a function call, regardless of whether the function is a Poaps primitive or a user-defined function. Figure 2 shows a Poaps program that a crowdsourcing expert might write for improving a piece of text using crowdsourcing. It is a simplified version of iterative-improvement that removes voting.

(d efine (improve text)
(c hoose
(improve (c-imp text))
Figure 2: A Poaps program for improving a piece of text. text is the current text.

To the non-expert user, the argument text is bound to a string, . During execution, there are two execution paths. Suppose the program chooses the first path. When the string, , is passed to c-imp, a primitive we described in the previous section, a function, is called to hire a crowdworker to improve the string. c-imp returns the improved string and the program recurses. If the program chooses the second execution path, the string is returned.

However, the semantics of the Poaps language are more complex. The expert user understands that in Poaps, all variables are actually bound to two values, and thus all expressions evaluate to two values. The first value, the Normal value, is the usual value that the non-expert user sees and understands, and is the same as it would be in any other programming language. For example, text is bound to a string. The second value is a Poaps value that can be unobservable, and hence, represented by a distribution in our system. This value is the value of a state variable in the POMDP Poaps constructs. Let represent this possibly hidden Poaps value of some expression, .

For the expert user, calling a primitive is everything that it is for the non-expert user. However, the expert user knows that in addition to being bound to the Normal value, the result of an expression is bound to a Poaps value with probability , where . Furthermore, when is called, an observation is produced with probability . The Poaps agent reasons about the Poaps values in the program using observations in order to make decisions.

Consider the program in Figure 2. The argument text is actually bound to two values. The first value, the string, is what the programmer cares about. The second Poaps value can be thought of as some unobservable measure of the quality of the text . The domain of this second value was implicitly specified by an expert when he defined the primitive c-imp. When c-imp is called, in addition to returning an improved string, a Poaps value is also returned with probabilities defined by . Then, the program recurses and text is now bound to both the new string and . In this example, no observation is produced.

We emphasize that the expert, the program, and the Poaps agent, may not know the Poaps values. The Poaps values may be unobservable, so the best an expert and an agent can do is hold a belief about what they might be, using the observations as hints. Therefore, the next step in Poaps is to compile a Poaps program into a POMDP, and then solve the POMDP to generate a policy that controls the program based on the beliefs about the hidden values of the variables.

As another example, we provide a description of the voting program we present in Figure 1. The program uses three primitives: +, >, and crowd-vote. The Poaps values of q, a0, and a1 can be thought of as unobservable measures of the difficulty of the question and the quality of the two answers, respectively. The Poaps values of c0 and c1 are observed, and are the same as their Normal values. crowd-vote’s range states are the same as its observations (), and its observation function is defined as if and only if . So, when it is called, it returns a Poaps value with probability defined by , and the observation it emits is the same Poaps value. The Normal value it returns is also the same as the Poaps value. The primitives + and > are defined in the expected way. 2

Of course, Poaps programs do not restrict users to calling primitives. Users can also call their own user-defined functions. For example, they can use their program for voting (Figure 1) in a program for iterative-improvement (Figure 3). We note that in the program for voting, the operators > and + are primitives. When a function calls another user-defined function, the semantics are “call-by-poaps-value.” Quite simply, in contrast to the normal “call-by-value” semantics where only one value is copied and passed, in our language, both the normal value and Poaps value are copied and passed.

(d efine (it-i image worse-text better-text)
(c hoose
(it-i image better-text
(c-imp better-text))
(if (vote-better? image better-text
worse-text 0 0)
(it-i image worse-text better-text)
(it-i image better-text worse-text))
Figure 3: A Poaps program for iterative-improvement on descriptions for images.

We now define a compiler for the Poaps programming language, which converts the language into a POMDP.

6 The Compiler

Before we delve into the technical details of the compiler, we provide a high-level description of the process.

The whole point of converting a Poaps adaptive program into a POMDP is to enable construction of an optimal policy for the program, but this requires an optimality criterion. Since optimality is different for every user, we need the flexibility to construct different utility functions or goals for individual users. In light of these challenges, we assume that executing a primitive incurs a cost defined by the primitive, but that an externally-provided and expert-defined mechanism for goal or utility elicitation (e.g. chajewska-aaai00 ()) is used to guide the overall program objective. For example, consider the voting program of Figure 1. It might cost $0.05 to execute the crowd-vote primitive, but learning a given user’s desired target accuracy in order to guide the execution of the program requires additional information. A reasonable goal elicitation module for a POMDP compiled from this program is one that simply asks the user for a desired accuracy and budget, and converts the desired accuracy into a goal belief state and the budget into a horizon to ensure no dead ends. Such a goal elicitation module could be used for any program that outputs “correct answers” and uses crowd-vote.

Alternatively, users can forego providing their goals/utilities with an external mechanism and simply write goals into their programs. For example, they can simply write their own termination conditions that rely only on the visible parts of their programs. Whether or not we have elicited goals/utilities, our goal is to execute the branches that minimize the expected sum of costs.

From our description of the Poaps language, we have a very natural, but unbounded, decision process that emerges. This decision problem can be posed as a history-based MDP. The state of the MDP consists of all the branches taken and observations received so far. An action in the MDP is choosing a branch in the program. Taking an action produces observations and costs, so the transition function (from a list of actions and observations to another list of actions and observations) is completely determined by the dynamics of the underlying primitives and our “call-by-poaps-value” semantics.

However, we do not want to define such an MDP because solution methods will not scale. Instead we define an equivalent POMDP. We now define the Poaps compiler that produces this POMDP by describing in detail the process that converts any Poaps program into a POMDP. Given an input program , the compiler converts into a POMDP by the following steps:

  1. Define a set of states by statically analyzing . Each state variable of will represent the Poaps value of some variable or expression in or a function called by . So, we call a state a control state, because it is the part of the state that determines what action should be taken.

  2. Construct a Hierarchical Abstract Machine (HAM) parr-nips98 () by evaluating the program under a set of operational semantics. A HAM is a type of nondeterministic finite state machine. Each state will be a representation of the current program counter. In other words, it tells the agent where in the program it currently is. So, we call this state the machine state. This state will be fully observable, and provides information about what actions are available to take.

  3. Following the insights of parr-nips98 (), merge with to create a POMDP with state space , and define the actions, transition function and observation function of by traversing and applying a set of rules. Therefore, a single state in our POMDP will be a tuple , where one part of the state is the machine state, and the other part of the state is the control state. Finally, using a separate goal/utility elicitation module, integrate the goal or rewards into the POMDP.

6.1 Step 1: Creation of a State Space S

First, we need to define the state variables for the arguments of . Let be the set of all arguments of . In order to construct , the compiler needs to know the state space of each argument. The state space of an argument is defined by the domain state space of the primitives that use 3.

Next, we need to define state variables for all subexpressions in . A program in our language can be viewed as an evaluation tree of expressions. For example, Figure 4 shows the corresponding tree for the improve program (Figure 2). In order to remember all state necessary to control, we have a state variable for each subexpression in . We denote this state space .

Figure 4: The tree for improve

Let be the set of Poaps programs corresponding to user-defined functions that are called in . Then, we abuse notation for ease of understanding, and recursively define . Thus, the state space that we have constructed is a cross product of the state spaces of all the functions that calls, the state spaces of all the arguments of , and the state space which consists of all possible evaluations of every expression in . This state space is the control state space. Since we use Monte-Carlo solution methods to solve our POMDPs, we do not need to express the state in a closed, non-recursive form.

6.2 Step 2: Construction of a HAM

The second step in the compilation process is to construct the machine state space by constructing a HAM parr-nips98 () given and . The HAM’s states will be used in our constructed POMDP as observable state variables that represent the current program counter. Each HAM state represents the evaluation of an expression. In other words, the HAM will be the part of the POMDP that says where in the evaluation tree we are for a program .

The five types of states of a HAM are Action, Call, Choice, Start, and Stop. Call states represent a call to a user-defined function. They will execute the corresponding HAM. Choice states can transition to one of many HAMs. Stop states signify the end of execution of a HAM and return control to the next state of the parent calling HAM. Start states denote the initial HAM state. Action states represent the evaluation of a symbol or constant, or the execution of a primitive.

Finally, we add a sixth type of state: an Observation State. Obs states do not represent the evaluation of any expression in . These states will do nothing except emit an observation. These states are inserted after conditionals so that an agent can eliminate inconsistent beliefs. These states were not necessary in parr-nips98 () because their world was fully observable.

We evaluate the program to a HAM by using inference rules in the same way computer programs are evaluated by interpreters. We recursively evaluate subexpressions to HAMs using inferences rules and then combine them into larger and larger HAMs for each parent expression until we have a HAM for .

Consider the improve program in Figure 2. We first use an inference rule for define, which leads to using a choice inference rule. We provide a simplified version of the choice inference rule here.


[choice]H ; e_i ⇓ M_iH ; (choose e_1 ⋯e_n) ⇓ Choice(M_1,…,M_n)

The rule says that if each expression evaluates to a HAM under the heap , then the expression evaluates to a HAM that contains a Choice node that can transition to any of the HAMs . Thus, for the improve program, when we evaluate the (choose...) subexpression, there are two expressions that needs to be recursively evaluated. The result is the HAM in Figure 5.

After we construct a HAM, we post-process by adding a Start state to the beginning and a Stop state to the end. Additionally, if we see any tail calls (Call states that are leaf nodes), we can add an edge from the call state to the beginning of the HAM, and change the semantics of that call state so that it transitions to the next state instead of executing another HAM as a subroutine.

Figure 5: A HAM for the program improve in Figure 2. Circles are action states, diamonds are call states, rectangles are choose states.

6.3 Step 3: Putting It All Together

Letting denote the set of states of a HAM , the state space of the POMDP, , that we construct is .

The actions depend only on the current machine state , which is fully observable. In any machine state that is not a Choice state, the agent can only take one action. If the machine state is a Choice state, then the actions are the branches of the Choice state.

We define the transition function of such that values are passed around correctly between states to enforce call-by-poaps-value semantics.

The observation function is simple. Observations are only received in two scenarios. First, observations can be received when executing a primitive and they are defined by the primitive. Second, observations can be received when transitioning to a HAM observation state.

7 Optimality

We now show that the POMDP we construct is correct, in that its optimal policies result in the optimal executions of its corresponding program.

Lemma 1

For any POMDP for a program , let be the set of choice belief states, which are the belief states in which the machine component of every possible world state is a choice node. There exists a semi-MDP with state space , such that an optimal policy for corresponds to an optimal policy for , in that simply augments by mapping belief states not in the domain of to their single, default actions.


Consider the belief-MDP that corresponds to . Consider the states that are not choice beliefs. We can remove these states to produce an equivalent belief-Semi-MDP. The optimal policy for this belief-Semi-MDP is the same as the optimal policy for the belief-MDP, and thus the same as the optimal policy for .

Theorem 1

Let be the history-based MDP associated with a program . Then an optimal policy for the POMDP we generate, , can be used as an optimal policy for .


and are stochastically bisimilar givan-aij03 () (we map each history to its corresponding belief state), and the optimal policy of corresponds to that of (Lemma 1), so the optimal policy for can be used as an optimal policy for (Lemma 1).

This theorem also affirms that an optimal policy for our constructed POMDP is not just a “recursively optimal” dietterich-jair00 () policy. For example, suppose a user has written a program that calls some other user-defined program that also calls some other user-defined program . Then, the optimal policy’s actions while in consider not only the state local to , but also the state local to and the state local to . In other words, an optimal policy for does not solve lots of primitive POMDPs in isolation. Notably, a POMDP solver that produces an optimal policy will update beliefs about state variables local to even when observations are made about correlating state variables local to .

8 Monte Carlo Planning

We solve the POMDP when a user runs a program. The POMDP that we construct can potentially have many unreachable states. Additionally, we do not want to construct the entire state space or the full matrix representation of the transition and observation functions, since these can be very large or infinite. Therefore, we choose to use online Monte-Carlo methods to solve the POMDP. While we try using a UCT-based solver, POMCP (without a rollout policy) 4 silver-nips10 (), we find that the value function can take an extraordinary amount of time to converge.

Instead, we modify RTDP-Bel barto-ai95 () to create C-RTDP, an algorithm similar to HAMQ-learning parr-nips98 () that takes advantage of the fact that the actual complexity of the POMDP is determined by the number of choice points. C-RTDP modifies RTDP-Bel by only performing backups on choice beliefs. In the POMDP that we construct, all the states that have non-zero probability in a reachable belief state will always have the same machine state. A choice belief is one in which the machine component of every state is a choice node.

We fully specify our algorithm below. means the belief state given that the agent had belief state , then took action and received observations . is the expected cost of taking action in belief and receiving observations .

Initialize belief Sample if  is terminal then
end if
if  is a choice node then
       for every action  do
             is choice belief
       end for
       default action
end if
Update Sample , Update and repeat
Algorithm 1 C-RTDP (One simulation)

We can show that C-RTDP converges to the optimal policy:

Theorem 2

For any POMDP for a program , C-RTDP will converge to the optimal policy with probability 1.


For a POMDP , C-RTDP solves the corresponding belief-Semi-MDP. Then by Lemma 1, we have that C-RTDP solves .

By using a Monte-Carlo, online approach, we gain several advantages. Initializing the initial belief is easy. Suppose a user runs a program for a function f with arguments . None of the expressions have been evaluated yet, so we only need to initialize our belief of the arguments . If is observable, as defined by the primitives that use it inside , then we simply define the Poaps value of to be equal to the Normal value. Therefore, if a state is observable, its space can be infinite. If is unobservable, then we initialize a uniform belief over the state space defined by the primitives that use it.

9 Proof-of-Concept

As a proof of concept that Poaps can run programs, we implement the Poaps system and write the voting program from the introduction. We also implement a goal eliciation module. The primitive crowd-vote has a cost of 1 cent and we specify a goal accuracy of 90%. We run the program on Mechanical Turk with 1100 named entity recognition tasks. Each named entity recognition task begins by providing the worker with a body of text and an entity, like “Washington led the troops into battle.” Then it uses Wikification milnewitten08 (); ratinov11 () to find a set of possible Wikipedia articles describing the entity, such as “Washington (state)” and “George Washington,” and asks workers to choose the article that best describes the entity. Poaps achieves an overall accuracy of 87.73% with an average cost of 4.33 cents. This result is consistent with those in dai-aij13 (), showing that our general purpose implementation can perform at par compared to an expert-written problem-specific POMDP, suggesting the value of our system to end-users.

10 A Larger Example

Throughout this paper, we have expressed several practical crowdsourcing problems in our language. Note that all our programs have only used two non-trivial expert-defined primitives: c-imp and c-vote (trivial Lisp-primitives like + and > whose hidden behaviors are identical to their visible behaviors come with Poaps). We demonstrate the versatility of our paradigm by writing find-fix-verify bernstein10 (), a more complex and popular workflow that can be used for crowdsourcing edits to text.

For example, given a piece of text as input (like a term paper), it first asks crowdworkers to find patches in the text that need attention. Then, given these subsets of text, it asks workers to revise them. Finally, given the revisions, it asks workers to verify that the revisions are better, through some voting mechanism.

In Figure 6, we show the Poaps program for find-fix-verify (ffv). In addition to using c-imp and c-vote, we only need to use one more non-trivial expert-defined primitive: c-find, which asks a worker to provide an interval of text that requires attention. However, note that we can potentially use only c-imp and c-vote by eliminating c-find and replacing it with c-vote where the possible answers are a set of intervals. These primitives provide all the information we need to construct the POMDP for the program. The domain state spaces of the primitives provides the state spaces of the Poaps values of the arguments to those primitives, and the transition functions describe the probabilities of the Poaps values of the returns.

The program also uses trivial string and list manipulation primitives like get-relevant-text, replace-text, and merge, which like + and >, do not need to be expertly defined. worse-text represents what we think is the worse-text and better-text represents what we think is the better text. We call ffv with worse-text bound to an empty string and better-text bound to the text we want to improve.

There are three choices. We can: 1) find mistakes in and fix the better text (find-fix and then recurse, or 2) verify which version of the text is better and recurse, or 3) return the better text. find-fix first calls find to repeatedly ask workers to provide intervals in the text that require work. Then, it calls fix with those intervals and repeatedly asks workers to improve the text in those intervals.

A simple goal elicitation module for this program could simply ask the user whether they would like a “Almost-Perfect” text, an “Excellent” text, or a “Satisfactory” text. It would translate the choice into a goal belief on the hidden quality of the text, and then minimize the expected cost of achieving that goal.

(d efine (ffv worse-text better-text)
(c hoose
(ffv better-text
(find-fix better-text))
(if (vote-better? ’which is better?’ better-text
worse-text 0 0)
(ffv worse-text better-text)
(ffv better-text worse-text))
(d efine (find-fix text)
(fix text (find text ’()))
(d efine (fix text intervals)
(ch oose
(let ((next-int (choose intervals))
(next-text (get-relevant-text text next-int))
(better-text (c-imp next-text)))
(fix (replace-text text next-int better-text)
(d efine (find text intervals)
(ch oose
(find text (merge (c-find text) intervals))
Figure 6: A Poaps program for the find-fix-verify workflow.

For even more examples of programs we can write, please refer to the Appendix.

11 Conclusion

We have presented Poaps, a system that provides a language for writing decision processes that provides an abstraction over POMDPs. Knowledge of POMDPs is not a prerequisite for being able to use decision-theory in everyday applications. In particular, the states and dynamics of POMDPs are hidden from users. We have shown how crowdsourcing experts can use Poaps to build and optimally control many of their workflows. We have also implemented Poaps and conducted a proof-of-concept experiment that shows that Poaps can run the voting program of Figure 1 and achieve results comparable to an expert-written problem-specific POMDP. The complete Poaps system that we have built will be available at the authors’ websites.

12 Future Work

Our work on Poaps is just the beginning. We imagine many future directions:

1) A key question is whether or not Poaps is easy to use by people who are not planning experts. A comprehensive answer to this question requires a user study of our complete system.

2) Poaps does not allow access to underlying POMDP details through its language. But ideally, we would like to allow users to modify whatever aspects they understand (e.g., a subset of state variables or costs). We imagine an extension of Poaps that allows users to work with POMDP/primitive components in their programs.

3) Poaps allows users to specify hard constraints on policies. By writing an adaptive program, they are exactly specifying the policies that may be chosen by a planner. In other words, they not only specify a POMDP, but they also specify a partial policy on that POMDP, by limiting the actions that can be taken in a given state. For example, in the voting program (Figure 1), if an agent decides to stop asking the crowd for more votes, it can only return the answer that received more votes. It is not allowed to return the answer that received fewer votes. We imagine a non-trivial extension of Poaps that allows users to specify soft constraints, in the same way that UCT allows users to specify a rollout policy to guide search. In this framework, Poaps would be able to deviate from the program. For example, instead of always returning the answer with more votes, it might return the answer with fewer votes, because maybe the user did not foresee that sometimes the answer with fewer votes is more likely to be correct.

4) Poaps assumes that all the primitives that use the same variable specify the same state space for that variable. Such a restriction makes life more difficult for non-experts. In particular, this assumption may lead to unforeseen crashing or unexpected behavior. However, we envision at least two methods for ameliorating these scenarios. The first is to type our language, making it impossible to write programs that would crash the compiler or planner. The second is to use polymorphic typing or subtyping so that primitives are more flexible.

5) Solving large POMDPs is a hard problem, and Poaps creates large POMDPs. The scalability of our system is a weakness that we hope to address. One way to reduce the size of the POMDPs that Poaps creates is to use state abstraction. If we can analyze the programs to determine the states that are irrelevant for making decisions, we can eliminate them and significantly increase the size of the programs that we can write.

6) While a procedural Lisp-like language is easy for us to compile and interpret, we believe that most users prefer a more imperative C-like language. Converting the Poaps language to one with a more familiar syntax should increase usability and adoption.

7) Finally, Poaps assumes that experts either know, or can write down the models for their primitives. However, we can easily extend the language to allow experts to direct Poaps to learn the models using reinforcement learning, thereby reducing the amount of work that experts need to put into the system.

We have shown how to write a program that polls workers in order to find the best answer to some question. However, requesters can do better by asking the question in multiple ways [12]. Figure 7 shows how to write a voting program if you have two methods for asking the same question.

While the primary goal of Poaps is to enable non-experts to write POMDPs, experts can also use Poaps to quickly build large and complex POMDPs. Figure 8 shows how one can use Poaps to write a goal-based rocksample. We note that the program we write constrains the possible policies to ones that are more likely to be optimal (though it may not include the most optimal policy).

(d efine (m-vote q0 q1 a0 a1 c0 c1)
(c hoose
(if (crowd-vote q0 a0 a1))
(m-vote q0 q1 a0 a1 (+ c0 1) c1)
(m-vote q0 q1 a0 a1 c0 (+ c1 1)))
(if (crowd-vote q1 a0 a1))
(m-vote q0 q1 a0 a1 (+ c0 1) c1)
(m-vote q0 q1 a0 a1 c0 (+ c1 1)))
(if (> c0 c1) #t #f)))
Figure 7: A Poaps program for multiple workflows. q0,q1 are the two ways of asking the same question, a0,a1 are the two possible answers, and c0,c1 count the number of votes for each choice.
(d efine (move start end)
(if  (= start end)
(choose (move (move-north start) end)
(move (move-south start) end)
(move (move-east start) end)
(move (move-west start) end))))
(d efine (r-s pos rocks exit-pos)
(c hoose
(move pos exit-pos)
(let ((good-rock (find-good-rock rocks)))
(r-s (sample (move pos good-rock))
(remove good-rock rocks)
Figure 8: A Poaps program for rocksample. pos is the initial position, rocks is a list of rocks, and exit-pos is the exit position.

move-* and sample are the primitives that need to expertly-defined. These definitions boostrap the creation of the POMDP. For example, The Poaps value of each rock can be defined by the primitives as a pair, where the first element is a binary indicator of whether or not the rock is good, and the second element is the position of the rock. The Normal value of each rock can also be a pair, where the first element is a rock id and the second element is its position. Then, the behavior of the Poaps values of the return values of these primitives (and thereby all subexpressions that use the return values), are given by the expert-defined transition probabilities. find-good-rock is a user-defined function (not shown) that can be viewed as a generalization of the voting program.

r-s is the rocksample program. It contains two choice points. The first choice is to simply move to the exit. The second choice is to first find a good rock, move to where the good rock is, sample it (a primitive), remove the rock from our list of rocks, and recurse.

To define the policy, we can write a goal elicitation module that asks the user how many rocks should be sampled before quitting. For example, if the user specifies that all rocks be sampled, then the agent should find the expected minimum cost policy to sample all the rocks, where the costs are given by the primitives.


  1. http://pomdp.org
  2. We note here that for planning purposes, the if construct in our language uses the Poaps value of its test expression to determine which branch to take instead of the Normal value. The next section will show how during execution we insert observations to tell our agent what branch was actually taken.
  3. We assume that all the primitives that use a variable have the same state space . We can relax this assumption by using typing techniques.
  4. Since Poaps is a general representation language, the existence of a rollout policy cannot be assumed.


  1. D. Andre and S. J. Russell. Programmable reinforcement learning agents. In NIPS, 2001.
  2. D. Andre and S. J. Russell. State abstraction for programmable reinforcement learning agents. In AAAI, 2002.
  3. A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72:81–138, 1995.
  4. M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich. Soylent: A word processor with a crowd inside. In UIST, 2010.
  5. C. Boutilier, R. Reiter, M. Soutchanski, and S. Thrun. Decision-theoretic, high-level agent programming in the situation calculus. In AAAI, 2000.
  6. U. Chajewska, D. Koller, and R. Parr. Making rational decisions using adaptive utility elicitation. In AAAI, 2000.
  7. P. Dai, C. H. Lin, Mausam, and D. S. Weld. Pomdp-based control of workflows for crowdsourcing. Artificial Intelligence, 202:52–85, 2013.
  8. P. Dai, Mausam, and D. S. Weld. Decision-theoretic control of crowd-sourced workflows. In AAAI, 2010.
  9. T. G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000.
  10. R. Givan, T. Dean, and M. Greig. Equivalence notions and model minimization in markov decision processes. Artificial Intelligence, 147:163–223, 2003.
  11. E. Kamar, S. Hacker, and E. Horvitz. Combining human and machine intelligence in large-scale crowdsourcing. In AAMAS, 2012.
  12. C. H. Lin, Mausam, and D. S. Weld. Dynamically switching between synergistic workflows for crowdsourcing. In AAAI, 2012.
  13. G. Little, L. B. Chilton, M. Goldman, and R. C. Miller. Turkit: tools for iterative tasks on mechanical turk. In KDD-HCOMP, pages 29–30, 2009.
  14. B. Marthi, S. Russell, D. Latham, and C. Guestrin. Concurrent hierarchical reinforcement learning. In IJCAI, 2005.
  15. D. McAllester. Bellman equations for stochastic programs, 1999.
  16. D. Milne and I. H. Witten. Learning to link with wikipedia. In Proceedings of the ACM Conference on Information and Knowledge Management, 2008.
  17. R. Parr and S. Russell. Reinforcement learning with hierarachies of machines. In NIPS, 1998.
  18. J. Pinto, A. Fern, T. Bauer, and M. Erwig. Robust learning for adaptive programs by leveraging program structure. In ICMLA, 2010.
  19. L. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algorithms for disambiguation to wikipedia. In Proceedings of the Annual Meeting of the Association of Computational Linguistics, 2011.
  20. S. Sanner. Relational dynamic influence diagram language (rddl): Language description. Technical report, NICTA and the Australian National University, 2011.
  21. D. Silver and J. Veness. Monte-carlo planning in large pomdps. In NIPS, 2010.
  22. C. Simpkins, S. Bhat, C. I. Jr., and M. Mateas. Towards adaptive programming: Integrating reinforcement learning into a programming language. In OOPSLA, 2008.
  23. D. S. Weld, Mausam, and P. Dai. Human intelligence needs artificial intelligence. In HCOMP, 2011.
  24. H. L. S. Younes and M. L. Littman. Ppddl1.0: The language for the probabilistic part of ipc-4. In IPC, 2004.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description