# Advantage Amplification in Slowly Evolving Latent-State Environments

###### Abstract

Latent-state environments with long horizons, such as those faced by recommender systems, pose significant challenges for reinforcement learning (RL). In this work, we identify and analyze several key hurdles for RL in such environments, including belief state error and small action advantage. We develop a general principle called advantage amplification that can overcome these hurdles through the use of temporal abstraction. We propose several aggregation methods and prove they induce amplification in certain settings. We also bound the loss in optimality incurred by our methods in environments where latent state evolves slowly and demonstrate their performance empirically in a stylized user-modeling task.

Advantage Amplification in Slowly Evolving Latent-State Environments

Martin Mladenov , Ofer Meshi , Jayden Ooi , Dale Schuurmans , Craig Boutilier

Google AI, Mountain View, CA, USA

Department of Computer Science, University of Alberta, Edmonton, AB, Canada

{mmldanenov,meshi,jayden,schuurmans,cboutilier}@google.com

## 1 Introduction

Long-term value (LTV) estimation and optimization
is of increasing
importance in the design of recommender systems (RSs), and other
user-facing
systems.
Often the problem is framed as a Markov decision process (MDP) and solved using
MDP algorithms or reinforcement learning (RL)
[?; ?; ?; ?; ?; ?].
Typically, actions are the set of recommendable
items^{1}^{1}1Item slates
are often recommended, but we ignore this here.; states reflect information
about the user (e.g., static attributes, past interactions, context/query); and rewards measure some form of user engagement (e.g., clicks, views, time spent,
purchase). Such event-level models
have seen some success, but current
state-of-the-art is limited to very short horizons.

When dealing with long-term user behavior, it is vital to consider the impact of recommendations on user latent state (e.g., satisfaction, latent interests, or item awareness) which often governs both immediate and long-term behavior. Indeed, the main promise of using RL/MDP models for RSs is to (a) identify latent state (e.g., uncover topic interests via exploration) and (b) influence the latent state (e.g., create new interests or improve awareness and satisfaction). That said, evidence is emerging that at least some aspects of user latent state evolve very slowly. For example, Hohnhold et al. [?] show that varying ad quality and ad load induces slow, but inexorable (positive or negative) changes in user click propensity over a period of months, while Wilhelm et al. [?] show that explicitly diversifying recommendations in YouTube induces similarly slow, persistent changes in user engagement (see such slow “user learning” curves in Fig. 1).

Event-level RL in such settings is challenging for several reasons. First, the effective horizon over which an RS policy influences the latent state can extend up to state transitions. Indeed, the cumulative effect of recommendations is vital for LTV optimization, but the long-term impact of any single recommendation is often dwarfed by immediate reward differences. Second, the MDP is partially observable, requiring some form of belief state estimation. Third, the impact of latent state on immediate observable behavior is often small and very noisy—the problems have a low signal-to-noise ratio (SNR). We detail below how these factors interact.

Given the importance of LTV optimization in RSs, we propose a new technique called advantage amplification to overcome these challenges. Intuitively, amplification seeks to overcome the error induced by state estimation by introducing (explicit or implicit) temporal abstraction across policy space. We require that policies take a series of actions, thus allowing more accurate value estimation by mitigating the cumulative effects of state-estimation error. We first consider temporal aggregation, where an action is held fixed for a short horizon. We show that this can lead to significant amplification of the advantage differences between abstract actions (relative to event-level actions). This is a form of MDP/RL temporal abstraction as used in hierarchical RL [?; ?] and can be viewed as options or macros designed for the purpose of allowing distinction of good and bad behaviors in latent-state domains with low SNR (rather than, say, for subgoal achievement). We generalize this by analyzing policies with (artificial) action switching costs, which induces similar amplification with more flexibility.

Limiting policies to temporally abstract actions induces potential sub-optimality [?; ?]. However, since the underlying latent state often evolves slowly w.r.t. the event horizon in RS settings, we identify a “smoothness” property that is used to bound the induced error of advantage amplification.

Our contributions are as follows. We introduce a stylized model of slow user learning in RSs
(Sec. 2) and formalize this as a POMDP (Sec. 3), defining
several novel concepts, and show how low SNR interacts
poorly with belief-state approximation (Sec. 4.1). We develop advantage
amplification as a principle
and prove that action aggregation
(Sec. 4.2)
and switching cost regularization
(Sec. 4.3)
provide strong
amplification guarantees with minimal policy loss under suitable conditions. Experiments with
stylized models show the effectiveness of these techniques.^{2}^{2}2Proofs, auxiliary lemmas and additional experiments are available in an extended version of the paper.

## 2 User Satisfaction: An Illustrative Example

Before formalizing our problem, we describe a stylized model reflecting the dynamics of user satisfaction as a user interacts with an RS. The model is intentionally stylized to help illustrate key concepts underlying the formal model and analysis developed in the sequel (hence ignores much of the true complexity of user satisfaction). Though focused on user-RS engagement, the principles apply more broadly to any latent-state system with low SNR and slowly evolving latent state.

Our model captures the relationship between a user and an RS over an extended period (e.g., a content recommender of news, video, or music) through overall user satisfaction, which is not known to the RS. We hypothesize that satisfaction is one (of several) key latent factors that impacts user engagement; and since new treatments often induce slow-moving or delayed effects on user behavior, we assume this latent variable evolves slowly as a function of the quality of the content consumed [?] (and see Fig. 1 (left)). Finally, the model captures the tension between (often low-quality) content that encourages short-term engagement (e.g., manipulative, provocative or distracting content) at the expense of long-term engagement; and high-quality content that promotes long-term usage but can sacrifice near-term engagement.

Our model includes two classes of recommendable items. Some items
induce high immediate engagement,
but degrade user engagement
over the long run. We dub these “Chocolate” (Choc)—immediately appealing but not very “nutritious.”
Other items—dubbed “Kale,” less attractive, but more “nutritious”—induce lower immediate engagement but tend to improve long-term engagement.^{3}^{3}3Our model allows
a real-valued continuum of items (e.g., degree of Choc between as in our
experiments)
like measures of ad quality. We use the binary form to streamline our
initial exposition.
We call this the Choc-Kale model (CK).
A stationary, stochastic policy can be represented by a single scalar representing the probability of taking action Choc. We sometimes refer to Choc as a “negative” and
Kale as a “positive” recommendation.

We use a single latent variable to capture a user’s overall satisfaction with the RS. Satisfaction is driven by net positive exposure , which measures total positive-less-negative recommendations, with a discount applied to ensure that is bounded: . We view as a user’s learned perception of the RS and as how this influences gradual changes in engagement.

A user response to a recommendation is her degree of engagement , and depends stochastically on both the quality of the recommendation, and her latent state . is a random function, e.g., responses might be normally distributed: . We use also to denote expected engagement. We require that Choc results in greater immediate (expected) engagement than Kale, , for any fixed .

The dynamics of is straightforward. A Kale exposure increases by 1
and Choc decreases it by 1 (with discounting):
with Kale (and with Choc).
Satisfaction
is a user-learned function of and follows a sigmoidal learning curve:
,
where is a temperature/learning rate parameter.
Other learning curves are possible, but the sigmoidal model captures both positive and negative exponential learning
as hypothesized in psychological-learning literature [?; ?]
and as observed in the empirical curves in Fig. 1.^{4}^{4}4Such learning
curves are often reflective of aggregate behavior, obscuring individual differences that are much less
“smooth.” However, unless cues are available that allow us to model such individual differences, the
aggregate model serves a valuable role even when optimizing for individual users.

We compute the Q-values of Choc and Kale for each satisfaction level and plot them in Fig. 1(a). We observe that when satisfaction is low, Kale is a better recommendation, and above some level Choc becomes preferable, as expected. We also see that for any the difference in Q-values is rather small. With additional noise, the Q-values become practically indistinguishable for a large range of satisfaction levels (Fig. 1(b)), which illustrates the hardness of RL in this setting.

## 3 Problem Statement

We outline a basic latent-state control problem as a partially observable MDP (POMDP) that encompasses the notions above. We highlight several properties that play a key role in the analysis of latent-state RL we develop in the next section.

We consider environments that can be modeled as a POMDP
[?]. States reflect user latent state
and other observable aspects of the domain: in the CK model, this is simply a user’s
current satisfaction . Actions
are recommendable items:
in CK, we distinguish only Choc from Kale.
The transition kernel
in the CK model is
if , where
is 1 (resp., -1) for action Kale (resp., Choc).^{5}^{5}5This is easily
randomized if desired.
Observations reflect observable user behavior and
the probability
of when is taken at state .
In CK, is the observed engagement with a recommendation
while reflects the
random realization of . The immediate reward is (expected) user
engagement (we let ),
the initial state distribution, and the discount factor.

In this POMDP, an RS does not have access to the true state , but must generate policies that depend only on the sequence of past action-observation pairs—let be the set of all finite such sequences . Any such history can be summarized, via optimal Bayes filtering, as a distribution or belief state . More generally, this “belief state” can be any summarization of used to make decisions. It may be, say, a collection of sufficient statistics, or a deep recurrent embedding of history. Let denote the set of (realizable) belief states. We also require a mapping that describes the update of any given . The pair defines our representation.

A (stochastic) policy is a mapping that selects an
action distribution for execution given belief ; we write
to indicate the probability of action . Deterministic policies are
defined in the usual way. The value of a policy is given by the standard
recurrence:^{6}^{6}6Here and are given by expectations of and , respectively,
w.r.t. if . The interpretation for other
representations is discussed below.

(1) |

We define by fixing in Eq. 1 (rather than taking an expectation). An optimal policy over has value (resp., Q) function (resp., ). Optimal policies and values can be computed using dynamic programming or learned using (partially observable) RL methods. When we learn a Q-function , whether exactly or approximately, the policy induced by is the greedy policy and its induced value function is . The advantage function reflects the difference in the expected value of doing at (and then acting optimally) vs. acting optimally at [?]. If is the second-best action at , the advantage of that belief state is .

Eq. 1 assumes optimal Bayesian filtering, i.e., the representation must be such that the (implicit) expectations over and are exact for any history that maps to . Unfortunately, exact recursive state estimation is intractable, except for special cases (e.g., linear-Gaussian control). As a consequence, approximation schemes are used in practice (e.g., variational projections [?]; fixed-length histories, incl. treating observations as state [?]; learned PSRs [?]; recursive policy/Q-function representations [?]). Approximate histories render the process non-Markovian; as such, a counterfactually estimated Q-value of a policy (e.g., using offline data) differs from its true value due to modified latent-state dynamics (not reflected in the data). In this case, any RL method that treats as (Markovian) state induces a suboptimal policy. We can bound the induced suboptimality using -sufficient statistics [?]. A function is an -sufficient statistic if, for all ,

where is the total variation distance. If is -sufficient, then any MDP/RL algorithm that constructs an “optimal” value function over incurs a bounded loss w.r.t. [?]:

(2) |

The errors in Q-value estimation induced by limitations of are irresolvable (i.e., they are a form of model bias), in contrast to error induced by limited data. Moreover, any RL method relying only on offline data is subject to the above bound, regardless of whether the Q-values are estimated directly or not. The impact of this error on model performance can be related to certain properties of the underlying domain as we outline below. A useful quantity for this purpose is the signal-to-noise ratio (SNR) of a POMDP, defined as:

(the denominator is treated as if no meets the condition).

As discussed above, many aspects of user latent state, such as satisfaction, evolve slowly. We say a POMDP is -smooth if, for all , and s.t. , we have

Smoothness ensures that for any state reachable under an action , the optimal Q-value of does not change much.

## 4 Advantage Amplification

We now detail how low SNR causes difficulty for RL in POMDPs, especially with long horizons (Sec. 4.1). We introduce the principle of advantage amplification to address it (Sec. 4.2) and describe two realizations, temporal aggregation (Sec. 4.2) and switching cost (Sec. 4.3).

### 4.1 The Impact of Low SNR on RL

The bound Eq. (2) can help assess the impact of low SNR on RL. Assume that policies, values or Q-functions are learned using an approximate belief representation that is -sufficient. We first show that the error induced by is tightly coupled to optimal action advantages in the domain.

Consider an RL agent that learns Q-values using a behavior (data-generating) policy . The non-Markovian nature of means that: (a) the resulting estimated-optimal policy will have estimated values that differ from its true values ; and (b) the estimates (hence, the choice of itself) will depend on . We bound the loss of w.r.t. the optimal (with exact filtering) as follows. First, for any (belief) state-action pair , suppose the maximum difference between its inferred and optimal Q-values is bounded for any : . By Eq. (2) we set

(3) |

If has small advantage , under behavior policy , the estimate (the second-best action) can exceed that of ; hence executes . If visits (or states with similarly small advantages) at a constant rate, the loss w.r.t. compounds, inducing error.

The tightness of the second part of the argument depends on the structure of the advantage function . To illustrate, consider two extreme regimes. First, if at all , i.e., if SNR , state estimation error has no impact on the recovered policy and incurs no loss. In the second regime, if all are less than (but on the order of) , i.e., if , then the inequality is tight provided saturates the state-action error bound. We will see below that low-SNR environments with long horizons (e.g., practical RSs, the stylized CK model) often have such small (but non-trivial) advantages across a wide range of state space.

The latter situation in illustrated on Fig. 2. In Fig. 1(a), the Q-values of the CK model are plotted against the level of satisfaction (as if satisfaction were fully observable). The small advantages are notable. Fig. 1(b) shows the Q-value estimates for independent tabular Q-learning reruns (the thin lines show the individual runs, the thick lines show the average) where noise is added to . The corrupted -values at all but the highest satisfaction levels are essentially indistinguishable, leading to extremely poor policies.

### 4.2 Temporal Abstraction: Action Aggregation

There is a third regime in which state error is relatively benign. Suppose the advantage at each state is either small, , or large, for some constants . The induced policy incurs a loss of at small-advantage states, and no loss on states with large advantages. This leads to a compounded loss of at most , which may be much smaller than the error (Eq. 3) depending on .

If the environment is smooth, action aggregation can be used to restructure a problem falling in the second regime to one in this third regime, with depending on the level of smoothness. This can significantly reduce the impact of estimation error on policy quality by turning the problem into one that is essentially Markovian. More specifically, if at state , we know that the optimal (stationary) policy takes action for the next decision periods, we consider a reparameterization of the belief-state MDP where, at , all actions are executed times in a row, no matter what the subsequent states are. In this new problem, the Q-value of the optimal repeated action is the same as that of its event-level counterpart , since the same sequence of expected rewards will be generated. Conversely, all suboptimal actions incur a cumulative reduction in Q-value in since their suboptimality compounds over periods. Thus, in , the optimal policy generates the same cumulative discounted return as the event-level optimal policy, while the advantage of over any other repeated action at is larger than that of over in the event-level problem.

To derive bounds, note that, for an -smooth POMDP, at any state where the advantage is at least , the optimal action persists for the next periods (its Q-value can decrease by at most while that of the second-best can at most increase by ). If we apply aggregation only at such states, the advantage increases to some value , putting us in regime 3 (i.e., the advantage is either less than or more than ). Of course, we cannot “cherry-pick” only states with high advantage for aggregation; but aggregating over all states induces some loss due to the inability to switch actions quickly. We bound that cost in computing and . This allows us to first lower bound the regret of the best -aggregate policy:

###### Theorem 1.

Let be a fixed horizon, and let —the event-level optimal function—be -smooth. Then for all , , where is the value of state under an optimal -aggregate policy.^{7}^{7}7The reparameterized problem
is also an MDP, so
the optimal value function
and deterministic policy are well-defined.

This theorem is proved by constructing a policy which switches actions every events and showing that it has bounded regret. This policy, at the start of any -event period, adopts the optimal action from the unaggregated MDP at the initiating state. Due to smoothness, Q-values cannot drift by more than during the period, after which the policy corrects itself. This, together with the reasoning abouve, offers an amplification guarantee:

###### Theorem 2.

In an -smooth MDP, let be a fixed repetition horizon. For any belief state where , the -aggregate-horizon advantage is bounded below:

This result is especially useful when the event-level advantage is more than . In this case, an aggregation horizon of can mitigate the adverse effects of approximating belief state with an -sufficient statistic for for an up to:

at the cost of the aggregation loss of .

Figs. 1(c) and 1(d) illustrate the benefit of action aggregation: they show the Q-values of the -aggregated CK model with with both perfect and imperfect state estimation, respectively (the amount of noise is the same as inFig. 1(b)). As we show Sec. 4.4, the recovered policies incur very little loss due to state estimation error.

We conclude with the following observation.

###### Corollary 1.

Optimal repeating policies are near-optimal for the event-level problem as and amplification at every state is guaranteed.

### 4.3 Temporal Regularization: Switching Cost

As discussed above, temporal aggregation is guaranteed to improve learning in
slow environments. It has, however, certain practical drawbacks due to its inflexibility.
One such drawback is that, in the non-Markovian setting induced by belief-state approximation,
training data should ideally be collected using a -aggregated
behavior policy.^{8}^{8}8This is unnecessary if the system is Markovian,
since tuples may be reordered to emulate any behavioral policy.
Another drawback arises if the -smoothness assumption is partially violated. For example,
if certain rare events cause large changes
in state or reward for short periods, the changes in Q-values may be abrupt, but are harmless from an SNR
perspective if they induce large advantage gaps. An agent “committed” to a constant action during
an aggregation period is unable to react to such events. We thus propose a more flexible
advantage amplification mechanism, namely, a switching-cost regularizer.
Intuitively, instead of fixing an aggregation horizon, we impose a fictitious cost (or penalty)
on the agent whenever it changes its action.

More formally, the goal in the switching-cost (belief-state) MDP is to find an optimal policy defined as:

(4) |

This problem is Markovian in the extended state space representing the current (belief) state and the previously executed action. This state space allows the switching penalty to be incorporated into the reward function as .

The switching cost induces an implicit adaptive action aggregation—after executing action , the agent will keep repeating until the cumulative advantage of switching to a different action exceeds the switching cost . We can use this insight to bound the maximum regret of such a policy (relative to the optimal event-level policy) and also provide an amplification guarantee, as in the case with action aggregation.

In the case of problems with actions, we can analyze the action of the switching cost regularizer in a relatively intuitive way. As with Thm. 1, we bound the regret induced by the switching cost by constructing a policy that behaves as if it had to pay with every action switch. In particular, the optimal policy under this penalty adopts the action of the event-level optimal policy at some state , then holds it fixed until its expected regret for not switching to a different action dictated by the event-level optimal policy exceeds . Suppose the time at which this switch occurs is (). The regret of this agent is no more than the regret of an agent with the option of paying upfront in order to follow the event-level optimal policy for steps. We can show that the same regret bound holds if the agent were paying to switch to the best fixed action for steps instead of following the event-level optimal policy. This allows derivation of the following bound:

###### Theorem 3.

The regret of the optimal switching cost policy for a -action MDP is less than , where

and where is the Lambert W-function [?].

This leads to an amplification result, analogous to Thm. 2:

###### Theorem 4.

Let be as in Thm. 3. Any state whose advantage in the event-level optimal policy is at least has an advantage of at least in the switching-cost regularized optimal policy.

### 4.4 Empirical Illustration

We experiment with synthetic models to demonstrate the theoretical results above. In a first experiment, we apply both action aggregation and switching cost regularization to the simple Choc-Kale POMDP, with parameters , and . To illustrate the effects of faulty state estimation, we corrupt the satisfaction level with noise drawn from a Gaussian (mean , stdev. ), truncated on . As we increase , state estimation becomes worse. To mitigate this effect, we apply aggregations of actions at discounts of and (Fig. 3a,b) and switching costs of (Fig. 3c). For each parameter setting, we train tabular policies with -learning, discretizing state space into buckets. For each training run, we roll-out event-level transitions, exploring using actions taken uniformly at random—aggregated actions in the aggregation setting—then evaluate the discounted return of each policy using Monte Carlo rollouts of length . Figs.3a, b and c show the average performance across the training runs (with the 95% confidence interval) as a function of the . We see that action aggregation has a profound effect on solution quality, improving the performance of the policy up to a factor of almost (for ). Switching cost regularization has a more subtle effect, providing more modest improvements in performance. We observe that over-regularized policies actually perform worse than the unregularized policy. We conjecture that this stark difference in performance is due to action aggregation having a double effect on the value estimates—apart from amplification, it also provides a more favorable behavioral policy.

A second experiment takes a more “options-oriented” perspective on the problem. Here, recommendable items have a continuous “kaleness” score between 0 and 1, with item ’s score denoted . At each time step, a set of items is drawn from a -truncated Gaussian with mean equal to the kaleness score of the previously consumed item. The RL agent sets a target kaleness score (its action space). This translates to a specific “presentation” of the items to the user such that the user is nudged to consume an item whose score is closer to the target. Specifically, the user chooses an item using a softmax distribution: , with temperature . The results are shown in Fig. 3d and exhibit a comparable level of improvement as in the binary-action case.

## 5 Related Work

The study of time series at different scales of granularity has a long-standing history in econometrics, where the main object of interest seems to be the behavior of various autoregressive models under aggregation [?], however, the behavior of aggregated systems under control does not seem to have been investigated in that field.

In RL, time granularity arises in several contexts. Classical semi-MDP/options theory employs temporal aggregation to organize the policy space into a hierarchy, where a pre-specified sub-policy, or option is executed for some period of time (termination is generally part of the option specification) [?]. That options might help with partial observability (“state noise”) has been suggested—e.g., Daniel et al. [?], who also informally suggest that reduced control frequency can improve SNR; however the task of characterizing this phenomenon formally has not been addressed to the best of our knowledge. The learning to repeat framework (see [?] and references therein) provide a modeling perspective that allows an agent to choose an action-repetition granularity as part of the action space itself, but does not study these models theoretically. SNR has played a role in RL, but in different ways than studied here, e.g., as applied to policy gradient (rather than as a property of the domain) [?].

The effect of the advantage magnitude (also called action gap) on the quality and convergence of reinforcement learning algorithms was first studied by Farahmand [?]. Bellemare et al. [?] observed that the action gap can be manipulated to improve the quality of learned polices by introducing local policy consistency constraints to the Bellman operator. Their considerations are, however, not bound to specific environment properties.

Finally, our framework is closely related with the study of regularization in RL and its benefits when dealing with POMDPs. Typically, an entropy-based penalty (or KL-divergence w.r.t. to a behavioral policy) is added to the reward to induce a stochastic policy. This is usually justified in one of several ways: inducing exploration [?]; accelerating optimization by making improvements monotone [?]; and smoothing the Bellman equation and improving sample efficiency [?]. Of special relevance is the work of Thodoroff et al. [?], who, akin to this work, exploit the sequential dependence of Q-values for better Q-value estimation. In all this work, however, regularization is simply a price to pay to achieve a side goal (e.g., better optimization/statistical efficiency). While stochastic policies often perform better than deterministic ones when state estimation is deficient [?], and methods that exploit this have been proposed in restricting settings (e.g., corrupted rewards [?]), the connection to regularization has not been made explicitly to the best of our knowledge.

## 6 Concluding Remarks

We have developed a framework for studying the impact of belief-state approximation in latent-state RL problems, especially suited to slowly evolving, highly noisy (low SNR) domains like recommender systems. We introduced advantage amplification and proposed and analyzed two conceptually simple realizations of it. Empirical study on a stylized domain demonstrated the tradeoffs and gains they might offer.

There are a variety of interesting avenues suggested by this work: (i) the study of soft-policy regularization for amplification (preliminary results are presented in the long version of this paper); (ii) developing techniques for constructing more general “options” (beyond aggregation) for amplification; (iii) developing amplification methods for arbitrary sources of modeling error; (iv) conducting more extensive empirical analysis on real-world domains.

## References

- [1]
- [Archak et al., 2012] N. Archak, V. Mirrokni, and S. Muthukrishnan. Budget optimization for online campaigns with positive carryover effects. WINE-12, pp.86–99, Liverpool, 2012.
- [Baird III, 1999] L. C. Baird III. Reinforcement Learning Through Gradient Descent. PhD thesis, US Air Force Academy, 1999.
- [Barto and Mahadevan, 2003] A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(1-2):41–77, 2003.
- [Boyen and Koller, 1998] X. Boyen and D. Koller. Tractable inference for complex stochastic processes. UAI-98, pp.33–42, 1998.
- [Choi et al., 2018] S. Choi, H. Ha, U. Hwang, C. Kim, J. Ha, and S. Yoon. Reinforcement learning-based recommender system using biclustering technique. arXiv:1801.05532, 2018.
- [Corless et al., 1996] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E. Knuth. On the Lambert W function. Adv. in Comp. Math., 5(1):329–359, 1996.
- [Daniel et al., 2016] C. Daniel, G. Neumann, O. Kroemer, and J. Peters. Hierarchical relative entropy policy search. JMLR, 17(93):1–50, 2016.
- [Downey et al., 2017] C. Downey, A. Hefny, B. Boots, G. J. Gordon, and B. Li. Predictive state recurrent neural networks. NIPS-17, pp.6053–6064. 2017.
- [Everitt et al., 2017] T. Everitt, V. Krakovna, L. Orseau, and S. Legg. Reinforcement learning with a corrupted reward channel. IJCAI-17, pp.4705–4713, Melbourne, 2017.
- [Fox et al., 2016] R. Fox, A. Pakman, and N. Tishby. Taming the noise in reinforcement learning via soft updates. UAI-16, pp.1889–1897, New York, 2016.
- [Francois-Lavet et al., 2017] V. Francois-Lavet, G. Rabusseau, J. Pineau, D. Ernst, and R. Fonteneau. On overfitting and asymptotic bias in batch reinforcement learning with partial observability, 2017. To appear, JAIR.
- [Hauskrecht et al., 1998] M. Hauskrecht, N. Meuleau, L. P. Kaelbling, T. Dean, and C. Boutilier. Hierarchical solution of Markov decision processes using macro-actions. UAI-98, pp.220–229, Madison, WI, 1998.
- [Hohnhold et al., 2015] H. Hohnhold, D. O’Brien, and D. Tang. Focusing on the long-term: It’s good for users and business. KDD-15, pp.1849–1858, Sydney, 2015.
- [Jaber, 2006] M. Y Jaber. Learning and forgetting models and their applications. Handbook of Industrial and Systems Engineering, 30(1):30–127, 2006.
- [Littman and Sutton, 2002] M. L. Littman and R. S. Sutton. Predictive representations of state. NIPS-02, pp.1555–1561, Vancouver, 2002.
- [Mladenov et al., 2017] M. Mladenov, C. Boutilier, D. Schuurmans, O. Meshi, G. Elidan, and T. Lu. Logistic Markov decision processes. IJCAI-17, pp.2486–2493, Melbourne, 2017.
- [Nachum et al., 2017] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. Bridging the gap between value and policy based reinforcement learning. NIPS-17, pp.1476–1483, Long Beach, CA, 2017.
- [Parr, 1998] R. Parr. Flexible decomposition algorithms for weakly coupled Markov decision processes. UAI-98, pp.422–430, Madison, WI, 1998.
- [Roberts and Tedrake, 2009] J. W. Roberts and R. Tedrake. Signal-to-noise ratio analysis of policy gradient algorithms. NIPS-09, pp.1361–1368, Vancouver, 2009.
- [Schulman et al., 2015] J. Schulman, S. L., P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. ICML-15, pp.1889–1897, Sydney, 2015.
- [Shani et al., 2005] G. Shani, D. Heckerman, and R. I. Brafman. An MDP-based recommender system. JMLR, 6:1265–1295, 2005.
- [Sharma et al., 2017] S. Sharma, A. S. Lakshminarayanan, and B. Ravindran. Learning to repeat: Fine-grained action repetition for deep reinforcement learning. ICLR-17, Toulon, France, 2017.
- [Silvestrini and Veredas, 2008] A. Silvestrini and D. Veredas. Temporal aggregation of univariate and multivariate time series models: A survey. J. Econ. Surveys, 22(3):458–497, 2008.
- [Singh et al., 1994] S. P. Singh, T. Jaakkola, and M. I. Jordan. Learning without state-estimation in partially observable Markovian decision processes. ICML-94, pp.284–292, New Brunswick, NJ, 1994.
- [Smallwood and Sondik, 1973] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable Markov processes over a finite horizon. Op. Res., 21:1071–1088, 1973.
- [Sutton et al., 1999] R. S. Sutton, D. Precup, and S. P. Singh. Between MDPs and Semi-MDPs: Learning, planning, and representing knowledge at multiple temporal scales. Artif. Intel., 112:181–211, 1999.
- [Taghipour et al., 2007] N. Taghipour, A. Kardan, and S. S. Ghidary. Usage-based web recommendations: A reinforcement learning approach. RecSys07, pp.113–120, Minneapolis, 2007.
- [Thodoroff et al., 2018] P. Thodoroff, A. Durand, J. Pineau, and D. Precup. Temporal regularization for Markov decision process. NeurIPS-18, pp.1784–1794, Montreal, 2018.
- [Thurstone, 1919] L. L. Thurstone. The learning curve equation. Psychological Monographs, 26(3):i, 1919.
- [Wilhelm et al., 2018] M. Wilhelm, A. Ramanathan, A. Bonomo, S. Jain, E. H. Chi, and J. Gillenwater. Practical diversified recommendations on youtube with determinantal point processes. CIKM18, pp.2165–2173, Torino, 2018.
- [Zhao et al., 2017] X. Zhao, L. Zhang, Z. Ding, D. Yin, Y. Zhao, and J. Tang. Deep reinforcement learning for list-wise recommendations. arXiv:1801.00209, 2017.

## Appendix A Technical Results and Additional Material

### a.1 Analysis of Action Aggregation

We start with an observation that frames the subsequent analysis.

###### Lemma 1 (Counterfactual Q-values).

Let and be two deterministic policies and let the -function of , be known. Then, for any state :

where the expectation is over trajectories generated by starting at ().

###### Proof.

Let be a non-stationary policy that starting at executes exactly once, then follows forever. Then,

Linearity of expectation and the boundedness of both functions ensures that recursive application to converges to the desired quantity. ∎

This allows us to measure the behavior of -aggregate policies (in terms of advantages) relative to atomic ones.

###### Lemma 2.

Let be a state for which the optimal action does not change within steps. In other words, for all , , where denotes all states reachable from in steps under some action sequence. Then for any other action ,

where the expectation is over the trajectory of states that follow after when taking action and for notational convenience.

###### Proof.

Let be a policy that executes for steps and then reverts to the optimal policy and be a policy that executes for steps and then reverts to the optimal policy. We apply Lemma 1 noting that coincides with the optimal policy, hence . ∎

Lemma 2 establishes that advantage of -aggregate actions is the compound discounted advantage of the atomic ones. This, combined with the smoothness of the optimal -function allows us to analyze the effects of action aggregation. In particular, suppose that we have found a state such that . The smoothness of the -function allows us to infer that for any action taken, the advantage at the next state will be at least , after steps and so on (this guarantees the advantage gap can’t close in less than steps). Hence if we were to replace atomic actions with -repeated actions at states with advantage of more than , the following can be observed:

###### Lemma 3.

Consider an MDP with an -smooth optimal -function , and the reparametrization by holding actions fixed for steps whenever the advantage gap is greater than . Then, at each state, either:

or .

###### Proof.

As established, . Smoothness of the -function guarantees that , resp. as discussed above. Hence

(5) | ||||

(6) | ||||

(7) |

Here, the first expression comes from the finite sum geometric series formula, and the second term reflects the fact that

∎

We have thus far considered the situation where aggregation is state-dependent, based on the knowledge of the advantage amplitude. A more realistic implementation of aggregation would be to fix some global static apriori. While the reasoning is similar to the above, there is an added complexity that the aggregation itself comes at a price – the values of certain states will be reduced (relative to the best atomic policy) due to the inability to rapidly switch actions. This additional cost must then be factored into the bounds.

We first bound the cost of the best slow policy under smooth assumptions.

###### Theorem 1.

Let be a fixed horizon and , the event-level optimal Q-function be -smooth. Then for all , , where is the value of state under an optimal -aggregate policy^{9}^{9}9Note that the reparametrized problem in which a decision can be made every atomic steps is also an MDP, so the notion of an optimal value function (one that provides the largest possible value at every state) resp. optimal deterministic policy is well-defined. .

###### Proof.

We lower bound the return of the optimal -aggregate policy by exhibiting a (not-necessarily optimal) -aggregate policy with sufficient returns. In particular, let be a deterministic optimal atomic policy, and be the policy that at state executes for the next steps, then executes , etc. Following Lemma 1, we have that

Let be the states at which may switch actions. At any , if , due to smoothness of , we know that remains optimal for the next steps thus and until the next decision point for (i.e. for ). Conversely, if , then in the worst case, assuming that the Q-value of the optimal action decreases by with every step and the Q-value of a suboptimal action increases respectively by , then for any of the next steps, . Hence, for all , is either or less than , leading to

∎

We now factor in the aggregation loss from Thm. 1 into Lemma 3, which together characterize the amplification properties of action aggregation.

###### Theorem 2.

In an smooth MDP, let be a fixed repetition horizon. For any state where the advantage gap greater than , the fixed-horizon advantage is lower-bounded as follows:

###### Proof.

Let be the losslessly amplified -function as in Lemma 3. Since , action is optimal under an atomic optimal policy for the next periods and due to Thm. 1,

Moreover, for any

as the expected return following the aggregation period of can only be lower than the one of the losslessly amplified problem. Thus:

The bound from the expectation term comes from Lemma 3. ∎

### a.2 Analysis of Switching Cost

Let be an L-smooth optimal Q-function of a POMDP with an optimal deterministic policy and be a switching cost. Consider the following policy : at , adopts the optimal atomic action . Also at , calculates the time until its regret for repeating relative to following the optimal policy exceeds, , i.e.

(where the expectation is, as before, over trajectories of belief sates generated by executing , conditioned on the realization of ) and then repeats until . At , queries the optimal policy for , executes it until time and so on. In other words, repeats an action adopted from the optimal policy until its regret for not having followed the optimal policy exceeds and then switches.

The return of is equivalent to the return of a policy that would pay upfront to switch to and follow the optimal policy for steps. We first bound the loss of this policy and then argue that it also upper-bounds the loss of the optimal switching cost policy.

###### Lemma 4.

The total regret of relative to is no more than

###### Proof.

Starting from Lemma 1, the total loss of relative to is

We now argue that is bounded at any step . Observe that the action taken by the policy at time depends only on the belief state realization , being the most recent time point at which the policy was allowed to switch actions. Thus, by the law of total probability,

In the above, we have just rewritten the loss relative to the last switching point . We now show that the conditional expectation is bounded for any realization . Recall that at time , picks the action and executes it until time , the time when its expected regret exceeds , and then switches. Thus, the highest achievable per-event expected regret conditioned on occurs when at time , there exists an alternative action having the same Q-value, i.e. , and then, in subsequent time steps, the expected -value of , , starts increasing at the maximum rate of while the Q-value of starts decreasing at the same rate. The maximum rate of increase of the expectation is justified since if the Q-value of is -smooth over individual sample paths, i.e. , the sequence of expectations is also -smooth, ). By the same reasoning as in Lemma 3 (but this time for the sequence of expectations, rather than the sequence of realizations), the regret for not following after steps is bounded by . Let us calculate the minimum (under assumptions of monotone regret increase at the maximum rate) for which this regret exceeds as , where is the solution of .

where is the Lambert W-function. Knowing , we can deduce that the maximum per-event expected regret, is no more than

under worst-case assumptions. This leads to an overall regret bound of . ∎

So far, we have produced an agent that pays a cost to switch to the optimal policy for some time-specific horizon and bounded its regret. This is not quite what we need, since gets more than one action switch for free within the horizon over which it is allowed to follow the optimal policy. We now argue that for a slow environment, this bound also holds for an agent which pays on every switch. To do so, we need to produce a reasonable policy that pays for every action switch.

Let be the advantage of following the optimal policy for turns over repeating for the same time and then switching to the optimal policy (resp. the regret for repeating over following the optimal policy). That is, following Lemma 1,

where the expectation is over trajectories generated by taking action and .

Furthermore, let s.t. if the former is feasible, else . In words, is the shortest time until the regret of having repeated the current action exceeds the regret of having repeated some other action by . Note that this does not need to be finite, since in some state, repeating any action for any amount of time might yield similar returns.

Based on this, we construct a policy that behaves as if it would pay on every switch. At , executes the atomic optimal action . At the next state, , if is infinite, repeats once. Otherwise, repeats times and then switches to , the action whose regret exceeds after steps. This policy can do no better than a policy that pays upfront to switch to and maintain it for steps. By extension, can generate no more return than the best switching cost policy. Now we can bound its loss.

###### Theorem 3.

The regret of the optimal switching cost policy for a -action MDP is less than , where is the same as in the previous theorem.

###### Proof.

Again we must scrutinize the sequence showing that it is bounded at any step . We consider the contrapositive: suppose there existed , such that

with as in the previous theorem. By the law of total probability,

The expectation over in the second line above implies that for at least one possible realization of , the expected Q difference after periods exceeds , i.e. . Let be such a realization of and let resp. . Due to the -smoothness of the sequences of expected Q-values , , we know that for all ,