Learning Adaptive Display Exposure
for RealTime Advertising
Abstract.
In Ecommerce advertising, where product recommendations and product ads are presented to users simultaneously, the traditional setting is to display ads at fixed positions. However, under such a setting, the advertising system loses the flexibility to control the number and positions of ads, resulting in suboptimal platform revenue and user experience. Consequently, major ecommerce platforms (e.g., Taobao.com) have begun to consider more flexible ways to display ads. In this paper, we investigate the problem of advertising with adaptive exposure: can we dynamically determine the number and positions of ads for each user visit under certain business constraints so that the platform revenue can be increased? More specifically, we consider two types of constraints: requestlevel constraint ensures user experience for each user visit, and platformlevel constraint controls the overall platform monetization rate. We model this problem as a Constrained Markov Decision Process with perstate constraint (psCMDP) and propose a constrained twolevel reinforcement learning approach to decompose the original problem into two relatively independent subproblems. To accelerate policy learning, we also devise a constrained hindsight experience replay mechanism. Experimental evaluations on industryscale realworld datasets demonstrate the merits of our approach in both obtaining higher revenue under the constraints and the effectiveness of the constrained hindsight experience replay mechanism.
The previous name is: Learning to Advertise with Adaptive Exposure via Constrained TwoLevel Reinforcement Learning
1. Introduction
With the advances of deep neural network (LeCun et al., 2015; Goodfellow et al., 2016), Deep Reinforcement Learning (DRL) approaches have made significant progress in a number of applications including Atari games (Mnih et al., 2015) and robot locomotion and manipulation (Schulman et al., 2015b; Levine et al., 2016). Recently, we also witness successful applications of DRL techniques to optimize the decisionmaking process in Ecommerce from different aspects including online recommendation (Chen et al., 2018), impression allocation (Cai et al., 2018; Zhao et al., 2018b), advertising bidding strategies (Jin et al., 2018; Wu et al., 2018; Zhao et al., 2018a) and product ranking (Hu et al., 2018).
In traditional online advertising, the ad positions are fixed, and we only need to determine which ads to be shown in these positions for each user request (Mehta and others, 2013). This can be modeled as an ads position bidding problem and DRL techniques have been shown to be effective in learning bidding strategies for advertisers (Jin et al., 2018; Wu et al., 2018; Zhao et al., 2018a). However, fixing ad positions limit the flexibility of the advertising system. Intuitively, if a user is with high monetization value (e.g., likes to click ads), it is reasonable for the advertising platform to display more ads when this user visits. On the other hand, we are also concerned with displaying too many ads for two reasons. First, it might lead to poor user experience and have a negative impact on user retention. Second, monetization rate is an important business index for a company to moderate. Therefore, in this paper, we consider two levels of constraints: (1) requestlevel: the maximum number of ads on each request ^{1}^{1}1It is worth pointing out that the request here refers to the user’s access to the platform (for example: opening the mobile app, swipe the screen) (a.k.a. user visit) cannot exceed a threshold; and (2) platformlevel: the average number of ads over all the requests (within a time window) cannot exceeda threshold. Under the above constraints, we investigate the problem of advertising with adaptive exposure: can we dynamically determine the set of ads and their positions for each user visit so that the platform revenue can be maximized? We call the above problem as advertising with adaptive exposure problem.
Fig.1 illustrates the flexible adaptive exposure mechanism adopted by Taobao ^{2}^{2}2One of the largest ecommerce company in China. For each user visit, the platform presents a dynamic mixture of product recommendations and product ads. The ad positions are not a fixed prior, and they are determined by the user’s profile and behaviors. The adaptive exposure problem can be formalized as a sequential decision problem. In each step, the recommendation and the advertising systems first select some items based on their scoring systems independently. Then these commodities are sorted altogether by their scores and the top few items are exposed to the request (user).
We model the above problem as a Constrained Markov Decision Process (CMDP) (Altman, 1999). Although optimal policies for smallsized CMDPs can be derived using linear programming (Altman, 1999), it is difficult to construct such policies for largescale and complex realworld ecommerce platforms. Thus, we resort to modelfree RL approaches to learn approximately optimal solutions (Achiam et al., 2017; Tessler et al., 2018). Existing modelfree RL approaches for solving CMDP are trajectorybased: they update policies by propagating constraintviolation signals over the entire trajectory (Achiam et al., 2017; Prashanth and Ghavamzadeh, 2016). Unfortunately, most of them fail to meet the constraints (Tessler et al., 2018). To address this issue, Tessler et al. (2018) propose the Reward Constrained Policy Optimization (RCPO), which decomposes the trajectory constraints into perstate penalties and dynamically adjusts their weights. To ensure that the overall penalty of a trajectory satisfies the given constraint, the constraintviolation signals are also propagated back along the entire trajectory. However, in the advertising with adaptive exposure problem, we need to satisfy both statelevel (requestlevel) and trajectorylevel (platformlevel) constraints. RCPO only considers trajectorylevel constraints and thus cannot be directly applied here.
In this paper, we first model the advertising with adaptive exposure problem as a CMDP with perstate constraint (psCMDP). Then we propose a constrained twolevel reinforcement learning framework to learn optimal advertising policies satisfying both statelevel and trajectorylevel constraints. In our framework, the trajectorylevel constraint and the statelevel constraint are divided into different levels in the learning process. The higher level policy breaks a trajectory into multiple subtrajectories and tackles the problem of selecting constraints for each subtrajectory to maximize total revenue under the trajectorylevel constraint. Each subtrajectory identifies an independent optimization problem with both subtrajectory constraint and statelevel constraint. Here we simplify the subtrajectory optimization problem at the cost of sacrificing the policy optimality by treating the subtrajectory constraint as another statelevel constraint. In this way, we can easily combine the subtrajectory constraint with the original statelevel constraint and use offpolicy methods such as Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) with auxiliary task (Jaderberg et al., 2016) to train the lower level policy. We also propose Constrained Hindsight Experience Replay (CHER) to accelerate the lower level policy training.
Note that our framework can be naturally extended to more levels by further decomposing each subtrajectory into a number of subtrajectories. Thus it is expected that the quality of the learned policy would be improved when we increase the number of levels, which means the length of each subtrajectory at the lower levels is reduced. Thus our framework is flexible enough to make a compromise between training efficiency and policy optimality. In this paper, we set our framework to be two levels. One additional benefit of our twolevel framework is that we can easily reuse the lower level policy to train the higher level constraint selection policy in case the trajectorylevel constraint is adjusted. We evaluate our approach using realworld datasets from Taobao platform both offline and online. Our approach can improve the advertising revenue and the advertisers’ income while satisfying the constraints at both levels. In the lower level, we verify that CHER mechanism can significantly improve the training speed and reduce the deviation of the perstate constraint. Moreover, in the higher level, our method can make good use of the lower level policy set to learn higher level policies with respect to different platformlevel constraints.
2. Preliminary: Constrained Reinforcement Learning
Reinforcement learning (RL) allows agents to interact with the environment by sequentially taking actions and observing rewards to maximize the cumulative reward (Sutton et al., 1998). RL can be modeled as a Markov Decision Process (MDP), which is defined as a tuple . is the state space and is the action space. The immediate reward function is . is the state transition probability, . There exists a policy on , which defines an agent’s behavior. The agent uses its policy to interact with the environment and generates a trajectory . Its goal is to learn an optimal policy which maximizes the expected return given the initial state:
(1) 
Here is the discount factor, is the length of the trajectory . The Constrained Markov Decision Process (CMDP) (Altman, 1999) is generally used to deal with the situation, by which the feasible policies are restricted. Specifically, CMDP is augmented with auxiliary cost functions , and a upperbound constraint . Let be the cumulative discounted cost of policy . The expected discounted return is defined as follows:
(2) 
The set of feasible stationary policies for a CMDP is then:
(3) 
And the policy is optimized by limiting the policy in the Equation 1. For DRL methods, Achiam et al. (2017) propose a new approach which replaces the optimization objective and constraints with surrogate functions, and uses Trust Region Policy Optimization (Schulman et al., 2015a) to learn the policy, achieving nearconstraint satisfaction in each iteration. Tessler et al. (2018) use a method similar to WeiMDP (Geibel, 2006). WeiMDP (Geibel, 2006) introduces a weight parameter and a derived weighted reward function , which is defined as:
(4) 
where and are the reward and the auxiliary cost under the transition respectively. For a fixed , this new unconstrained MDP can be solved with standard methods, e.g. QLearning (Sutton et al., 1998). Tessler et al. (2018) use the weight as the input to the value function and dynamically adjust the weight by backpropagation.
Notation  Description 
The sequence of incoming requests. , is the total number of requests visiting the platform within one day, different days have different .  
The th request in the day.  
the number of candidate ads.  
The number of recommended products.  
The number of commodities shown to . Usually, . And  
The number of ads exposed for  
The total percentage of ads exposed in one day.  
The percentage of ads exposed for . .  
The maximum percentage of the total ads exposed in one day  
The maximum percentage of the ads exposed for each request  
The candidate ads set for . 
3. Advertising With Adaptive Exposure
3.1. Adaptive Exposure Mechanism
In an Ecommerce platform, user requests come in order, ^{3}^{3}3Table 1 summarizes the notations. In this paper, we usually use i subscript to refer to the ith request and j subscript to refer to the jth ad in request.. When a user sends a shopping request , commodities are exposed to the request based on the user’s shopping history and personal preferences. The commodities are composed of advertising and recommendation products. Exposing more ads may increase the advertising revenue. However, the exposed ads for users are not necessarily their favorite or needed products. Therefore, we should limit the number of exposed ads for each user’s request.
For each request , traditional Ecommerce systems use fixedpositions to expose ads: , where is the number of fixed positions. However, it is obvious that this advertising mechanism is not optimal. Different consumers have different shopping habits and preferences to different products and ads. Therefore, we can expose more advertising products to those consumers who are more likely to click and purchase them (thus increasing the advertising revenue) and vice versa.
To this end, recently Taobao (one of the largest Chinese Ecommerce platform) has begun to adopt more flexible and mixed exposing mechanism for exposing advertising and recommended products (Fig. 1). Specifically, for each request , the recommendation and the advertising systems first select the top and items based on their scoring systems independently (Fig. 1, Step (2) and Step (3)). Then these commodities are sorted altogether according to scores in descending order(Fig. 1, Step (5)) and the top items are exposed to this request (Fig. 1, Step (6)).
In the meantime, to ensure users’ shopping experience, we need to impose the following constraints:

platformlevel constraint , the total percentage of ads exposed in one day should not exceed a certain threshold :
(5) 
requestlevel constraint , the percentage of ads exposed for each request should not exceed a certain threshold :
(6)
where . This means that we can exploit this inequality requirement to expose different numbers of ads to different requests according to users’ profiles, e.g. expose more ads to users who are more interested in the ads and can increase the average advertising revenue, and fewer ads for the others. On one hand, for each request, the size of can be automatically adjusted according to the quality of the candidate products and ads. On the other hand, the positions of items are determined by the quality of the products and ads, which can further optimize the user experience. In this way, we can increase the total advertising revenue while satisfying both the requestlevel and platformlevel constraints. The scoring systems for both recommendation and advertising sides can be viewed as black boxes. However, from the advertising perspective, we can adaptively adjust the score of each ad to change their relative rankings and the number of ads to be exposed eventually (Fig. 1, Step (4)).
The adaptive exposure mechanism can potentially improve the advertising revenue, however, it faces a number of challenges. First, the ads score adjusting strategy is highly sensitive to the dynamics of recommendation system (e.g., system upgrading) and other components of advertising system (e.g., the candidate ads selection mechanism may upgraded). Our advertising system needs to be adjusted to meet the constraints. Second, actual business needs to change from time to time (adjust to platformlevel and requestlevel constrain), so does our advertising system. These challenges force us to design a more flexible algorithm. ^{4}^{4}4Due to space constraints, we further discuss the novelty and related work of our setup and methods in the appendix.
3.2. Formulation
3.2.1. Problem Description
From the advertising perspective, the above advertising exposure problem can be seen as a bidding problem: The product displayed in each user request is determined by the score (bid price) of the advertising item and the recommended item (Rank by bid price, the higher the price, the more possible to be displayed), and the advertising system adjusts the score of the original advertisement (the auction bid) to satisfy the constraint and increase revenue (auction results). We follow the settings of the bidding problem in Display Advertising (Cai et al., 2017; Zhang et al., 2014) and extend it to advertising with adaptive exposure problem. Formally, for the th ad in request , its score is adjusted as follows:
(7) 
where . is a bidding function and is the parameters of . is the original score given by the advertising system for in request . Within the advertising system only, we cannot directly figure out whether the ad (which score has been adjusted) will be finally exposed to the request. We can only get the final displayed results from the Shuffle System (Fig. 1, Step (5)). So we define as the probability of winning the bid request with bid adjustment ratio , where is the parameter of the recommendation system and indicates whether the advertisement is finally displayed in request given the recommendation system’s parameters . We use to denote the expected revenue value of the advertising product under request .^{5}^{5}5 can be computed in a truthful or Generalized Second Price (GSP) fashion. Then under the premise of satisfying the constraints and , the optimization goal of the advertising system can be written as follows:
(8) 
Requests arrive in chronological order. To satisfy the constraint (e.g. the maximum proportion of displaying ads during a day of the platform ), if the system exposes too many ads during early period, it should expose fewer ads later. Hence the above problem is naturally a sequential decisionmaking problem.
3.2.2. Problem Formulation
To solve such a sequential decisionmaking problem, one typical method is to model it as MDP (Cai et al., 2017) or CMDP (Wu et al., 2018). and then use reinforcement learning techniques to solve the problem. In practice, we cannot acquire and make accurate predictions of the environmental information like and aforehand, thus we resort to modelfree reinforcement learning techniques. However, since there exist both platformlevel and requestlevel constraints, the traditional CMDP (Altman, 1999) cannot be directly applied here. We propose a special CMDP which we term as CMDP with perstate constraint (psCMDP). Formally a psCMDP can be defined as a tuple . Comparing to the original CMDP (Altman, 1999), we see that the difference here is that for each trajectory , psCMDP needs to satisfy not only the trajectorylevel constraint :
(9) 
but also the perstate constraint over each request:
(10) 
where and is the upper bound of . So the set of feasible stationary policies for a psCMDP is:
(11) 
The components of a psCMPD are described in details as follows:

: The state should reflect both the environment and the constraints in principle. In our settings, we consider the following statistics for : 1) information related to the the current request , e.g., features of the candidate ads; 2) system context information, e.g., the number of ads exposed up to time .

: Considering the system picks out products for each request by the score of all the products, we adjust all the ad candidates’ score of a request at once. Accordingly, we denote , where is the coefficient of the th ad for request , where .

: , where is the score adjustment action in state , is the set of ads finally exposed in and is the revenue value of displaying ad in , We set as the Generalized SecondPrice after the actual sorting of the advertising items and recommended items.

: State transition models the dynamics of requests visiting sequence and system information changes. The effect of on state transitions is reflected in: Different would lead to different ads in , which would also affect the total number of ads that have been shown (which is a component of ). Similar ways of modeling state transitions have also been adopted previously in Cai et al. (2017), Jin et al. (2018) and Wu et al. (2018).
Specifically, for the constraints:

: It is defined as the platform level constraint  the advertising exposure constraint over a day (trajectory), and set discount factor to 1, .

: It is defined as the request level constraint  the advertising exposure constraint over each request (state), and set discount factor to 1, .
With all the definitions above, an optimal policy is defined as follows:
(12)  
Our problem also shares similarity with contextual bandit problem wellstudied in the Ecommerce literature. However, one major difference is that contextual bandit mainly studies choosing an action from the fixed and known set of possible actions, such as deciding whether to display advertisements, and which locations to display advertisements (Badanidiyuru et al., 2014; Wu et al., 2015; Agrawal et al., 2016; Tang et al., 2013). In our case, however we have to adjust the scores (which are continuous variables) of hundreds of millions of items to maximize the total reward, which drives us to model the state transitions explicitly. The other reasons for adopting RL instead of contextual bandit are as follows: 1) Wu et al. (2018) show that modeling trajectory constraints into RL will lead to higher profits since RL can naturally track the changes of constraints in a long run and make longer term decisions. 2), Hu et al. (2018) further confirm that RL methods can bring higher longterm returns than contextual bandit methods for ranking recommended products in ecommerce.
3.3. Solution: Constrained Twolevel Reinforcement Learning
We propose a constrained twolevel reinforcement learning framework to address the constrained advertising optimization problem. The overall structure is shown in Fig. 2 and the framework is described in Algorithm 1. We split the entire trajectory into a number of subtrajectories. The optimization task of the higher level is to learn the optimal policy for selecting constraints for different subtrajectories to maximize longterm revenue while ensuring that the constraint over the whole trajectory is satisfied (Algorithm 1, Line 4). Given a subtrajectory constraint from the higher level, the lower level is responsible for learning the optimal policy over its subtrajectory while ensuring that both the subtrajectory constraint and the perstate constraints are satisfied (Algorithm 1, Line 2  3). In this way, the original psCMDP optimization problem is simplified by decoupling it into the two independent optimization subproblems.
By decoupling the adaptive exposure learning problem in such a twolevel manner, we have higher adaptability and can quickly make response to dynamically changing ecommerce environments. This property is critical in online ecommerce environments since slower response would result in significant amount of monetary loss to the company. First, the platformlevel constraint may vary frequently due to the company’s business strategic change. In this case, the lower level policies we have learned can be reused and only the higher level policy needs to be retrained. Second, the recommendation system or other components of the adverting system may change frequently, and our adjustment policy needs to be updated accordingly. In this case, we only need to retrain the lower level policies while the higher level part can be retained.
3.3.1. Lower Level Control
In the lower level, we have to address the subproblem of learning an optimal advertising policy under a particular subtrajectory constraint provided by the higher level part. In our approach, we convert the constraint of each straightforward into state level constraints to do more precise manipulating. We simplify the subtrajectory optimization problem at the cost of sacrificing the policy optimality by treating the subtrajectory constraint as a statelevel constraint. It is obvious that the original statelevel constraint is more strict than . Thus, we can easily combine and into a single statelevel constraint . Obviously once the state level constraint is satisfied, the higher level constraint would be satisfied at the same time. Thus, given a subtrajectory constraint by the higher level policy, we can directly optimize the lower level policy at the state level. One natural approach in CMDP is to guide the agent’s policy update by adding an auxiliary value related to the perstate constraint to each immediate reward (Equation. 4). And, during policy update, both the current and the future penalty values are considered. However, in our lower level, since each transition satisfies the constraint independently, each action selection does not need to consider its future perstate constraints .
Enforce Constraints with Auxiliary Tasks
Considering the above reasons, we propose a method similar to auxiliary tasks (Jaderberg et al., 2016) by adding an auxiliary loss function based on perstate constraints. We use and to denote the RL loss function and the perstate constraint loss function respectively. During training, the policy is updated towards the direction of minimizing the weighted sum of the above two:
(13) 
where and are the weights, are the parameters of the value network. For example, for critic part in DDPG (Lillicrap et al., 2015), the original critic loss function is:
(14) 
and the additional loss function for perstate constraints can be defined as follows:
(15) 
where , and are the online critic network parameters, actor network parameters and the target critic network parameters respectively and is the function of . The value of is used to control the degree that Qfunction is updated when the constraint is violated. For example, when we want to limit the pvr of each request close to 0.4, we can set , where is the pvr value of request . Intuitively, the more deviates from the target pvr, the more its corresponding Qvalue will be decreased. Similar techniques have also been used to ensure the optimality of the expert demonstrations by using a margin classification loss (Hester et al., 2018).
Constrained Hindsight Experience Replay
To increase the sample efficiency, we propose leveraging the idea of hindsight experience replay (HER) (Andrychowicz et al., 2017) to accelerate the training of optimal policies for different subtrajectory constraints. HER relieves the problem of sample inefficiency in DRL training by reusing transitions, which can be obtained by using different goals to modify reward. We extend this idea to propose the constrained hindsight experience replay (CHER). Different from HER, CHER does not directly revise the reward. Instead, it uses different constraints to define the extra loss during training. The overall algorithm for training lower level policies under CHER is given in Algorithm 2. When we learn a policy to satisfy constraint (specific constraints on each state), it obtains the transition: (Algorithm 2, Line5  8). We can replace with another constraint and then reuse those samples and to train a policy satisfying constraint (Algorithm 2, Line12).
3.3.2. Higher Level Control
The higher level task is to determine trajectorylevel constraints for each subtrajectory to maximize the expected longterm advertising revenue while satisfying the original trajectory constraint . ^{6}^{6}6By satisfying the statelevel constraint in the lower level, we reduce the optimization problem of the higher level into an optimization problem that only needs to consider the trajectorylevel constraint. At each decision point, the higher level policy (we term as constraint choice policy, CCP) selects a constraint for the next subtrajectory, and the corresponding lower level policy takes over and determines the ads adjustment score for each request within that subtrajectory. After the lower level policy execution is finished, the accumulated revenue over that subtrajectory is returned to the higher level policy as its abstraction immediate reward . The above steps repeat until we reach the end of the trajectory, and then we obtain the actual percentage of of ads displayed over the whole trajectory, which can be compared with the trajectory constraint as an additional reward with weighted :
(16) 
Similar to WeiMDP (Geibel, 2006), we use DQN (Mnih et al., 2015) for the higher level policy optimization. More advanced RL techniques (such as CPO (Achiam et al., 2017), SPSA (Prashanth and Ghavamzadeh, 2016)) can be applied as well. Note that our higher level control is similar to the temporal abstraction of hierarchical reinforcement learning (HRL) (Bacon et al., 2017; Kulkarni et al., 2016). However, in contrast to learning how to switch option (Bacon et al., 2017) and alleviate the sparse reward problem(Kulkarni et al., 2016) in HRL, our work leverage the idea of hierarchy to decompose different constraints into different levels.
4. Experiments
4.1. Experimental Setup
Our experiments are conducted on a real dataset of the Chinese largest Ecommerce platform Taobao, and the data collection scenario is consistent with the problem description in Section 3. In Fig. 1, we have demonstrated that the score adjustment of a product produced by the advertising system does not affect the selection and scoring of the candidate products produced by the recommendation system. It only influences the relative ranking of the ads compared with the recommendation products and affects the final mixed sorting results. Therefore, the online data collected from platform can be reused to evaluate the effects of score adjusting through resorting the mixture of the original recommended products and the regraded ad products. Similar settings can be found in related work (Cai et al., 2017; Jin et al., 2018; Perlich et al., 2012; Zhang et al., 2014). Specificailly, we replay the users’ access logs in chronological order to simulate the users’ requests. The state of our psCMDP is represented by integrating the features of all candidate ads and the system contextual information. The action is defined as the score adjusting ratios for candidate ads. Finally, the reward and the satisfaction condition of each constraint are calculated accordingly following the definition in Section 3.2. All details can be found in the appendix.
4.2. Does CHER improve performance?
To verify the effectiveness of using CHER, we compare the impact of using CHER on the learning speed and stability with a baseline DDPG (Lillicrap et al., 2015) under the same network structure and parameters. Suppose is the number of exposure ads for each request, and cannot exceed 5, so we set to be consisting of 5 constraints. Each goal in represents the expected average number of ads exposed per request. Intuitively, we can use the constraint as part of the input and use a network to satisfy all constraints (Andrychowicz et al., 2017). However, considering the learning stability, we use 5 different networks to satisfy different constraints. Since we use DDPG, we add to during Critic training,
(17) 
(18) 
where is set to 10, and is the percentage of ads exposed for th request . We set up 4 different random seeds, and the experiment results are shown in Fig. 3. We only show the results of due to the space limit. The criterion for an algorithm will be better if its result is closer to the target constraint. We can find that, under different constraints, DDPG with CHER is better than DDPG in terms of training speed, achieving and stabilizing around constraints. In order to understand the rationality of the policy after training, we randomly sampled some user visits in the dataset. By recording the actions of adjusting scores of the advertisements in these user visits (Fig. 4), our approach learning can be intuitively understood as follows: if the advertising product has higher value (ecpm, price), then its score is adjusted higher.
Policy  performance  Policy  performance  Policy  performance  
PVR  Revenue  PVR  Revenue  PVR  Revenue  
target  0.35    target  0.41    target  0.46   
manual  0.3561  143121 (100%)  manual  0.4179  157120 (100%)  manual  0.4640  167489 (100%) 
CHER  0.3558  290260 (202.8%)  CHER  0.4100  362676 (230.8%)  CHER  0.4608  399712 (238.6%) 
CCP  0.3576  308108 (215.3%)  CCP  0.4141  370914 (236.1%)  CCP  0.4673  420119 (250.8%) 
4.3. Verify the Effectiveness of Constrained Twolevel Reinforcement Learning
In order to verify the twolevel structure can bring about an increase in revenue, we compare the performance of different methods under different platformlevel constraints PVR=0.35, 0.41, 0.46 (the upper bound of the advertising rate of each day is 0.35, 0.41, 0.46 respectively, =0.35, 0.41, 0.46) with the state level constraint fixed (, the upper bound of the advertising rate of each request is 0.5, ). Since we are considering a new adaptive exposure mechanism, there are no existing approaches suitable for comparison. In our paper, we consider the following two approaches as baselines: 1). manual: the score of an advertisement is manually adjusted by human experts. 2). CHER + DDPG: a previous trained model of Section 4.2. It corresponds to a policy of using a fixed requestlevel constraint for the whole trajectory without adaptive adjustment. Since the performance of DDPG varies a lot , we add a complementary CHER to DDPG and use this optimized approach (CHER+DDPG) to attain a more stable .
Hour  Revenue  PVR  Revenue / PVR  
DDPG+CHER  CCP  CCP   DDPG+CHER  CCP  CCP   DDPG+CHER  CCP  CCP   
8  11556  15845  4289  0.01280837  0.01590289  0.003094520  902222  996359  94137 
9  15595  23422  7827  0.0162089  0.02192751  0.005718610  962125  1068155  106030 
10  20157  28979  8822  0.0184266  0.02321300  0.004786400  1093907  1248395  154487 
11  18221  24739  6518  0.01880709  0.02246692  0.003659830  968836  1101130  132293 
12  16777  18646  1869  0.01794808  0.01895375  0.001005670  934751  983763  49011 
                   
15  18129  16023  2106  0.02096899  0.01851524  0.00245375  864562  865395  832 
16  22913  20450  2463  0.02233828  0.01964052  0.00269776  1025727  1041214  15486 
                   
17  12919  11432  1487  0.01914268  0.01718366  0.00195901  674879  665283  9596 
18  11424  9786  1638  0.01633943  0.01428198  0.00205745  699167  685199  13968 
19  11586  10081  1505  0.01570854  0.01391865  0.00178989  737560  724280  13280 
                   
22  18362  15391  2971  0.02780465  0.02398777  0.00381688  660393  641618  18774 
23  12720  10584  2136  0.02291417  0.01988751  0.00302666  555115  532193  22921 
notice that CCP  is the performance difference of CCP and under different evaluation indicators (e.g. Revenue, PVR, Revenue/PVR)
Does higher level control improve performance?
To distinguish different policies in the behaviour policy set, we use , , , , to refer to the different lower level policies (DDPG+CHER) previous trained in Section 4.2 under different platformlevel constraints . The temporal abstraction value is set to 1 hour, which means the higher level CCP makes a decision per hour.^{7}^{7}7In fact, a more finegrained decomposition can lead to a better performance. However, we simply set the minimum temporal unit to 1 hour here to make the analysis of the improvement of CCP easier. After selecting the subtrajectory constraint, the behavior policy of the state level is activated for adjusting ads’ scores in the flowing hour with the subtrajectory constraint fixed. In our experiments, We combine the double DQN architecture with the dueling structure to train CCP. The state of CCP consists of hourly information, such as the timestamp, hourly eCPM, PVR from 00:00 to current time. The objectives of the higher level policy are: (1) achieving approximately the same number of exposure ads with target ; (2) improving revenue as much as possible. Detailed results are shown in Table. 2 and Fig. 7  7, in which we see the CCP can increase the revenue of each day compared to the manual and DDPG+CHER policies under the same constraint . Therefore, we demonstrate that our approach learns to expose different numbers of ads in different time periods, which means that more ads are exposed when the value of the ads in a request is higher and fewer ads are shown in other time slots.
Why higher level control can improve performance?
We analyze the finally exposed ads and all the corresponding candidate ads of each request within a day. First, we set the advertising rate of each request to a fixed value 0.35. Then we calculate out the proportion of the finally exposed ads within each hour to the total number of ads in that day, which is represented as the Fix policy in Fig. 8. Keeping the total number of ads displayed in a day exactly the same, Oracle is calculated by resorting all the candidates of all requests together according to the scores and picks out the top 35% ads to display. Note that this Oracle policy shown in Fig. 8 is the best strategy available for displaying ads in one day. We can clearly find out that during the time period of hour 8  hour 12, the advertising rate of the Oracle policy is more than 35%, which means that we should display more ads within this period to enlarge revenue. Accordingly, during hour 17  hour 20 and hour 22  hour 23, the advertising rate of the Oracle policy is less than 35%, which means that we should reduce the number of the unpromising ads and leave this opportunity to the more valuable ones. The detailed advertising performance of each hour is shown in Table. 3. We can clearly see that the revenue gap between the baseline policy and our approaches mainly appear on hour 8  hour 12; Besides, our method can obtain more costeffective advertising exposure within hour 8  hour 12 and hour 15  hour 16. Our method can dynamically adjust the advertising number corresponding to different time periods with the daily PVR constraint satisfied.
4.4. Online Results
Lastly, we also report the production A/B test experiments, which compared the performance of our approach to a currently deployed baseline (which displays a fixed number of ads to every user ^{8}^{8}8 In the online test, we have also tried the manual approach (as with the experimental setup). However, we found the manual method could not ensure a relatively stable satisfaction of the pvr constraint, so we omit the results for fair comparisons.) in Taobao’s online platform. We conduct the experiments on the section of ”guess what your like”, where a mixture of recommendations and advertisements are displayed to the users. Our method does not fix the numbers and positions of ads. We apply our designed mechanism to adaptively adjust the scores of each ad to different users so as to display different numbers of ads to different users in different positions. More detials are illustrated in section 3.1. For a fair comparison, we keep the platformlevel constraint the same for all approaches. As a result, we find that our approach does present different numbers of ads to different users in different positions satisfying the preset constraint overall. Besides, we observe 9%, 3%, and 2% improvements in RPM (Revenue Per Mille), CTR (Clickthrough rate) and GMV (Gross Merchandise Value) respectively, which indicates that our adaptive exposure mechanism not only significantly increase the revenue of the platform (RPM) but also the revenues of advertisers (GMV). We also introduced the detailed online implementation process in the appendix.
5. Conclusion
We first investigate the flaws in traditional Ecommerce systems using fixedpositions to expose ads, and further propose more flexible advertising methods (Adaptive Exposure Mechanism) to alleviate the defects of fixedpositions. Further, we emphasize that there are a series of challenges when applying adaptive Exposure Mechanism in actual scenarios. We first model it as a psCMDP problem with different level constraints, and propose a constrained twolevel reinforcement learning framework to solve this problem. Our framework offers high adaptability and quick response to the dynamic changing ecommerce environments. We also propose a novel replay buffer mechanism, CHER, to accelerate the policy training of the lower level. We have demonstrated that our designed adaptive Exposure Mechanism can provide more flexible advertising displaying methods while satisfying a series of constraints through offline simulation experiments and online verification. At the same time, we also verified that the constrained twolevel reinforcement learning framework can effectively utilize the adaptive Exposure Mechanism to improve the platform revenue and user experience while satisfying the constraints.
References
 Constrained policy optimization. arXiv preprint arXiv:1705.10528. Cited by: §A.2.2, §1, §2, §3.3.2.
 An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In Proceedings of COLT, pp. 4–18. Cited by: §3.2.2.
 Constrained markov decision processes. Vol. 7, CRC Press. Cited by: §1, §2, §3.2.2.
 Safe policy search for lifelong reinforcement learning with sublinear regret. In Proceedings of ICML, pp. 2361–2369. Cited by: §A.2.2.
 Hindsight experience replay. In Proceedings of NIPS, pp. 5048–5058. Cited by: §3.3, §4.2.
 Optimising tradeoffs among stakeholders in ad auctions. In Proceedings of EC, pp. 75–92. Cited by: §A.1.
 The optioncritic architecture.. In Proceedings of AAAI, pp. 1726–1734. Cited by: §3.3.2.
 Resourceful contextual bandits. In Proceedings of COLT, pp. 1109–1134. Cited by: §3.2.2.
 Realtime bidding by reinforcement learning in display advertising. In Proceedings of WSDM, pp. 661–670. Cited by: §A.2.1, 4th item, §3.2.1, §3.2.2, §4.1.
 Reinforcement mechanism design for ecommerce. In Proceedings of WWW, pp. 1339–1348. Cited by: §1.
 Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of SIGKDD, pp. 1187–1196. Cited by: §1.
 Riskconstrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research 18, pp. 167–1. Cited by: §A.2.2.
 Reinforcement learning for mdps with constraints. In Proceedings of ECML, pp. 646–653. Cited by: §2, §3.3.2.
 Deep learning. Vol. 1, MIT press Cambridge. Cited by: §1.
 Deep qlearning from demonstrations. In Proceedings of AAAI, Cited by: §3.3.
 Reinforcement learning to rank in ecommerce search engine: formalization, analysis, and application. In Proceedings of SIGKDD, Cited by: §1, §3.2.2.
 Advertising in a stream. In Proceedings of WWW, pp. 29–38. Cited by: §A.1.
 Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397. Cited by: §1, §3.3.
 Realtime bidding with multiagent reinforcement learning in display advertising. In Proceedings of CIKM, pp. 2193–2201. Cited by: §1, §1, 4th item, §4.1.
 Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In Proceedings of NIPS, pp. 3675–3683. Cited by: §3.3.2.
 Deep learning. Nature 521 (7553), pp. 436. Cited by: §1.
 Estimating conversion rate in display advertising from past erformance data. In Proceedings of SIGKDD, pp. 768–776. Cited by: §A.2.1.
 Endtoend training of deep visuomotor policies. JMLR 17 (1), pp. 1334–1373. Cited by: §1.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §3.3, §4.2.
 Ad click prediction: a view from the trenches. In Proceedings of SIGKDD, pp. 1222–1230. Cited by: §A.2.1.
 Online matching and ad allocation. Foundations and Trends in Theoretical Computer Science 8 (4), pp. 265–368. Cited by: §1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §3.3.2.
 Bid optimizing and inventory scoring in targeted online advertising. In Proceedings of SIGKDD, pp. 804–812. Cited by: §A.2.1, §4.1.
 Varianceconstrained actorcritic algorithms for discounted and average reward mdps. Machine Learning 105 (3), pp. 367–417. Cited by: §1, §3.3.2.
 Trust region policy optimization. In Proceedings of ICML, pp. 1889–1897. Cited by: §2.
 Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §1.
 Reinforcement learning: an introduction. MIT press. Cited by: §2.
 Automatic ad format selection via contextual bandits. In Proceedings of CIKM, pp. 1587–1594. Cited by: §3.2.2.
 Reward constrained policy optimization. arXiv preprint arXiv:1805.11074. Cited by: §A.2.2, §1, §2.
 Constrained reinforcement learning from intrinsic and extrinsic rewards. In Proceedings of ICDL, pp. 163–168. Cited by: §A.2.2.
 Display advertising with realtime bidding (rtb) and behavioural targeting. arXiv preprint arXiv:1610.03013. Cited by: §A.2.1.
 Budget constrained bidding by modelfree reinforcement learning in display advertising. In Proceedings of CIKM, pp. 1443–1451. Cited by: §1, §1, 4th item, §3.2.2, §3.2.2.
 Algorithms with logarithmic or sublinear regret for constrained contextual bandits. In Proceedings of NIPS, pp. 433–441. Cited by: §3.2.2.
 Optimal realtime bidding for display advertising. In Proceedings of SIGKDD, pp. 1077–1086. Cited by: §A.2.1, §3.2.1, §4.1.
 Deep reinforcement learning for sponsored search realtime bidding. Proceedings of SIGKDD. Cited by: §1, §1.
 Impression allocation for combating fraud in ecommerce via deep reinforcement learning with action norm penalty. In Proceedings of IJCAI, pp. 3940–3946. Cited by: §1.
Appendix A Appendix
a.1. Discussion: Adaptive Exposure
Current research on dynamic ad exposure focuses on sponsored search(Bachrach et al., 2014), stream advertising in news feeds(Ieong et al., 2014), etc. Their dynamic ad exposure mainly refers to how to select the appropriate location and quantity to display the ad in the fixed optional ad positions(Bachrach et al., 2014), or how to dynamically insert the ad in the feeds through the user’s previous browsing process(Ieong et al., 2014). Compared with sponsored search(Bachrach et al., 2014), our mechanism does not limit the position of advertisements. Instead, it chooses the number and location of ads by means of score sorting, which means that our approach can bring more flexibility. Compared with stream advertising in news feeds(Ieong et al., 2014) we consider the mixeddisplay scenario where both recommended products and advertised products are displayed altogether to the customers and their display orders are determined by their relative rankings.
a.2. Related Work
a.2.1. Bidding Optimization in RealTime Bidding
Under the RealTime Bidding (RTB) settings in Ecommerce advertising, amounts of work have been proposed to estimate the impression values, e.g. clickthrough rate (CTR) (McMahan et al., 2013) and conversion rate (CVR) (Lee et al., 2012), which help to improve the bidding effectiveness via predicting more precise impression values. Besides, the user impression analysis, bidding optimization is one of another most concerned problems in RTB, whose goal is to dynamically set a more appropriate price for each auction aiming at maximizing some key performance indicators (KPIs) (e.g. CTR) (Wang et al., 2016). However, constraints are inevitable while solving optimization problems in real world bidding situations. So, smarter bidding strategies are needed for attaining higher KPI values (e.g. the cumulative impression value), which can be achieved through reinforcement learning techniques (Perlich et al., 2012; Zhang et al., 2014; Cai et al., 2017). In these approaches, they optimize the bidding strategy under the fixed budget constraint and the budget will be reset at the beginning of each episode. Perlich et al. (Perlich et al., 2012) and Zhang et al. (Zhang et al., 2014) propose static bid optimization frameworks based on the distribution analysis of the previously collected log data. However, their approach can’t apply well to the setting in which the data distribution is unstable and will change from day to day in extreme circumstances. For this reason, Cai et al. (Cai et al., 2017) model the bidding problem as a MDP and consider the budgets allocation as a sequential decision problem. Experimental results show the robustness of their reinforcement learning approach under the nonstationary auction environments.
By contrast, we are the first to propose a more general deep reinforcement learning framework which takes more realistic business constraints into consideration. In our settings, we concentrate on the practical advertising with adaptive exposure problem. Not only do we consider the trajectory level constraint , but also the state level constraint . This is the main reason why the previous approaches are not applicable to our settings.
a.2.2. Constrained Reinforcement Learning
We are focusing on a constrained optimization problem and a number of researches have been done. One typical solution is the constrained reinforcement learning. Uchibe and Doya (2007) propose a policy gradient algorithm which uses gradient projection to enforce the active constraints. However, their approach is unable to prevent the policy from becoming insecure at the beginning of the training. Later, Ammar et al. (2015) propose a theoreticallymotivated policy gradient method for lifelong learning under safety constraints. Unfortunately, they involve an expensive inner loop which contains an optimization of a semidefinite program, making it unsuitable for the DRL settings. Similarly, Chow et al. (2017) propose a primaldual subgradient method for riskconstrained reinforcement learning, which takes policy gradient steps trading off return for lower risk while simultaneously learning the tradeoff coefficients (dual variables).
More recently, a number of DRLbased approaches have been proposed to address the constrained optimization problem. Achiam et al. (2017) use the conjugate gradient method to optimize the policy. However, the computational cost will significantly arise as the constraint number increases, resulting in such approaches inapplicable. Tessler et al. (2018) propose the Reward Constrained Policy Optimization (RCPO), which converts the trajectory constraints into perstate penalties and dynamically adjusts the weight of each perstate penalty during the learning procedure via propagating the constraint violation signal over the entire trajectory.
Our work tackle the multiconstraint problem from a different point of view and take the relationship between the different constraints into account. We decouple the original multiconstraints optimization problem into relatively independent single constraint optimization problems and propose a constrained twolevel reinforcement learning framework. More importantly, our twolevel framework is quite general and any stateoftheart RL algorithms can be flexibly applied to learning procedures of both levels.
a.3. Network structure and training parameters
a.3.1. Cher
Both the actor network and the critic network are fourlayer fully connected neural networks, where each of the two hidden layers consists of 20 neurons and a ReLU activation function is applied on the outputs of the hidden layers. A tanh function is applied to the output layer of the actor network to bound the size of the adjusted scores. The input of the actor network and critic network is a tensor of shape 46 representative feature vectors of the request’s candidate ad items and the number of currently exposed items. The output of the actor network and the critic network are respectively 15 actions and corresponding Qvalues. The learning rate of the actor is 0.001, the learning rate of the critic is 0.0001, and the size of the replay buffer is 50000. The exploration rate starts from 1 and decays linearly to 0.001 after 50,000 steps.It is worth pointing out that in the environment, we will make certain adjustments to the action, such as adding a certain value, performing certain scaling, to ensure that the operation of adjusting the score is in line with the business logic. Therefore, the output of the action is not the actual adjusted scores. We consider this adjustment part as part of the environmental logic. It does not affect the training of the network.
a.3.2. Higher Level Control
DQN network has threelayer neural networks. The hidden layer consists of 20 neurons and a ReLU activation function is applied on the outputs of the hidden layer. Then we connect the hidden layer output to: 1) the nodes with the same number of actions, which is used to simulate the action advantage value , 2) only one node, which is used to simulate the state value . Finally we obtain . The size of the replay buffer is 5000. We use the prioritized replay to sample the replay buffer. The learning rate is 0.0006. The exploration rate starts from 1 and linearly decays to 0.001 after 1000 steps. Also we set the discount factor .
a.4. Experimental Setup
Based on the log data, for each request, we collect 15 recommended products and their scores marked by the recommendation system as candidates for each request, and the information of 15 advertising products, such as: eCPM (effective cost per mille), price, predicted ClickThroughRate (pCTR), and initial score. Since the actual amount of data is significantly large, we sample a part of the data for empirical evaluation, and verify that the sampled data has representativeness to the real data set. In both training and evaluation stages, we split the previously collected data by day and replay the requests in chronological order for simulation. We consider the data flow from 00:00 AM to the next day as a trajectory. At the beginning of each day, the number of ads has been displayed and number of requests have been reset to 0. At the end of each day, we count the daily number of ads displayed and make a judgement whether the trajectorylevel constraint has been satisfied.
A state consists of 46 dimensions including the characteristics of the 15 candidate ads: eCPM, price, pCTR, and the number of exposed ads. Action is the coefficient to adjust scores for 15 ads, . After adjusting scores using actions, 15 candidate advertising commodities and 15 candidate recommended commodities are sorted based on the new scores. The reward is calculated according to the ads in the first 10 exposure items. We can replay the data to train and test the effect of our algorithm offline in two ways: 1) after ads adjusting scores, whether the quantity of ads in the 10 exposure items meets and , and 2) the rewards of the exposed ads. Actually, the positions of ads have impact on user behaviors. E.g., the ads in front are more possible to be clicked, and so on. Hence the reward is defined as:
(1) 
where is the eCPM value of the ad , and corrects the eCPM by considering the influence of different positions. is fitted using the real data. (Fig. 9)
a.5. Online Experiment
Due to the architecture of the online engine in Taobao, we replaced the gradient based DDPG with the widely used gradientfree Cross Entropy Method (CEM, an evolution based genetic algorithm) within the platform. When deploying our algorithm to the online environment, we consider two separate processes: (1) online serving and data collection; (2) offline training. For (1), we use Blink (an open source stream processing framework which is specially designed and optimized for ecommerce scenarios) to record the constantly updated online data. Besides, to fully explore the parameter space of CEM, we split the online traffic into a number of buckets and deploy different sets of parameter configurations at the same time. Different buckets are controlled by different parameters. After processing each user’s request, the newly produced data is recorded to corresponding data tables by Blink. For (2), a centralized learner will periodically update its parameters based on the latest recorded data, generate different sets of parameters for different buckets and synchronize them to the parameter server. At the same time, the online search engine deployed at each bucket also regularly request the latest parameters from the parameter server.