Multiobjective Reinforcement Learning based approach for UserCentric Power Optimization in Smart Home Environments
Abstract
Smart homes require every device inside them to be connected with each other at all times, which leads to a lot of power wastage on a daily basis. As the devices inside a smart home increase, it becomes difficult for the user to control or operate every individual device optimally. Therefore, users generally rely on power management systems for such optimization but often are not satisfied with the results. In this paper, we present a novel multiobjective reinforcement learning framework with twofold objectives of minimizing power consumption and maximizing user satisfaction. The framework explores the tradeoff between the two objectives and converges to a better power management policy when both objectives are considered while finding an optimal policy. We experiment on realworld smart home data, and show that the multiobjective approaches: i) establish tradeoff between the two objectives, ii) achieve better combined user satisfaction and power consumption than singleobjective approaches. We also show that the devices that are used regularly and have several fluctuations in device modes at regular intervals should be targeted for optimization, and the experiments on data from other smart homes fetch similar results, hence ensuring transferability of the proposed framework.
I Introduction
The amount of power consumption in households (residential) is among the top three in world electricity consumption [4], and is ever increasing with the increase in demand of smart homes and IoT (Internet of Things) devices. According to the United States Department of Energy (DoE), the average household consumes 90 million units of power a year, and much of that power is wasted [2]. Habits like leaving lights on when we leave rooms, forgetting to turn off televisions or computers when not in use, etc., are major reasons behind such wastage [3]. Therefore, there is a need for power controllers that can take actions like turning devices on and off, or changing devices’ modes of operation on behalf of users to achieve a goal like optimized consumption.
In the past, researchers have used traditional reinforcement learning for several power optimization tasks. For example, [11] proposed a modelfree constrained RL approach for online power management. [9] presented another similar algorithm that requires no prior information of the workload and dynamically adapts to the environment to achieve autonomous power management. [12] proposed an RL based technique that performs simultaneous online management of both performance and power consumption. The authors applied RL in a realistic laboratory testbed to find the optimal policy. None of these techniques towards power optimization are used for smart home power management, and they do not consider user satisfaction while finding optimal policies.
However, power management in a smart home is a problem that needs to solve two tasks with different rewards simultaneously: minimize power consumption and maximize user satisfaction. It is important for a power controller to consider user preferences as well, i.e., the goal of minimal power consumption must be achieved but not at the expense of user satisfaction. The scenario can be formulated as a MORL problem where sequential decision making is required with multiple objectives.
Our contribution: In this paper, we propose, for the first time, a novel multiobjective reinforcement learning (MORL) approach for power management inside a smart home with two objectives: minimize power consumption and maximize user satisfaction. In a MORL problem, an action on the environment results in multiple rewards. The agent (power controller) learns an optimal policy from these rewards using a variation of Qlearning [14]. Since the objectives are contrasting, there is a tradeoff between the two, and based on their importance, optimization priorities are set. We use an overall reward function to incorporate these optimization priorities, which is a weighted sum of the two rewards representing power consumption, and representing user satisfaction. We specifically focus on the weightedsum method [5] for multiobjective optimization and compare the results with single objective strategies.
We evaluate our proposed methods on the Smart* data set for sustainability[1]. The data samples include devicelevel realworld power consumption values in several smart homes, named as A, B, C, …, H recorded at every 30 minutes. We show the effectiveness of our approach on data from smart home A, and demonstrate transferability of experiments on smart homes B and C. We use Q learning with individual objectives (single policy single objective approaches) as a baseline reference to compare the proposed single policy multiobjective approaches. We also define a metric “clash rate” for evaluating user satisfaction in the predicted policy at each episode.
The remainder of this paper is organized as follows. Section II gives some background of traditional and multiobjective reinforcement learning (MORL). Section III explains our problem formulation followed by our algorithm to solve the optimization problem in Section IV. The experiments and results are presented in Section VVII. We conclude our work in Section VIII.
Ii Background
In this section, we discuss traditional reinforcement learning with Qlearning, an algorithm widely used to solve traditional RL problems. Then we introduce the concepts of multiobjective reinforcement learning (MORL) and how it differs from the traditional RL.
Iia Traditional Reinforcement Learning
Traditional reinforcement learning [10] mimics the natural learning style of trialanderror by interacting with an environment (static or dynamic) and receiving feedback based on an action. The components of reinforcement learning are:

An Agent;

A finite state space ;

A set of available actions for the agent ;

A reward function .
The agent’s objective is to maximize its average longterm reward. It is achieved by learning a policy , which is a mapping between the states and the actions. In our problem, one goal is to minimize the power consumption of a smart home, and the other is to maximize user satisfaction. But, in a traditional reinforcement learning setting, the two goals are independent. An agent can either minimize power consumption, or it can maximize user satisfaction.
Qlearning [14] is a widely known algorithm used to solve sequential decisionmaking RL problems. In each step, on the successful execution of every action , the environment yields a reward , which indicates the value of a state transition. The issued reward can be positive or negative. The agent keeps a value function for each stateaction pair. Learning to act in the environment will make the agent choose actions to maximize longterm rewards. Based on this value function, the agent decides its immediate action. The Qvalue for each stateaction pair is initially chosen during the problem formulation, and later, it is updated with each taken action and its issued reward. The value function is given by the following Bellman equation:
(1) 
where is the reward issued after taking action in state , is the successive state of , and is the discount factor used due to the different influences of future rewards on the present value.
The optimal stateaction value function is defined as:
(2) 
When is obtained, the optimal policy can be computed by:
IiB Multiobjective Reinforcement Learning
Reinforcement learning is a machine learning paradigm that helps with sequential decision making under several uncertainties and aims to achieve a single longterm objective. However, due to the complex requirements of realworld control systems, often times, there are two (or more) conflicting objectives. For example, in our case of smart home power management system the controller has two goals: i) to minimize energy consumption of the smart home, ii) to maximize user comfort by moving to states preferred by the user. In reinforcement learning, problems of this nature having more than one conflicting objectives are called multiobjective reinforcement learning problems (MORL).
MORL is different from tradtional RL in that there are two or more objectives to be optimized simultaneously by the learning agent. [6] provides an architecture for a MORL problem, where reward is provided for the learning agent at each step. Figure 1 shows the difference between architectures of traditional RL (Figure 1(a)) and multiobjective RL (Figure 1(b)). In MORL ( Figure 1(b)), there are objectives and is the reward signal provided by the environment. The architecture illustrates a single agent that has to find an optimal policy for a set of multiple objectives simultaneously. The objectives can be conflicting, as in our case, or they can be independent as well.
For each objective i and a stationary policy , there is a corresponding stateaction value function , which satisfies equation 1.
Let the combined value function for MORL is:
where is a vector and it also satisfies the Bellman equation (1). Then the optimal stateaction function will be given as:
(3) 
and the optimal policy can be obtained by:
(4) 
MORL is a combination of multiobjective optimization methods and RL techniques to solve sequential decision making problems with multiple conflicting objectives. We will justify why we formulate smart home power management as a MORL problem in the next section.
Iii Problem Formulation
The case of a smart home power management system is a multiobjective problem with two objectives, viz., minimizing power consumption and maximizing user satisfaction. Ideally, a controller will try to reduce the power consumption as much as it can, given an optimization goal. The trivial solution for the controller will be to turn off all the devices that operate in the smart home. However, this state might not be desirable by the user. Therefore, it is important for a controller to consider user preferences as well. Hence, the goal of minimal power consumption must be achieved by establishing a tradeoff with user satisfaction, and not at it’s expense.
Based on the importance of an objective function, optimization priorities must be ensured while designing the policies. After appropriately expressing the preferences, we have to design an efficient algorithm that can solve the sequential decision making problems based on observed state transition data.
Iiia Environment
Smart homes usually have smart meters to measure the power consumption for each device operating within it. The power consumption values for every device is independent, and take a fixed number of discrete values. This is because each device operates only in a fixed number of modes and their power consumption in a specific mode remains the same. For example, a simple furnace has only two modes, ON and OFF. In OFF mode, it consumes no power while in ON mode it consumes (say) units of power. We are assuming that the consumption remains the same and no degradation of device happens over time, hence causing more energy consumption. Similarly, a washing machine can have three modes of operation, viz, standby, washer and dryer. Let’s assume a smart home has devices. A state in an environment is a vector of energy consumption values (in whatever device modes they are in) of these devices, as depicted by blocks and in Figure 2.
Let’s assume the number of device modes each one operates in is given by a set
Increasing the number of devices, or just their modes of operation can lead to statespace explosion. Therefore, in our techniques we choose devices selectively and use data processing (explained in Section V) to avoid statespace explosion.
IiiB Power Controller (Agent)
The Agent is a power controller that can change the mode of operation of any of the devices, consequently changing the energy consumption value. For example, a power controller can turn off the furnace, if it is on, or switch the washing machine to dryer mode from some other mode. However, an agent can also choose not to do anything. Therefore, the Agent can perform either of the two actions, i.e., or , on each device in a state , to move the environment to a state . For example, in Figure 2, the controller changes the mode of operation of device 1 and device 2, from to by action, and chooses to keep the device it its current state by action.
IiiC Reward
Whenever the proposed agent takes an action on the environment, a reward is calculated on the basis of the state chosen by the controller and the ground truth state from the Smart House Dataset. Since there are two distinct objectives, we formulate the reward functions to incorporate the power consumption and user satisfaction. Every update of the stateaction value function (Equation 1) is dependent on the reward. Therefore, by integrating the optimizations in reward function, the agent learns the tradeoff between optimization priorities for an optimal state. First we introduce both rewards separately and then we combine them to form a single reward, as shown in Figure 2.
Minimizing Power Consumption
Let’s say the power consumption in the predicted and ground truth state is , and respectively. The reward, is given as:
(5) 
which is the average difference of power consumed by the D devices between predicted state and the desired state. As the agent always tries to maximize the reward, thus we negate the sum in order to achieve state which consumes less power than what the user had chosen. By negating, the state with the least electricity consumption becomes goal state for the power controller.
Maximizing User Satisfaction
To model user behavior, we compute the Euclidean distance between the predicted state and the ground truth state. The reward
(6) 
where .
Overall reward
We take a weighted combination of both the rewards, (Equation 5) and (Equation 6), and define overall reward as:
(7) 
where and are the weights to manipulate the optimization priorities of the two objectives. These weights are treated as hyper parameters during experimentation.
IiiD Evaluation
The evaluation of power controller’s performance is twofold due to the multiobjective nature of the optimization problem. The reward represents negation of power consumption, therefore, a policy with more positive value is desired. Hence, as we increase the number of iterations, the value of should increase.
Similarly, the reward represents the likelihood that the next state predicted by the controller, , matches to the next state that user prefers, . However, to evaluate , we introduce a term called “clash rate” to get a device level view of clashes. To calculate clash rate:
(8) 
where “” is an element wise comparison that assigns 1 if values do not match and 0 otherwise, and returns an array with 1’s and 0’s.
For example, let us say the user wants next state to be , where represent the device modes at this state. Now, the controller takes an action on the environment to change its state to . The clash rate in this case is , as the device modes at index 0, 1, and 3 do not match (assume the vectors are indexed starting from 0). As we increase the iterations to train the power controller more, the clash rate should decrease.
Iv Solutions
MORL approaches can be divided into two groups based on the number of policies to be learned [13]:single policy and multiple policy approaches. In our case, the objectives are contrasting, and the availability of data allows us to create a sufficiently good representation of the environment. Therefore, we focus on a single policy approach to solve it.
The aim of single policy approaches is to obtain the best policy which satisfies the optimization priorities as set by the designer, or defined by the application domain. Therefore, based on varying optimization priorities we implemented four variations of a single policy algorithm to find an optimal policy for our twofold objectives of minimum power consumption and maximum user satisfaction. A single policy approach to solve MORL problems is to formalize an objective function , which can represent overall preferences in optimization. The approach is very similar to Qlearning with a few modifications, as shown in Algorithm 1. The objective function is given as the summation of Qvalues for all the objectives, and is given as:
(9) 
As discussed in Section IIIC, we incorporate the optimization priorities using weights and in the reward function. Since is dependent on the reward function, and on , any change in the weight values in equation 7 will result in a change of values in . They are defined as:
Iva Single Policy Single Objective
As a baseline reference, we implement single policy approach with single objectives. Recall equation 7, the overall reward is defined as the sum of two rewards, one for minimizing power consumption and other for maximizing user satisfaction. Therefore, for single policy with one objective taken at a time has two cases:
Power Consumption Minimization
To implement this, we give 100% optimization priority to power consumption, and set and to and , respectively in equation 7.
User Satisfaction Maximization
To implement this, we give 100% optimization priority to user satisfaction, and set and to and , respectively in equation 7.
IvB Single Policy Multi Objective
The goal of the power controller is to achieve a multiobjective optimization. Therefore, we consider two cases:
Equal weights
The case where both objectives are equally as important, and the power controller tries to optimize both. Both and are set to . Based on the policy calculated by the power controller, the action with the maximum summed values is chosen to be executed.
Weighted Sum
The weighted sum approach is proven to be effective with multiple objectives in the past. [7] used it to combine seven vehicle overlapping objectives, and [15] used it with a combination of three objectives, viz., degree of the crowd in an elevator, the waiting time, and the number of startends. The approach modifies equation 9 as:
In our case, the are and , and we experiment with different values of both to get the best results.
V Environment Setup
We evaluate the proposed solutions in Section IV using the Smart* data set for sustainability [1]. As a baseline reference, we consider Qlearning with single objective of power consumption minimization and user satisfaction maximization. We plot the reward and clashes for all four proposed algorithms to contrast the results. In this section, we first briefly explain the data set, then the environment design, and finally the experiments and results.
Va Smart* Data Set for Sustainability
The data set includes real power consumption readings of multiple devices such as furnace, fridge, washing machine, etc., inside smart homes collected over regular intervals of 30 minutes. Each device has sensors attached to them to record the power consumption after regular time intervals inside seven smart homes
VB Designing the Environment
The data set has power consumption values from more than 20 devices for each smart home. We have considered only 5 devices from a smart home: furnace, washing machine, fridge, heater, and kitchen lights. The reasons to do so are:

In a real world scenario, a user does not want the controller to operate on all of the devices in their smart home.

Formalizing an optimization problem with only a top few devices with maximum power consumption is more realistic and helpful than taking all the possible devices and constraints into consideration.

For simplicity of experimentation.
Data Processing
In our data set, the power consumption reading for each device took many distinct unique values. For example, the power consumption values for Furnace has unique entries, and that of Fridge is . However, a lot of these values are very close and differ only at / decimal place representing a data collection glitch. Since we chose 5 devices, a state in this environment is represented by a vector of size where represents the device in mode , and its value is given as the power consumption by device in mode .
The size of the state space is the cross product of all the unique values taken by each device. Therefore by this convention, if we consider only the furnace and the fridge, the size of state space will be million (). With such a big state space, the problem becomes very complex to solve, and therefore, to avoid the state space explosion, we cluster the energy consumption values of each device separately to find a fixed number of modes of operation for each device. Intuitively, in reallife, a furnace cannot have 17,000 modes of operation. Therefore, finding device modes with clustering seems to be a fair assumption to make.
Clustering to assign the modes of operation for each device
We wanted to find cluster centers of power consumption values for each device individually, which can represent different modes of operation. The modes of operation can be readily available from manufacturer’s end but they might not be ideal for our case. For example, suppose a washing machine consumes power in standby mode, in wash mode and in dry mode, and the values, and are very close. The manufacturer can say that modes and are different, but we have similar readings for the two states, and hence, does not affect our objective. Therefore, we cluster the readings such that each device mode represents a significant amount of change in power consumption from one mode to another. The clustering helps us reduce the state space to a very good extent.
First we performed silhouette analysis [8] to find the optimal number of clusters for each device. We vary , the number of clusters from 2 to 6 assuming it is rare that a device has more than 6 modes of operation. Figure 3(a) shows the silhouette plot for various clusters using Duct Heater’s electricity consumption data for . The clusters are well formed with coefficient value more than the threshold, as can be seen in Figure 3(b). The plot is shown for as it yielded the best silhouette score. Similar experiments are performed with remaining 4 devices.
After clustering, the optimal number of clusters for the chosen devices, Furnace, Washing Machine, Fridge, Duct Heater, and Kitchen Lights is 2, 3, 3, 3, and 5, respectively. Clustering reduces the size of state space from million+ () to just 270 () preventing the state space explosion.
Vi Experiments
The first objective is to minimize the total power consumption, and the second objective is to maximize user satisfaction. Our algorithm takes into account 4 hyperparameters: learning rate (), discount factor (), exploration rate (), reward prioritization weights and . As a baseline, we use Algorithm 1 with single objectives. Note that if we run the algorithm with a single objective, it becomes the traditional Qlearning algorithm. The clash rate as defined in Equation 8 will be maximum for single policy with power consumption minimization objective, and minimum with user satisfaction maximization objective. However, with multiobjectives, the cash rate should be between the two. The overall reward is given as the weighted sum, therefore, reward will be maximum for multiobjective approach. We implemented the solutions discussed in Section IV as:
Via Single Policy Single Objective
The overall reward has two weighted terms, and representing power consumption and user satisfaction, respectively. For the first set of experiments, we focus only on optimizing a single objective by initialising and as for power consumption minimization objective, and for user satisfaction maximization objective. Hence, in single policy single objective Qlearning formulation, our agent only receives the reward in the former case and the reward in the latter.
We experimented with more than 100 combinations of and with to find the best hyperparameters. The agent calculates average total reward and the for every combination of our hyperparameters over a total of 463 unique states episodes learned over 300 epochs. We decayed the value of by a factor of 1.4 every 20 epochs. The set of parameters which gives us the highest average reward and least number of clashes is chosen. The hyperparameters shown in Table I achieve the best results when our aim is to minimize the average number of clashes to meet each objective individually.
ViB Single Policy MultiObjective
We divide the experiments for multiobjective approaches into two approaches as discussed in Section IV: Equal Weights and WeightedSum. As shown in Line 11 of Algorithm 1, the update function for the Qvalues is different than the normal Qlearning formulation. The equation for the Qvalue update is given as:
(10) 
For equal weights approach, the weights and have been assigned the same value of representing equal priority for both objectives. For WeightedSum approach, we perform experiments by taking approximately combinations of , ,, and with their values within the range (0,1]. The best hyperparameters for the multiobjective approaches are listed in the Table II.
Vii Results
Viia Average Power
To compare the four algorithms proposed to find an optimal policy, we ran them for equal number of epochs using the best hyperparameters obtained for each. Each epoch has training steps and validation steps, we plotted the average power for each epoch for comparison. Figure 4 shows the average reward for each algorithm.
The average power is maximum for power satisfaction minimization because the policy is getting reward based on the predicted state and user’s next state, and it is lowest for user satisfaction maximization due to the fact that if we move from a high power state to a low power state, it will hurt user’s satisfaction, which is indeed the desired behavior. The plots for multiobjective approaches always end up between the two single objective ones, representing the tradeoff between the two contrasting objectives.
ViiB Average Number of Clashes
Figure 5 shows the combined clash rate for all four algorithms. The experimental parameters are kept same as the previous section. Note that for power consumption minimization the clash rate is highest because no weightage is given to user satisfaction. If we deploy an agent with such policies, the user will get agitated and they will try to override agent’s actions, rendering it useless. While on the other hand, an agent with user satisfaction maximization policies will not be helpful in optimizing power consumption. However, if we look at clash rates for multiobjective techniques, they lie between the two single objective approaches, and this clash rate can be adjusted using weights based on user preferences.
ViiC Appliancewise clashes
We calculate the average number of clashes for each of the five appliances and plot them separately to see the behavior of proposed approaches. Figure 6 shows the clash rate for each device. The experimental parameters are kept same as Section VIIA.
The results for appliances with three device modes is consistent with the overall results with two exceptions of furnace (two device modes) and kitchen lights (5 device modes).
For furnace, the power consumption minimization approach does not behave as expected. The reason could be the irregular usage and collection of data, as a furnace is used only during colder seasons and the data we used for experiments is collected over a span of three years. For lights, all algorithms fetch nearly same results. The reason can be because lights are used for prolonged times, and there are not many fluctuations in lights’ modes of operation. Therefore, the clash rate coincides for user satisfaction, power consumption, and a combination of the two. Hence, Furnace and Lights have very little to contribute to the overall optimization. The results, therefore, suggest that devices that are used regularly and with several fluctuations in device modes at regular intervals should be targeted for optimization.
Transferability on other smart homes consumption data
To show that the proposed framework can be applied to power consumption data of multiple smart homes, we choose the best algorithm (weighted sum approach) and run it for smart homes B, and C from the same Smart* data discussed in Section VA.
Figure 7 shows that the rewards increase and the clash rate decreases with increase in number of episodes. Figure 7(a) and Figure 7(b) show that the behavior is similar on all three smart homes data.
Viii Conclusion
In this paper, we present a novel multiobjective reinforcement learning technique for power consumption optimization with contrasting objectives of minimizing power consumption and maximizing user satisfaction. We show that both objectives, when considered together, achieve the best optimal policy. Our experimental results show that the proposed multiobjective techniques establish a tradeoff between the two objectives. The optimal policy achieves better user satisfaction than power optimization policies and achieves better power consumption than user maximization policies. We show that the devices used regularly in smart homes should be the ones targeted for such optimization purposes. Finally, we also show that the experiments can be performed with other smart home data set to achieve similar results.
Footnotes
 http://traces.cs.umass.edu/index.php/Smart/Smart
References
 (2012) Smart*: an open data set and tools for enabling research in sustainable homes. SustKDD, August 111 (112), pp. 108. Cited by: §I, §V.
 (2019) Energy data facts. U.S. Department of Energy. External Links: Link Cited by: §I.
 (2019) Energysaving strategies for smart homes. Constellation. External Links: Link Cited by: §I.
 (2019) World electricity final consumption by sector, 19742017. IEA, Paris. External Links: Link Cited by: §I.
 (2006) Adaptive weighted sum method for multiobjective optimization: a new method for pareto front generation. Structural and multidisciplinary optimization 31 (2), pp. 105–116. Cited by: §I.
 (2014) Multiobjective reinforcement learning: a comprehensive overview. IEEE Transactions on Systems, Man, and Cybernetics: Systems 45 (3), pp. 385–398. Cited by: §IIB.
 (2011) A multiplegoal reinforcement learning method for complex vehicle overtaking maneuvers. IEEE Transactions on Intelligent Transportation Systems 12 (2), pp. 509–522. Cited by: §IVB2.
 (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, pp. 53 – 65. External Links: ISSN 03770427, Document Cited by: §VB2.
 (2013) Achieving autonomous power management using reinforcement learning. ACM Transactions on Design Automation of Electronic Systems (TODAES) 18 (2), pp. 1–32. Cited by: §I.
 (2018) Reinforcement learning: an introduction. MIT press. Cited by: §IIA.
 (2009) Adaptive power management using reinforcement learning. In 2009 IEEE/ACM International Conference on ComputerAided DesignDigest of Technical Papers, pp. 461–467. Cited by: §I.
 (2008) Managing power consumption and performance of computing systems using reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1497–1504. Cited by: §I.
 (2011) Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine learning 84 (12), pp. 51–80. Cited by: §IV.
 (1992) Qlearning. Machine learning 8 (34), pp. 279–292. Cited by: §I, §IIA.
 (2010) Selfadaptive multiobjective optimization method design based on agent reinforcement learning for elevator group control systems. In 2010 8th World Congress on Intelligent Control and Automation, pp. 2577–2582. Cited by: §IVB2.