Reinforcement Learning-Driven Test Generation for Android GUI Applications using Formal Specifications

Reinforcement Learning-Driven Test Generation for Android GUI Applications using Formal Specifications

Yavuz Koroglu yavuz.koroglu@boun.edu.tr 0000-0001-9376-0698Bogazici UniversityIstanbulTurkey  and  Alper Sen alper.sen@boun.edu.tr 0000-0002-5508-6484Bogazici UniversityIstanbulTurkey
Abstract.

There have been many studies on automated test generation for mobile Graphical User Interface (GUI) applications. These studies successfully demonstrate how to detect fatal exceptions and achieve high code and activity coverage with fully automated test generation engines. However, it is unclear how many GUI functions these engines manage to test. Furthermore, these engines implement only implicit test oracles. We propose Fully Automated Reinforcement LEArning-Driven Specification-Based Test Generator for Android (FARLEAD-Android). FARLEAD-Android accepts a GUI-level formal specification as a Linear-time Temporal Logic (LTL) formula. By dynamically executing the Application Under Test (AUT), it learns how to generate a test that satisfies the LTL formula using Reinforcement Learning (RL). The LTL formula does not just guide the test generation but also acts as a specified test oracle, enabling the developer to define automated test oracles for a wide variety of GUI functions by changing the formula. Our evaluation shows that FARLEAD-Android is more effective and achieves higher performance in generating tests for specified GUI functions than three known approaches, Random, Monkey, and QBEa. To the best of our knowledge, FARLEAD-Android is the first fully automated mobile GUI testing engine that uses formal specifications.

software testing, mobile applications, specification-based testing, reinforcement learning, temporal logic, test oracle
copyright: nonejournalyear: 2019doi: conference: arXiv; 2019; arXivbooktitle: arXivprice: 00.00isbn: ccs: Theory of computation Modal and temporal logicsccs: Software and its engineering Software testing and debuggingccs: Computing methodologies Reinforcement learningccs: Software and its engineering Functionality

1. Introduction

Today, mobile GUI applications are ubiquitous, as there are more than 2.6 billion smartphone users worldwide (Piejko, 2016). A recent survey shows that 78 of mobile GUI application users regularly encounter bugs that cause the GUI application to fail at performing some of its functions (Bolton, 2017). Testing if the GUI application performs its intended functions correctly is essential for mitigating this problem.

More than 80 of the applications in the mobile market are Android GUI applications (Gartner, 2018) and there are 2 million Android GUI applications in Google Play (GPLAYDATA, [n.d.]). Therefore, many studies propose fully automated test generation engines for Android GUI applications (Azim and Neamtiu, 2013; Anand et al., 2012; Moran et al., 2016; Cao et al., 2018; Li et al., 2017; Machiry et al., 2013; Mahmood et al., 2014; Yan et al., 2018; Amalfitano et al., 2015; Google, [n.d.]; Linares-Vásquez et al., 2015; Yang et al., 2013; Hao et al., 2014; Koroglu et al., 2018; Mao et al., 2016b; Su et al., 2017; Choi et al., 2013; Choi, [n.d.]; Koroglu and Sen, 2018; Mirzaei et al., 2016; Cao et al., 2019; Eler et al., 2018; Zaeem et al., 2014; Liu et al., 2014). These engines are adept at detecting fatal exceptions and achieving high code and activity coverage. However, it is unclear how many of the GUI functions they manage to test. In practice, an automated test generation engine may achieve high coverage but fail to test many essential GUI functions. Furthermore, engines in (Azim and Neamtiu, 2013; Moran et al., 2016; Cao et al., 2018; Li et al., 2017; Machiry et al., 2013; Yan et al., 2018; Amalfitano et al., 2015; Koroglu et al., 2018; Mao et al., 2016b; Su et al., 2017; Koroglu and Sen, 2018) only check fatal exceptions, (Mirzaei et al., 2016; Cao et al., 2019; Anand et al., 2012; Mahmood et al., 2014; Google, [n.d.]; Linares-Vásquez et al., 2015; Yang et al., 2013; Hao et al., 2014; Choi et al., 2013; Choi, [n.d.]) only target coverage, and (Eler et al., 2018; Zaeem et al., 2014; Liu et al., 2014) only focus on accessibility issues, energy bugs, and app-agnostic problems, respectively. All these approaches ignore other possible functional bugs.

To test if a GUI function works correctly, we must have an automated test oracle. All the test oracles mentioned until now (fatal exceptions, accessibility issues, app-agnostic problems, and energy bugs) are implicit (Barr et al., 2014), meaning that they are implemented using some implied assumptions and conclusions. Since there can be many different GUI functions, implementing implicit oracles for every GUI function is impractical. Therefore, we must use specified test oracles where each test oracle is associated with a formal specification. This way, the problem reduces to checking if the generated test satisfies the formal specification. However, checking if a test satisfies a formal specification is not enough to fully automate the test generation process. We must also develop an efficient method to generate the test.

In this study, we use Reinforcement Learning (RL) to generate a test that satisfies a given GUI specification. RL is a semi-supervised machine learning methodology that has driven impressive advances in artificial intelligence in recent years, exceeding human performance in domains ranging from resource management (Mao et al., 2016a), traffic light control (Arel et al., 2010), playing sophisticated games such as chess (Silver et al., 2017) and atari (Mnih et al., 2013), to chemistry (Zhou et al., 2017). In RL, an RL agent dynamically learns to perform its task by trial-and-error. After every action, the RL agent receives an immediate reward from the environment. This reward can be positive, negative, or zero, meaning that the last decision of the RL agent was good, bad, or neutral, respectively. Decisions made according to the RL agent’s experience is said to follow the agent’s policy. After enough iterations, the RL agent becomes proficient in its task. At this point, the RL agent is said to have converged to its optimal policy. Upon convergence, the RL agent is said to have minimized its expected number of bad decisions in the future. The RL agent requires no prepared training data, which decreases the manual effort spent preparing it. Therefore, RL is attractive amongst many machine learning methods.

Through dynamic execution, the RL agent learns from positive and negative rewards, on-the-fly for every action taken. Typically, an RL agent is trained to keep getting positive rewards and avoid the negative ones, indefinitely. Instead, our goal is to generate one satisfying test for a specified test oracle and terminate, which requires much less training than typical RL use cases. We exploit this fact to develop a test generator with low execution costs, which is crucial for dynamic execution tools.

Figure 1. FARLEAD-Android Overview
\Description

FARLEAD-Android Overview

We call our test generator Fully Automated Reinforcement LEArning-Driven Specification-Based Test Generator for Android (FARLEAD-Android) and show its overview in Figure 1. First, FARLEAD-Android requires an Android Device where the Application Under Test (AUT) is installed. Second, the developer provides a Linear-time Temporal Logic (LTL) formula as a formal specification. rlearner takes this LTL formula as input and performs RL by sending replayable actions to the Android Device and then receiving observations from the Android Device. The replayable action sequence generated during this process is called a test. rlearner checks if the generated test satisfies the LTL formula and accordingly calculates a reward after every action. Finally, as soon as rlearner finds a satisfying test, it outputs the test and terminates.

FARLEAD-Android generates tests through dynamic execution. Hence, there is an execution cost for every action. To reduce the execution cost as much as possible, we implement several improvements which we describe in the following paragraphs.

Traditional RL approaches get positive rewards only when the objective is satisfied. However, since we are interested in test generation we terminate as soon as we find a satisfying test, which restricts a typical RL agent to learning only from negative rewards. We propose LTL-based Reward Shaping, which generates positive intermediate rewards. To the best of our knowledge, ours is the first work that combines LTL and Reward Shaping.

Actions enabled in a GUI state directly affects the RL performance. If there are too many actions enabled, it takes a long time for FARLEAD-Android to generate a satisfying test. We demonstrate that if the LTL formula specifies properties about not just the observations but also actions, we can predict if the LTL formula is going to be satisfied before action execution, saving time. We call this improvement Reward Prediction.

We argue that the satisfaction of an LTL formula may not just depend on the current GUI state, but also on previous GUI states. In that case, the same state-action pair may get conflicting rewards and confuse a typical RL agent. To address this issue, we propose to encode GUI states as a sequence of action-state pairs which we call tails. Then, we encode actions as tail-action pairs which we call decisions. We put a maximum tail length to avoid state explosion.

It is common to start RL with indifference to all available decisions. This indifference leads to the random exploration of all new GUI states. We propose to use the knowledge about previously encountered actions in new GUI states by learning what we call stateless values. We call this novel stateless learning improvement as Action Label Learning.

In short, our main contributions in this paper are as follows.

  1. To the best of our knowledge, FARLEAD-Android is the first fully automated mobile GUI test generation engine that uses formal specifications.

  2. We implement four novel improvements in RL; LTL-based Reward Shaping, Reward Prediction, Tails/Decisions, and Action Label Learning.

  3. We evaluate FARLEAD-Android via experiments on two applications from F-Droid. We show that our approach is more effective and achieves higher performance in generating satisfying tests than three known test generation approaches, namely Monkey, Random, and QBEa.

We give background information, describe our approach, provide an example, make evaluations, explain related work, discuss issues, and conclude in Sections 2-8, respectively.

2. Background

We now provide information on Android Devices and Linear-time Temporal Logic (LTL) in Sections 2.1 and 2.2, respectively.

2.1. Android Devices

We analyze an Android Device as a system which takes GUI actions as input and produces GUI states as output where we denote the set of all GUI states and GUI actions as and , respectively. A GUI state or state is the set of observed attributes about {enumerate*}

the active package,

the focused activity,

widgets on the screen, and

contextual properties such as the screen being on or off. In any GUI state, the user interacts with the Android Device using click and other gestures, which we collectively call GUI actions or actions. A GUI action is either {enumerate*}

widget-dependent such as click and text or

widget-independent such as back which presses the hardware back button. Due to widget-dependency, every GUI state has a different set of enabled actions . Note that a GUI action may have parameters. For example, a click requires two coordinates.

Initially, we assume that the Android Device is in a don’t care state and we denote this GUI state as . In a don’t care state, we allow only reinitialize actions. A reinitialize action reinstalls the AUT and starts one of its launchable activities given as a parameter. We do not allow any reinitialize action in any other GUI state. We describe how we decide to restart the AUT at a GUI state other than in Section 3.

A dynamic execution trace, or just trace, is a finite sequence of action-state pairs where the first action is a reinitialize action. We denote it as where is the trace length. Note that in this definition, we omit which is always the initial state. A test is the finite action sequence that generates the GUI states in the trace.

In Figure 1, we make three kinds of observations on the Android Device. We observe {enumerate*}

the set of currently enabled actions ,

the resulting state after executing the selected action , and

a set of atomic propositions that currently evaluate to true (). An atomic proposition that evaluates to true is called a label and denoted as where is the set of all possible atomic propositions. A label is an action label if it is an observation on a GUI action . Otherwise, it is an observation on a state , so it is a state label. The set of all observed state and action labels are called state and action labelings, and , respectively.

Formally, we represent the observations as an Observation Triplet where {enumerate*}

is a function which returns the set of enabled actions in the current state ,

is a function that returns the next state after executing the action, and

is a labeling function for GUI states and actions. The Observation Triplet defines the necessary conditions to perform RL and the interface between the Android Device and rlearner.

2.2. Linear-time Temporal Logic (LTL)

iff iff iff and iff iff s.t. and , iff

Table 1. Pointwise LTL Semantics

Equation (1) defines the syntax of an LTL formula where is an atomic proposition.

(1)

We interpret over a (finite) trace using the pointwise semantics in Table 1. A test is satisfying if and only if its execution trace satisfies the formula .

3. Methodology

1:
2: is the observation triplet
3: are the maximum number of episodes and steps, respectively
4:
5: is a satisfying test or
6:
7:initializeVariables()
8: Episode index
9:repeat
10:      Position index
11:      Start from a don’t care state
12:     initializeEpisode()
13:     repeat
14:          
15:          
16:          
17:          learn
18:          
19:     until  or or
20:     makePostEpisodeUpdates()
21:     
22:     
23:until  or
Algorithm 1 A General RL Framework for rlearner

Reinforcement Learning (RL) (Sutton and Barto, 1998) is a semi-supervised machine learning methodology. The main idea of RL is to calculate a reward at every state, so previous rewards drive future decisions.

In Figure 1, rlearner is responsible for calculating rewards from observations and generate a satisfying test. We describe a general RL framework for rlearner in Algorithm 1, with six procedures explained in later sections. Algorithm 1 has three requirements. First, it requires the Observation Triplet , which is the interface between the Android Device and rlearner. Second, it requires a maximum number of episodes , where is equal to the maximum number of tests generated before termination. If is too small, rlearner may often terminate without finding a satisfying test. If is too large, rlearner may spend too much time, especially over unsatisfiable specifications. The third requirement is the maximum number of steps . At every step, rlearner executes exactly one GUI action, so is equal to the maximum trace length. If is too small, rlearner may not find any satisfying test. If is too large, rlearner may generate unnecessarily long tests.

We initialize the variables required for RL in Line 7 of Algorithm 1. We also initialize an episode and a position index in Lines 8 and 10, respectively. Every episode starts from a don’t care state , as shown in Line 11. Before every episode, we update RL-related variables in Line 12. We divide an episode into at most number of steps. At every step, {enumerate*}

we decide an action in Line 14,

execute the action and obtain the resulting state in Line 15,

calculate a reward from labelings of the current position and in Line 16,

learn from the immediate reward in Line 17, and finally

increment the position index in Line 18. If the immediate reward is equal to , we understand that the test satisfies the given specification. If the immediate reward is equal to , we understand the latest action made it impossible to satisfy the given specification, no matter what actions we choose from now on. We say that a test is a dead-end if one of its actions gets a reward of . We end the episode prematurely if we get either or as a reward in Line 19. Otherwise, we continue until we generate steps. After every episode, we update RL-related variables in Line 20. The test is the action sequence generated during the last episode, as shown in Line 21. We increment the episode index in Line 22. We continue to a new episode until either the immediate reward is equal to , which means that the test is satisfying, or we reach the maximum number of episodes .

We now explain FARLEAD-Android by implementing the procedures in Algorithm 1, through Sections 3.1-3.3.

3.1. Policy

1:procedure decideNextAction()
2:     for all  do
3:          
4:     end for
5:     return random with probability
6:end procedure
Algorithm 2 -greedy Softmax Policy with Temperature

A policy is a procedure which decides on a next action from the set of currently enabled actions in the current state . Traditionally, the -greedy and the softmax policies are the most common policies in RL (Alpaydin, 2014, p.525). We unify the common softmax and -greedy policies in Algorithm 2. This policy implements the procedure given in Line 14 of Algorithm 1. In the policy, is the probability of deciding the action, completely randomly, and denotes the temperature variable. High temperatures indicate low trust and vice versa. Intuitively, it makes sense to start with a high temperature and gradually decrease it (increase trust) as the learning continues. This approach is called annealing. We describe quality functions, Q-functions in short, to explain our policy. At every step, RL learns from the immediate reward by updating a quality function . Using two Q-functions ( and ) instead of one is an improvement known as Double Q-Learning. Double Q-Learning improves learning performance by reducing maximization bias (overestimation). Q-values of these Q-functions imply a probability distribution , as shown in Line 3. We sample a GUI action from this probability distribution in Line 5.

Reward Prediction.

It is trivial to see that the number of enabled actions directly affects learning performance. So, it is essential to keep as small as possible. Therefore, we modify Algorithm 1 by adding three operations immediately before Line 14. First, thanks to the observation triplet , we know the action labeling before executing the action . We use this fact to eliminate all GUI actions from that make the specification false () from the set of enabled actions. Second, if becomes zero, we understand that all actions in the current state lead to dead-ends, so we conclude that the previous state-action pair leads to a dead-end. Therefore, we terminate the episode with a reward of to the previous state-action pair . Finally, if was not zero at the second step and we find one action that makes the specification true (), we immediately take that action and terminate FARLEAD-Android, since we have found a satisfying test. If we cannot find such an action, we continue from Line 14. We call this new improvement Reward Prediction.

3.2. Rewards

1:procedure calculateImmediateReward()
2:     
3:     return
4:end procedure
5:
6:procedure projection()
7:     return advance(restrict(expand(), ))
8:end procedure
9:
10:procedure expand( in LTL)
11:     return
12:end procedure
13:
14:procedure restrict()
15:     return
16:end procedure
17:
18:procedure advance( in LTL)
19:     return
20:end procedure
Algorithm 3 LTL-based Reward Calculation with Reward Shaping

The immediate reward is typically either one, zero, or minus one, depending on whether the RL agent achieved, not yet achieved, or cannot achieve its objective anymore in the current episode, respectively. The fact that FARLEAD-Android terminates after receiving the positive reward restricts the typical agent to learn only from negative rewards.

Reward Shaping.

We develop an LTL-based Reward Shaping approach that enables us to produce intermediate rewards between zero and one in Algorithm 3, so FARLEAD-Android can recieve positive rewards before termination. This algorithm implements the procedure in Line 16 of Algorithm 1.

First, in Line 2, we obtain the LTL formula from the LTL formula by using the labeling . We describe how we get this new formula with an example in Section 4. We return one or minus one if the LTL formula is true or false, respectively. Otherwise, we return a reward between zero and one, depending on a distance metric between and . We measure the distance using a function that returns the number of atomic propositions in the formula. The distance reflects the amount of change between two formulae and is zero if there is no change. As a final note, is an enabling parameter that takes either one or zero to enable or disable the Reward Shaping, respectively.

We now explain the projection procedure. We divide this procedure into three sub-procedures, {enumerate*}

expand,

restrict, and

advance. In expand, we apply the expansion law to all operators in the formula. We present the expansion law in Equation (2). In restrict, we substitute the atomic propositions with trues and falses according to the labeling . Finally, in advance, we eliminate one round of operators.

(2)

3.3. Learning

1:procedure initializeVariables
2:     
3:end procedure
4:
5:procedure initializeEpisode
6:     
7:end procedure
8:
9:procedure learn()
10:      Myopic Update Equation
11:     
12:     for all  s.t.  do
13:          
14:          
15:          
16:     end for
17:      with probability
18:end procedure
19:
20:procedure makePostEpisodeUpdates
21:     
22:end procedure
Algorithm 4 Learning Procedures

Traditionally, RL stores the quality function as a look-up table. Whenever it executes an action at a state , it adds an update value multiplied by a learning rate to the Q-value . The update value commonly depends on three terms, {enumerate*}

the immediate reward,

the previous Q-value of the current state-action pair, and

a Q-value for a next state-action pair. The third term allows future Q-values to backpropagate one step at a time. Instead of the third term, an eligibility trace , which is an improvement used in RL, allows for faster backpropagation. The main idea of using the eligibility trace is to remember previous state-action pairs and update all of them at once, directly backpropagating multiple steps at a time.

We describe all the remaining procedures of Algorithm 1 in Algorithm 4. We initialize five variables in Line 2, {enumerate*}

the first Q-function ,

the second Q-function ,

the temperature ,

the random decision probability , and

the learning rate . Note that is the initial Q-function where every state-action pair maps to zero. Before every episode, we initialize an eligibility trace to all zeros in Line 6.

The procedure in Line 9 requires four parameters, {enumerate*}

the GUI state ,

the executed action ,

the immediate reward , and

the resulting state . We give our update equation in Line 10. Our update equation does not take any future Q-values into account, so it does not use the resulting state . This is why our approach is also called myopic learning.

We ensure that the current state-action pair has non-zero eligibility in Line 11. We perform the update on all eligible state-action pairs by adding multiplied by and to the Q-value in Line 13. We force all Q-values to be between . This idea is known as vigilance in adaptive resonance theory, and it prevents the accumulation of too large Q-values. Our experience shows that too large Q-values may prevent further learning. is the doubleness ratio. If it is zero, we ensure that in Line 14, and therefore Double-Q is disabled. If it is one, remains completely disjoint from , and therefore Double-Q is enabled. We can also take to perform mixed Double-Q learning. We ensure that the eligibility trace updates closer state-action pairs more by multiplying it with an eligibility discount rate in Line 15. We swap the two Q-functions half of the time to perform Double-Q learning. Note that if , and therefore this swap becomes redundant.

Finally, after every episode, we decrease {enumerate*}

the temperature to perform annealing by a stepwise temperature difference ,

the random decision probability by an update factor to follow the traditional -greedy strategy, and

the learning rate by another update factor to prevent forgetting previous Q-values in Line 21.

Action Label Learning.

A Q-value encodes information about only a single state-action pair, and therefore it is useful only if that state is the current state. Action Label Learning allows us to use Q-values in new GUI states. The main idea is to learn stateless Q-values and for action labelings immediately before Line 13 of Algorithm 4. Whenever we are going to observe a new GUI state in Line 10, we update the Q-value as immediately before that line. Immediately before Line 14, we execute . In Line 17, we also swap and with probability.

Tails/Decisions.

So far, we assumed that we always produce the same immediate reward at a state regardless of how we reached that particular state, which is not always true. We propose the Tails/Decisions approach to mitigate this issue. In this approach, we replace the set of states and the set of actions with a finite set of tails where is the maximum tail length and a set of decisions , respectively. This way, we can learn different Q-values for the same state-action pair with different paths. Note that must be small to avoid state explosion.

4. Example

We now describe how FARLEAD-Android works with a small example. Our Application Under Test (AUT) for this example is ChessWalk, which is a chess game. The GUI function that we are testing for is the ability to go from MainActivity to AboutActivity and then return to MainActivity. We specify this GUI function as , where and are true if and only if the current activity matches with the word Main and the word About, respectively. According to , the activity of the second GUI state must be MainActivity until it is AboutActivity and then it must be AboutActivity until it is MainActivity. Note that we impose this condition on the second GUI state and not the first because the first GUI state is a don’t care state.

In our example, FARLEAD-Android finds a satisfying test in three episodes. We provide the details of these episodes in Table 2 and give the screenshots for these episodes in Figures 2-4. For every episode, Table 2 shows the episode index , the initial LTL formula , and the position index , the selected action , the labeling after executing , projected from the previous formula , and the immediate reward , for every step. The reinitialize action is the only choice at a don’t care state, so every episode begins with . The labeling after is because the resulting activity matches the word Main. We calculate the projection from as in Algorithm 3. Note that because the first operator is , meaning that the formula states nothing about the current state and starts from the next state. We only advance one step by removing a single level of operators and get . The number of atomic propositions in and are and , respectively. Therefore, the immediate reward is . The zero reward means that is neither good nor bad for satisfying the formula.

\DescriptionEpisode i = 1

Figure 2. Episode i = 1

\DescriptionEpisode i = 2

Figure 3. Episode i = 2

In the first episode, FARLEAD-Android has no idea which action is going to satisfy the specification, so it chooses a random action . This action only pauses and then resumes the AUT, so we reach the same state, as shown in Figure 2. The labeling is again . This time, we expand as in Equation (3).

(3)

means and in the current state. Hence, we restrict accordingly, as in Equation (4). Note that we do not replace atomic propositions protected by a operator.

(4)

After we minimize the resulting formula and advance one step, we obtain . Since , we again calculate the immediate reward as zero.

Finally, after executing randomly, we get out of the AUT, as shown in Figure 2, so the current activity is neither MainActivity nor AboutActivity. Therefore, the labeling is empty, so . We calculate the final formula as . We calculate the immediate reward as and terminate this episode.

In the second episode, FARLEAD-Android opens AboutActivity, but fails to return to MainActivity, as shown in Figure 3. Though it gets in the end, it also obtains for opening AboutActivity. The intermediate reward instructs FARLEAD-Android to explore the second action again in the next episode.

In the final episode, FARLEAD-Android again opens AboutActivity and gets as before. This time, it finds the correct action and returns to MainActivity, as shown in Figure 4. Therefore, it receives and terminates. The action sequence generated in the final episode is the satisfying test for the specification .

\DescriptionEpisode i = 3

Figure 4. Episode i = 3
EPISODE
reinit chesswalk MainActivity
pauseresume
back
EPISODE
reinit chesswalk MainActivity
click 239 669
click 143 32
EPISODE
reinit chesswalk MainActivity
click 239 669
back
Table 2. FARLEAD-Android Example

5. Evaluation

In this section, we demonstrate the effectiveness and the performance of FARLEAD-Android through experiments on a VirtualBox guest with Android 4.4 operating system and 480x800 screen resolution. In evaluation, a virtual machine is better than a physical Android device because {enumerate*}

anyone can reproduce our experiments without the physical device, and

even if a physical device is available, it must be the same with the original to produce the same results.

We downloaded two applications from F-Droid, namely ChessWalk and Notes. F-Droid (Gultnieks, 2010) is an Android GUI application database, and many Android testing studies use it. We find F-Droid useful because it provides old versions and bug reports of the applications.

Source Description Level Formula

AUT: ChessWalk

App-Agnostic The user must be able to go to AboutActivity and return back. (a)
(b)
(c)
App-Agnostic The user must be able to go to SettingsActivity and return back. (a)
(b)
(c)
App-Agnostic Pausing and resuming the AUT should not change the screen. (a) N/A
(b)
(c)
Bug Reports The AUT should prevent the device from sleeping but it does not. (a)
(b)
(c)




App-Agnostic


When the user changes some settings, the AUT should remember it later.
(a)
(b)
(c)


Manually Created

The user must be able to start a game and make a move.
(a)
(b)
(c)




Novel Bug

When the user starts a second game, moves of the first game is shown.
(a)
(b)
(c)
Source Description Level Formula

AUT: Notes





Novel Bug

The sketch note palette does not show one of the 20 color choices (black is missing).
(a)
(b)
(c)


Bug Reports

Even if the user cancels a note, a dummy note is still created.
(a) N/A
(b)
(c)
Table 3. Our Experimental Specifications

Table 3 shows the GUI-level specifications we obtained for ChessWalk and Notes applications. These specifications come from four sources, {enumerate*}

app-agnostic test oracles (Zaeem et al., 2014),

bug reports in the F-Droid database,

novel bugs we found, and

specifications we manually created. Note that we can specify the same GUI function with different LTL formulae. In an LTL formula, an action label starts with the word . Otherwise, it is a state label. There are two kinds of action labels, type and detail. An action type label constrains the action type, while an action detail label constrains the action parameters. Using these label categories, we define three levels of detail for LTL formulae with (a) only state labels, (b) state labels and action type labels, and (c) all labels. Intuitively, FARLEAD-Android should be more effective as the level of detail goes from (a) to (c). Note that specifications and are inexpressable with a level (a) formula because they explicitly depend on action labels.

We investigate FARLEAD-Android in three categories, FARLEADa, FARLEADb, and FARLEADc, indicating that we use a level (a), (b), or (c) formula, respectively. For specifications and , the LTL formula does not change from level (b) to (c) because the specified action does not take any parameters. Hence, we combine FARLEADb and FARLEADc as FARLEADb/c for these specifications. Other than FARLEAD-Android, we perform experiments on three known approaches, {enumerate*}

random exploration (Random),

Google’s built-in monkey tester (Monkey) (Google, [n.d.]), and

Q-Learning Based Exploration optimized for activity coverage (QBEa) (Koroglu et al., 2018). Random explores the AUT with completely random actions using the same action set of FARLEAD-Android. Monkey also explores the AUT randomly, but with its own action set. QBEa chooses actions according to a pre-learned probability distribution optimized for traversing activities. We implement these approaches in FARLEAD-Android so we can check if they satisfy our specifications, on-the-fly.

For every specification in Table 3, we execute Random, Monkey, QBEa, FARLEADa, FARLEADb, and FARLEADc times each for a maximum of episodes. The maximum number of steps is or , depending on the specification, so every execution runs up to episodes with at most six steps per episode. We keep the remaining parameters of FARLEAD-Android fixed throughout our experiments. We use these experiments to evaluate the effectiveness and the performance of FARLEAD-Android in Sections 5.1 and 5.2, respectively. In Section 5.3, we discuss the impact of the detail levels on the effectiveness/performance. In Section 5.4, we compare the number of steps taken in our experiments with known RL-LTL methods.

Engine Total
Random 6
Monkey 3
QBEa 4
FARLEADa 7
FARLEADb 9
FARLEADc 9
Table 4. Effectiveness of Engines per Specification

5.1. Effectiveness

We say that an engine was effective at satisfying a GUI-level specification if it generated a satisfying test at least once in our experiments.

Table 4 shows the total number of specifications that our engines were effective at satisfying. Our results show that FARLEADa, FARLEADb, and FARLEADc were effective at more specifications than Random, Monkey, and QBEa. Hence, we conclude that FARLEAD-Android is more effective than other engines.

Figure 5. Number of Failures Across 100 Executions
\Description

Number of Failures Across 100 Executions

Figure 6. Test Times Across 100 Executions
\Description

Test Times Across 100 Executions

Figure 7. Number of Steps Across 100 Executions
\Description

Number of Steps Across 100 Executions

5.2. Performance

We say that an engine failed to satisfy a GUI-level specification if it could not generate a satisfying test in one execution. We consider an engine achieved higher performance than another if it failed fewer times in our experiments.

Figure 5 shows the number of failures of all the engines across executions in logarithmic scale. According to this figure, FARLEADa, FARLEADb, and FARLEADc failed fewer times than Random, Monkey, and QBEa at , , and , indicating that FARLEAD-Android achieved higher performance for these specifications. Only FARLEAD-Android was effective at and , so we ignore those specifications in evaluating performance.

Figure 6 shows the average and the maximum times required to terminate for all the engines with every specification across executions in logarithmic scale. According to this figure, FARLEADb and FARLEADc spent less time on average and in the worst case than Random, Monkey, and QBEa for the remaining specifications -, indicating that FARLEAD-Android achieved higher performance when it used a level (b) or (c) formula. However, our results show that QBEa and Monkey spent less time than FARLEADa for and because these specifications are about traversing activities only, a task which QBEa explicitly specializes on and Monkey excels at [5]. As a last note, FARLEADa outperformed QBEa for even though QBEa spent less time than FARLEADa because QBEa failed more than FARLEADa. Hence, we conclude that FARLEAD-Android achieves higher performance than Random, Monkey, and QBEa unless the specification is at level (a) and about traversing activities only.

5.3. Impact of LTL Formula Levels

Table 4 shows that FARLEADb and FARLEADc were effective at more specifications than FARLEADa. Hence, we conclude that FARLEAD-Android becomes more effective when the level of detail goes from (a) to (b) or (c).

Figure 5 shows that FARLEADa failed more times than FARLEADb and FARLEADc at -, indicating that FARLEADb and FARLEADc outperformed FARLEADa. FARLEADa was not effective at and , so we ignore those specifications in evaluating performance. Figure 6 shows that FARLEADa spent more time than FARLEADb and FARLEADc at the remaining specifications, , , and , indicating that FARLEADb and FARLEADc again outperformed FARLEADa. Furthermore, Figure 6 shows that FARLEADc spent less time than FARLEADb in all specifications except and , where FARLEADb and FARLEADc are equivalent. Hence, we conclude that the performance of FARLEAD-Android increases as the level of detail goes from (a) to (c).

Overall, both the effectiveness and the performance of FARLEAD-Android increase with the level of detail in LTL formula. Hence, we recommend developers to provide as much detail as possible to get the highest effectiveness and performance from FARLEAD-Android.

5.4. Number of Steps to Terminate

Figure 7 shows the average and the maximum number of steps required to terminate for all the engines with every specification across executions in logarithmic scale. The number of steps is a known measure used to compare RL methods logically constrained with LTL formulae (Hasanbeig et al., 2019b, a; Toro Icarte et al., 2018; Wen et al., 2015). Known RL-LTL methods take a high number of steps, in the order of hundreds of thousands, because these methods aim to converge to an optimal policy. FARLEAD-Android took less than four thousand steps in all experiments since it terminates as soon as it finds a satisfying test, which occurs before convergence. Figures 6 and 7 show that the number of steps is a similar metric to the test times, indicating that the number of steps is suitable for evaluating performance. We evaluate performance using the test times instead of the number of steps because the test times will directly affect the developer’s ability to test more.

6. Related Work

Several studies (Hasanbeig et al., 2019b, a; Toro Icarte et al., 2018; Wen et al., 2015) use an LTL specification as a high-level guide for an RL agent. The RL agent in these studies never terminate and has to avoid violating a given specification indefinitely. For this purpose, they define the LTL semantics over infinite traces and train the RL agent until it converges to an optimal policy. Convergence ensures that the RL agent will continue to satisfy the specification. The goal of FARLEAD-Android is not to train an RL agent until it converges to an optimal policy but to generate one satisfying test and terminate before the convergence occurs. Hence, we define the LTL semantics over finite traces. As a result, FARLEAD-Android requires much fewer steps, which is crucial due to the high execution cost of Android devices. We also develop many improvements to the RL algorithm, so FARLEAD-Android generates a satisfying test with as few steps as possible. To the best of our knowledge, FARLEAD-Android is the first engine that combines RL and LTL for test generation.

Two studies (Behjati et al., 2009; Araragi and Cho, 2006) use RL for model checking LTL properties. These studies do not prove LTL properties but focus on finding counterexamples. This type of model checking can replace testing given an appropriate model. However, such models are not readily available for Android applications, and it is hard to generate them.

Two studies (Koroglu et al., 2018; Mariani et al., 2012) use RL to generate tests for GUI applications. Although engines proposed in these studies are automated, they do not check if they test a target GUI function or not. In literature, this is a known issue called the oracle problem. We propose a solution to this problem by specifying GUI functions as monitorable LTL formulae.

Monitoring LTL properties dynamically from an Android device is not a new idea. To the best of our knowledge, there are three Android monitoring tools, RV-Droid (Falcone et al., 2012), RV-Android (Daian et al., 2015), and ADRENALIN-RV (Sun et al., 2017). All these tools monitor LTL properties at the source code level. Instead, our monitoring is at the GUI level. Furthermore, these tools do not generate tests. They assume that a test is given and focus only on monitoring properties. Instead, FARLEAD-Android performs test generation and monitoring at the same time.

FARLEAD-Android monitors an LTL specification by making changes to it at every step using the projection procedure. This kind of LTL monitoring is called LTL progression (formula rewriting). Icarte et al. (Toro Icarte et al., 2018) and RV-Droid (Falcone et al., 2012) use LTL progression. As an alternative, RV-Android (Daian et al., 2015) first translates an LTL specification into a Past Time Linear Temporal Logic (PTLTL) formula and monitors past actions using that formula. Using LTL progression enables us to use Reward Shaping (Laud, 2004). To the best of our knowledge, FARLEAD-Android is the first engine that uses Reward Shaping in RL-LTL.

7. Discussion

Effectiveness of FARLEAD-Android depends heavily on the exact definitions of atomic propositions, GUI states, and actions. We believe FARLEAD-Android will be able to generate tests for more GUI functions as these definitions get richer. To the best of our abilities, we made these definitions as comprehensive as possible.

Android devices and applications are non-deterministic. As a result, the same tes t sometimes gets different rewards. RL is known to be robust against non-determinism, so this was not a problem in our experiments. However, due to the non-determinism, a satisfying test may not always generate the same execution trace and therefore may not always satisfy a given LTL formula . We say that a satisfying test is reliable under an LTL formula if and only if all possible execution traces of the test satisfy the LTL formula . Replaying the test several times may help to establish confidence in test reliability.

FARLEAD-Android is trivially sound because it guarantees that the specification is satisfied if termination occurs before the maximum number of episodes. However, it is not complete because it cannot decide unsatisfiability. For this purpose, if FARLEAD-Android terminates without producing a satisfying test, we could either use a model checker or warn the developer to investigate the associated GUI function manually.

FARLEAD-Android is fully automated, does not need the source code of the AUT, and does not instrument the AUT. FARLEAD-Android requires the developer to have experience in LTL. A converter that transforms a user-friendly language, for example, Gherkin Syntax, to LTL could make FARLEAD-Android even more practical for the developer.

When the LTL formula is of level (c), it usually takes a single episode to generate a satisfying test. It is possible to use this fact to store tests as level (c) LTL formulae instead of action sequences. Storing tests as LTL has two advantages. First, it is more portable where GUI actions may be too specific. For example, a click requires two coordinates that could be different for various devices. However, one can abstract these details in a single LTL formula and obtain tests for as many devices as possible. Second, the LTL formula encodes the test oracle in addition to the test, whereas the test only specifies the action sequence and not what should we check at the end. We believe these advantages are side benefits of our approach.

Although an RL agent typically stores the quality function as a look-up table, in theory, any regression method (function approximator) can replace the look-up table (Alpaydin, 2014, p.533). If the function approximator is an artificial neural network, then this approach is called Deep Reinforcement Learning (Deep RL) (Mnih et al., 2013). Deep Double SARSA and Deep Double Expected SARSA are two example variants of the original Deep RL (Ganger et al., 2016). Deep RL requires large amounts of data to train and therefore, may result in high execution cost.

We could add fatal exceptions to FARLEAD-Android as a contextual attribute of the current state, such as . So, will effectively look for a crash. However, this is not the intended use of FARLEAD-Android. With , we expect FARLEAD-Android to be less effective than a test generation engine optimized for crash detection. However, FARLEAD-Android can be effective at reproducing known bugs, as we show in our experiments. We have also found two new bugs while investigating our experimental AUTs. We have reported these bugs to the respective developers of these AUTs using the issue tracking system.

We can specify the reachability of every activity in the AUT using LTL. Then, the tests FARLEAD-Android generate will achieve % activity coverage, assuming all activities of the AUT are indeed reachable. However, this approach may fail to test many essential GUI functions of the AUT, as we state in Section 1.

Though we have experimented only on virtual machines, FARLEAD-Android also supports physical Android devices, the Android Emulator, and any abstract model that receives the actions and returns the observations described in Figure 1.

To evaluate a test generation tool in terms of its effectiveness at testing a GUI function, we must modify it for monitoring the LTL specification, on-the-fly. Such a modification requires a high amount of engineering, so we implemented the simple Random and Monkey approaches in FARLEAD-Android. We also included QBEa since FARLEAD-Android already had the support for RL, which QBEa uses. Since Monkey and QBEa are known to be efficient in achieving high activity coverage, we expect that comparisons with other test generation tools will also yield similar results.

Finally, many external factors may affect our experimental results. Examples are the Android OS version, issues with the Android Debugging Bridge (ADB), the size of the AUT, and the complexity of the specified GUI function. Therefore, we believe that FARLEAD-Android would benefit from replication studies under various conditions.

8. Conclusion

In this study, we proposed the Fully Automated Reinforcement LEArning-Driven Specification-Based Test Generator for Android (FARLEAD-Android). FARLEAD-Android uses Reinforcement Learning (RL) to generate replayable tests that satisfy given Linear-time Temporal Logic (LTL) specifications. To the best of our knowledge, FARLEAD-Android is the first test generation engine that combines RL and LTL. Our evaluation shows that FARLEAD-Android has been more effective and has achieved higher performance in generating tests that satisfy given specifications than three known test generation approaches, Random, Monkey, and QBEa. We also demonstrated that the developer should provide as much detail as possible to get the highest effectiveness and performance from FARLEAD-Android. Finally, we showed that the early termination of FARLEAD-Android allows us to take much fewer steps than known RL-LTL methods.

In the future, with FARLEAD-Android, we aim to generate tests for larger applications and more specifications. Currently, we cannot specify GUI functions that depend on timings. Hence, we aim to support Metric Temporal Logic (MTL), which will increase the number of GUI functions that we can specify. Finally, we will improve FARLEAD-Android so that it will support atomic propositions on sensory inputs, energy bugs, and security issues.

References

  • (1)
  • Alpaydin (2014) Ethem Alpaydin. 2014. Introduction to Machine Learning (3rd ed.). The MIT Press.
  • Amalfitano et al. (2015) D. Amalfitano, A. R. Fasolino, P. Tramontana, B. D. Ta, and A. M. Memon. 2015. MobiGUITAR: Automated Model-Based Testing of Mobile Apps. IEEE Software 32, 5 (2015), 53–59.
  • Anand et al. (2012) Saswat Anand, Mayur Naik, Mary Jean Harrold, and Hongseok Yang. 2012. Automated concolic testing of smartphone apps. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (FSE). https://github.com/saswatanand/acteve.
  • Araragi and Cho (2006) Tadashi Araragi and Seung Mo Cho. 2006. Checking liveness properties of concurrent systems by reinforcement learning. In International Workshop on Model Checking and Artificial Intelligence. Springer, 84–94.
  • Arel et al. (2010) Itamar Arel, Cong Liu, T Urbanik, and AG Kohls. 2010. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intelligent Transport Systems 4, 2 (2010), 128–135.
  • Azim and Neamtiu (2013) Tanzirul Azim and Iulian Neamtiu. 2013. Targeted and Depth-first Exploration for Systematic Testing of Android Apps. In ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). 641–660.
  • Barr et al. (2014) Earl T Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2014. The oracle problem in software testing: A survey. IEEE transactions on software engineering 41, 5 (2014), 507–525.
  • Behjati et al. (2009) Razieh Behjati, Marjan Sirjani, and Majid Nili Ahmadabadi. 2009. Bounded rational search for on-the-fly model checking of LTL properties. In International Conference on Fundamentals of Software Engineering. Springer, 292–307.
  • Bolton (2017) David Bolton. 2017. 88 percent of people will abandon an app because of bugs. https://www.applause.com/blog/app-abandonment-bug-testing.
  • Cao et al. (2019) C. Cao, J. Deng, P. Yu, Z. Duan, and X. Ma. 2019. ParaAim: Testing Android Applications Parallel at Activity Granularity. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC).
  • Cao et al. (2018) Yuzhong Cao, Guoquan Wu, Wei Chen, and Jun Wei. 2018. CrawlDroid: Effective Model-based GUI Testing of Android Apps. In Tenth Asia-Pacific Symposium on Internetware. https://github.com/sy1121/CrawlDroid.
  • Choi ([n.d.]) Wontae Choi. [n.d.]. SwiftHand2: Android GUI Testing Framework.
    https://github.com/wtchoi/swifthand2.
  • Choi et al. (2013) Wontae Choi, George Necula, and Koushik Sen. 2013. Guided GUI Testing of Android Apps with Minimal Restart and Approximate Learning. In ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). 623–640.
  • Daian et al. (2015) Philip Daian, Yliès Falcone, Patrick O’Neil Meredith, Traian-Florin Serbanuta, Shinichi Shiraishi, Akihito Iwai, and Grigore Rosu. 2015. RV-Android: Efficient Parametric Android Runtime Verification, a Brief Tutorial. In Runtime Verification - 6th International Conference, RV 2015 Vienna, Austria, September 22-25, 2015. Proceedings (Lecture Notes in Computer Science), Vol. 9333. Springer, 342–357. https://doi.org/10.1007/978-3-319-23820-3_24
  • Eler et al. (2018) M. M. Eler, J. M. Rojas, Y. Ge, and G. Fraser. 2018. Automated Accessibility Testing of Mobile Apps. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).
  • Falcone et al. (2012) Ylies Falcone, Sebastian Currea, and Mohamad Jaber. 2012. Runtime verification and enforcement for Android applications with RV-Droid. In International Conference on Runtime Verification. Springer, 88–95.
  • Ganger et al. (2016) Michael Ganger, Ethan Duryea, and Wei Hu. 2016. Double Sarsa and double expected Sarsa with shallow and deep learning. Journal of Data Analysis and Information Processing 4, 04 (2016), 159.
  • Gartner (2018) Gartner 2018. Market Share: Final PCs, ultramobiles and mobile phones, all countries, 4q17 update.
  • Google ([n.d.]) Google. [n.d.]. Android UI/application exerciser monkey.
    http://developer.android.com/tools/help/monkey.html.
  • GPLAYDATA ([n.d.]) GPLAYDATA [n.d.]. https://en.wikipedia.org/wiki/Google Play.
  • Gultnieks (2010) Ciaran Gultnieks. 2010. F-Droid Benchmarks. https://f-droid.org/.
  • Hao et al. (2014) Shuai Hao, Bin Liu, Suman Nath, William G.J. Halfond, and Ramesh Govindan. 2014. PUMA: Programmable UI-automation for Large-scale Dynamic Analysis of Mobile Apps. In 12th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys). 204–217.
  • Hasanbeig et al. (2019a) Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. 2019a. Certified Reinforcement Learning with Logic Guidance. arXiv preprint arXiv:1902.00778 (2019).
  • Hasanbeig et al. (2019b) Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. 2019b. Logically-Constrained Neural Fitted Q-Iteration. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2012–2014.
  • Koroglu and Sen (2018) Yavuz Koroglu and Alper Sen. 2018. TCM: Test Case Mutation to Improve Crash Detection in Android. In Fundamental Approaches to Software Engineering.
  • Koroglu et al. (2018) Yavuz Koroglu, Alper Sen, Ozlem Muslu, Yunus Mete, Ceyda Ulker, Tolga Tanriverdi, and Yunus Donmez. 2018. QBE: QLearning-Based Exploration of Android Applications. In IEEE International Conference on Software Testing, Verification and Validation (ICST).
  • Laud (2004) Adam Daniel Laud. 2004. Theory and application of reward shaping in reinforcement learning. Technical Report.
  • Li et al. (2017) Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2017. DroidBot: a lightweight UI-guided test input generator for Android. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). https://github.com/honeynet/droidbot.
  • Linares-Vásquez et al. (2015) Mario Linares-Vásquez, Martin White, Carlos Bernal-Cárdenas, Kevin Moran, and Denys Poshyvanyk. 2015. Mining Android App Usages for Generating Actionable GUI-based Execution Scenarios. In 12th Working Conference on Mining Software Repositories (MSR). 111–122.
  • Liu et al. (2014) Yepang Liu, Chang Xu, Shing-Chi Cheung, and Jian Lu. 2014. GreenDroid: Automated Diagnosis of Energy Inefficiency for Smartphone Applications. IEEE Transactions on Software Engineering (2014).
  • Machiry et al. (2013) Aravind Machiry, Rohan Tahiliani, and Mayur Naik. 2013. Dynodroid: An Input Generation System for Android Apps. In 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE). https://dynodroid.github.io.
  • Mahmood et al. (2014) Riyadh Mahmood, Nariman Mirzaei, and Sam Malek. 2014. EvoDroid: Segmented Evolutionary Testing of Android Apps. In 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE). 599–609.
  • Mao et al. (2016a) Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016a. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks. ACM, 50–56.
  • Mao et al. (2016b) Ke Mao, Mark Harman, and Yue Jia. 2016b. Sapienz: Multi-objective Automated Testing for Android Applications. In 25th International Symposium on Software Testing and Analysis (ISSTA). 94–105.
  • Mariani et al. (2012) Leonardo Mariani, Mauro Pezze, Oliviero Riganelli, and Mauro Santoro. 2012. Autoblacktest: Automatic black-box testing of interactive applications. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. IEEE, 81–90.
  • Mirzaei et al. (2016) Nariman Mirzaei, Joshua Garcia, Hamid Bagheri, Alireza Sadeghi, and Sam Malek. 2016. Reducing combinatorics in GUI testing of android applications. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, 559–570.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
  • Moran et al. (2016) Kevin Moran, Mario Linares Vásquez, Carlos Bernal-Cárdenas, Christopher Vendome, and Denys Poshyvanyk. 2016. Automatically Discovering, Reporting and Reproducing Android Application Crashes. In IEEE International Conference on Software Testing, Verification and Validation (ICST). 33–44. https://www.android-dev-tools.com/crashscope-home.
  • Piejko (2016) Pawel Piejko. 2016. 16 mobile market statistics you should know in 2016. https://deviceatlas.com/blog/16-mobile-market-statistics-you-should-know-2016.
  • Silver et al. (2017) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. 2017. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815 (2017).
  • Su et al. (2017) Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang Pu, Yang Liu, and Zhendong Su. 2017. Guided, stochastic model-based GUI testing of Android apps. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. https://tingsu.github.io/files/stoat.html.
  • Sun et al. (2017) Haiyang Sun, Andrea Rosa, Omar Javed, and Walter Binder. 2017. ADRENALIN-RV: android runtime verification using load-time weaving. In 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST). IEEE, 532–539.
  • Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. 1998. Introduction to Reinforcement Learning (1st ed.). MIT Press, Cambridge, MA, USA.
  • Toro Icarte et al. (2018) Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and Sheila A. McIlraith. 2018. Teaching Multiple Tasks to an RL Agent Using LTL. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS ’18). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 452–461. http://dl.acm.org/citation.cfm?id=3237383.3237452
  • Wen et al. (2015) Min Wen, Rüdiger Ehlers, and Ufuk Topcu. 2015. Correct-by-synthesis reinforcement learning with temporal logic constraints. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
  • Yan et al. (2018) Jiwei Yan, Linjie Pan, Yaqi Li, Jun Yan, and Jian Zhang. 2018. LAND: A User-friendly and Customizable Test Generation Tool for Android Apps. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA).
  • Yang et al. (2013) Wei Yang, Mukul R. Prasad, and Tao Xie. 2013. A Grey-box Approach for Automated GUI-model Generation of Mobile Applications. In 16th International Conference on Fundamental Approaches to Software Engineering (FASE). 250–265.
  • Zaeem et al. (2014) Razieh Nokhbeh Zaeem, Mukul R. Prasad, and Sarfraz Khurshid. 2014. Automated Generation of Oracles for Testing User-Interaction Features of Mobile Apps. In IEEE International Conference on Software Testing, Verification, and Validation (ICST).
  • Zhou et al. (2017) Zhenpeng Zhou, Xiaocheng Li, and Richard N Zare. 2017. Optimizing chemical reactions with deep reinforcement learning. ACS central science 3, 12 (2017), 1337–1344.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398172
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description