Towards Quality Assurance of Software Product Lines with Adversarial Configurations

Towards Quality Assurance of Software Product Lines with Adversarial Configurations

Paul Temple NaDI, PReCISE, Faculty of Computer Science, University of NamurNamurBelgium Mathieu Acher Univ Rennes, IRISA, Inria, CNRSRennesFrance Gilles Perrouin NaDI, PReCISE, Faculty of Computer Science, University of NamurNamurBelgium Battista Biggio University of CagliariCagliariItaly Jean-Marc Jézéquel Univ Rennes, IRISA, Inria, CNRSRennesFrance  and  Fabio Roli University of CagliariCagliariItaly
Abstract.

Software product line (SPL) engineers put a lot of effort to ensure that, through the setting of a large number of possible configuration options, products are acceptable and well-tailored to customers’ needs. Unfortunately, options and their mutual interactions create a huge configuration space which is intractable to exhaustively explore. Instead of testing all products, machine learning techniques are increasingly employed to approximate the set of acceptable products out of a small training sample of configurations. Machine learning (ML) techniques can refine a software product line through learned constraints and a priori prevent non-acceptable products to be derived. In this paper, we use adversarial ML techniques to generate adversarial configurations fooling ML classifiers and pinpoint incorrect classifications of products (videos) derived from an industrial video generator. Our attacks yield (up to) a 100% misclassification rate and a drop in accuracy of 5%. We discuss the implications these results have on SPL quality assurance.

conference: 23rd International Systems and Software Product Line Conference; 9–13 September, 2019; Paris, Francecopyright: none

1. Introduction

Testers don’t like to break things; they like to dispel the illusion that things work.  (Kaner:2001:LLS:559964)

Software Product Line Engineering (SPLE) aims at delivering massively customized products within shortened development cycles (pohl-etal2005; clements2001). To achieve this goal, SPLE systematically reuses software assets realizing the functionality of one or more features, which we loosely define as units of variability. Users can specify products matching their needs by selecting/deselecting the features and provide additional values for their attributes. Based on such configurations, the corresponding products can be obtained as a result of the product derivation phase. A long-standing issue for developers and product managers is to gain confidence that all possible products are functionally viable, e.g., all products compile and run. This is a hard problem, since modern software product lines (SPLs) can involve thousands of features and practitioners cannot test all possible configurations and corresponding products due to combinatorial explosion. Research efforts rely on variability models (e.g., feature diagrams) and solvers (SAT, CSP, SMT) to compactly define how features can and cannot be combined (batory2005; schobbens2007; berger2013; benavides2010). Together with advances in model-checking, software testing and program analysis techniques, it is conceivable to assess the functional validity of configurations and their associated combination of assets within a product of the SPL (DBLP:conf/pldi/BoddenTRBBM13; DBLP:journals/fac/StruberRACTP18; classen2010; classen2011; DBLP:journals/jlp/BeekFGM16; DBLP:conf/icse/NadiBKC14).
Yet, when dealing with qualities on the derived products (performance, costs, etc.) several unanswered challenges remain from the specification of feature-aware quantities to the best trade-offs between products and family-based analyses (e.g., (Legay:2017:QRP:3023956.3023970; terBeek:2019:QVM:3302333.3302349)). In our industrial case-study, the MOTIV video generator (galindoISSTA2014), one can approximately generate video variants. Furthermore, it takes about minutes to create a new video: a non-acceptable (e.g., a too noisy or dark) video can lead to a tremendous waste of resources. A promising approach is to sample a number of configurations and predict the quantitative or qualitative properties of the remaining configurations using Machine Learning (ML) techniques (SGKA:ESECFSE15; guo2015; guo2013; DBLP:conf/isola/BeekFGS16; siegmund2013; DBLP:conf/sigsoft/OhBMS17; temple:hal-01323446). These techniques create a predictive model (a classifier) from such sampled configurations and infer the properties of yet unseen configurations with respect to their distribution’s similarity. This way, unseen configurations that do not match specific properties can be automatically discarded and constraints can be added to the feature diagram in order to avoid them permanently (DBLP:journals/software/TempleAJB17; temple:hal-01323446). However, we need to trust the ML classifier (barreno2006can; nelson2008) to avoid costly misclassifications. In the ML community, it has been demonstrated that some forged instances, called adversarial, can fool a given classifier (biggio2018wild). Adversarial machine learning (advML) thus refers to techniques designed to fool (e.g., (biggio2013poisoning; biggio2013evasion; nelson2008)), evaluate the security (e.g., (biggio2014security)) and even improve the quality of learned classifiers (gan2014). Our overall goal is to study how advML techniques can be used to assess quality assurance of ML classifiers employed in SPL activities. In this paper, we design a generator of adversarial configurations for SPLs and measure how the prediction ability of the classifier is affected by such adversarial configurations. We also provide scenarios of usage of advML for quality assurance of SPLs. We discuss how adversarial configurations raise questions about the quality of the variability model or the testing oracle of SPL’s products. This paper makes the following contributions:

  1. An adversarial attack generator, based on evasions attacks and dedicated to SPLs;

  2. An assessment of its effectiveness and a comparison against a random strategy, showing that up to of the attacks are valid with respect to the variability model and successful in fooling the prediction of acceptable/non-acceptable videos, leading to a 5% loss in accuracy;

  3. A qualitative discussion on the generated adversarial configurations w.r.t. to the classifier training set, its potential improvement and the practical impact of advML in the quality assurance workflow of SPLs.

  4. The public availability of our implementation and empirical results at https://github.com/templep/SPLC_2019

The rest of this paper is organized as follows: Section 2 presents the case study and gives background information about ML and advML; Section 3 shows how advML is used in the context of MOTIV; Section 4 describes experimental procedures and discusses results; Section 5 and 6 presents some potential threats that could mitigate our conclusions and propose qualitative discussions about how adversarial configurations could be leveraged for SPLs developers. Section 7 covers related work and Section 8 wraps up the paper with conclusions.

2. Background

2.1. Motivating case: MOTIV generator

MOTIV is an industrial video generator which purpose is to provide synthetic videos that can be used to benchmark computer vision based systems. Video sequences are generated out of configurations specifying the content of the scenes to render (temple:hal-01323446). MOTIV relies on a variability model that documents possible values of more than configuration options, each of them affecting the perception of generated videos and the achievement of subsequent tasks, such as recognizing moving objects. Perception’s variability relates to changes in the background (e.g., being a forest or buildings), objects passing in front of the camera (with varying distances to the camera and different trajectories), blur, etc. There are Boolean options, categorical (encoded as enumerations) options (e.g., to define predefined trajectories) and real-value options (e.g., dealing with blur or noise). Precisely, enumerations contain about elements each and, in average, real-value options vary between and with a precision of . Excluding (very few) constraints in the variability model, we over-estimate the video variants’ space size: . Concretely, MOTIV takes as input a text file describing the scene to be captured by a synthetic camera as well as recording conditions. Then, Lua (Ierusalimschy:2006:PLS:1200583) scripts are called to compose the scene and apply desired visual effects resulting in a video sequence. To realize variability, the Lua code use parameters in functions to activate or deactivate options and to take into account values (enumerations or real values) defined into the configuration file. A highly challenging problem is to identify feature values and interactions that make the identification of moving objects extremely difficult if not impossible. Typically, some of the generated videos contain too much noise or blur. In other words, they are not acceptable as they cannot be used to benchmark object tracking techniques. Another class of non-acceptable videos is composed of the ones in which pixels value do not change, resulting in a succession of images for which all pixels have the same color: nothing can be perceived. As mentioned in Section 1, non-acceptable videos represent a waste of time and resources: minutes per video, not including to run benchmarks related to object tracking (several minutes depending on the computer vision algorithm). We therefore need to constraint our variability model to avoid such cases.

2.2. Previous work: ML and MOTIV

We previously used ML classification techniques to predict the acceptability of unseen video variants (temple:hal-01323446). We summarise this process in Figure 1.

Figure 1. Refining the variability model of MOTIV video generator via an ML classifier.

We first sample valid configurations using a random strategy (see Temple et al. (temple:hal-01323446) for details) and generate the associated video sequences. A computer program playing the role of a testing oracle then labels videos as acceptable (in green) or non-acceptable (in red). This oracle implements image quality assessment (IQA) defined by the authors via an analysis of frequency distribution given by Fourier transformations. An ML classifier (in our case, a decision tree) can be trained on such labelled videos. “Paths" (traversals from the top to the leaves) leading to non-acceptable videos can easily be transformed into new constraints and injected in the variability model. An ML classifier can make errors, preventing acceptable videos (false negatives) or allowing non-acceptable videos (false positives). Most of these errors can be attributed to the confidence of the classifier coming from both its design (i.e., the set of approximations used to build its decision model) and the training set (and more specifically the distribution of the classes). Areas of low confidence exist if configurations are very dissimilar to those already seen or at the frontier between two classes. We use advML to quantify these errors and their impact on MOTIV.

2.3. ML and advML

Figure 2. Adversarial configurations (stars) are at the limit of the separating function learned by the ML classifier

ML classification. Formally, a classification algorithm builds a function that associates a label in the set of predefined classes with configurations represented in a feature space (noted ). In MOTIV, only two classes are defined: , respectively representing acceptable and non-acceptable videos. represents a set of configurations and the configuration space is defined by configuration options of the underlying feature model (and their definition domain). The classifier is trained on a data set constituted of a set of pairs (, ) where are a set of valid configurations from the variability model and their associated labels. To label configurations in , we use an oracle (see Figure 1). Once the classifier is trained, induces a separation in the feature space (shown as the transition from the blue/left to the white/right area in Figure 2) that mimics the oracle: when an unseen configuration occurs, the classifier determines instantly in which class this configuration belongs to. Unfortunately, the separation can make prediction errors since the classifier is based on statistical assumptions and a (small) training sample. We can see in Figure 2 that the separation diverges from the solid black line representing the target oracle. As a result, two squares are misclassified as being triangles. Classification algorithms realise trade-offs between the necessity to classify the labelled data correctly, taking into account the fact that it can be noisy or biased and its ability to generalise to unseen data. Such trade-offs lead to approximations that can be leveraged by adversarial configurations (shown as stars in Figure 2).

AdvML and evasion attacks. According to Biggio et al. (biggio2018wild), deliberately attacking an ML classifier with crafted malicious inputs was proposed in 2004. Today, it is called adversarial machine learning and can be seen as a sub-discipline of machine learning. Depending on the attackers’ access to various aspects of the ML system (dataset, ability to update the training set) and their goals, various kinds of attacks (biggio2013poisoning; biggio2012poisoning; biggio2013evasion; biggio2014security; biggio2014pattern) are available: they are organised in a taxonomy (barreno2006can; biggio2018wild). In this paper, we focus on evasion attacks: these attacks move labelled data to the other side of the separation (putting it in the opposite class) via successive modifications of features’ values. Since areas close to the separation are of low confidence, such adversarial configurations can have a significant impact if added to the training set. To determine the direction to move the data towards the separation, a gradient-based method has been proposed by Biggio et al. (biggio2013evasion). This method requires the attacked ML algorithm to be differentiable. One of such differentiable classifiers is the Support Vector Machine (SVM), parameterizable with a kernel function111most common functions are linear, radial based functions and polynomial.

3. Evasion attacks for MOTIV

3.1. A dedicated Evasion Algorithm

Input: , the initial configuration; t, the step size; nb_disp, the number of displacements; g, the discriminant function
Output: , the final attack point

  (1) m = 0;
  (2) Set to a copy of a configuration of the class from which the attack starts;
  while  do
     (3) m = m+1;
     (4) Let a unit vector, normalisation of ;
     (5) = ;
  end while
  (6) return = ;
Algorithm 1 Our algorithm conducting the gradient-descent evasion attack inspired by (biggio2013evasion)

Algorithm 1 presents our adaptation of Biggio et al.’s evasion attack (biggio2013evasion). First, we select an initial configuration to be moved (): selection tradeoffs are discussed in the next section. Then, we need to set the step size (), a parameter controlling the convergence of the algorithm. Large steps induce difficulties to converge, while small steps may trap the algorithm in a local optimum. While the original algorithm introduced a termination criterion based on the impact of the attack on the classifier between each move (if this impact was smaller than a threshold , the algorithm stopped; assuming an optimal attack) we fixed the maximal number of displacements in advance. This allows for a controllable computation budget, as we observed that for small step sizes the number of displacements required to meet the termination criterion was too large. The function is the discriminant function and is defined by the ML algorithm that is used. It is defined as that maps a configuration to a real number. In fact, only the sign of is used to assign a label to a configuration . Thus, can be decomposed in two successive functions: first that maps a configuration to a real value and then with . However, (the absolute value of ) intuitively reflects the confidence the classifier has in its assignment of . increases when is far from the separation and surrounded by other configurations from the same class and is smaller when is close to the separation. Using this discriminant function has been proposed by Biggio et al. (biggio2013evasion) and should not be confused with the unrelated discriminator component of GANs by Goodfellow et al. (gan2014). In GANs, the discriminator is part of the “robustification process". It is an ML classifier striving to determine whether an input has been artificially produced by the other GANs’ component, called the generator. Its responses are then exploited by the generator to produce increasingly realistic inputs. In this work, we only generate adversarial configurations, though GANs are envisioned as follow-up work.

Concretely, the core of the algorithm consists of the loop that iterates over the number of displacements. Statement (4) determines the direction towards the area of maximum impact with respect to the classifier (explaining why only a unit vector is needed). is the gradient of and the direction of interest towards which the adversarial configuration should move. This vector is then multiplied by the step size and subtracted to the previous move (5). The final position is returned after the number of displacements has been reached. For statements (4) and (5) we simplified the initial algorithm (biggio2013evasion): we do not try to mimic as much as possible existing configurations as we look forward to some diversity. In an open ended feature space, gradient can grow indefinitely possibly preventing the algorithm to terminate. Biggio et al. (biggio2013evasion) set a maximal distance representing a boundary of the feasible region to keep the exploration under control. In MOTIV, this boundary is represented by the hard constraints in the variability model. Because of the heterogeneity of MOTIV features, cross-tree constraints and domain values are difficult to specify and enforce in the attack algorithm. SAT/SMT solvers would slow down the attack process. We only take care of the type of feature values (natural integers, floats, Boolean). For example, we reset to zero natural integer values that could be negative due to displacements or we ensure that Boolean values are either 0 or 1.

(a)
(b)
(c)
(d)
(e)
Figure 8. Examples of generated videos using evasion attack

As introduced in Section 2, decision trees are not directly compatible with evasion attacks as the underlying mathematical model is highly non-linear making it non-derivable (forbidding to compute a gradient). We learn another classifier (i.e., a Support Vector Machine) on which we can perform evasion attacks directly (biggio2013evasion; biggio2018wild). We rely upon evidence that attacks conducted on a specific ML model can be transferred to others (Demontis2019Transfer; Demontis2018Transfer; brownadversarial; brown2017adversarial).

3.2. Implementation

We implemented the above procedure in Python 3 (scripts available on the companion website). Figure 8 depicts some images of videos generated out of adversarial configurations.

MOTIV’s variability model embeds enumerations which are usually encoded via integers. The main difference between the two is the logical order that is inherent to integers but not encoded into enumerations. As a result, some ML techniques have difficulties to deal with them. The solution is to “dummify" enumerations into a set of Boolean features, which truth values take into account exclusion constraints in the original enumerations. Conveniently, Python provides the get_dummies function from the pandas library which takes as input a set of configurations and feature indexes to dummify. For each feature index, the function creates and returns a set of Boolean features representing the literals indexes encountered while running through given configurations: if the get_dummies function detects values in the integer range for a feature associated to an enumeration, it will return a set of Boolean features representing literals indexes in that range. It also takes care or preserving the semantics of enumerations. However, dummification is not without consequences for the ML classifier. First, it increases the number of dimensions: our initial enumerations would be transformed into features. Doing so may expose the ML algorithm to the curse of dimensionality (Bellman:1957): as the number of features increases in the feature space, configurations that look alike (i.e., with close feature values and the same label) tend to go away from each other, making the learning process more complex. This curse has also been recognised to have an impact on SPL activities (davril:hal-01243571). Dummification implies that we will operate our attacks in a feature space that is the essentially different than the one induced by the real SPL. This means that we need to transpose the generated attacks in the dummified feature space back to the original SPL one, raising one main issue: there is no guarantee that an attack relevant in the dummified space is still efficient in the reduced original space (the separation may simply not be the same). Additionally, gradient methods operate per feature only, meaning that exclusion constraints in dummified enumerations are ignored. That is, when transposed back to the original configuration space, invalid configurations would need to be “fixed", potentially putting these adversarial configurations away from the optimum computed by the gradient method. For all these reasons, we decided to operate on the initial feature space, acknowledging the threat of considering enumerations as ordered. We conducted a preliminary analysis222available on the companion webpage: https://github.com/templep/SPLC_2019 that showed that the order of the importance of the features were kept whether we use a dummified or the initial feature space. So this threat is minor in comparison of the pitfalls of dummification. We do not make any further distinctions between the two terms since we use them without making any transformations.

Since the attack cannot be conducted directly on decision trees, we decided to learn another classifier first. We chose support vector machines with a linear kernel since it was faster according to a preliminary experiment. Scripts as well as data used to compare predictions can be found on the companion webpage.

4. Evaluation

4.1. Research questions

We address the following research questions:
  RQ1: How effective is our adversarial generator to synthesize adversarial configurations? Effectiveness is measured through the capability of our evasion attack algorithm to generate configurations that are misclassified:

  • RQ1.1: Can we generate adversarial configurations that are wrongly classified?

  • RQ1.2: Are all generated adversarial configurations valid w.r.t. constraints in the VM?

  • RQ1.3: Is using the evasion algorithm more effective than generating adversarial configurations with random modifications?

  • RQ1.4: Are attacks effective regardless of the targeted class?

RQ2: What is the impact of adding adversarial configurations to the training set regarding the performance of the classifier? The intuition is that adding adversarial configurations to the training set could improve the performance of the classifier when evaluated on a test set.

4.2. Evaluation protocol

Our evaluation dataset is composed of randomly sampled and valid video configurations, that we used in previous work (temple:hal-01323446). We selected configurations to train the classifier keeping a similar representation of non-acceptable configurations (, i.e., configurations) compared to the whole set. The remaining configurations are used as a test set and also have a similar representation regarding acceptable/non-acceptable configurations. This setting contrast with a common practice of using a high percentage (i.e., around 66%) of available examples to train the classifier. However, due to the low number of non-acceptable configurations, such a setting is impossible. -fold cross-validation is another common practice used when few data points are available for training ( configurations is an arguably low number with respect to the size of the variant space) but it is used to validate/select a classifier when several are created (which is not our case here) and in addition, separating our configurations into smaller sets is likely to create a lot of sets without any non-acceptable configurations. None of these practices seem to be adapted to our case.

The key point is that only about 10% of configurations are non-acceptable. This is a ratio that we cannot control exactly as it depends from the targeted non-functional property. In order to reduce imbalance, several data augmentation techniques exist like SMOTE (SMOTE). Usually, they create artificial configurations while maintaining the configurations’ distribution in the feature space. In our case, we compute the centroid between two configurations and use it as a new configuration. Thanks to the centroid method, we can bring perfect balance between the two classes (i.e., of acceptable configurations and non-acceptable configurations). Technically, we compute how many configurations are needed to have perfectly balanced sets (i.e., training and test sets): We select randomly two non-acceptable configurations and compute the centroid between them, check that it is a never-seen-before configuration and adds it to the available configurations. The process is repeated until the number of configurations required is reached. Once a centroid is added to the set of available configurations, it is available as a configuration to create the next centroid.

In the remainder, we present the results with both original and balanced data sets in order to assess whether the impact of class representation imbalance on adversarial attacks. We configured our evasion attack generator with the following settings: i) we set the number of attacks points to generate configurations for RQ1 and configurations for RQ2 as explained hereafter; ii) considered step size () values are ; ; ; ; ; ; ; iii) the number of iterations is fixed to , or . To mitigate randomness, we repeat ten times the experiments. All results discussed in this paper can also be found on our companion webpage333https://github.com/templep/SPLC_2019.

4.3. Results

4.3.1. RQ1: How effective is our adversarial generator to synthesize adversarial configurations?

To answer this question, we assess the number of wrongly classified adversarial configurations over generations (and about configurations when the training set is balanced) and compare them to a random baseline: there is to the best of our knowledge no comparable evasion attack.

RQ1.1: Can we generate adversarial configurations that are wrongly classified?

(a) Number of misclassified adversarial configurations (20 displacements)
(b) Number of misclassified adversarial configurations (100 displacements)
Figure 11. Number of successful attacks on class acceptable; X-axis represents different step size values while Y-axis is the number of misclassified adversarial configurations by the classifier. For each value, results with balanced and imbalanced training set are shown (respectively in blue and orange).

For each run, a newly created adversarial (i.e., after is reached) configuration is added to the set of initial configurations that can be selected to start an evasion attack. We therefore give a chance to previous adversarial configurations to continue further their displacement towards the global optimum of the gradient.

Figure 11 shows box-plots resulting of ten runs for each attack setting. We also show results when the training set is imbalanced (i.e., using the previous training set containing configurations with about of non-acceptable configurations) and when it is balanced (i.e., increasing the number of non-acceptable configurations using the data augmentation technique described above). Both Figure (a)a and Figure (b)b indicate that we can always achieve of misclassified configurations with our attacks. Regarding Figure (a)a, all generated configurations become misclassified when step size is set to or a higher value. When displacements are allowed (see Figure (b)b), the limit appears earlier, i.e., when equals . Similar results can be obtained when the number of maximum displacements is set to , the only difference is that with set to not all adversarial configurations are misclassified but about () when the training set is imbalanced and about () with a balanced set.

Discussion. Increasing the number of displacements require lower step sizes to reach the misclassification goal but it comes at the cost of more computations. However, increasing the number of displacements when the step size is already large results in incredibly large displacements, leading to invalid configurations in the MOTIV case.

RQ1.2: Are all generated adversarial configurations valid w.r.t. constraints in the VM?

As discussed in Section 3, we perform a basic type check on features. However, this check does not cover specific constraints such as cross-tree ones. To ensure the full compliance of our adversarial configurations, we run the analysis realised by the MOTIV video generator. This includes, amongst others, checking the correctness features values with respect to their specified intervals.

Figure 12. Number of valid attacks on class acceptable; X-axis represents different step size values; Y-axis reports the number of valid configurations. In red and blue are respective results with an imbalance and a balance training set in terms of classes representation.

Figure 12 shows on the X-axis the different step sizes while the Y-axis depicts the number of valid adversarial configurations w.r.t. constraints. Regardless of the number of displacements and whether the training set is balanced, all results are the same except for Figure 12 that presents one difference for a displacement step size of . One possible explanation is that when the training set is balanced, more configurations can be taken as a starting point of the evasion algorithm: the gradient descent procedure might lead the current attack towards a slightly different area in which configurations remain valid. Overall, regardless of the number of authorized displacements, we can see a clear drop of valid configurations from to between step size set to and .

Discussion: We thus can scope the parameters such that adversarial configurations are both successful and valid: when step size is set to a value between and , regardless of the number of displacements. Increasing the step size leads to non-valid configurations while with smaller step sizes, adversarial configurations have not moved enough to cross the separation of the classifier (leading to unsuccessful attacks).

RQ1.3: Is using the evasion algorithm more effective than generating adversarial configurations with random displacements? Previous results of RQ1.1 and RQ1.2 show we are able to craft valid adversarial configurations that can be misclassified by the ML classifier, but is our algorithm better than a random baseline? The baseline algorithm consists in: i) for each feature, choosing randomly whether to modify it; ii) choosing randomly to follow the slope of the gradient or going against it (the role of ‘-’ of line 5 in Algorithm 1 that can be changed into a ‘+’); iii) choosing randomly a degree of displacement (corresponding to the slope of the gradient () of line 5 in Algorithm 1). Both the step size and the number of displacements are the same as in the previous experiments.

(a) Number of successful random attacks after 20 displacements
(b) Number of successful random attacks after 100 displacements
Figure 15. Number of successful random attacks on class acceptable; X-axis represents different step size values while Y-axis is the number of misclassified adversarial configurations by the classifier. In red and blue are respective results with an imbalance and a balance training set in terms of classes representation.

Figure 15 shows the ability of random attacks to successfully mislead the classifier. Random modifications are not able to exceed misclassifications (regardless of the number of displacements, the step size or whether the training set is balanced or not) which corresponds to more than half the generated configurations but with a lower effectiveness than with our evasion attack. The maximum number of misclassified configurations after random modifications starts from step size regardless of the studied number of displacements.

Considering the validity of these configurations, results are similar to what can be observed in Figure 12. The only difference is that the transition from to in the number of valid configurations is smoother and happens when is in .

Discussion: Previous results lead us to state that the effectiveness of evasion attacks are superior to random modifications since i) evasion attacks are able to craft configurations that are always misclassified by the ML classifier while less than over generations will be misclassified using random modifications; ii) generated evasion attacks support a larger set of parameter values for which generated configurations are valid; iii) we were able to identify sweet spots for which evasion attacks were able to generate configurations that were both misclassified and valid.

RQ1.4: Are attacks effective regardless of the targeted class? Previously, we generated evasion attacks from the class non-acceptable and tried to make them acceptable for the ML classifier but is our attack symmetric? We thus configure our adversarial configurations generator so that it moves configurations from the class +1 (acceptable configurations) to the class -1 (non-acceptable).

(a) Number of successful adversarial attacks after 20 displacements
(b) Number of successful adversarial attacks after 100 displacements
Figure 18. Number of successful adversarial attacks on class non-acceptable; X-axis represents different step size values while Y-axis is the number of misclassified adversarial configurations by the classifier; In orange and blue are respectively shown results when the training set is not balanced and when it is.

Overall, the attack is symmetric: all generated adversarial configurations can be misclassified. Figure (a)a shows that all generated configurations are misclassified when step size is set to or higher with a number of displacements of while, when the number of displacements is set to (see Figure (b)b), the step size can be set to or higher. These observations are the same regardless of the balance in the training set.

Regarding the adversarial configuration validity, a transition from to can still be observed. However, when the number of displacements is set to or when the training set is balanced, the transition is abrupt and occurs when step size belongs to the range. With a higher number of displacements (i.e., and and no balance), the transition is smoother but happens with smaller step sizes (i.e., with in between . Adversarial configurations can be generated regardless of the targeted class even if targeting the least represented class seems promising. {tcolorbox}[colback=gray!30,colframe=black, rounded corners = all,boxsep=1pt] Our generated adversarial attacks are: effective (always misclassified, RQ1.1), do not depend on the target class (RQ1.4) and yield valid configurations (RQ1.2). In contrast, our random baseline was only able to achieve of effectiveness at best (RQ1.3). The balance in the training set does not affect these results and the targeted class affects show the same trends despite small differences (RQ1.4).

4.3.2. RQ2: What is the impact of adding adversarial configurations to the training set regarding the performance of the classifier?

In our previous experiments, we only evaluated the impact of generated attacks in test set. Yet, some ML techniques (GANs) take advantage of adversarial instances by incorporating them in the training set to improve the classifier confidence and possibly performance. In our context, we want to assess the impact of our attacks when our classifier includes them in the dataset used during the training phase, especially with less “aggressive” (e.g., small step sizes and a low number of displacements) configurations of the attacks.

To do so, we allowed displacements in order to avoid configurations moving too far from their initial positions and we restrict the step size to every power of 10 in between and . For each step size, we generate adversarial configurations that are added all at once in the training set, we retrain the classifier and evaluate it on the configurations that constitute the initial test set (without any adversarial configurations in it). Every retraining process were repeated ten times in order to mitigate the effects of the random configuration selection and starting configurations. We also present results when the training set is balanced, in which case we have also augmented the test set to bring balance and to follow the same data distribution. In this case, the test set does not contain configurations but about in which of the configurations are considered acceptable and the remaining are considered non-acceptable.

Figure 19. Accuracy of the classifier after retraining with adversarial configurations in the training set over a test set of configurations ( configurations when the training set is balanced). In red are results when no balance are forced in the classes, in blue, both training set and test set are balanced. The initial accuracy of the classifier is represented by the horizontal line ( for the red line and for the blue one). X-axis represents different step size values while Y-axis is the accuracy of the classifier (zoomed between and ).

Figure 19 shows the accuracy of the retrained classifiers over a test set composed of configurations for the red part and configurations for the blue one.

The initial accuracy of the classifier was over the same configurations and is shown as the horizontal red line. We make the following observations: i) using adversarial configurations in the training, even with low step sizes, tend to decrease the accuracy of the retrained classifier; ii) starting from step sizes of , every run gives the same result.

Specifically, with step size equals to , the median of the boxplot is very close to the initial accuracy (i.e., ) of the classifier and the interquartile range is small suggesting that the impact of adding adversarial configurations into the training set is marginal. Between to , the median is slightly decreasing and the interquartile range tends to increase. At the accuracy of the classifier drops to , the adversarial configurations are specially efficient, forcing the ML classifier to change drastically its separation resulting in a lot of prediction errors. The last two step sizes shows that all the runs with the same step size give the same results in terms of accuracy. Focusing on these runs, adversarial configurations had features with the same value: the amount of heat haze, of blur, of compression artefact or the amount of static noise of the adversarial configurations are all equal. All of these features are directly related to the global quality of images, and are key for the classifier accuracy. We explain the evolution of the classifier’s accuracy as a combination of the contribution of the most important features and the constraints of the VM. For low step sizes (), displacements are modest and therefore perturbations are very limited, though slightly observable. The sweet spot is at : the resulting displacement is important enough to move features values enough so that the associated configurations are moved effectively towards the separation and fool the classifier. We computed the means and standard deviations between the initial and adversarial configurations and their difference witness the impact of adversarial configurations on the classifier. For larger values of (i.e., to ), these features lose their impact because their values are limited by constraints (so that they do not exceed the bounds).

In the case where training and test sets are balanced (in blue on Figure 19), results follow the same tendency. Since most of added configurations to provide balance are well classified, we see that the accuracy is a bit higher than in the non-balanced. Values remain close to the baseline, however, when , results are worse than for other executions as for the non-balanced setting. Yet, we cannot conclude about the classifier robustness and more experiments should be conducted to take into account the fact the balanced and non-balanced datasets do not contain the same number of configurations.

{tcolorbox}

[colback=gray!30,colframe=black, rounded corners = all,boxsep=1pt] Our attacks cannot improve the classifier’s accuracy but can make it significantly worse: adversarial configuration over can make the accuracy drop by . Successful attacks also pinpoint visual features that do influence the videos’ acceptability and that do make sense from the SPL perspective (computer vision).

5. Threats to validity

Internal threats. Choice of parameter values for your experiments may constitute a threat. The step size has been set to different power of , we only used different number of allowed displacements (i.e., , and ). From our perspective, using step sizes of in a highly dimensional space seems ridiculously small while, on the contrary, using step sizes of are tremendously large which motivates our choice to not going over these boundaries. However, the lower boundary could have been extended which might have affect results regarding RQ2. Still, given the design of our attack generator, it is likely that performance of the classifier would never have increased. Regarding the number of displacements, we could have used finer grained values. We sought a compromise between allowing a lot of small steps and a few big steps. Regarding the choice of evasion attacks, as presented in Section 2, several techniques exist. Evasion attacks showed interesting results and open new perspectives that we discuss in the Section 6.

We rely on centroids to deal with class imbalance (see Section 4.2). The centroid method has pros and cons: centroids are easy and quick to compute, new configurations tend to follow the same distribution as they result in more densely populated clusters and on rare occasions, make clusters expand a little bit. However, new configurations may not be realistic, since they do not provide so much diversity – centroids, by definition, lie in the middle of the cluster of points. Since our goal is only to limit imbalance in the available configurations, this technique is appropriate while maintaining the initial distribution of configurations. However, we are aware that other data augmentation techniques can be used.

External threats. We only assessed our adversarial attack generator on one case study, namely MOTIV. Yet, MOTIV is a complex and industrial case exhibiting various challenges for SPL analysis, including heterogeneous features, a large variability space and non-trivial non-functional aspects. The x.264 encoder has been studied (e.g., (DBLP:conf/sigsoft/NairMSA17)) but is relatively small in terms of features (only 16 were selected), heterogeneity (only Boolean features) and number of configurations. This can nevertheless be a candidate for replicating our study. Our adversarial approach is not specific to the video domain and, in principle, applicable to any SPL. Generating adversarial configurations without taking into account all constraints of the variability model directly into the attack algorithm may threaten the applicability of our approach to other SPLs. Calls to SAT/SMT solvers are unpractical due to feature heterogeneity and the frequency of validity checks. Benchmarks of large and real-world feature models can be considered if we are only interested in sampling aspects (DBLP:conf/se/KnuppelTMMS18; Siegmund:2017:AVM:3106237.3106251). Finally, open-source configurable systems like JHipster (DBLP:journals/ese/HalinNADPB19) can be of interest to study non-functional properties like binaries’ sizes or testing predictions. We also considered accuracy as a the main performance measure. Accuracy is the standard measure used in the advML literature (barreno2006can; nelson2008; biggio2013poisoning; biggio2014security; biggio2014pattern; gan2014) to assess the impact of attacks.

6. Discussions

Adversarial configurations pinpoint areas of the configuration space where the ML classifier fails or has low confidence in its prediction. We qualitatively discuss what the existence of adversarial configurations suggests for an SPL and to what extent the knowledge of adversarial configurations is actionable for MOTIV developers.

#1 Adversarial training. A first reaction is simply to seek improvements of the ML classifier and making it more robust to attacks. Previous work on advML (biggio2018wild; barreno2006can; guo2017countering; dhillon2018stochastic; madry2017towards) proposed different defense strategies in presence of adversarial configurations. Adversarial training is a specific category of defense: the training sample is augmented with adversarial examples to make the ML models more robust against them. In our case study, it consists in applying our attack generator and re-inject adversarial configurations as part of the original training set. We saw in RQ2 that, when adversarial configurations are introduced in the training set, even moderately agressive attacks affect the ML classifier performance. Our adversarial training is not adequate: our adversarial generator has simply not been designed for this defensive task and rather excels in finding misclassifications. It opens two perspectives. The first is to apply other, more effective defense mechanisms (manifold projections, stochasticity, prepossessing, etc. (biggio2018wild; barreno2006can; guo2017countering; dhillon2018stochastic; madry2017towards)). The second and most important perspective is to leverage adversarial ML knowledge for improving the SPL itself with “friendly” rather than malicious attacks, fooling the classifier is a mean to this objective.

#2 Improvement of the testing oracle. The labelling of videos as acceptable or non-acceptable – the testing oracle – is approximated by the ML classifier. If the oracle is not precise enough, it is likely that the approximation performs even worse. In the MOTIV case, oracles are an approximation of the human perception system which in turn could be seen as an approximation of the real separation between acceptable images and non-acceptable ones regarding a specific task. Object recognition should potentially work on an infinite number of input images which makes the construction of a “traditional” oracle (a function that is able to give the nature of every single input) challenging. Testing oracles for an SPL are programs that may fail on some specific configurations. Adversarial configurations can lead to “cases” (videos) for which the oracle has not been designed and tested for and may provide insights to improve such oracles.

MOTIV’s developers may revise the visual assessment procedure to determine what a video of sufficient quality means (IQA; temple:hal-01323446). Adversarial configurations can help understanding the bugs (if any) of the procedure over specific videos (see Figure 8, page 8). Based on this knowledge, a first option is to fix this procedure – adversarial configurations would then act as good test cases for ensuring non-regression issues with the oracle. In our context, one can envision to crowd-source the labelling effort with humans (e.g., with Amazon Mechanical Turk (Mechanical_Turk)). However, asking human beings to check whether a video is acceptable or not is costly and hardly scalable – we have derived more than videos. Crowd-sourcing is also prone to errors made by humans due to fatigue or disagreements on the task. To decrease the effort, adversarial configurations can be specifically reviewed as part of the labelling process. An open problem is to find a way to control adversarial displacements such that we are able to ensure that the generated adversarial configuration does not cross the ML separation. This level of control is left for future work. Overall, the choice of the adequate testing oracle strategy in the MOTIV case is beyond the scope of this paper. Several factors are involved, including cost (e.g., manually labelling videos has a significant cost) and reliability.

#3 Improvement of the variability model. While generating adversarial configurations, SPL practitioners can gain insights on whether the feature model is under or over constrained. Looking at modified features of adversarial configurations (see RQ2), practitioners can observe that the same patterns arise involving some features or combinations of features. Such behavior typically indicate that constraints are missing – some configurations are allowed despite they should not be but it was never specifically defined as such in the variability model. Conversely, adversarial configurations can also help identifying which constraints can be relaxed.

#4 Improvement of the variability implementation. Features of MOTIV are implemented in Lua (Ierusalimschy:2006:PLS:1200583). An incorrect implementation can be the cause of non-acceptable configurations either because of bugs in individual features or undesired feature interactions. In the case of MOTIV, we did not find variability-related bugs. We rather considered that the cause of non-acceptable videos was due to the variability model and that the solution was to add constraints preventing this.

7. Related Work

Our contribution is at the crossroad of (adversarial) ML, constraint mining, variability modeling, and testing.

Testing and learning SPLs. Testing all configurations of an SPL is most of time challenging and sometimes impossible, due to the exponential number of configurations (thuem2014; MKRGA:ICSE16; Legay:2017:QRP:3023956.3023970; terBeek:2019:QVM:3302333.3302349; DBLP:conf/splc/VarshosazATRMS18; halin:hal-01829928). ML techniques have been developed to reduce cost, time and energy of deriving and testing new configurations using inference mechanisms. For instance, regression models can be used to perform performance prediction of configurations that have not been generated yet (SGKA:ESECFSE15; guo2015; guo2013; DBLP:conf/isola/BeekFGS16; siegmund2013; DBLP:conf/sigsoft/OhBMS17) . In (temple:hal-01323446; DBLP:journals/software/TempleAJB17), we proposed to use supervised ML to discover and retrieve constraints that were not originally expressed before in a variability model. We used decision trees to create a boundary between the configurations that should be discarded and the ones that are allowed. In this paper, we build upon previous works and follow a new research direction with SVM-based adversarial learning.

Siegmund et al. (Siegmund:2017:AVM:3106237.3106251) reviewed ML approaches on variability models. They propose THOR, a tool for synthesizing realistic attributed variability models. An important issue in this line of research is to assess the robustness of ML on variability models. Yet, our work specifically aims to improve ML classifiers of SPL. None of these bodies of work use adversarial ML neither the possible impact that adversarial configurations could have on the predictions.

Adversarial ML can be seen as set of security assesement and reinforcement techniques helping to better understand flaws and weaknesses of ML algorithms. Typical scenarios in which adversarial learning is used are: network traffic monitoring, spam filtering, malware detection (barreno2006can; biggio2013poisoning; biggio2014security; biggio2014pattern; biggio2013evasion; biggio2012poisoning) and more recently autonomous cars and object recognition (deeproad2018; deepXplore2017; canFoolBothGoodfellow2018; limitDLPapernot2016; accessorize2016; advInPhysical2016; physWorldAttacks2017). In such works, authors suppose that a system uses ML in order to perform a classification task (e.g., differentiate emails as spams and non-spams) and some malicious people try to fool such classification system. These attackers can have knowledge on the system such as the dataset used, the kind of ML technique that is used, the description of data, etc. The attack then consists in crafting a data point in the description space that the ML algorithm will misclassify. Recent works (gan2014) used adversarial techniques to strengthen the classifier by specifically creating data that would induce such kind of misclassification. In this paper, we propose to use a similar approach but adapted to SPL engineering: adversarial techniques may be used to strengthen the SPL (including variability model, implementation and testing oracle over products) while analyzing a small set of configurations. To our knowledge, no adversarial technique has been experimented in this context.

8. Conclusion

Machine learning techniques are increasingly used in software product line engineering as they are able to predict whether a configuration (and its associated program variant) meets quality requirements. ML techniques can make prediction errors in areas where the confidence in the classification is low. We adapted adversarial techniques on our MOTIV case and generated both successful and valid attacks that can fool a classifier with a low number of adversarial configurations and decrease its performance by . The analysis of the attacks exhibit the influence of important features and variability model constraints. This is a first and promising step in the direction of using adversarial techniques as a novel framework for quality assurance of software product lines. As future work, we plan to compare adversarial learning with traditional learning or sampling techniques (e.g., random, t-wise). Generally we want to use adversarial ML to support quality assurance of SPLs.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
390214
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description