A case study : Influence of dimension reduction on regression treesbased algorithms  Predicting aeronautics loads of a derivative aircraft
Abstract
In aircraft industry, market needs evolve quickly in a highly competitive context. This requires adapting a given aircraft model in minimum time considering for example an increase of range or the number of passengers (cf A330 NEO family). The computation of loads and stress to resize the airframe is on the critical path of this aircraft variant definition: this is a consuming and costly process, one of the reason being the high dimensionality and the large amount of data. This is why Airbus has invested since a couple of years in Big Data approaches (statistic methods up to machine learning) to improve the speed, the data value extraction and the responsiveness of this process. This paper presents recent advances in this work made in cooperation between Airbus, ENAC and Institut de Mathématiques de Toulouse in the framework of a proof of value sprint project. It compares the influence of three dimensional reduction techniques (PCA, polynomial fitting, combined) on the extrapolation capabilities of Regression Trees based algorithms for loads prediction. It shows that AdaBoost with Random Forest offers promising results in average in terms of accuracy and computational time to estimate loads on which a PCA is applied only on the outputs.
Keywords: Regression trees, Aeronautics, Dimensional reduction, Extrapolation
MSC Classification: 62J02, 6207, 63P30
1 Introduction
In aircraft industry, market needs evolve quickly in a high competitiveness context. This requires adapting a given aircraft model in minimum time considering for example an increase of range or of the number of passengers such as the A330 family in [air]. In our case study, variants concern the maximum takeoff weight of a given aircraft model. Depending on the configuration, the computation of loads and stress, as defined in [aiaa88, hje05], to resize the airframe is on the critical path of this aircraft variant definition: this is a time consuming (approximately a year for a new aircraft variant) and costly process, one of the reason being the high dimensionality and the large amount of data. Big Data approaches such as defined by [gan07] is mandatory to improve the speed, the data value extraction and the responsiveness of the overall process. This study has been realized during a proof of value sprint project within Airbus to demonstrate the usefulness of statistics and machine learning approaches in the Engineering field. In a previous internal project, it has been shown that the family of regression trees [brei84] works well to predict loads for different aircraft missions in an interpolation context. Thus, we can formulate our problem in this way: is it possible to use dimensional reduction and regression treesbased algorithms to predict loads in an extrapolation context (i.e outside the design space of a certain weight variant) to improve the actual process?
1.1 Industrial context
An airframe structure is a complex system and its design is a complex task involving today many simulation activities generating massive amounts of data. Such is the case of the process of loads and stress computations for an aircraft (that is to say the calculations of the forces and the mechanical strains suffered by the structure) and can be represented as follows:
The overall process exposed in Figure 1 is run to identify load cases (i.e aircraft mission and configurations: maneuvers, speed, loading, stiffness…), that are critical in terms of stress endured by the structure and, of course, the parameters which make them critical. The final aim is to size and design the structure (and potentially to reduce loads in order to reduce the weight of the structure). Typically for an overall aircraft structure, millions of load cases can be generated and for each of these load cases millions of structural responses (i.e how structural elements react under such conditions) have to be computed. As a consequence, computational times can be significant.
For a derivative aircraft, we can give some rough order of magnitudes in terms of quantities of produced data: External loads ( of bytes); Weights: number of elements ( of bytes); Internal loads: number of components by the number of external loads by the number of elements ( of bytes); Reserve Factors: number of internal loads by the number of failure modes ( of bytes). Hence, we easily reach to of bytes for a single derivative aircraft.
In an effort to continuously improve methods, tools and waysofworking, Airbus has invested a lot in digital transformation and the development of infrastructures allowing to treat data (newly or already produced). The objective here is to exploit and adapt Machine Learning and optimization tools in the right places of the computational process. As pointed by [man11], these techniques cover a large number of fields such as Internet and Business Intelligence but they can also benefit to the manufacturing industry (here aeronautics). The main industrial challenge for Airbus is to reduce lead time in the computation of loads and preliminary sizing of an airframe.
1.2 A simplistic load and stress model computation process example
In order to illustrate the process exposed in the previous subsection, let us consider a simplistic load model completed with equations calculating thickness used to correct the weight distribution of a wing structure similar to [doh09].
The structure contains a fuel tank at the wing tip with the dimensions Lf, Ctf, Cof as shown in Figure 2. The length of the wing is L, the chord length at wing root is Co and at the tip Ct. As a consequence, there are three different types of loads which affect the wing: the aerodynamic lift (i.e the force which allows the aircraft to lift off and to maintain altitude) which depends on the length of the wing, the load factor and the total weight of the aircraft; the loads concerning the fuel and the fuel tank weight depending on the fuel weight and the dimension of the fuel tank; and the loads due to the wing structure depending on the weight and the dimension of the wing. By adding these three types of loads, and providing the weight of the wing structure, the weights of the tank and the fuel contained, as well as the total weight of the aircraft and the load factor; provides the basis for calculating the shear force (transverse forces near to vertical arising from aerodynamic pressure and inertia) and bending moment (resulting from the shear forces) of the wing. The relations between these quantities are :
,
,
where is the position along the wing. We consider that the wing is represented by a simplified rectangular box schematized by two parallel panels representing the covers (see Figure 3) : This is enough to distribute the fluxes induced by the bending moment.
We can complete equations calculating thickness. Indeed, by considering the box has height supposed linearly decreasing along the span, considering we must not exceed an allowable of tension and compression. Considering the fluxes in the wing covers are given by thus we have the thickness distribution defined by:
And by integrating we get the weight of the cover given by:
Indeed, by considering that the wing takes the form of a box presented in Figure 3, by integrating along and by multiplying by , where is the density of the material used to fabricate the wing panels, we get the weight of the wing cover. More precisely, we obtain the minimum weight of the wing cover able to resist an allowable tension and compression. We assume that , then we can extract the minimum weight of the wing structure able to resist an allowable tension and compression.
1.3 Data presentation
The data we have at our disposal are the aircraft parameters (features) which are used in the computing chain for calculating loads (outputs which correspond to moments and forces). We have data coming from the weight variant 238 tons (aircraft parameters and loads distribution along the wing); and we would like to predict those of the 242t and other weight variants (247t and 251t). All the different datasets have been previously computed and we use them to assess the capability of methods defined in the following sections to predict loads in such context. In fact, we hope to answer, by doing so, to the question: ”What would the results have been if we had applied such a methodology to calculate the loads instead of the normal process for new weight variants?”.
25 aircraft (A.C.) parameters play the role of features (lying in ) of a load case and we would like to predict the associated loads (outputs) which are in . To simplify, we will focus on predicting bending moment along the wing which is, in our data, represented by a vector of size . In other words, each load case (i.e observation) is defined by its 25 features and its bending moment (output). The features are used to identify a typical aircraft event (maneuvers, gusts, continuous turbulences) with specific aerodynamic and weight conditions. Gusts are loads produced by environmental perturbations: sudden vertical or lateral wind blasts which are required by certification organisms like EASA from statistical meteorological histories. Continuous turbulence cases are linked to the cumulative energy stored by the structure under a spectrum of random gusts. A typical maneuver is a 2.5g pullup consisting in producing an increase aerodynamic lift by deflecting the elevator and increasing the angle of attach of the aircraft. This gives a bending moment close to the maximum value in competition with gust cases. The data base is constituted mainly by gusts (90% of all load cases) and we will focus on them. To begin, we shall focus on the 238t and 242t data before generalizing our results to other weight variants. A quick summary of the size of our different datasets is presented in Table 1:
238t(Train&Test)  242t(Validation)  

Dimension data features  28391 rows x 25 col.  28391 rows x 25 col. 
Dimension data outputs  28391 rows x 29 col.  28391 rows x 29 col. 
In a more formal way, let be the 238t database of features defined by where are quantitative variables (i.e a A.C. parameter), and . The 238t database of outputs is then defined by and . Aircraft parameters X (inputs) we have at our disposal in the training data base 238t are described in Table 2:
Contrary to the simplistic load calculation example, real simulations needs much more of information: the first ten variables are linked to the orientation of ailerons, spoilers and the rudder which are directional control surfaces (see Figure 4); the xlocation of gravity center is an indicator concerning the location of the gravity center along the xaxis; the thrust is a calculated variable corresponding to the force which moves the aircraft forward (contrary to the drag force); and the load factors are global indicators which express the ”amount of loads” the structure can withstand. All these features are processed by dynamic flight equations considering the flexible body behaviour of the aircraft through finite element models (Lagrange’s equations): for further readings, we refer to [fp09].
The bending moment is calculated at 29 points along the wing  each point represents a station and stations are not equidistant (two more stations are located in the center wing box; we prefer to focus here on stations of the wing only). Thus represents the values of the bending moment taken at the station. Through a change of coordinate system (aircraft system to wing system), we can easily plot bending moments (Figure 5):
1.4 Industrial problem
Aircrafts (A.C.) have been developed for different maximum takeoff weight (which is one of the many aircraft parameters used in the computing chain to calculate the loads). Because the computation process exposed above for a new aircraft variant (a new weight variant in our case) can reach easily a year, the use of metamodels, optimization and statistic approaches such defined by [gan07] is mandatory to improve the speed and responsiveness of the overall process.
From this standpoint, we can expose the following problem: for each combination of A.C. parameters corresponding to a load case, and each load case being categorized into a load condition (family of load cases  gusts or maneuvers), can we give an estimation of the loads for different A.C. parameters for new weight variants (242t, 247t and 251t) knowing the loads of the weight variant 238t?
The mathematical problem of this project is an extrapolation problem. Is it possible to ”extrapolate” loads of the 242 tons, 247t and 251t knowing loads of the 238t by using machine learning? To be more precise, can we find a function depending on aircraft parameters that allows us to estimate/extrapolate to 242t and other weight variants by learning from those of the 238t? In a previous project concerning loads, it has been shown that the family of regression trees works well on the data we have to deal with. As a consequence, different algorithms based on decision trees will be investigated. Besides, because of the dimension of our outputs, how do dimensional reduction techniques affect the capability of extrapolation of machine learning algorithms based on regression trees?
This paper is organized as follows: Section 2 is dedicated to the description of the three different techniques of dimension reduction we used in our study. Then in Section 3 we expose the different algorithms based on regression trees and finally we present in Section 4 our results.
2 Three Dimensional Reduction Techniques
In order to improve the efficiency and speed of the modeling process, we compare several dimensional reduction techniques. We start by using a classical PCA on the inputs and also on the outputs. Then we consider a polynomial fitting and finally we mix the two methods. These dimensional reduction techniques will reduce the dimension of the output space. Each technique has been used on the 238t, and these allow us to reverse the technique to come back to the original output space easily.
2.1 Principal Components Analysis
In few words, the Principal Components Analysis (PCA), developed by [pea01] and formalized by [hot93] is a statistical method used to compress a matrix x of quantitative variables into a smaller rank matrix. This method uses the variancecovariance matrix (or correlation matrix) to extract important factors (few in general) to represent observations in a smaller subspace. As a consequence, each observation is represented by coordinates into new components linked to these factors (this approach is similar to the SVD decomposition).
We apply the PCA in the space defined by the outputs (centered and reduced), and the Figure 6 shows the decline of the variance explained by each component as well as the cumulative percentage of the explained variance:
The study of the eigenvalues shows that the six first components explain 99.99% of the total variance. When we look closer at the correlation of the original variables with the principal components, we see that all features have a similar correlation coefficient with the two first principal components.
2.2 Polynomial fitting
As we can see in Figure 5, a discontinuity always appears at the 12th station along the wing. Besides, the curves we observe are extremely regular. Consequently, it seems reasonable to fit a polynomial on the first part of the curve and another on the second. In order to choose properly the degree of each polynomial, we assess the quality of the fit by calculating a Rsquared score for each curve.
Thus, we consider that it exists a polynomial function of degree for each part of the curve such as:
The coefficients are obtained by minimizing the squared error by the least squares method.
To have an Rsquared score greater than 99.9% for each curve and to avoid overfitting by choosing too great degrees, the optimal couple of degrees is set to 2 for both polynomials. The dimension of the output space would be 6 instead of 29.
2.3 Polynomial fitting & Principal Components Analysis
By first applying polynomial fitting on the curves and then applying a PCA on the coefficients of the polynomials, we can decrease one more time the dimension of the output space from 6 to 4.
By keeping 4 principal components, the output space goes from 6 to the 4 dimensions and the precision is greater than 99.9% for at least 99% of the observations. Here follows the decline of the explained variance per component as well as the cumulative percentage of the explained variance (Figure 7):
In the following, we shall test the different dimensional reduction techniques above which will be compared to no dimensional reduction.
3 Regression based on Trees
In this section, different algorithms based on decision trees will be investigated. More precisely, the Classification and Regression Trees have been the source of numerous ensemble methods such as Bagging, Random Forest, the Gradient Boosting and AdaBoost and we explain how they work on the data we deal with. Recall we have at our disposal the 238t database of inputs which contains where are quantitative variables (i.e a A.C. parameter), and outputs are defined by . For each individual, we observe a couple where and . We have thus a sample of observations of size . The aim is to explain Y by a function of X. For the sake of simplicity, we will consider the univariate regression (that is to say the value of the bending moment on the station) by a function of X.
3.1 Classification and Regression Trees (CART)
Classification and Regression Trees have been formalized by [brei84] and are decision trees. They consist of approximating a function F such as . This algorithm considers all of 28391 observations and all of the 25 inputs. In no technical terms, the algorithms partitions the data into smaller and smaller subsamples until all subsamples are homogeneous in terms of output variables. Let us recall how the method works (see [brei84], [quin93]):
The construction of a tree is the successive partitioning of the output space thanks to the features in the form of a sequence of nodes. At the beginning, the full data set is linked to the initial node (also called the root) and is divided into two classes (two children nodes, left and right) accordingly to a division criteria. Thus, each child node represents a subsample of the dataset of the parent node, and recursively from each child node will arise two other children  if a node has no child, it is considered as a terminal node, also called a leaf. The observations belonging to each node must be the most homogeneous, and two children from a node must be the most heterogeneous. In fact at each node, a feature is selected and the algorithm finds the threshold of (thanks to an impurity measure, also called heterogeneity function or split function) which leads to the most homogeneous sample vs heterogeneous classes. The division criteria leads to know if a node must be a leaf or not, and finally associates each leaf to a value of .
A tree stop growing at a certain node for two reasons: the subsample contains too little data according to a fixed threshold set by the user, or the sample linked to the node is homogeneous and no other division is acceptable (that is to say that possible divisions lead to an empty child node). The Figure 8 shows an example of construction of a tree.
is the node containing all observations of X, and other nodes or leaves contain a subsample of X. Let be . Then, the value of associated to is defined by :
(1) 
The value of associated to each leaf is then the average value of s associated to the subsample of the leaf.
At the end, this algorithm provides a huge tree with many leaves which can lead to over fitting. To avoid this effect, the tree must be pruned: we have to extract a subtree. Among a sequence of subtrees, we keep the one which minimizes a criteria which depends most of the time of the generalization error and the complexity (the number of leaves): this method is called the cost complexity pruning. In our case, the generalization error (i.e the mean squared error) is calculated by crossvalidation.
3.2 Bagging with regression trees
Bagging is an algorithm which aggregates trees and has been introduced by [brei96]. Let us consider the full sample X of size . For , we denote by a sample of size obtained by sampling with replacement X. For each , we train a predictor . is therefore an ensemble of predictors, predictors defined on different samples and are treebased algorithms. Each individual , belongs to differents leaves (one for each tree) denoted by . So, by equation 1, we have different values for the prediction of , i.e . The aggregated prediction value of is then defined by:
(2) 
Sampling with replacement is most of the time associated to boosting sampling. The method explained above is named Bagging (stands for Boosting AGGregatING). Bagging improves predictions capabilities because it introduces differences between training samples which lead to variability of predictors. Breiman has shown that good candidates to boosting are classification and regression trees and neural networks.
3.3 Random Forest
Random Forests, introduced by [brei01], are based on bootstrap sampling and CART. As in Section 3.2, we first construct sub samples with replacement of size . When a tree is built, at each node of the tree, we draw randomly inputs out of 25 (independently) and the optimal splitting criteria is defined through these drawn variables. Trees grow to the maximal size and are not necessarily pruned.
Each tree is an estimator of the underlying function and built on a variation of the training set. As a consequence, each estimator leads to different results. Nevertheless, because of the numbers of estimators, the ensemble of trees (the forest), leads to a stable model. For a new observation, the prediction is then the average value of all the predictions of all predictors as in Bagging.
3.4 Gradient Boosting
The gradient boosting, intuited by [brei97] and developed by [frie99], is like every other boosting method: it combines weak learners. The goal stays the same, to explain by a function of X and instead of tuning parameters of this model, we iteratively add a model to the previous one to increase its capabilities. The name of ”gradient” comes from the fact that the gradient of the squared error is the negative residual (see [frie99] and [lii]). In our case, we use regression trees (CART). Here follows a simplified version of the Gradient Boosting Machine algorithm (for more details, see [frie99]):
3.5 AdaBoost
One thing that Bagging does not take into account is that each observation is not equally susceptible to be drawn randomly from the training set. Most of the time, we cannot assure this condition. As explained by [druc97]; ”in boosting, the probability of a particular example being in the training set of a particular machine depends on the performance of the prior machines on that example”. In other words, if machine (a model) is able to predict and learn properly an observation, we do not need to learn more about it, but on observations which are difficult to learn on. Thus, these last ones will be more likely to be picked in a boosting sample. Adaboost was first introduced by [freu95, freu96], and the following is a slightly modified version by [druc97] called AdaBoost.R2:
Initially, each observation is assigned by a weight , . The algorithm is defined this way and continues till the average loss goes under 0.5:
Although this algorithm is noise and outliers sensitive, it does not need to be calibrated. This ensemble technique can be used with Random Forest and Decision Trees Regressors.
4 Prediction of loads for a new weight variant
In this section, we apply the techniques we described in Section 3 to our database and present the results we obtain.
4.1 Data preparation
Several options are possible to improve the capability of predictions of machine learning. For example, some of them are sensitive to the homogeneousness of the data they learn from, or the number of input variables, as well as outliers. Concerning the last case, we cannot consider outliers because every load cases have been validated thus we must consider all of them. In the first part, we will focus on clustering of our load cases of gusts to improve the ML performance. In the second part, we shall analyze the influence of different dimensional reduction techniques on the generalization capabilities of several algorithms based on regression trees.
To improve the capability of machine learning algorithms, clustering has been performed on the gust cases. From a weight variant to another, loads experts are able to roughly estimate the form and intensity of the bending moments. To represent it a priori, we add the coefficients of the polynomials to the features to cluster our data and the Kmeans algorithm has been performed on these data (features and coefficients). The number of clusters was chosen with the experts and the elbow method using an Euclidean distance. A PCA has been performed and in the two first components, two clusters can be distinguished precisely (see Figure 9). In the following, these two clusters will be referred as Cluster 0 and Cluster 1:
As we can see in Figure 9, the average bending moment of the Cluster 0 is more linear than the one of Cluster 1. Besides, the cluster 1 is constituted by bending moment which are mainly positive and with higher value at the wing root. By looking closer at the A.C. parameters, we can see that most of variables have the same distribution with a slightly different mean value. Nevertheless, some of them are really different (see Table 3): this is the case for DQ_DEGL1 (Deflection left inboard Elevator), DSP_DEG1L (Deflection Spoiler 1 Left Wing), DP_DEGIL (Deflection all speed Inner Aileron), DP_DEGOL (Deflection low speed Outer Aileron) and even more for ENXF (XLoad Factor Body Axis), especially the distribution (see Figure 10 and 11):
DQ_DEGL1  DSP_DEG1L  DP_DEGIL  DP_DEGOL  ENXF  

Cluster 1  0.0043  0.00025  0.0082  0.0079  0.0587 
Cluster 0  0.0258  0.4363  0.0495  0.0488  0.0173 
4.2 From 238t to 242t
Before presenting the results, it is important to explain more the Rsquared score we have used in this project and why it is relevant in an engineering context. The Rsquared, or also known as coefficient of determination, is a number that shows how well predictions are with respect to the explained variance. In other words, it is a measure of how well the model fits the data:
In our case, we calculate a at each station of the wing. Indeed, by doing so, we maintain the engineering sense of accuracy of a curve. Because the variance for one curve can be extremely high  for example, we have at the root a value of 8 000 000 and at the wing tip it is closed to 0  calculating a on all the values at the same times would lead to overestimate the accuracy of our models because the total variance is higher and thus, the ratio between the squared error and the variance is really low.
The industrial goal was to have the higher : in fact, this sprint project is part of a bigger project aiming to deliver models to accelerate predevelopment of aircraft. Thus, the necessary condition is to have models precise enough and able to generalize simulations computed anteriorly to approximate, in our case, the computing chain of loads and stress. We agree that the can be misleading if the variance of the output is very high. As a consequence, by calculating a squared at each station (that is to say for each predictor) of the wing: we consider the variance only of the same kind of values in the outputs. The Rsquared score given is then the average value of all Rsquared calculated at each station.
To compare properly the results, from the 238t data set, we have drawn randomly a sample representing 80% of the observations, the last 20% represent the test set, and the 242t, 247t and 251t are our validation datasets, and we have repeated the process several time to see if a modification of the training set leads to unstable results in forecasting and generalizing.
To perform the comparison of algorithms presented above, we have used the scikitlearn library. Unfortunately, because we are trying to predict a field of vectors (we fit a model per station along the wing), just Random Forest is naturally implemented to do so and to take advantage of links which could exist between them. Simply speaking, when we fit a multioutput model with Random Forest, the impurity measure used at each node has a ”covariance” form such as defined in [seg]. Then we used the MultiOutputRegressor for the other algorithms which fits an independent predictor per output vector (i.e per station): the MultiOutputRegressor is then an object containing as much predictors as outputs. As a recall, here are the algorithms we have tested the generalization capabilities: Adaboost based on decision trees regressors (ADBDT); Adaboost based on Random Forest regressors (ADBRF), Random Forest (RF), Bagging and Gradient Boosting (GBM). First, before checking the influence of dimensional reduction techniques we check which algorithms work the best on raw data:
As we can see in Table 4, even if AdaBoost is not able to predict and take into account several outputs, the one based on decision tree regressors gets the better results. Random Forest combined with AdaBoost has 3% higher scores with a lower variability than RandomForest only. It is important to notice that GBM has the less degrowth from the test score to the validation score but the poorest score. Adaboost (based on decision trees or Random Forest) having the best results and the second less degrowth from the test score to the validation score (from 97.56% to 95.6%), we will focus on this algorithm to see the impact of dimensional reduction techniques.
To quantify the influence of dimensional reduction techniques on extrapolation capabilities, here follows the different configurations we need to compare:

(1) Raw inputs + raw outputs: no data transformation.

(2) Raw inputs + PCA outputs: we keep the original input space and we perform a PCA on the output space.

(3) Raw inputs + polynomial fitting: we keep the original input space and replace the outputs by polynomial coefficients.

(4) Raw inputs + polynomial fitting and PCA: we keep the original input space and replace the outputs by polynomial coefficients on which we perform a PCA.

(5) PCA inputs + Raw outputs: we keep the original bending moment and we perform a PCA on the input space.

(6) PCA inputs + PCA outputs: we perform a PCA on the design space, and another on the output space.

(7) PCA inputs + polynomial fitting: we perform a PCA on the design space and replace the outputs by polynomial coefficients.

(8) PCA inputs + polynomial fitting and PCA: we perform a PCA on the design space and replace the outputs by polynomial coefficients on which we perform a PCA.
Methods concerning the polynomial fitting are not shown due to lack of generalization and poor results. Other results are shown in Table 5.
Remark 3
Parameters of algorithms can be consulted in the Appendix A.
PCA performed on the inputs does not improve results but reduces their variability for Random Forest. Nevertheless, we can see that a PCA applied only on the outputs improves slightly the average results when predicting the 242t for all algorithms. This is not surprising that applying a PCA does not highly improve the results since Random Forest and AdaBoost are natively able to deal with a large number of variables.
The results of ADBRF are similar to ADBDT. One major difference is the variability concerning the validation scores which is reduced against the other methods. From a cluster to another, results concerning the variability and the type of algorithms are the same; just the scores change.
AdaBoost with Random Forest or Decision Trees are similar, just the variability in scores is different. Indeed, due to the stable behavior of Random Forests, it is not surprising that AdaBoost performs better on Decision Trees than on Random Forests. Nevertheless, we can assume now that a PCA on the outputs improves the results and from now, we shall investigate how are the error distributed to understand better the lack of generalization capabilities of our model. In the following, just AdaBoost with Random Forest will be investigated concerning the extrapolation with a PCA applied on the outputs.
4.3 From 238t to 251t
The Rsquared is not optimal to appreciate the quality of the fit: this score can hide poor results depending on the data people are dealing with. To assess the goodness of fit of our models, we defined for a curve of bending moment the error rate as follows:
For , where is the size of the sample we calculate the error rates, and where is the number of stations along the wing. It allows us to have a physical idea of how far our predictions are. For this standpoint, we can easily compute the empirical cumulative distribution function (CDF): , let . The empirical CDF is defined as:
As soon as we try to generalize our results far from the training dataset, results drop. This is easily explain by the fact that some variables in the 247t and the 251t are far (in average) from the 238t: for example, the quantity of fuel in the first tank is 50% more important in the 242t, 117% in the 247t and 270% more important in the 251t. By looking at Deflection left inboard elevator, it is up to 50% different in the 247t and 251t than is the 238t and 242t. Unfortunately, theses features have a low importance according to Random Forest (see Appendix B). Besides, it is known that in some cases, slight changes of the features (especially the load factor along the Zaxis) can lead to very different behaviours.
5 Conclusion
Let us highlight now the contribution of this case study. As mentioned above, AdaBoost associated with Random Forest gives excellent results for observations which are not far from the training set. This is even more accurate when the outputs have similar forms for close design points and for load cases that are not impacted by the weight change roughly. As soon as we try to generalize the results for observations far from the learning data set or for load cases which leads to different behaviour, results drop. If we control the design space at the starting point, or add information concerning the form of the load to predict, or place us in an interpolation context, results would be even better.
A PCA on the outputs improves the results in average, and this can be explained because of the high co linearity of the outputs. Because of the presence of outliers and especially because all inputs matter, a PCA on the input space does not improve our results in average.
By trying to predict a vector (the shape of our training matrix is 28931x53) and not a point (it would have been 838 999x25), the speed of learning is exponentially decreased, and we keep the engineering information of the mathematical object.
Upcoming works concerning this project should investigate the following point: define a reliable method for extrapolation; test other dimensional reduction techniques as the shape invariant model approach such as defined by [ser12] which has been used in the petroleum industry; produce data in subspaces where there is a lack of information; investigate the fact that the optimal parameters obtained are maybe not optimal in term of generalization; consider other machine learning algorithms than those based on regression trees because they are known to be not optimal in a generalization problem, because they are considered as ?blackboxes? and because they do not give uncertainties; considering online learning: as soon as a new observation is available, the model should keep learning sequentially.
Airbus pursues the increasing knowledge capitalization and the development of new methods and tools for Research and Engineering through Big Data initiatives and the promising results of the sprint project, in which this case study has been achieved, are part of the root of upcoming bigger projects about Machine Learning in the load and stress process.
Appendix A: Models parameters
Models have been optimize through crossvalidation (5folds). The parameters which do not appear in the following table are set to default value of algorithm in scikitlearn. AdaBoost and Bagging use Decision trees as based estimators: due to time constraints, we have first optimized the parameters of Decision Trees alone on the data, and then optimize AdaBoost and Bagging parameters. Here follows the table containing the parameters of the models exposed in the previous sections.
One can notice that applying a PCA on the outputs leads to increase significantly the number of estimators in almost all cases when the min_samples_leaf and the min_samples_split are stable for RF and ADBRF. Naturally, the number of estimators increases when the depth of the trees grows. A learning_rate above 1.0 seems to compensate a too large number of estimators and the more transformation we apply on our data, the more deeper are the trees underneath.
Appendix B: Features importance in Random Forests
The following table gives the features importance in Random Forest for the cases (1) Raw inputs + raw outputs and (2) Raw inputs +PCA outputs:
Features importance are stable from a method to another and the two most important features are identified: the mass and the Zload factor. As said in the section 4.3, the importance of variables such as the Deflection left inboard elevator or the quantity of fuel in the first tank is small compared to those last two variables: thus, even if they change roughly for the other weight variants, they have a low impact on the prediction of loads.
Acknowledgements
We are very much indebted to the referees and the Associate Editor for their constructive criticisms, comments and remarks that resulted in a major improvement of the original manuscript. We would also like to thank Fabrice Gamboa for careful rereadings.