Fast Construction of Correcting Ensembles for Legacy Artificial Intelligence Systems: Algorithms and a Case Study
Abstract
This paper presents a technology for simple and computationally efficient improvements of a generic Artificial Intelligence (AI) system, including Multilayer and Deep Learning neural networks. The improvements are, in essence, small network ensembles constructed on top of the existing AI architectures. Theoretical foundations of the technology are based on Stochastic Separation Theorems and the ideas of the concentration of measure. We show that, subject to mild technical assumptions on statistical properties of internal signals in the original AI system, the technology enables instantaneous and computationally efficient removal of spurious and systematic errors with probability close to one on the datasets which are exponentially large in dimension. The method is illustrated with numerical examples and a case study of ten digits recognition from American Sign Language.
keywords:
Neural Ensembles, Stochastic Separation Theorems, Perceptron, Artificial Intelligence, Measure ConcentrationMsc:
[2010] 68T05, 68T42, 97R40, 60F051 Introduction
In recent years, Artificial Intelligence (AI) systems have risen dramatically from being the subject of mere academical and focused specialized practical interests to the level of commonly accepted and widelyspread technology. Industrial giants such as Google, Amazon, IBM, and Microsoft offer a broad range of AIbased services, including intelligent image and sound processing and recognition.
As a rule of thumb, Deep Learning and related computational technologies [8], [32] are currently perceived as stateofthe art systems, capable of handling large volumes of data and delivering unprecedented accuracy [48] at reasonable computational costs, albeit after some optimization [34]. Despite these advances, several fundamental challenges hinder further progress of the technology.
Regardless of how well an AI system is trained, all datadriven AI systems make mistakes, and some examples of these have already received global public attention [6], [16]. Mistakes may arise due to uncertainty that is inherently present in empirical data, data misrepresentation, and imprecise or inaccurate training. Conventional approaches aimed at tackling inevitable errors include altering training data and improving design procedures [38], [42], [45], [59]. Transfer learning [10], [44], [58] and privileged learning [53] constitute a viable way to reduce generalization errors and hence improve performance. These approaches, however, invoke extensive training procedures. The latter, whilst eradicating some errors, may introduce new errors by the very nature of the steps involved (e.g. minibatches, randomized training sets etc).
In this work, we propose an alternative. Instead of trying to solve the issue of inevitable spurious errors arising in iterative trialanderror retraining of stateoftheart large AI systems, we suggest a technology whereby uptraining of the original legacy AI is ensured, with high probability, via incorporation of small neuronal ensembles and their cascades into the original AI’s decisionmaking. Ensembles methods and classifiers’ cascades (see e.g. [17], [18], [46] and references therein) constitute a wellknown framework for improving performance of classifiers. Here we demonstrate that geometry of high dimension, in agreement with “blessing of dimensionality” [13], [23], [36] enables a simple technology for efficient construction of AI error correctors and their ensembles with guaranteed performance bounds.
The technology bears similarity with neurogenesis deep learning [14], classical cascade correlation [15] and its more recent variants [54], greedy approximation [5], and deep stochastic configuration networks [55], [56] in that it relies upon new nodes to fix errors. The proposed technology, however, does not require computationally expensive training or preconditioning, and in some instances can be set up as a noniterative oneoff procedure. In the latter case the procedure’s computational complexity scales at most linearly with the size of the training set. Similar to [55], [56], the technology is built on the ideas of randomization [40], [49]. In our case, though, we take advantage of the intrinsic randomness in the data itself, in addition to randomized procedures aimed at handling the data.
At the core of the technology is the concentration of measure phenomenon [19], [21], [29], [30], [39], a view on the problem of learning in AI systems that is stemming from conventional probabilistic settings [11], [52], and stochastic separation theorems [22], [23], [26]. Main building blocks of the technology are simple threshold, perceptrontype [41], [47] classifiers. The original AI system, however, need not be a classifier itself. We show that, subject to mild assumptions on statistical properties of “internal” signals in the original AI system, small neuronal ensembles comprising of cascades of simple linear classifiers are an efficient tool for learning away spurious and systematic errors. These cascades can be used for learning new skills too.
The paper is organized as follows: Section 2 contains a formal statement of the problem, Section 3 presents main results: namely, the algorithms for constructing correcting ensembles (Section 3.2) preceded by their mathematical justification (Section 3.1). Section 4 presents a case study on American Sign Language recognition illustrating the concept, and Section 5 concludes the paper. Proofs of theorems and other statements as well as additional technical results are provided in A and B, respectively.
Notation
The following notational agreements are used throughout the paper:

denotes the field of real numbers;

is the set of natural numbers;

stands for the dimensional real space; unless stated otherwise symbol is reserved to denote dimension of the underlying linear space;

let be elements of , then is their inner product;

let , then is the Euclidean norm of : ;

denotes a ball of radius centered at : ;

if is a finite set then stands for its cardinality;

is the Lebesgue volume of ;

if is a random variable then is its expected value.
2 Problem Statement
In what follows we suppose that an AI system is an operator mapping elements of its input set, , to the set of outputs, . Examples of inputs are images, temporal or spatiotemporal signals, and the outputs correspond to labels, classes, or some quantitative characteristics of the inputs. Inputs , outputs , and internal variables of the system represent the system’s state. The state itself may not be available for observation but some of its variables or relations may be accessed. In other words, we assume that there is a process which assigns an element of to the triple . A diagram illustrating the setup for a generic AI system is shown in Fig. 1.
In addition to the core AI system we consider an addon ensemble mapping samples into auxiliary signals (shown as improvement signals in Fig. 1) which are then used to improve performance of the original core AI. An example of such an improvement signal could be an indication that the current state of the core AI system represented by corresponds to an erroneous decision of the core AI. At the integration stage, this information will alter the overall response.
With regards to the operations performed at the individual nodes (circles in the diagram on Fig. 1), we will focus on the following relevant class
(1) 
where the function is piecewise continuous, , and . This class of functions is broad enough to enable the ensemble to function as a universal approximation device (e.g. by choosing the function in the class satisfying conditions discussed in [5], [12], [20]) as well as to fit into a wide range of common AI architectures including decision trees, multilayer neural networks and perceptions, and deep learning convolutional neuronal networks. Additional insight into “isolation” properties of elements (1) in comparison with more natural choices such as e.g. ellipsoids or balls (cf. [1]) is provided in B.
Over a relevant period of time, the AI system generates a finite but large set of measurements . This set is assessed by an external supervisor and is partitioned into the union of the sets and :
The set may contain measurements corresponding to expected operation of the AI, whereas elements from constitute core AI’s performance singularities. These singularities may be both desired (related e.g. to “important” inputs ) and undesired (related e.g. to errors of the AI). The function of the ensemble is to respond to these singularities selectively by producing improvement signals in response to elements from the set .
A straightforward approach to derive such ensembles is to use standard optimization facilities such as error backpropagation [57] and its various modifications or other iterative procedures similar to greedy approximation [5], stochastic configuration networks [56] (cf. [14]). This approach, however, requires computational resources and time that may not necessarily be available. Even in the simplest case of a singleelement addon system that is to separate the entire set from , determining the best possible solution, in the form of e.g. support vector machines [52], is nontrivial computationally. Indeed, theoretical worstcase estimates of computational complexity for determining parameters of the singleelement system are of the order [9].
The question, therefore is: if there exists a computationally efficient procedure for determining ensembles or addon networks such that

they are able to “improve” performance of a generic AI system with guaranteed probability, and

they consist of elements (1)?
Preferably, computational complexity of the procedure is to be linear or even sublinear in .
In the next sections we show that, subject to some mild technical assumptions on the way the sets and are generated, a family of simple algorithms with the required characteristics can indeed be derived. These algorithms are motivated by Stochastic Separation Theorems [22], [23], [26]. For consistency, adapted version of these theorems are presented and discussed in Section 3.1. The algorithms themselves are presented and discussed in Section 3.2.
3 Main Results
3.1 Mathematical Preliminaries
Following standard assumptions (see e.g. [11], [52]), we suppose that all are generated in accordance with some distribution, and the actual measurements are samples from this distribution. For simplicity, we adopt a traditional setting in which all such samples be identically and independently distributed (i.i.d.) [11]. With regards to the elements of , the following technical condition is assumed:
Assumption 1
Elements are random i.i.d. vectors drawn from a product measure distribution:

their th components are independent and bounded random variables : , ,

, and .
The distribution itself, however, is supposed to be unknown. Let
Then the following result holds (cf. [22]).
Theorem 1
Let be i.i.d. random points from the product distribution satisfying Assumption 1, , and . Then

for any ,
(2) 
for any ,
(3) 
for any given and any
(4)
Remark 0
The following Theorem is now immediate.
Theorem 3 (Element separation)
Let elements of the set be i.i.d. random points from the product distribution satisfying Assumption 1, , and . Consider and let
(5) 
Then
(6) 
(The proof is provided in A.)
Remark 0
Theorem 3 not only establishes the fact that the set can be separated away from by a linear functional with reasonably high probability. It also specifies the separating hyperplane, (5), and provides an estimate from below of the probability of such an event, (6). The estimate, as a function of , approaches exponentially fast. Note that the result is intrinsically related to the work [37] on quasiorthogonal dimension of Euclidian spaces.
Let us now move to the case when the set contains more than one element. Theorem 5 below summarizes the result.
Theorem 5 (Element separation. Case 1)
Let elements of the set be i.i.d. random points from the product distribution satisfying Assumption 1, and . Let, additionally,
(7) 
for all , . Pick
and consider
(8) 
Then
(9) 
(The proof is provided in A.)
The value of in estimate (7) may not necessarily be available apriori. If this is the case then the following corollaries from Theorems 3 and 5 may be invoked.
Corollary 0 (Element separation. Case 1)
Let elements of the set be i.i.d. random points from the product distribution satisfying Assumption 1, and . Pick
and let
(10) 
Then
(11) 
Corollary 0 (Element separation. Case 2)
Let elements of the set be i.i.d. random points from the product distribution satisfying Assumption 1, , and . Pick and consider
(12) 
Then
(13) 
Proofs of the corollaries are provided in A.
Theorems 1 – 5 and Corollaries 6, 7 suggest that simple elements (1) with being mere threshold elements
posses remarkable selectivity. For example, according to Theorem 3, the element
assigns “” to and “” to all with probability that is exponentially (in ) close to . To illustrate this point, consider a simple test case with a set comprising of i.i.d. samples from the (uniform) product distribution in . For this distribution, , , and estimate (6) becomes:
(14) 
The righthand side rapidly approaches as grows. The estimate, however, could be rather conservative as is illustrated with Fig. 2.
A somewhat more relaxed approach could be to allow a small margin of error by determining a hyperplane separating the set from “nearly all” elements . An estimate of success for the latter case follows from Lemma 8 below
Lemma 0
Let elements of the set be i.i.d. random points, and let
and be such that
for an arbitrary element . Then
(The proof is provided in A.)
According to Lemma 8 and Theorem 1,
(15) 
Observe, however, that random points (in general position) are linearly separable with probability . Thus, with probability , spurious points can be separated away from by a second hyperplane. Fig. 3 compares separation probability estimate for such a pair with that of for (Theorem 3, (6)).
3.2 Fast AI uptraining algorithms
Theorem 5 as well as Corollaries 6, 7 and Lemma 8 motivate a simple computational framework for construction of networks and cascades of elements (1). Following the setup presented in Section 2, recall that the data is sampled from the core AI system and is partitioned into the sets and . The latter set corresponds to singular events which are to be picked by the cascade. Their union, , is the entire data that is available for the cascade construction.
We are now ready to proceed with the algorithms for fast construction of the correcting ensembles. The first algorithm is a recursion of which each iteration is a multistep process. The algorithm is provided below.
Algorithm 1 (Correcting Ensembles)
Initialization: set , , and , define a list or a model of correcting actions – formalized alternations of the core AI in response to an error/error type.

Input to the th iteration: Sets , , and ; the number of clusters, , and the value of filtering threshold, .

Centering. All current data available is centered. The centered sets are denoted as and and are formed by subtracting the mean from the elements of and , respectively:

Regularization. The covariance matrix, , of is calculated along with the corresponding eigenvalues and eigenvectors. New regularized sets , are produced as follows. All eigenvectors that correspond to the eigenvalues which are above a given threshold are combined into a single matrix . The threshold could be chosen on the basis of the KaiserGuttman test [35] or otherwise (e.g. by keeping the ratio of the maximal to the minimal in the set of retained eigenvalues within a given bound). The new sets are defined as:

Whitening. The two sets then undergo a whitening coordinate transformation ensuring that the covariance matrix of the transformed data is the identity matrix:

Projection (optional). Project elements of , onto the unit sphere by scaling them to the unit length: .

Training: Clustering. The set (the set of errors) is then partitioned into clusters that’s elements are pairwise positively correlated.

Training: Nodes creation and aggregation. For each , and its complement we construct the following separating hyperplanes:
Retain only those hyperplanes for which . For each retained hyperplane, create a corresponding element (1)
(16) with being a function that satisfies: for all and for all , and add it to the ensemble. If optional step 4 was used then the functional definition of becomes:
(17) where is the mapping implementing projection onto the unit sphere.

Integration/Deployment. Any that is generated by the original core AI is put through the ensemble of . If for some any of the values of then a correcting action is performed on the core AI. The action is dependent on the purpose of correction as well as on the problem at hand. It could include label swapping, signalling an alarm, ignoring/not reporting etc. The combined system becomes new core AI.

Testing. Assess performance of new core AI on a relevant data set. If needed, generate new sets , , (with possibly different error types and definitions), and repeat the procedure.
Steps 1–3 of the algorithm are standard preprocessing routines. In the context of Theorems 1–5, step 1 aims to ensure that the first part of A2) in Assumption 1 holds, and step 2, in addition to regularization, results in that all components of are uncorrelated (a necessary condition for part A1) of Assumption 1 to hold). Whitening transformation, step 3, normalizes the data in the sense that for all relevant. Step 4 (optional) is introduced to account for data irregularities and clustering that may negatively affect data separability. An illustration of potential utility of this step is shown in Fig. 4.
Note that individual components of the data vectors may no longer satisfy the independence assumption. Nevertheless, if the data is reasonably distributed, different versions of separation theorems may apply to this case too [23], [26]. In Section 4 we illustrate the effect of this step in an example application.
Step 5, clustering, is motivated by Theorem 5 and Corollaries 6, 7 suggesting that the rate of success in isolating multiple points from the background set increases when these multiple points are positively correlated or are spatially close to each other.
Step 6 is a version of (8), albeit with a different normalization and a slight perturbation of weights. The choice of this particular normalization is motivated by the experiments shown in Fig. 2. As for the choice of weights, these hyperplanes are standard Fisher discriminants. Yet, they are not far from (8). In view of the previous steps, is diagonally dominated and close to the identity matrix. When , term is nearly zero, and hence is approximately at the centroid of the cluster. Filtering of the nodes is necessitated by the concentration of measure effects as exemplified by Theorem 1.
Functions in step 7 can be implemented using a range of ReLU, threshold, , sigmoidal functions etc.; these are available in the majority of standard core AI systems.
Computational complexity of each step in the recursion, except for step 5, is at most . Complexity of the clustering step may generally be superpolynomial (worst case) as is e.g. the case for standard kmeans clustering [3]. If, however, suboptimal solutions are accepted [31] then the complexity of this step scales linearly with . Further speedups are possible through randomized procedures such as e.g. in kmeans++ [2].
Algorithm 1 is based primarily on the intuition and rationale stemming from Theorem 3, 5 and their corollaries. It can be modified to take advantage of the possibility offered by Lemma 8 and the subsequent discussion. The modification is that each node added in step 6 is assessed and “corrected” by an additional hyperplane if required. The modified algorithm is summarized as Algorithm 2 below.
Algorithm 2 (With cascaded pairs)
Initialization: set , , and , define a list or a model of correcting actions – formalized alternations of the core AI in response in response to an error/error type.

Input to the th iteration: Sets , , and ; the number of clusters, , and the value of filtering threshold, .

As in Algorithm 1.

Training: Nodes creation and aggregation. For each , and its complement construct the following separating hyperplanes:
Retain only those hyperplanes for which . For each retained hyperplane, create a corresponding element in accordance to (16) or, if step 4 was used, then (17). For each retained hyperplane,

determine the complementary set comprised of elements for which (the set of points that are accidentally picked up by the “by mistake”)

project elements of the set orthogonally onto the hyperplane as

determine a hyperplane
separating projections of from that of so that for all projections from and for all projections from . If no such planes exist, use linear Fisher discriminant or any other relevant computational procedure

create a node
or, in case step 4 was used

create the pair’s response node, , so that a nonnegative response is generated only when both nodes produce a nonnegative response

add the combined to the ensemble.


Integration/Deployment. Any that is generated by the original core AI is put through the ensemble of . If for some any of the values of then a correcting action is performed on the core AI. The action is dependent on the purpose of correction as well as on the problem at hand. It could include label swapping, signalling an alarm, ignoring/not reporting etc. The combined system becomes new core AI.

Testing. Assess performance of new core AI on a relevant data set. If needed, generate new sets , , (with possibly different error types and definitions), and repeat the procedure.
In the next section we illustrate how the proposed algorithms work in a case study example involving a core AI in the form of a reasonably large convolutional network trained on a moderatesize dataset.
4 Distinguishing the Ten Digits in American Sign Language: a Case Study
4.1 Setup and Datasets
In this case study we investigated and tested the approach on a challenging problem of gestures recognition in the framework of distinguishing ten digits in American Sign Language. To apply the approach a core AI had to be generated first. As our core AI we picked a version of Inception deep neural network model [50] whose architecture is shown in Fig. 5. The model was trained^{2}^{2}2https://www.tensorflow.org/tutorials/image_retraining on ten sets of images that correspond to the American Sign Language pictures for 09 (see Fig. 6).
Each set contained 1000 unique images consisting of profile shots of the personâs hand, along with 3/4 profiles and shots from above and below.
The states are the vectors containing the values of presoftmax layer bottlenecks of size for however many neurons are in the penultimate layer. Schematically the network’s layer whose outputs are is shown in the Diagram in Fig. 5 as the fully connected (bottleneck layer) in Unit H. Elements of this set that gave incorrect readings are noted and copied into the set .
4.2 Experiments and results
Once the network was trained, additional 10000 images of the same ratio were evaluated using the trained system. For these experiments the classification decision rule was to return a gesture number that corresponds to the network output with the highest score (winnertakesall) if the highest score exceeds a given threshold, . Ties are broken arbitrarily at random. If the highest score is smaller than or equal to the threshold then no responses are returned.
To assess performance of the resulting classifier, we used the following indicators as functions of threshold :
(18) 
where is the number of true positives, is the number of false negatives, and is the number of false positives. The definitions of these variables are provided in Table 1.
Presence of  System’s  Error 
a gesture  response  Type 
Yes  Correctly classified  True Positive 
Incorrectly classified  False Positive  
Not reported  False Negative  
No  Reported  False Positive* 
Not reported  True Negative* 
In our experiments we did not add any negatives to the test set, as the original focus was to illustrate how the approach may cope with misclassification errors. This implies that the number of True Negatives is for any threshold and, consequently, the rate of false positives is always . Therefore, instead of using traditional ROC curves showing the rate of true positives against the rate of false positives, we employed a different family of curves in which the rate of false positives is replaced with the rate of misclassification as defined by (18).
A corresponding performance curve for this Core AI system is shown in Fig. 7.
At the result was an 82.4% success rate for the adapted algorithm. At this end of the curve, the observed performance was comparable/similar to that reported in e.g. [7] (see also references therein). The numbers of errors per each gesture in the trained system are shown in Table 2.
0  1  2  3  4  5  6  7  8  9 
10  52  235  62  410  80  269  327  207  108 
The variance of errors is mostly consistent among the ten classes with very few errors for the “0” gesture, likely due to its unique shape among the classes.
Once the errors were isolated, we used Algorithms 1 and 2 to construct correcting ensembles improving the original core AI. To train the ensembles, the testing data set of 10000 images that has been used to assess performance of Inception was split into two nonoverlapping subsets. The first subset, comprised of 6592 records of data points corresponding to correct responses and 1408 records corresponding to errors, was used to train the correcting ensembles. This subset was the ensemble’s training set, and it accounted for 80 of the data. The second subset, the ensemble’s testing set, combining 1648 data points of correct responses and 352 elements labelled as errors was used to test the ensemble.
Both algorithms have been run on the first subset, the ensemble’s training set. For simplicity of comparison and evaluation, we iterated the algorithms only once (i.e. did not build cascades of ensembles). In the regularization step, step 2, we used KaiserGuttman test. This returned principal components reducing the original dimensionality more than 10 times. After the whitening transformation, step 3, we assessed the values of (shown in Fig. 8).
According to Fig. 8, data points labelled as errors are largely orthogonal to each other apart from few modestlysized groups.
The number of clusters, parameter in step 5, was varied from 2 to 1408 in regular increments. As a clustering algorithm we used standard kmeans routine (kmeans ++) supplied with MATLAB 2016a. For each value of , we run the kmeans algorithm 10 times. For each clustering pass we constructed the corresponding nodes (or their pairs for Algorithm 2) as prescribed in step 6, and combined them into a single correcting ensemble in accordance with step 7.
Before assessing performance of the ensemble, we evaluated filtering properties of the ensemble as a function of the number of clusters used. For consistency with predictions provided in Theorems 1 – 5, Algorithm 1 was used in this exercise. Results of the test are shown in Fig 9.
Note that as the number of clusters increases, the True Positive Rate at as a function of , approaches 1 regardless of the projection step. This is in agreement with theoretical predictions stemming from Theorem 3. We also observed that performance drops rapidly with the average number of elements assigned to a cluster. In view of our earlier observation that vectors labelled as “errors” appear to be nearly orthogonal to each other, this drop is consistent with the bound provided in Corollary 6.
Next we assessed performance of Algorithms 1, 2 and resulting ensembles on the training and testing sets for . In both algorithms, optional projection step was used. The value of threshold was set to . In general, the value of threshold can be selected arbitrarily in an interval of feasible values of . When the optional projection step is used, this interval is . Here we set the value of so that the hyperplanes in step 6 of Algorithm 2 produced by standard perceptron algorithm [41], [47] (cf. [43]) consistently yielded perfect separation. Results are shown in Figs. 10 and 11.
According to Fig. 10, performance on the training set grows with , as expected. This is aligned with theoretical results presented in Section 3.1. The ensembles rapidly remove misclassification errors from the system, albeit at some cost to True Positive Rate. The latter cost for Algorithm 2 appears to be negligible. On the testing set, the picture changes. We see that, as the number of clusters/nodes grows, performance of the ensemble saturates, with Algorithm 1 catchingup with Algorithm 2. This signals an overfit and indicates a point where further increases in the number of nodes do not translate into expected improvements of the combined systems’s performance. Apparent lack of correlations between feature vectors corresponding to errors (illustrated with Fig. 8) may also contribute to the observed performance saturation on the testing set.
5 Conclusion
In this work we presented a novel technology for computationally efficient improvements of generic AI systems, including sophisticated Multilayer and Deep Learning neural networks. These improvements are easytotrain ensembles of elementary nodes. After the ensembles are constructed an additional natural step might be to prune and further optimize the system via e.g. error backpropagation [57]. Mathematical operations at the nodes involve computations of inner products and, possibly, normalization followed by standard nonlinear operations like ReLU, step functions, sigmoids etc. The technology was illustrated with a simple case study confirming its viability.
AI improvements can be employed for both learning new skills as well as for “learning errors away” in existing AI architectures. Theoretical results are not limited to the realm of AI. Similar to [28], [51], the results can be employed to explain extreme selectivity of neurons and reveal simple mechanisms of learning in stratified brain structures.
The proposed concept builds on our previous work [24] and complements results for equidistributions in a unit ball to product measure distributions. When the clustering structure is fixed, the method is inherently oneshot and noniterative, and its computational complexity scales at most linearly with the size of the training set. Sublinear computational complexity of the ensembles construction makes the technology particularly suitable for large AI systems that have already been deployed and are in operation.
An intriguing and interesting feature of the approach is that training of the ensembles is largely achieved via Fisher linear discriminants. This, combined with the ideas from [26], paves way for a potential relaxation of the i.i.d. assumption for the sampled data. Dealing with this as well as exploring the proposed technology on a range of different AI architectures and applications [27] is the focus of our future work.
6 Acknowledgement
The work is supported by Innovate UK Technology Strategy Board (Knowledge Transfer Partnership grants KTP009890 and KTP010522) and by the Ministry of Education and Science of Russian Federation (Project No. 14.Y26.31.0022). The Authors are thankful to Tatiana Tyukina for help and assistance with preparing the manuscript and artwork.
References
References
 Anderson et al. [2014] J. Anderson, M. Belkin, N. Goyal, L. Rademacher, and J. Voss. The more, the merrier: the blessing of dimensionality for learning large Gaussian mixtures. Journal of Machine Learning Research: Workshop and Conference Proceedings, 35:1–30, 2014.
 Arthur and Vassilvitskii [2007] D. Arthur and S. Vassilvitskii. kmeans++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial and Applied Mathematics, 2007.
 Arthur [2006] S. Arthur, D.and Vassilvitskii. How slow is the kmeans method? In Proceedings of the twentysecond annual symposium on Computational geometry, pages 144–153. ACM, 2006.
 Ball [1997] K. Ball. An elementary introduction to modern convex geometry. Flavors of geometry, 31:1–58, 1997.
 Barron [1993] A. R. Barron. Universal approximation bounds for superposition of a sigmoidal function. IEEE Trans. on Information Theory, 39(3):930–945, 1993.
 Beene et al. [2018] R. Beene, A. Levin, and E. Newcomer. Uber selfdriving test car in crash wasn’t programmed to brake. https://www.bloomberg.com/news/articles/20180524/uberselfdrivingsystemsawpedestriankilledbutdidntstop, 2018.
 Bheda and Radpour [2017] V. Bheda and D. Radpour. Using deep convolutional networks for gesture recognition in american sign language. arXiv preprint arXiv:1710.06836, 2017.
 Brahma et al. [2016] P. P. Brahma, D. Wu, and Y. She. Why deep learning works: a manifold disentanglement perspective. IEEE Transactions On Neural Networks And Learning Systems, 27(10):1997–2008, 2016.
 Chapelle [2007] O. Chapelle. Training a support vector machine in the primal. Neural computation, 19(5):1155–1178, 2007.
 Chen et al. [2015] T. Chen, I. Goodfellow, and J. Shlens. Net2net: Accelerating learning via knowledge transfer. ICLR 2016, 2015.
 Cucker and Smale [2002] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1–49, 2002.
 Cybenko [1989] G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. of Control, Signals and Systems, 2:303–314, 1989.
 Donoho [2000] D. Donoho. Highdimensional data analysis: The curses and blessings of dimensionality. AMS math challenges lecture, 1(2000):32, 2000.
 Draelos et al. [2016] T. J. Draelos, N. E. Miner, C. C. Lamb, C. M. Vineyard, K. D. Carlson, C. D. James, and J. B. Aimone. Neurogenesis deep learning. arXiv preprint arXiv:1612.03770, 2016.
 Fahlman and Lebiere [1990] S. E. Fahlman and C. Lebiere. The cascadecorrelation learning architecture. In Advances in neural information processing systems, pages 524–532, 1990.
 Foxx [2018] C. Foxx. Face recognition police tools “staggeringly inaccurate”. http://www.bbc.co.uk/news/technology44089161, 2018.
 Freund and Schapire [1997] Y. Freund and R. E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
 Gama and Brazdil [2000] J. Gama and P. Brazdil. Cascade generalization. Machine learning, 41(3):315–343, 2000.
 Gibbs [1960 (1902] J. Gibbs. Elementary Principles in Statistical Mechanics, developed with especial reference to the rational foundation of thermodynamics. Dover Publications, New York, 1960 (1902).
 Gorban [1998] A. N. Gorban. Approximation of continuous functions of several variables by an arbitrary nonlinear continuous function of one variable, linear functions, and their superpositions. Applied Mathematics Letters, 11(3):45–49, 1998.
 Gorban [2007] A. N. Gorban. Orderdisorder separation: Geometric revision. Physica A, 374:85–102, 2007.
 Gorban and Tyukin [2017] A. N. Gorban and I. Y. Tyukin. Stochastic separation theorems. Neural Networks, 94:255–259, 2017.
 Gorban and Tyukin [2018] A. N. Gorban and I. Y. Tyukin. Blessing of dimensionality: mathematical foundations of the statistical physics of data. Philosophical Transactions of the Royal Society A, 376, 2018.
 Gorban et al. [2016a] A. N. Gorban, R. Burton, I. Romanenko, and I. Y. Tyukin. Onetrial correction of legacy AI systems and stochastic separation theorems. arXiv preprint arXiv:1610.00494, 2016a.
 Gorban et al. [2016b] A. N. Gorban, I. Y. Tyukin, and I. Romanenko. The blessing of dimensionality: Separation theorems in the thermodynamic limit, 09 2016b. URL https://arxiv.org/abs/1610.00494v1. A talk given at TFMST 2016, 2nd IFAC Workshop on Thermodynamic Foundations of Mathematical Systems Theory. September 2830, 2016, Vigo, Spain.
 Gorban et al. [2018a] A. N. Gorban, A. Golubkov, B. Grechuk, E. M. Mirkes, and I. Y. Tyukin. Correction of ai systems by linear discriminants: Probabilistic foundations. Information Sciences, 466:303–322, 2018a.
 Gorban et al. [2018b] A. N. Gorban, B. Grechuk, and I. Y. Tyukin. Augmented artificial intelligence. arXiv preprint arXiv:1802.02172, 2018b.
 Gorban et al. [2018c] A. N. Gorban, V. A. Makarov, and I. Y. Tyukin. The unreasonable effectiveness of small neural ensembles in highdimensional brain. Physics of Life Reviews, 2018c. doi: 10.1016/j.plrev.2018.09.005.
 Gromov [1999] M. Gromov. Metric Structures for Riemannian and nonRiemannian Spaces. With appendices by M. Katz, P. Pansu, S. Semmes. Translated from the French by Sean Muchael Bates. Birkhauser, Boston, MA, 1999.
 Gromov [2003] M. Gromov. Isoperimetry of waists and concentration of maps. GAFA, Geomteric and Functional Analysis, 13:178–215, 2003.
 Hartigan and Wong [1979] J. A. Hartigan and M. A. Wong. A KMeans clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.
 He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 Hoeffding [1963] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
 Iandola et al. [2016] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and 0.5mb model size. arXiv preprint, arXiv:1602.07360, 2016.
 Jackson [1993] D. Jackson. Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches. Ecology, 74(8):2204–2214, 1993.
 Kainen [1997] P. C. Kainen. Utilizing geometric anomalies of high dimension: When complexity makes computation easier. In Computer Intensive Methods in Control and Signal Processing, pages 283–294. Springer, 1997.
 Kainen and Kurkova [1993] P. C. Kainen and V. Kurkova. Quasiorthogonal dimension of euclidian spaces. Appl. Math. Lett., 6(3):7–10, 1993.
 Kuznetsova et al. [2015] A. Kuznetsova, S. Hwang, B. Rosenhahn, and L. Sigal. Expanding object detectorâs horizon: Incremental learning framework for object detection in videos. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 28–36, 2015.
 Lévy [1951] P. Lévy. Problèmes concrets d’analyse fonctionnelle. GauthierVillars, Paris, second edition, 1951.
 Li and Wang [2017] M. Li and D. Wang. Insights into randomized algorithms for neural networks: Practical issues and common pitfalls. Information Sciences, 382:170–178, 2017.
 Minsky and Papert [1969] M. Minsky and S. A. Papert. Perceptrons: An introduction to computational geometry. MIT press, 1969.
 Misra et al. [2015] I. Misra, A. Shrivastava, and M. Hebert. Semisupervised learning for object detectors from video. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3594–3602, 2015.
 Motzkin and Schoenberg [1954] T. Motzkin and I. Schoenberg. The relaxation method for linear inequalities. Canadian Journal of Mathematics, 6(3):393–404, 1954.
 Pratt [1992] L. Pratt. Discriminabilitybased transfer between neural networks. Advances in Neural Information Processing, (5):204–211, 1992.
 Prest et al. [2012] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3282–3289, 2012.
 Rokach [2010] L. Rokach. Ensemblebased classifiers. Artificial Intelligence Review, 33(12):1–39, 2010.
 Rosenblatt [1962] F. Rosenblatt. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, 1962.
 Russakovsky et al. [2014] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., pages 1–42, 2014. DOI:10.1007/s112630150816y.
 Scardapane and Wang [2017] S. Scardapane and D. Wang. Randomness in neural networks: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(2), 2017.
 Szegedy et al. [2015] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 Tyukin et al. [2018] I. Y. Tyukin, A. N. Gorban, C. Calvo, J. Makarova, and V. A. Makarov. Highdimensional brain. A tool for encoding and rapid learning of memories by single neurons. Bulletin of Mathematical Biology, pages 1–33, 2018.
 Vapnik and Chapelle [2000] V. Vapnik and O. Chapelle. Bounds on error expectation for support vector machines. Neural Computation, 12(9):2013–2036, 2000.
 Vapnik and Izmailov [2017] V. Vapnik and R. Izmailov. Knowledge transfer in svm and neural networks. Annals of Mathematics and Artificial Intelligence, pages 1–17, 2017.
 Wang and Cui [2017] D. Wang and C. Cui. Stochastic configuration networks ensemble with heterogeneous features for largescale data analytics. Information Sciences, 417:55–71, 2017.
 Wang and Li [2017a] D. Wang and M. Li. Deep stochastic configuration networks with universal approximation property. arXiv preprint arXiv: 1702.05639. 2017, 2017a.
 Wang and Li [2017b] D. Wang and M. Li. Stochastic configuration networks: Fundamentals and algorithms. IEEE transactions on cybernetics, 47(10):3466–3479, 2017b.
 Werbos [1974] P. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph. D. thesis. Harvard University, Cambridge, MA, 1974.
 Yosinski et al. [2014] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
 Zheng et al. [2016] S. Zheng, Y. Song, T. Leung, and I. Goodfellow. Improving the robustness of deep neural networks via stability training. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4480–4488, 2016. https://arxiv.org/abs/1604.04326.
Appendix A Proofs of theorems and other technical statements
Proof of Theorem 1
The proof follows immediately from Theorem 2 in [22] and Hoeffding inequality [33]. Indeed, if , are independent bounded random variables, i.e. , , and then (Hoeffding inequality)
Given that where are independent random variables with , (Assumption 1), we observe that
Denoting and recalling that we conclude that (2) holds true.
Proof of Theorem 3
Proof of Theorem 5
Proof of Corollary 6
Proof of Corollary 7
Proof of Lemma 8
The proof follows that of Theorem 13 in [25]. Let be an element of and be the probability of the event . Then the probability of the event that for at most elements from is
Observe that , as a function of , is monotone and nonincreasing on the interval , with at and at . Hence
Given that for , we obtain
Expanding at with the Lagrange remainder term:
results in the estimate
Hence
Appendix B Hyperplanes vs ellipsoids for error isolation
Consider
(20) 
Let