Unsupervised and Semi-supervised Anomaly Detection with LSTM Neural Networks

# Unsupervised and Semi-supervised Anomaly Detection with LSTM Neural Networks

Tolga Ergen, Ali H. Mirza, and Suleyman S. Kozat Senior Member, IEEE This work is supported in part by Outstanding Researcher Programme Turkish Academy of Sciences and TUBITAK Contract No 117E153.The authors are with the Department of Electrical and Electronics Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey, Tel: +90 (312) 290-2336, Fax: +90 (312) 290-1223, (contact e-mail: {ergen, mirza, kozat}@ee.bilkent.edu.tr).
###### Abstract

We investigate anomaly detection in an unsupervised framework and introduce Long Short Term Memory (LSTM) neural network based algorithms. In particular, given variable length data sequences, we first pass these sequences through our LSTM based structure and obtain fixed length sequences. We then find a decision function for our anomaly detectors based on the One Class Support Vector Machines (OC-SVM) and Support Vector Data Description (SVDD) algorithms. As the first time in the literature, we jointly train and optimize the parameters of the LSTM architecture and the OC-SVM (or SVDD) algorithm using highly effective gradient and quadratic programming based training methods. To apply the gradient based training method, we modify the original objective criteria of the OC-SVM and SVDD algorithms, where we prove the convergence of the modified objective criteria to the original criteria. We also provide extensions of our unsupervised formulation to the semi-supervised and fully supervised frameworks. Thus, we obtain anomaly detection algorithms that can process variable length data sequences while providing high performance, especially for time series data. Our approach is generic so that we also apply this approach to the Gated Recurrent Unit (GRU) architecture by directly replacing our LSTM based structure with the GRU based structure. In our experiments, we illustrate significant performance gains achieved by our algorithms with respect to the conventional methods.

{keywords}

Anomaly detection, Support Vector Machines, Support Vector Data Description, LSTM, GRU.

## I Introduction

### I-a Preliminaries

Anomaly detection [1] has attracted significant interest in the contemporary learning literature due its applications in a wide range of engineering problems, e.g., sensor failure [2], network monitoring [3], cybersecurity [4] and surveillance [5]. In this paper, we study the variable length anomaly detection problem in an unsupervised framework, where we seek to find a function to decide whether each unlabeled variable length sequence in a given dataset is anomalous or not. Note that although this problem is extensively studied in the literature and there exist different methods, e.g., supervised (or semi-supervised) methods, that require the knowledge of data labels, we employ an unsupervised method due to the high cost of obtaining accurate labels in most real life applications [1] such as in cybersecurity [4] and surveillance [5]. However, we also extend our derivations to the semi-supervised and fully supervised frameworks for completeness.

In the current literature, a common and widely used approach for anomaly detection is to find a decision function that defines the model of normality [1]. In this approach, one first defines a certain decision function and then optimizes the parameters of this function with respect to a predefined objective criterion, e.g., the One Class Support Vector Machines (OC-SVM) and Support Vector Data Description (SVDD) algorithms [6, 7]. However, algorithms based on this approach examine time series data over a sufficiently long time window to achieve an acceptable performance [1, 8, 9]. Thus, their performances significantly depend on the length of this time window so that this approach requires careful selection for the length of time window to provide a satisfactory performance [10, 8]. To enhance performance for time series data, neural networks, especially Recurrent Neural Networks (RNNs), based approaches are introduced thanks to their inherent memory structure that can store “time” or “state” information [1, 11]. However, since the basic RNN architecture does not have control structures (gates) to regulate the amount of information to be stored [12, 13], a more advanced RNN architecture with several control structures, i.e., the Long Short Term Memory (LSTM) network, is introduced [14, 13]. However, neural networks based approaches do not directly optimize an objective criterion for anomaly detection [15, 1]. Instead, they first predict a sequence from its past samples and then determine whether the sequence is an anomaly or not based on the prediction error, i.e., an anomaly is an event, which cannot be predicted from the past nominal data [1]. Thus, they require a probabilistic model for the prediction error and a threshold on the probabilistic model to detect anomalies, which results in challenging optimization problems and restricts their performance accordingly [1, 15, 16]. Furthermore, both the common and neural networks based approaches can process only fixed length vector sequences, which significantly limits their usage in real life applications [1].

In order to circumvent these issues, we introduce novel LSTM based anomaly detection algorithms for variable length data sequences. In particular, we first pass variable length data sequences through an LSTM based structure to obtain fixed length representations. We then apply our OC-SVM [6] and SVDD [7] based algorithms for detecting anomalies in the extracted fixed length vectors as illustrated in Fig. 1. Unlike the previous approaches in the literature [1], we jointly train the parameters of the LSTM architecture and the OC-SVM (or SVDD) formulation to maximize the detection performance. For this joint optimization, we propose two different training methods, i.e., a quadratic programming based and a gradient based algorithms, where the merits of each different approach are detailed in the paper. For our gradient based training method, we modify the original OC-SVM and SVDD formulations and then provide the convergence results of the modified formulations to the original ones. Thus, instead of following the prediction based approaches [1, 15, 16] in the current literature, we define proper objective functions for anomaly detection using the LSTM architecture and optimize the parameters of the LSTM architecture via these well defined objective functions. Hence, our anomaly detection algorithms are able to process variable length sequences and provide high performance for time series data. Furthermore, since we introduce a generic approach in the sense that it can be applied to any RNN architecture, we also apply our approach to the Gated Recurrent Unit (GRU) architecture [17], i.e., an advanced RNN architecture as the LSTM architecture, in our simulations. Through extensive set of experiments, we demonstrate significant performance gains with respect to the conventional methods [6, 7].

### I-B Prior Art and Comparisons

Several different methods have been introduced for the anomaly detection problem [1]. Among these methods, the OC-SVM [6] and SVDD [7] algorithms are generally employed due their high performance in real life applications [18]. However, these algorithms provide inadequate performance for time series data due to their inability to capture time dependencies [8, 9]. In order to improve the performances of these algorithms for time series data, in [9], the authors convert time series data into a set of vectors by replicating each sample so that they obtain two dimensional vector sequences. However, even though they obtain two dimensional vector sequences, the second dimension does not provide additional information such that this approach still provides inadequate performance for time series data [8]. As another approach, the OC-SVM based method in [8] acquires a set of vectors from time series data by unfolding the data into a phase space using a time delay embedding process [19]. More specifically, for a certain sample, they create an dimensional vector by using the previous samples along with the sample itself [8]. However, in order to obtain a satisfactory performance from this approach, the dimensionality, i.e., , should be carefully tuned, which restricts its usage in real life applications [20]. On the other hand, even though LSTM based algorithms provide high performance for time series data, we have to solve highly complex optimization problems to get an adequate performance [1]. As an example, the LSTM based anomaly detection algorithms in [10, 21] first predict time series data and then fit a multivariate Gaussian distribution to the error, where they also select a threshold for this distribution. Here, they allocate different set of sequences to learn the parameters of the distribution and threshold via the maximum likelihood estimation technique [10, 21]. Thus, the conventional LSTM based approaches require careful selection of several additional parameters, which significantly degrades their performance in real life [1, 10]. Furthermore, both the OC-SVM (or SVDD) and LSTM based methods are able to process only fixed length sequences [6, 7, 10]. To circumvent these issues, we introduce generic LSTM based anomaly detectors for variable length data sequences, where we jointly train the parameters of the LSTM architecture and the OC-SVM (or SVDD) formulation via a predefined objective function. Therefore, we not only obtain high performance for time series data but also enjoy joint and effective optimization of the parameters with respect to a well defined objective function.

### I-C Contributions

Our main contributions are as follows:

• We introduce LSTM based anomaly detection algorithms in an unsupervised framework, where we also extend our derivations to the semi-supervised and fully supervised frameworks.

• As the first time in the literature, we jointly train the parameters of the LSTM architecture and the OC-SVM (or SVDD) formulation via a well defined objective function, where we introduce two different joint optimization methods. For our gradient based joint optimization method, we modify the OC-SVM and SVDD formulations and then prove the convergence of the modified formulations to the original ones.

• Thanks to our LSTM based structure, the introduced methods are able to process variable length data sequences. Additionally, unlike the conventional methods [6, 7], our methods effectively detect anomalies in time series data without requiring any preprocessing.

• Through extensive set of experiments involving real and simulated data, we illustrate significant performance improvements achieved by our algorithms with respect to the conventional methods [6, 7]. Moreover, since our approach is generic, we also apply it to the recently proposed GRU architecture [17] in our experiments.

### I-D Organization of the Paper

The organization of this paper is as follows. In Section II, we first describe the variable length anomaly detection problem and then introduce our LSTM based structure. In Section III-A, we introduce anomaly detection algorithms based on the OC-SVM formulation, where we also propose two different joint training methods in order to learn the LSTM and SVM parameters. The merits of each different approach are also detailed in the same section. In a similar manner, we introduce anomaly detection algorithms based on the SVDD formulation and provide two different joint training methods to learn the parameters in Section III-B. In Section IV, we demonstrate performance improvements over several real life datasets. In the same section, thanks to our generic approach, we also introduce GRU based anomaly detection algorithms. Finally, we provide concluding remarks in Section V.

## Ii Model and Problem Description

In this paper, all vectors are column vectors and denoted by boldface lower case letters. Matrices are represented by boldface uppercase letters. For a vector , is its ordinary transpose and is the -norm. The time index is given as subscript, e.g., is the th vector. Here, (and ) is a vector of all ones (and zeros) and represents the identity matrix, where the sizes are understood from the context.

We observe data sequences , i.e., defined as

 \boldmathXi=[\boldmathxi,1 \boldmathxi,2…\boldmathxi,di],

where , and is the number of columns in , which can vary with respect to . Here, we assume that the bulk of the observed sequences are normal and the remaining sequences are anomalous. Our aim is to find a scoring (or decision) function to determine whether is anomalous or not based on the observed data, where and represent the outputs of the desired scoring function for nominal and anomalous data respectively. As an example application for this framework, in host based intrusion detection [1], the system handles operating system call traces, where the data consists of system calls that are generated by users or programs. All traces contain system calls that belong to the same alphabet, however, the co-occurrence of the system calls is the key issue in detecting anomalies [1]. For different programs, these system calls are executed in different sequences, where the length of the sequence may vary for each program. Binary encoding of a sample set of call sequences can be , and for case [1]. After observing such a set of call sequences, our aim is to find a scoring function that successfully distinguishes the anomalous call sequences from the normal sequences.

In order to find a scoring function such that

 l(\boldmathXi)={−1 if % \boldmathXi is anomalous+1 otherwise ,

one can use the OC-SVM algorithm [6] to find a hyperplane that separates the anomalies from the normal data or the SVDD algorithm [7] to find a hypersphere enclosing the normal data while leaving the anomalies outside the hypersphere. However, these algorithms can only process fixed length sequences. Hence, we use the LSTM architecture [14] to obtain a fixed length vector representation for each . Although there exist several different versions of LSTM architecture, we use the most widely employed architecture, i.e., the LSTM architecture without peephole connections [13]. We first feed to the LSTM architecture as demonstrated in Fig. 2, where the internal LSTM equations are as follows [14]:

 \boldmathzi,j=g(\boldmathW(z)% \boldmathxi,j+\boldmathR(z)\boldmathhi,j−1+\boldmathb(z)) (1) \boldmathsi,j=σ(\boldmathW(s)\boldmathxi,j+\boldmathR(s)\boldmathhi,j−1+\boldmathb(s)) (2) \boldmathfi,j=σ(\boldmathW(f)\boldmathxi,j+\boldmathR(f)\boldmathhi,j−1+\boldmathb(f)) (3) \boldmathci,j=\boldmathsi,j⊙%\boldmath$z$i,j+\boldmathfi,j⊙\boldmathci,j−1 (4) \boldmathoi,j=σ(\boldmathW(o)\boldmathxi,j+\boldmathR(o)\boldmathhi,j−1+\boldmathb(o)) (5) \boldmathhi,j=\boldmathoi,j⊙g(\boldmathci,j), (6)

where is the state vector, is the input vector and is the output vector for the th LSTM unit in Fig. 2. Additionally, , and is the input, forget and output gates, respectively. Here, is set to the hyperbolic tangent function, i.e., , and applies to input vectors pointwise. Similarly, is set to the sigmoid function. is the operation for elementwise multiplication of two same sized vectors. Furthermore, , and are the parameters of the LSTM architecture, where the size of each is selected according to the dimensionality of the input and output vectors. After applying the LSTM architecture to each column of our data sequences as illustrated in Fig. 2, we take the average of the LSTM outputs for each data sequence, i.e., the mean pooling method. By this, we obtain a new set of fixed length sequences, i.e., denoted as , . Note that we also use the same procedure to obtain the state information for each as demonstrated in Fig. 2.

###### Remark 1.

We use the mean pooling method in order to obtain the fixed length sequences as However, we can also use the other pooling methods. As an example, for the last and max pooling methods, we use and , , respectively. Our derivations can be straightforwardly extended to these different pooling methods.

## Iii Novel Anomaly Detection Algorithms

In this section, we first formulate the anomaly detection approaches based on the OC-SVM and SVDD algorithms. We then provide joint optimization updates to train the parameters of the overall structure.

### Iii-a Anomaly Detection with the OC-SVM Algorithm

In this subsection, we provide an anomaly detection algorithm based on the OC-SVM formulation and derive the joint updates for both the LSTM and SVM parameters. For the training, we first provide a quadratic programming based algorithm and then introduce a gradient based training algorithm. To apply the gradient based training method, we smoothly approximate the original OC-SVM formulation and then prove the convergence of the approximated formulation to the actual one in the following subsections.

In the OC-SVM algorithm, our aim is to find a hyperplane that separates the anomalies from the normal data [6]. We formulate the OC-SVM optimization problem for the sequence as follows [6]

 min\scriptsize\mbox{\boldmathθ}∈Rnθ,\scriptsize\mbox{\boldmathw}∈Rm,ξ∈R,ρ∈R ∥\boldmathw∥22+1nλn∑i=1ξi−ρ (7) subject to: \boldmathwT\boldmath¯hi≥ρ−ξi, ξi≥0, ∀i (8) \boldmathW(⋅)T\boldmathW(⋅)=\boldmathI,\boldmathR(⋅)T\boldmathR(⋅)=\boldmathI and \boldmathb(⋅)T%\boldmath$b$(⋅)=1, (9)

where and are the parameters of the separating hyperplane, is a regularization parameter, is a slack variable to penalize misclassified instances and we group the LSTM parameters into , where . Since the LSTM parameters are unknown and is a function of these parameters, we also minimize the cost function in (7) with respect to .

After solving the optimization problem in (7), (8) and (9), we use the scoring function

 l(\boldmathXi)=sgn(\boldmathwT\boldmath¯hi−ρ) (10)

to detect the anomalous data, where the function returns the sign of its input.

We emphasize that while minimizing (7) with respect to , we might suffer from overfitting and impotent learning of time dependencies on the data [22], i.e., forcing the parameters to null values, e.g., . To circumvent these issues, we introduce (9), which constraints the norm of to avoid overfitting and trivial solutions, e.g., , while boosting the ability of the LSTM architecture to capture time dependencies [22, 23].

###### Remark 2.

In (9), we use an orthogonality constraint for each LSTM parameter. However, we can also use other constraints instead of (9) and solve the optimization problem in (7), (8) and (9) in the same manner. As an example, a common choice of constraint for neural networks is the Frobenius norm [24], i.e., defined as

 ∥\boldmathA∥F=∑i∑j\boldmathA2ij (11)

for a real matrix , where represents the element at the th column and th row of . In this case, we can directly replace (9) with a Frobenius norm constraint for each LSTM parameter as in (11) and then solve the optimization problem in the same manner. Such approaches only aim to regularize the parameters [23]. However, for RNNs, we may also encounter exponential growth or decay in the norm of the gradients while training the parameters, which significantly degrades capabilities of these architectures to capture time dependencies [22, 23]. Thus, in this paper, we put the constraint (9) in order to regularize the parameters while improving the capabilities of the LSTM architecture in capturing time dependencies [22, 23].

#### Iii-A1 Quadratic Programming Based Training Algorithm

Here, we introduce a training approach based on quadratic programming for the optimization problem in (7), (8) and (9), where we perform consecutive updates for the LSTM and SVM parameters. For this purpose, we first convert the optimization problem to a dual form in the following. We then provide the consecutive updates for each parameter.

We have the following Lagrangian for the SVM parameters

 L(\boldmathw,ξ,ρ,ν,α)= ∥\boldmathw∥22+1nλn∑i=1ξi−ρ−n∑i=1νiξi −n∑i=1αi(\boldmathwT% \boldmath¯hi−ρ+ξi), (12)

where , are the Lagrange multipliers. Taking derivative of (12) with respect to , and and then setting the derivatives to zero gives

 \boldmathw=n∑i=1αi\boldmath¯hi (13) n∑i=1αi=1 and αi=1/(nλ)−νi, ∀i. (14)

Note that at the optimum, the inequalities in (8) become equalities if and are nonzero, i.e., [6]. With this relation, we compute as

 ρ=n∑j=1αj\boldmath¯hTj\boldmath¯hi for 0<αi<1/(nλ). (15)

By substituting (13) and (14) into (12), we obtain the following dual problem for the constrained minimization in (7), (8) and (9)

 min\scriptsize\mbox{\boldmathθ}∈Rnθ,\scriptsize\mbox{\boldmathα}∈Rn12n∑i=1n∑j=1αiαj\boldmath¯hTi\boldmath¯hj (16) subject to: n∑i=1αi=1 and  0≤αi≤1/(nλ), ∀i (17) \boldmathW(⋅)T\boldmathW(⋅)=\boldmathI,\boldmathR(⋅)T\boldmathR(⋅)=\boldmathI and \boldmathb(⋅)T%\boldmath$b$(⋅)=1, (18)

where is a vector representation for ’s. Since the LSTM parameters are unknown, we also put the minimization term for into (16) as in (7). By substituting (13) into (10), we have the following scoring function for the dual problem

 l(\boldmathXi)=sgn(n∑j=1αj\boldmath¯hTj\boldmath¯hi−ρ), (19)

where we calculate using (15).

In order to find the optimal and for the optimization problem in (16), (17) and (18), we employ the following procedure. We first select a certain set of the LSTM parameters, i.e., . Based on , we find the minimizing values, i.e., , using the Sequential Minimal Optimization (SMO) algorithm [25]. Now, we fix as and then update from to using the algorithm for optimization with orthogonality constraints in [26]. We repeat these consecutive update procedures until and converge [27]. Then, we use the converged values in order to evaluate (19). In the following, we explain the update procedures for and in detail.

Based on , i.e., the LSTM parameter vector at the th iteration, we update , i.e., the vector at the th iteration, using the SMO algorithm due to its efficiency in solving quadratic constrained optimization problems [25]. In the SMO algorithm, we choose a subset of parameters to minimize and fix the rest of parameters. In the extreme case, we choose only one parameter to minimize, however, due to (17), we must choose at least two parameters. To illustrate how the SMO algorithm works in our case, we choose and to update and fix the rest of the parameters in (16). From (17), we have

 α1=1−S−α2, where S=n∑i=3αi. (20)

We first replace in (16) with (20). We then take the derivative of (16) with respect to and equate the derivative to zero. Thus, we obtain the following update for at the th iteration

 αk+1,2=(αk,1+αk,2)(K11−K12)+M1−M2K11+K22−2K12, (21)

where , and represents the th element of . Due to (17), if the updated value of is outside of the region , we project it to this region. Once is updated as , we obtain using (20). For the rest of the parameters, we repeat the same procedure, which eventually converges to a certain set of parameters [25]. By this way, we obtain , i.e., the minimizing for .

Following the update of , we update based on the updated vector. For this purpose, we employ the optimization method in [26]. Since we have that satisfies (17), we reduce the dual problem to

 min\scriptsize\mbox{\boldmathθ} κ(\boldmathθ,\boldmathαk+1)=12n∑i=1n∑j=1αk+1,iαk+1,j\boldmath¯hTi\boldmath¯hj (22) s.t.:\boldmathW(⋅)T\boldmathW(⋅)=\boldmathI,\boldmathR(⋅)T% \boldmathR(⋅)=\boldmathI and \boldmathb(⋅)T\boldmathb(⋅)=1. (23)

For (22) and (23), we update as follows

 \boldmathW(⋅)k+1=(\boldmathI+μ2\boldmathAk)−1(\boldmathI−μ2\boldmathAk)\boldmathW(⋅)k, (24)

where the subscripts represent the current iteration index, is the learning rate, and the element at the th row and the th column of , i.e., , is defined as

 \boldmathGij≜∂κ(% \boldmathθ,\boldmathαk+1)∂% \boldmathW(⋅)ij. (25)
###### Remark 3.

For and , we first compute the gradient of the objective function with respect to the chosen parameter as in (25). We then obtain according to the chosen parameter. Using , we update the chosen parameter as in (24).

With these updates, we obtain a quadratic programming based training algorithm (see Algorithm 1 for the pseudocode) for our LSTM based anomaly detector.

#### Iii-A2 Gradient Based Training Algorithm

Although the quadratic programming based training algorithm directly optimizes the original OC-SVM formulation without requiring any approximation, since it depends on the separated consecutive updates of the LSTM and OC-SVM parameters, it might not converge to even a local minimum [27]. In order to resolve this issue, in this subsection, we introduce a training method based on only the first order gradients, which updates the parameters at the same time. However, since we require an approximation to the original OC-SVM formulation to apply this method, we also prove the convergence of the approximated formulation to the original OC-SVM formulation in this subsection.

Considering (8), we write the slack variable in a different form as follows

 G(β\scriptsize\mbox{\boldmathw},ρ(% \boldmath¯hi))≜max{0,β\scriptsize\mbox{% \boldmathw},ρ(\boldmath¯hi)},∀i, (26)

where

 β\scriptsize\mbox{\boldmathw},ρ(% \boldmath¯hi)≜ρ−\boldmathwT% \boldmath¯hi.

By substituting (26) into (7), we remove the constraint (8) and obtain the following optimization problem

 min\scriptsize\mbox{\boldmathw}∈Rm,ρ∈R,\scriptsize\mbox{\boldmathθ}∈Rnθ ∥\boldmathw∥22+1nλn∑i=1G(β\scriptsize\mbox{\boldmathw},ρ(% \boldmath¯hi))−ρ (27) s.t.:\boldmathW(⋅)T\boldmathW(⋅)=\boldmathI,\boldmathR(⋅)T% \boldmathR(⋅)=\boldmathI and \boldmathb(⋅)T\boldmathb(⋅)=1. (28)

Since (26) is not a differentiable function, we are unable to solve the optimization problem in (27) using gradient based optimization algorithms. Hence, we employ a differentiable function

 Sτ(β\scriptsize\mbox{\boldmathw},ρ(\boldmath¯hi))=1τlog(1+eτβ\scriptsize\mbox{\boldmathw},ρ(\boldmath¯% \scriptsize hi)) (29)

to smoothly approximate (26), where is the smoothing parameter and represents the natural logarithm. In (29), as increases, converges to (see Proposition 1 at the end of this section), hence, we choose a large value for . With (29), we modify our optimization problem as follows

 min\scriptsize\mbox{\boldmathw}∈Rm,ρ∈R,\scriptsize\mbox{\boldmathθ}∈RnθFτ(\boldmathw,ρ,\boldmathθ) (30) s.t.:\boldmathW(⋅)T\boldmathW(⋅)=\boldmathI,\boldmathR(⋅)T% \boldmathR(⋅)=\boldmathI and \boldmathb(⋅)T\boldmathb(⋅)=1 (31)

where is the objective function of our optimization problem and defined as

 Fτ(\boldmathw,ρ,\boldmathθ)≜∥\boldmathw∥22+1nλn∑i=1Sτ(β\scriptsize\mbox{\boldmathw},ρ(% \boldmath¯hi))−ρ.

To obtain the optimal parameters for (30) and (31), we update , and until they converge to a local or global optimum [28, 26]. For the update of and , we use the SGD algorithm [28], where we compute the first order gradient of the objective function with respect to each parameter. We first compute the gradient for as follows

 ∇\scriptsize\mbox{\boldmathw}Fτ(% \boldmathw,ρ,\boldmathθ)=\boldmathw+1nλn∑i=1−\boldmath¯hieτβ\scriptsize\mbox{\boldmathw},ρ(\boldmath¯% \scriptsize hi)1+eτβ\scriptsize\mbox{\boldmathw},ρ(\boldmath¯\scriptsize hi). (32)

Using (32), we update as

 \boldmathwk+1=\boldmathwk−μ∇\scriptsize\mbox{\boldmathw}Fτ(\boldmathw,ρ,\boldmathθ)∣∣% \boldmathw=\boldmathwk\scalebox1.1$ρ$=\scalebox1.1$ρk$\boldmathθ=\boldmathθk, (33)

where the subscript indicates the value of any parameter at the th iteration. Similarly, we calculate the derivative of the objective function with respect to as follows

 ∂Fτ(\boldmathw,ρ,% \boldmathθ)∂ρ=1nλn∑i=1eτβ\scriptsize\mbox{\boldmathw},ρ(\boldmath¯\scriptsize hi)1+eτβ\scriptsize\mbox{% \boldmathw},ρ(\boldmath¯\scriptsize hi)−1. (34)

Using (34), we update as

 ρk+1=ρk−μ∂Fτ(\boldmathw,ρ,\boldmathθ)∂ρ∣∣\boldmathw=\boldmathwk\scalebox1.1$ρ$=\scalebox1.1$ρk$\boldmathθ=\boldmathθk. (35)

For the LSTM parameters, we use the method for optimization with orthogonality constraints in [26] due to (31). To update each element of , we calculate the gradient of the objective function as

 ∂Fτ(\boldmathw,ρ,% \boldmathθ)∂\boldmathW(⋅)ij=1nλn∑i=1−\boldmathwT(∂% \boldmath¯hi/∂\boldmathW(⋅)ij)eτβ\scriptsize\mbox{\boldmathw},ρ(\boldmath¯\scriptsize hi)1+eτβ\scriptsize\mbox{% \boldmathw},ρ(\boldmath¯\scriptsize hi). (36)

We then update using (36) as

 \boldmathW(⋅)k+1=(\boldmathI+μ2\boldmathBk)−1(\boldmathI−μ2\boldmathBk)\boldmathW(⋅)k, (37)

where and

 \boldmathMij≜∂Fτ(%\boldmath$w$,ρ,\boldmathθ)∂\boldmathW(⋅)ij. (38)
###### Remark 4.

For and , we first compute the gradient of the objective function with respect to the chosen parameter as in (38). We then obtain according to the chosen parameter. Using , we update the chosen parameter as in (37).

###### Remark 5.

In the semi-supervised framework, we have the following optimization problem for our SVM based algorithms [29]

 (39) s.t.:yi(\boldmathwT\boldmath¯hi+ρ)≥1−ηi, ηi≥0, i=1,…,l (40) \boldmathwT\boldmath¯hj−ρ≥1−ξj, ξj≥0, j=l+1,…,l+k (41) −\boldmathwT\boldmath¯hj+ρ≥1−γj, γj≥0, j=l+1,…,l+k (42) \boldmathW(⋅)T\boldmathW(⋅)=\boldmathI,\boldmathR(⋅)T\boldmathR(⋅)=\boldmathI and \boldmathb(⋅)T%\boldmath$b$(⋅)=1, (43)

where and are slack variables as , is a trade-off parameter, and are the number of the labeled and unlabeled data instances, respectively and represents the label of the th data instance.

For the quadratic programming based training method, we modify all the steps from (12) to (25) with respect to creftypeplural 43, 42, 41, 40 and 39. In a similar manner, we modify the equations from (26) to (38) according to creftypeplural 43, 42, 41, 40 and 39 in order to get the gradient based training method in the semi-supervised framework. For the supervised implementation, we follow the same procedure with the semi-supervised implementation for case.

Hence, we complete the required updates for each parameter. The complete algorithm is also provided in Algorithm 2 as a pseudocode. Moreover, we illustrate the convergence of our approximation (29) to (26) in Proposition 1. Using Proposition 1, we then demonstrate the convergence of the optimal values for our objective function (30) to the optimal values of the actual SVM objective function (27) in Theorem 1.

###### Proposition 1.

As increases, uniformly converges to . As a consequence, our approximation converges to the SVM objective function , i.e., defined as

 F(\boldmathw,ρ,\boldmathθ)≜∥\boldmathw∥22+1nλn∑i=1G(β\scriptsize\mbox{\boldmathw},ρ(\boldmath¯hi))−ρ.
###### Proof of Proposition 1.

In order to simplify our notation, for any given , , and , we denote as . We first show that , . Since

 Sτ(Ω) =1τlog(1+eτΩ) ≥1τlog(eτΩ) =Ω

and , we have . Then, for any , we have

 ∂Sτ(Ω)∂τ =−1τ2log(1+eτΩ)+1τΩeτΩ1+eτΩ <−1τΩ+1τΩeτΩ1+eτΩ ≤0

and for any , we have

 ∂Sτ(Ω)∂τ =−1τ2log(1+eτΩ)+1τΩeτΩ1+eτΩ <0,

thus, we conclude that is a monotonically decreasing function of . As the last step, we derive an upper bound for the difference . For , the derivative of the difference is as follows

 ∂(Sτ(Ω)−G(Ω))∂Ω=eτΩ1+eτΩ−1<0,

hence, the difference is a decreasing function of for . Therefore, the maximum value is and it occurs at . Similarly, for , the derivative of the difference is positive, which shows that the maximum for the difference occurs at . With this result, we obtain the following bound

 log(2)τ=maxΩ(Sτ(Ω)−G(Ω)). (44)

Using (44), for any , we can choose sufficiently large so that . Hence, as increases, uniformly converges to . By averaging (44) over all the data points and multiplying with , we obtain

 log(2)λτ=max\scriptsize\mbox{% \boldmathw},ρ,\scriptsize\mbox{\boldmathθ}(Fτ(\boldmathw,ρ,\boldmathθ)−F(\boldmath% w,ρ,\boldmathθ)),

which proves the uniform convergence of to . ∎

###### Theorem 1.

Let and be the solutions of (30) for any fixed . Then, and are unique and converges to the minimum of .

###### Proof of Theorem 1.

We have the following Hessian matrix of with respect to

 ∇2\scriptsize\mbox{\boldmathw}Fτ(\boldmathw,ρ,\boldmathθ)=\boldmathI+τnλn∑i=1eτβ\scriptsize\mbox{% \boldmathw},ρ(\boldmath¯\scriptsize hi)(1+eτβ\scriptsize\mbox{\boldmathw},ρ(% \boldmath¯\scriptsize hi))2\boldmath¯hi\boldmath¯hTi,

which satisfies for any nonzero column vector . Hence, the Hessian matrix is positive definite, which shows that is strictly convex function of . Consequently, the solution is both global and unique given any and . Additionally, we have the following second order derivative for

 ∂2Fτ(\boldmathw,ρ,% \boldmathθ)∂ρ2=τnλn∑i=1eτβ\scriptsize\mbox{\boldmathw},ρ(% \boldmath¯\scriptsize hi)(1+eτβ% \scriptsize\mbox{\boldmathw},ρ(\boldmath¯% \scriptsize hi))2>0,

which implies that is strictly convex function of . As a result, the solution is both global and unique for any given and .

Let and be the solutions of (27) for any fixed . From the proof of Proposition 1, we have

 Fτ(\boldmathw∗,ρ∗,\boldmathθ)≥Fτ(\boldmathwτ,ρτ,% \boldmathθ) ≥F(\boldmathwτ,ρτ,\boldmathθ) ≥F(\boldmathw∗,ρ∗,\boldmathθ). (45)

Using the convergence result in Proposition 1 and (III-A2), we have

 limτ→∞Fτ(\boldmathwτ,ρτ,\boldmathθ)≤limτ→∞Fτ(\boldmathw∗,ρ∗,\boldmathθ)=F(% \boldmathw∗,ρ∗,\boldmathθ) limτ→∞Fτ(\boldmathwτ,ρτ,\boldmathθ)≥F(\boldmathw∗,ρ∗,\boldmathθ),

which proves the following equality

 limτ→∞Fτ(\boldmathwτ,ρτ,\boldmathθ)=F(\boldmathw∗,ρ∗,\boldmathθ).

### Iii-B Anomaly Detection with the SVDD algorithm

In this subsection, we introduce an anomaly detection algorithm based on the SVDD formulation and provide the joint updates in order to learn both the LSTM and SVDD parameters. However, since the generic formulation is the same with the OC-SVM case, we only provide the required and distinct updates for the parameters and proof for the convergence of the approximated SVDD formulation to the actual one.

In the SVDD algorithm, we aim to find a hypersphere that encloses the normal data while leaving the anomalous data outside the hypersphere [7]. For the sequence , we have the following SVDD optimization problem [7]

 min\scriptsize\mbox{\boldmathθ}∈Rnθ,\scriptsize\mbox{\boldmath~c}∈Rm,ξ∈R,R∈R R2+1nλn∑i=1ξi (46) subject to: ∥\boldmath¯hi−% \boldmath~c∥2−R2≤ξi, ξi≥0,∀i (47) \boldmathW(⋅)T\boldmathW(⋅)=\boldmathI,\boldmathR(⋅)T\boldmathR(⋅)=\boldmathI and \boldmathb(⋅)T%\boldmath$b$(