The SWAG Algorithm; a Mathematical Approach that Outperforms Traditional Deep Learning. Theory and Implementation

The SWAG Algorithm; a Mathematical Approach that Outperforms Traditional Deep Learning. Theory and Implementation

Saeid Safaei, Vahid Safaei, Solmazi Safaei, Zerotti Woods,
Hamid R. Arabnia*, Juan B. Gutierrez*
Department of Computer Science, University of Georgia
Mechanical Engineering, Yasouj University
Department of Mathematics, University of Georgia
Institute of Bioinformatics, University of Georgia
{ssa,zerotti.woods25,hra,jgutierr}@uga.edu
* Joint corresponding authors.
Abstract

The performance of artificial neural networks (ANNs) is influenced by weight initialization, the nature of activation functions, and their architecture. There is a wide range of activation functions that are traditionally used to train a neural network, e.g. sigmoid, tanh, and Rectified Linear Unit (ReLU). A widespread practice is to use the same type of activation function in all neurons in a given layer. In this manuscript, we present a type of neural network in which the activation functions in every layer form a polynomial basis; we name this method SWAG after the initials of the last names of the authors. We tested SWAG on three complex highly non-linear functions as well as the MNIST handwriting data set. SWAG outperforms and converges faster than the state of the art performance in fully connected neural networks. Given the low computational complexity of SWAG, and the fact that it was capable of solving problems current architectures cannot, it has the potential to change the way that we approach deep learning.

1 Introduction

Deep learning allows computational models that are composed of multiple processing layers, to learn very abstract representations of data[LBH15]. There has been reports of many successes using deep neural networks (DNNs) in areas such as computer vision, speech recognition, language processing, drug discovery, genomics, and a host of other areas.[JSA15]. DNNs have allowed us to solve difficult problems and have motivated extensive work to understand their theoretical properties [HDR18].

The process of how to effectively train a DNN is a complicated task and has been proven to be an NP-Complete problem [BR89]. Features such as weight initialization, the nature of activation functions, and network architecture can affect the training process of a neural network [Sch15] [HDR18] [RZL18]. In particular, some choices of activation functions or network architectures can cause loss of information or may increase the amount of time needed to train a DNN [HDR18][ZL18][CLP16][LTR17].

The question of how to effectively find the best set of nonlinear activation functions is challenging [CLP16]. Some of the well-known nonlinear activation functions are:

(1)
(2)
(3)

The activation function in equation (3) Rectified Linear Unit (ReLU), is the most popular and widely-used activation function; and while some hand-designed activation functions have been introduced to replace ReLU, none have gained to popularity that ReLu has [MHN13][CUH15] [KUMH17][HSM00] [JKL09] [NH10].

Trainable nonlinear activation functions have been proposed by [CLP16], [TGY04]. Chung et al. [CLP16] used a Taylor series approximation of , , and as an initialization point for their activation functions, and trained the coefficients of the Taylor series approximation to optimize training. This implementation used the same polynomial function on each neuron of a given layer. The results were comparable to the state of the art.

In this manuscript, we present a type of neural network in which the activation functions in every layer form a polynomial basis, i.e. groups of neurons are assigned to unique monomials in a given layer. We also propose a new architecture in which we vertically concatenate many fully connected layers to form one layer that makes computation more efficient. We do not train activation functions. Our activation functions are fixed and they form a polynomial basis. The structure of the hidden layers follows the pattern of: (i) a layer with polynomials as the activation functions, and (ii) a layer with a linear activation function.

The remaining of this manuscript is organized as follows: Section 2 describes the mathematical foundations and the architecture for SWAG, Section 3 describes the experiments that were conducted, and section 4 is a discussion of results and future work.

2 Methods

2.1 Representation of functions with a basis

Suppose that we have a data set for and labels that corresponding to our data set. We would like to find a function such that for all . The Stone-Weierstrass approximation theorem states that any continuous real valued function on a compact set can be uniformly approximated by a polynomial. Formally:

Theorem 2.1 (Stone-Weierstrass Approximation Theorem).

Suppose is a continuous real-valued function defined on any closed and bounded subset for any . For every , there exists a polynomial such that for any

The simplicity of polynomial systems make them very attractive analytically and computationally. They are easy to form and have well-understood properties. The use of polynomials of a given degree as activation functions for all neurons in a single layer seems to be mathematically discouraged in traditional neural network settings because they are not universal approximators. Particularly, Leshno et al. (1993) [LLPS93] proved the following theorem:

Theorem 2.2.

Let be the set of functions which are with the property that the closure of the set of points of discontinuity of any function in has zero Lebesgue measure. Let . Then for a fixed ,

is dense in if and only if is not an algebraic polynomial (a.e.)

This theorem implies that fully connected feedforward neural networks with a sufficient number of neurons are universal approximators if and only if the activation functions are not polynomials. We note that in this traditional setting it is assumed that the activation function is the same for every neuron in a given layer. We now give the following extension of the Stone-Weierstrass approximation theorem

Corollary 2.3.

Let for . Then

is dense in where is a compact set.

Proof.

Notice that is a basis for the vector space of polynomials over . So since we know that polynomials are dense in by the Stone-Weierstrass approximation theorem the result follows. ∎

This corollary implies that if we allow a diverse set of polynomial activation functions in a particular layer we will still have the result of universal approximation capabilities of feedfoward neural networks. Using the same framework as Leshno et al. (1993) [LLPS93], in which the output was assumed to be in , an extension to higher dimensions can be easily obtained by re-defining as a pointwise operation that takes each element of and raises it to the power, e.g. given , then .

2.2 Architecture of the SWAG Algorithm

Let be a data point in our data set .

  • Normalize data to be in the interval [0,1].

  • Create the first polynomial layer as follows:

    • Choose a for the number of of polynomial terms used ( is a hyperparameter of the model).

    • Choose for the number of neurons that correspond to each monomial of the layer ( is a hyperparameter of the model).

    • Create fully-connected layers with neurons in each layer, all with as their inputs.

      • The fully-connected layer for is defined by for , , and as defined above.

      • Initialization of weights are random and drawn from , a Gaussian distribution with mean 0 and standard deviation 1.

    • Vertically concatenate the layers to form a vector of length

  • Create a layer with a linear activation function. This is considered the second layer of SWAG.

  • To add a third and fourth layer, repeat the structure of the previous 2 layers with the input of the third layer as the the output of the second layer. If a third and fourth layer is added then the first dimension of the matrix used in the second layer is a hyperparameter of the model.

  • Continue to add layers in this pattern as desired.

  • The matrix used for the final linear activation layer will have its first dimension be the dimension of the output vector.

Figure 1 is a diagram of an example of SWAG using two layers and Figure 3 is a diagram of an example of SWAG with four layers.

Figure 1: Implementation of the SWAG architecture with three groups of monomials of powers 1 through 3, and two layers
Figure 2: Implementation of the SWAG architecture with three groups of monomials of powers 1 through 3, and four layers

3 Results

3.1 Representation of Non-Linear Functions

To test our model, we generated a random data set , with as the vector for training and , with as the vector for testing. We selected three functions for which traditional DNNs do not converge at all, or require a number of epochs orders of magnitude larger than SWAG to converge.

(4)
(5)
(6)

We trained 5 traditional DNNs of various architectures (code in appendix). We also trained SWAG, with , , and we used 4 layers. The first dimension of the second layer in this implementation of SWAG was . We used the standard mean squared loss function with Adam optimizer to test model accuracy [KB14].

We conducted a first experiment for , as shown in Figure 3. SWAG is the only model that had the cost function converge to zero after 50 epochs of training on . We also note that Figure 4 gives a visual representation of how the different architectures reconstructed .

We conducted a second experiment with and
. This allows the test and training sets to have almost the same number of points.

We repeat the process of the first experiment to train and test the various models. The results of the two experiments are found in Figures 5-14 in the appendix.

Figure 3: Experiment 1 shape
Figure 4: Experiment 1 shape

3.2 MNIST Handwriting Data Set

For our final experiment we ran SWAG on the MNIST hand writing data set [LC10]. The data set is composed of a total of 70,000 images, all of which are unique hand writing samples of the numbers 0-9. We flattened these images into vectors of size and used these as inputs to a traditional DNN, as well as SWAG. The traditional DNN had three hidden layer. In the first and second layers we used as the activation function with 1024 neurons in each layer. For the third layer we used as the activation function with 10 neurons (code in appendix). For our implementation of SWAG, we used , and 2 layers. We used a training set that consisted of 60,000 images, and a test set that consisted of 10,000 images. In the traditional method we got a test accuracy of after 4 epochs. SWAG achieved test accuracy after 4 epochs. The results are shown in the appendix in Figure 15 and Figure 16.

4 Discussion

In this work, we introduced a set of activation functions and a new architecture. We named this architecture SWAG. The first layer of our architecture has at least neurons where is the degree of the polynomial for estimation of the function such that for all ; this layer has different activation functions . The second layer is a fully connected layer with a linear activation function. To add additional layers the pattern of the first 2 layers is repeated. By using the back propagation algorithm we can find the set of weights that optimize the predictions.

We created a random data set with highly complicated nonlinear functions. We evaluated the effectiveness of SWAG and found that it was able to approximate the functions better than traditional deep learning methods; it also converged faster. Finally we tested SWAG on the MNIST handwriting data set. Our method was able to replicate the state of the art in fully-connected architectures while converging in only 4 epochs.

We note that there are many basis sets that are able to estimate a function with arbitrary accuracy. In future work, it will be important to compare the performance of different basis sets and function approximations to determine which one has better performance in specific situations. Our conjecture is that orthogonal basis will provide an advantage in some cases. Another interesting question to be pursued is to find a way to set the initial weights of the system more effectively. We believe that a Taylor estimation of our data set will increase the performance of SWAG after initialization.

In addition we find the question of how to implement this architecture in convolutional and recursive neural networks especially interesting. Convolutional neural networks have surpassed the accuracy achieved with fully connected neural networks on the MNIST data set, and also reduce the number of parameters that are necessary to train a fully connected neural network. We also reduced the number of parameters in a fully connected network, but we were not able to surpass the state of the art in convolutional neural networks with our current implementation. We hypothesize that implementing the SWAG framework into convolutional and Recursive neural networks will allow us to further reduce parameters, make our model converge even faster and get a better accuracy then that which is currently possible.

References

  • [BR89] Avrim Blum and Ronald L. Rivest. Training a 3-Node Neural Network is NP-Complete. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 1, pages 494–501. Morgan-Kaufmann, 1989.
  • [CLP16] Hoon Chung, Sung Joo Lee, and Jeon Gue Park. Deep neural network using trainable activation functions. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 348–352. IEEE, 2016.
  • [CUH15] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
  • [HDR18] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. On the selection of initialization and activation function for deep neural networks. arXiv preprint arXiv:1805.08266, 2018.
  • [HSM00] Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and H Sebastian Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405(6789):947, 2000.
  • [JKL09] Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best multi-stage architecture for object recognition? In Computer Vision, 2009 IEEE 12th International Conference on, pages 2146–2153. IEEE, 2009.
  • [JSA15] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods. arXiv:1506.08473 [cs, stat], June 2015. arXiv: 1506.08473.
  • [KB14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [KUMH17] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Advances in Neural Information Processing Systems, pages 971–980, 2017.
  • [LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.
  • [LC10] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
  • [LLPS93] Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–867, January 1993.
  • [LTR17] Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247, 2017.
  • [MHN13] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 30, page 3, 2013.
  • [NH10] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814, 2010.
  • [RZL18] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. 2018.
  • [Sch15] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.
  • [TGY04] Fevzullah Temurtas, Ali Gulbag, and Nejat Yumusak. A study on neural networks using taylor series expansion of sigmoid activation function. In International Conference on Computational Science and Its Applications, pages 389–397. Springer, 2004.
  • [ZL18] Guoqiang Zhang and Haopeng Li. Effectiveness of scaled exponentially-regularized linear units (serlus). arXiv preprint arXiv:1807.10117, 2018.

Appendix A Appendix

Figure 5: Experiment 1 loss
Figure 6: Experiment 1 shape
Figure 7: Experiment 1 loss
Figure 8: Experiment 1 shape
Figure 9: Experiment 2 loss
Figure 10: Experiment 2 shape
Figure 11: Experiment 2 loss
Figure 12: Experiment 2 shape
Figure 13: Experiment 2 loss
Figure 14: Experiment 2 shape
Figure 15: Traditional Deep Learning on MNIST Data Set. Test loss: 0.08366 Test accuracy: 0.9767
Figure 16: SWAG on MNIST Data Set. Test loss: 0.07297 Test accuracy: 0.9787

a.1 Source Code

All the source code is available at the following link: https://github.com/DeepLearningSaeid/New-Type-of-Deep-Learning/

In [23]: model = Sequential()         model.add(Dense(10, input_dim=input_dim, activation=’relu’))         model.add(Dense(20, activation=’sigmoid’))         model.add(Dense(30, activation=’tanh’))         model.add(Dense(20, activation=’relu’))         model.add(Dense(15, activation=’sigmoid’))         model.add(Dense(25, activation=’relu’))         model.add(Dense(10, activation=’relu’))         model.add(Dense(output_dim, activation=’tanh’))         model.add(Dropout(0.2))         model.summary()         model.compile(loss=’mean_squared_error’, optimizer=’adam’)         model.fit(train_x,train_y,epochs=number_epo,verbose=0,batch_size=10,                   validation_data=(test_x, test_y))\@endparenv

_________________________________________________________________Layer (type)                 Output Shape              Param #=================================================================dense_132 (Dense)            (None, 10)                20_________________________________________________________________dense_133 (Dense)            (None, 20)                220_________________________________________________________________dense_134 (Dense)            (None, 30)                630_________________________________________________________________dense_135 (Dense)            (None, 20)                620_________________________________________________________________dense_136 (Dense)            (None, 15)                315_________________________________________________________________dense_137 (Dense)            (None, 25)                400_________________________________________________________________dense_138 (Dense)            (None, 10)                260_________________________________________________________________dense_139 (Dense)            (None, 1)                 11_________________________________________________________________dropout_17 (Dropout)         (None, 1)                 0=================================================================Total params: 2,476Trainable params: 2,476Non-trainable params: 0_________________________________________________________________Run Time : 15.135070\@endparenv

Out[23]: [<matplotlib.lines.Line2D at 0x2440d33ea20>]\@endparenv

In [24]: model = Sequential()         model.add(Dense(5, input_dim=input_dim, activation=’relu’))         model.add(Dense(10, activation=’relu’))         model.add(Dense(50, activation=’tanh’))         model.add(Dense(18, activation=’relu’))         model.add(Dense(15, activation=’tanh’))         model.add(Dense(18, activation=’sigmoid’))         model.add(Dropout(0.2))         model.add(Dense(8, activation=’relu’))         model.add(Dropout(0.2))         model.add(Dense(output_dim, activation=’relu’))         model.summary()         model.compile(loss=’mean_squared_error’, optimizer=’adam’)         model.fit(train_x,train_y,epochs=number_epo,verbose=0,batch_size=10,                   validation_data=(test_x, test_y))\@endparenv

_________________________________________________________________Layer (type)                 Output Shape              Param #=================================================================dense_140 (Dense)            (None, 5)                 10_________________________________________________________________dense_141 (Dense)            (None, 10)                60_________________________________________________________________dense_142 (Dense)            (None, 50)                550_________________________________________________________________dense_143 (Dense)            (None, 18)                918_________________________________________________________________dense_144 (Dense)            (None, 15)                285_________________________________________________________________dense_145 (Dense)            (None, 18)                288_________________________________________________________________dropout_18 (Dropout)         (None, 18)                0_________________________________________________________________dense_146 (Dense)            (None, 8)                 152_________________________________________________________________dropout_19 (Dropout)         (None, 8)                 0_________________________________________________________________dense_147 (Dense)            (None, 1)                 9=================================================================Total params: 2,272Trainable params: 2,272Non-trainable params: 0_________________________________________________________________Run Time : 14.923352\@endparenv

Out[24]: [<matplotlib.lines.Line2D at 0x2441d9ba160>]\@endparenv

In [25]: model = Sequential()         model.add(Dense(5, input_dim=input_dim, activation=’relu’))         model.add(Dense(10, activation=’relu’))         model.add(Dense(20, activation=’tanh’))         model.add(Dense(15, activation=’relu’))         model.add(Dense(25, activation=’tanh’))         model.add(Dense(20, activation=’sigmoid’))         model.add(Dense(25, activation=’relu’))         model.add(Dense(20, activation=’relu’))         model.add(Dropout(0.2))         model.add(Dense(8, activation=’relu’))         model.add(Dropout(0.2))         model.add(Dense(output_dim, activation=’relu’))         model.summary()         model.compile(loss=’mean_squared_error’, optimizer=’adam’)         model.fit(train_x,train_y,epochs=number_epo,verbose=0,batch_size=10,                   validation_data=(test_x, test_y))\@endparenv

_________________________________________________________________Layer (type)                 Output Shape              Param #=================================================================dense_148 (Dense)            (None, 5)                 10_________________________________________________________________dense_149 (Dense)            (None, 10)                60_________________________________________________________________dense_150 (Dense)            (None, 20)                220_________________________________________________________________dense_151 (Dense)            (None, 15)                315_________________________________________________________________dense_152 (Dense)            (None, 25)                400_________________________________________________________________dense_153 (Dense)            (None, 20)                520_________________________________________________________________dense_154 (Dense)            (None, 25)                525_________________________________________________________________dense_155 (Dense)            (None, 20)                520_________________________________________________________________dropout_20 (Dropout)         (None, 20)                0_________________________________________________________________dense_156 (Dense)            (None, 8)                 168_________________________________________________________________dropout_21 (Dropout)         (None, 8)                 0_________________________________________________________________dense_157 (Dense)            (None, 1)                 9=================================================================Total params: 2,747Trainable params: 2,747Non-trainable params: 0_________________________________________________________________Run Time : 16.343230\@endparenv

Out[25]: [<matplotlib.lines.Line2D at 0x2441f61fda0>]\@endparenv

In [26]: model = Sequential()         model.add(Dense(40, input_dim=input_dim, activation=’relu’))         model.add(Dense(25, activation=’relu’))         model.add(Dropout(0.2))         model.add(Dense(output_dim, activation=’relu’))         model.add(Dropout(0.2))         model.summary()         model.compile(loss=’mean_squared_error’, optimizer=’adam’)         model.fit(train_x,train_y,epochs=number_epo,verbose=0,batch_size=10,                   validation_data=(test_x, test_y))\@endparenv

_________________________________________________________________Layer (type)                 Output Shape              Param #=================================================================dense_158 (Dense)            (None, 40)                80_________________________________________________________________dense_159 (Dense)            (None, 25)                1025_________________________________________________________________dropout_22 (Dropout)         (None, 25)                0_________________________________________________________________dense_160 (Dense)            (None, 1)                 26_________________________________________________________________dropout_23 (Dropout)         (None, 1)                 0=================================================================Total params: 1,131Trainable params: 1,131Non-trainable params: 0_________________________________________________________________Run Time : 14.620049\@endparenv

Out[26]: [<matplotlib.lines.Line2D at 0x24421b63da0>]\@endparenv

In [27]: model = Sequential()         model.add(Dense(5, input_dim=input_dim, activation=’soft_plus_te’))         model.add(Dense(10, activation=’soft_plus_te’))         model.add(Dense(20, activation=’tanh’))         model.add(Dense(15, activation=’relu’))         model.add(Dense(25, activation=’tanh’))         model.add(Dense(20, activation=’sigmoid’))         model.add(Dense(25, activation=’relu’))         model.add(Dense(output_dim, activation=’soft_plus_te’))         model.add(Dropout(0.2))         model.compile(loss=’mean_squared_error’, optimizer=’adam’)         model.fit(train_x,train_y,epochs=number_epo,verbose=0,batch_size=10,                   validation_data=(test_x, test_y))         model.summary()\@endparenv

_________________________________________________________________Layer (type)                 Output Shape              Param #=================================================================dense_161 (Dense)            (None, 5)                 10_________________________________________________________________dense_162 (Dense)            (None, 10)                60_________________________________________________________________dense_163 (Dense)            (None, 20)                220_________________________________________________________________dense_164 (Dense)            (None, 15)                315_________________________________________________________________dense_165 (Dense)            (None, 25)                400_________________________________________________________________dense_166 (Dense)            (None, 20)                520_________________________________________________________________dense_167 (Dense)            (None, 25)                525_________________________________________________________________dense_168 (Dense)            (None, 1)                 26_________________________________________________________________dropout_24 (Dropout)         (None, 1)                 0=================================================================Total params: 2,076Trainable params: 2,076Non-trainable params: 0_________________________________________________________________\@endparenv

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
321558
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description