Multiplierless and Sparse Machine Learning based on Margin Propagation Networks

Multiplierless and Sparse Machine Learning based on Margin Propagation Networks

Nazreen P.M. Department of Electronic Systems Engineering, Indian Institute of Science, Bangalore 560012 Shantanu Chakrabartty Department of Electrical and Systems Engineering, Washington University in St. Louis,USA, 63130 Chetan Singh Thakur Department of Electronic Systems Engineering, Indian Institute of Science, Bangalore 560012
Abstract

The new generation of machine learning processors have evolved from multi-core and parallel architectures (for example graphical processing units) that were designed to efficiently implement matrix-vector-multiplications (MVMs). This is because at the fundamental level, neural network and machine learning operations extensively use MVM operations and hardware compilers exploit the inherent parallelism in MVM operations to achieve hardware acceleration on GPUs, TPUs and FPGAs. A natural question to ask is whether MVM operations are even necessary to implement ML algorithms and whether simpler hardware primitives can be used to implement an ultra-energy-efficient ML processor/architecture. In this paper we propose an alternate hardware-software codesign of ML and neural network architectures where instead of using MVM operations and non-linear activation functions, the architecture only uses simple addition and thresholding operations to implement inference and learning. At the core of the proposed approach is margin-propagation based computation that maps multiplications into additions and additions into a dynamic rectifying-linear-unit (ReLU) operations. This mapping results in significant improvement in computational and hence energy cost. The training of a margin-propagation (MP) network involves optimizing an cost function, which in conjunction with ReLU operations leads to network sparsity and weight updates using only Boolean predicates. In this paper, we show how the MP network formulation can be applied for designing linear classifiers, multi-layer perceptrons and for designing support vector networks.

Margin Propagation, Low Power, Machine learning, Multi-layer Perceptron, Support Vector Machine, Approximate Computing,

1 Introduction

Reducing the energy footprint is one of the major goals in the design of current and future machine learning (ML) systems. This is not only applicable for deep-learning platforms that run on data servers, consuming mega-watts of power [al2015efficient], but is also applicable for Internet-of-things and edge computing platforms that are highly energy-constrained [li2018learning]. Computation in most of these ML systems are highly regular and involve repeated use of matrix-vector-multiplication (MVM) and non-linear activation and pooling operations. Therefore, current hardware compilers achieve performance acceleration and energy-efficiency by optimizing these fundamental operations on parallel hardware like the Graphical Processing Units (GPUs) or the Tensor Processing Units (TPUs). This mapping onto hardware accelerators can be viewed as a top-down approach where the goal from the perspective of a hardware designers to efficiently but faithfully map well-established ML algorithms without modifying the basic MVM or the activation functions. However, if the MVMs and the non-linear activation-functions could be combined in a manner that the resulting architecture becomes multiplier-less and uses much simpler computational primitives, then significant energy-efficiency could be potentially achieved at the system-level. In this paper we argue that a margin-propagation (MP) based computation can achieve this simplification by mapping multiplications into additions and additions into a dynamic rectifying-linear-unit (ReLU) operations.

Fig. 1: Hardware-software co-design using margin–propagation design framework to map multiplications into additions, and additions into dynamic rectifying linear operations: (a) Learning in the conventional architecture using a loss function E where parameter updates are estimated as the product of the gradient and the input; (b) Learning in the margin-propagation (MP) architecture where parameter updates are just Boolean up/down flag with no products; (c) Mapping of real-time learning architecture into margin-propagation architecture where parameter updates could be implemented using simple feedback paths.

The consequence of this mapping is a significant reduction in the complexity of inference and training which in turn leads to significant improvement in system energy-efficiency. To illustrate this, consider a very simple example as shown in Fig.1(a) and (b) for a comprising of a single training parameter w and a one-dimensional input x. In a conventional architecture minimizing a loss-function E(.) in Fig. 1(a) results in a learning/parameter update step that requires modulating the gradient with the input. In the equivalent margin-approximation, as shown in Fig. 1(b), the absence of multiplication implies that each parameter update is independent and the use of ReLU operations leads to learning update that involves only Boolean predicates. Rather than modulating the gradient with the input (as shown in Fig. 1(a)), the new updates are based on comparing the sum of w and x with respect to a dynamic threshold z, as shown in Fig. 1(b). This significantly simplifies the learning phase, and the storage of the parameters w. This is illustrated in Fig. 1(c) using a single-layer network with three-dimensional input/parameters. The margin nodes not only implement the forward computation but also provide a continuous feedback to updates parameters . For a digital implementation, this could be a simple up/down flag; for an analog implementation this could be equivalent to charging or discharging a capacitor storing the values of w11-w13.

Margin-propagation (MP) was originally proposed in [chakrabartty2004margin] and then was used in [gu2009sparse, gu2012theory] in the context of approximate computing and synthesis of piece-wise linear circuits. In [chakrabartty2007gini, chakrabartty2005sub, kucher2007energy, gu2009sparse, gu2012theory] the MP formulation was used to synthesize ML algorithms, by replacing the MVM operation with simple addition and thresholding operations. However, in all the previous formulations, MP was to approximate log-sum-exp and any approximation error would propagate/accumulate as the size of the network increased. The formulation presented in this paper views MP as an independent computational paradigm and the networks presented in this paper are trained using the exact form of the MP function.

The paper is organized as follows: Section 2 discuss the margin propagation (MP) algorithm and compare its computational complexity with traditional MVM. Section 3 presents MP based perceptron and its simulation results. Similarly sec. 4 and 5 discuss MP based MLP and SVM respectively and their simulation results. Section 6 concludes the paper.

A perceptron [freund1999large, bishop2006pattern] is a single layer neural network used as a linear binary classifier as shown in fig. 2. Let input vector to a perceptron be ; where is the bias. The weighted sum of these inputs and the bias with the weights is taken which is then fed into the activation function which maps the input into one of the two classes. For learning the perceptron weights standard gradient descent can be used with sum of squared errors as our cost function as given below;

(1)

where is the actual output for sample and is the estimated output.

Fig. 2: Perceptron as a binary classifier

Support Vector Machine (SVM) is a supervised machine learning algorithm which is used mostly for classification problems [cortes1995support]. Given labeled training data, SVM outputs an optimal hyperplane which categorizes any new test input into one of the classes. Given a test input where , the decision function for SVM is given as,

(2)

where is the kernel function, is the support vector and is the sample of the input vector.

Fig. 3: A three layer multilayer perceptron (MLP) for a two class problem; with hidden nodes and inputs.

In order to learn complex functions, a group of perceptrons can be stacked up in multiple layers [bishop2006pattern] to form a multilayer perceptron (MLP). A three layer MLP for a two class problem is shown in Fig. 3. The weighted sum of the input vector with the weights of the hidden layer is the input to the activation functions in the hidden layer. In the figure, and indicates the input bias to each nodes in the hidden layer and output layer whose weights are usually set to 1. Using the weights from the hidden layer to output layer, the weighted sum of the outputs from the hidden layer is again computed which is then fed into the activation function of the final output node to obtain the output. The weights of such a feed-forward multilayer network is learned using the backpropagation algorithm. In this case also a squared error cost function is used.

(3)

where , in this case.

2 Margin propagation computation and complexity

MP algorithm is based on the reverse water filling procedure [gu2009sparse, gu2012theory] as shown in Fig. 4. The algorithm computes the normalization factor , given a set of scores using the constraint;

(4)

where is the rectification operation and is the algorithm parameter.

Fig. 4: Reverse water-filling procedure

This is a recursive algorithm which computes such that the net balance of score in excess to is [gu2009sparse, gu2012theory]. Thus given a set of input scores , we can obtain the factor as;

(5)

where

2.1 Complexity

As mentioned before replacing the MVM operations in the perceptron, SVM and MLP into simple addition and thresholding operations in the log-likelihood domain using MP algorithm during inference and learning, significantly reduces the complexity. If is the dimension of the input vector , then the overall complexity for an MVM operation,

(6)

is

(7)
(8)

where , and are the complexity of MVM, multiplication and addition and is the number of digits.

whereas for the margin propagation algorithm

(9)

the overall complexity is given as,

(10)
(11)

where is the sparsity factor of the thresholding operation determined by . This will also result in significant improvement in energy cost, as energy per multiplication is more than energy per addition operation as explored in [horowitz20141]. In [horowitz20141], they show that for an 8 bit integer multiplication the rough energy cost is with a relative area cost of whereas for an 8 bit addition it is only and . For bit integer case, the energy cost is and area cost is for multiplication and and for addition. The cost function used in conjunction with the ReLU operation ensures network sparsity.

3 Perceptron using MP algorithm

A single layer perceptron using MP algorithm is shown in Fig. 5. We minimize the norm given in eq. (16) as the cost function to learn the network parameters. The inputs and weights are in the log-likelihood domain so that the network can be implemented using MP algorithm as mentioned in [gu2012theory].

Fig. 5: Perceptron using margin propagation (MP) algorithm, as a binary classifier for linearly separable data.

3.1 Inference

Let the input vector to the perceptron in the log-likelihood domain be and let be the learned weights.

From Fig 5 the perceptron output in differential form is,

(12)

For the output node;

(13)

where is estimated such that . and are computed using the reverse water-filling constraints as;

(14)
(15)

where is the input sample and is the corresponding weight in the log-likelihood domain.

3.2 Training: evaluation of error-function derivatives

Considering a two class problem class and class, the error function can be written as;

(16)

where

: label for class for sample

: label for class for sample

From eq. (16)

(17)

If is the input to the MP algorithm such that, where indicates each element of then,

(18)

where indicates the number of such that and is the indicator function. Also

(19)

Using equations (13), (14), (18) and (19)

(20)

Similarly using (13), (15), (18) and (19)

(21)

Substituting (20) and (21) in (17) we get,

Similarly,

(22)

where,

(23)
(24)

3.2.1 Derivatives with respect to bias

From eq. (16)

(25)

As

Using equations (13), (14), (18) and (19)

(26)

Similarly,

(27)

Using (13), (15), (18) and (19)

(28)

3.3 Parameter update rule

Using the error gradient obtained from above, the weight and bias are updated during each iteration as follows;

(29)
(30)
(31)
(32)

where is the learning rate and indicates the iteration step.

(a) synthetic two class train and test data
(b) Perceptron training curve
(c) Contour plot of the perceptron classification result
Fig. 6:

3.4 Implementation and results

Train Test
Overall Class 1 Class 2 Overall Class 1 Class 2
Accuracy (%)
TABLE I: Perceptron classification accuracy for the synthetic train and test data

The formulation is sec. 3 is implemented and results are evaluated using MATLAB. A linearly separable Markovian data is simulated using MATLAB functions for training and testing. We use 100 data samples as train set and 100 samples as test set.

3.4.1 Results and discussion

Figure 5(a) shows the scatter plot of the linearly separable two class training and test data. The training curve is shown in Fig. 5(b) which shows that the cost function value reduces during each iteration. The algorithm performs really well as can be seen from the classification results in Table I. The contour plot of the inference results are also shown in Fig. 5(c).

4 Multilayer perceptron based on MP algorithm

Figure 7 shows an MLP synthesized using MP algorithm. The network consists of an input layer , a hidden layer and an output layer with 2 nodes in the hidden layer. The network parameters are learned by minimizing the norm cost function as shown in (40). We use an algorithm similar to backpropagation to evaluate the error gradient in-order to update the network parameters. The red arrows indicate the backward propagation of error information w.r.t the weights and .

Fig. 7: A three layer multilayer perceptron (MLP) using MP algorithm as a binary classifier for a non linearly separable xor data; For the present work we use a two dimensional input data and the hidden layer with two nodes.

4.1 Inference

Let the input vector in the log-likelihood domain be . Let and be the set of learned weights from node of layer to the node in layer and node of layer to the node in output layer respectively.

From Fig 7 the output in differential form is,

(33)

For the output layer ;

(34)

where is estimated such that and and are computed using

(35)
(36)

Similarly

For the hidden layer ;

(37)

where is estimated such that

where,

(38)
(39)

4.2 Training: evaluation of error-function derivatives

Considering a two class problem class and class, the error function can be written as;

(40)

where

: label for class for sample

: label for class for sample

Output layer

From eq. (40)

(41)

Using equations (34), (35), (18) and (19)

(42)

Similarly using (34), (36), (18) and (19)

(43)

Substituting (42) and (43) in (41) we get,

Similarly,

(44)

where,

(45)
(46)

4.2.1 Derivatives with respect to bias

From eq. (40)

(47)

As

Using equations (34), (35), (18) and (19)

(48)

Similarly,

(49)

Using (34), (36), (18) and (19)

(50)

Hidden layer

From (40)

(51)

Using equations (34),(35), (37), (38) and (39) we get,

Using equations (34),(36), (37), (38) and (39) we get,

Similarly,

(54)

where,

4.2.2 Derivatives with respect to bias

From (40)

(57)

Using equations (34),(35), (37), (38) and (39) we get,

Using equations (34),(36), (37), (38) and (39) we get,

Similarly,

(60)

where,

4.3 Parameter update rule

The weight and bias are updated using the obtained error gradient during each iteration as follows;

(63)