Learning to Recognize Actions from Limited Training Examples Using a Recurrent Spiking Neural Model

Learning to Recognize Actions from Limited Training Examples Using a Recurrent Spiking Neural Model

Priyadarshini Panda School of Electrical & Computer Engineering, Purdue University, West Lafayette, IN, USA 47907 Narayan Srinivasa
Abstract

A fundamental challenge in machine learning today is to build a model that can learn from few examples. Here, we describe a reservoir based spiking neural model for learning to recognize actions with a limited number of labeled videos. First, we propose a novel encoding, inspired by how microsaccades influence visual perception, to extract spike information from raw video data while preserving the temporal correlation across different frames. Using this encoding, we show that the reservoir generalizes its rich dynamical activity toward signature action/movements enabling it to learn from few training examples. We evaluate our approach on the UCF-101 dataset. Our experiments demonstrate that our proposed reservoir achieves 81.3%/87% Top-1/Top-5 accuracy, respectively, on the 101-class data while requiring just 8 video examples per class for training. Our results establish a new benchmark for action recognition from limited video examples for spiking neural models while yielding competetive accuracy with respect to state-of-the-art non-spiking neural models.

Introduction

The exponential increase in digital data with online media, surveillance cameras among others, creates a growing need to develop intelligent models for complex spatio-temporal processing. Recent efforts in deep learning and computer vision have focused on building systems that learn and think like humans [1, 2]. Despite the biological inspiration and remarkable performance of such models, even beating humans in certain cognitive tasks [3], the gap between humans and artificially engineered intelligent systems is still great. One important divide is the size of the required training datasets. Studies on mammalian concept understanding have shown that humans and animals can rapidly learn complex visual concepts from single training examples [4]. In contrast, the state-of-the-art intelligent models require vast quantities of labeled data with extensive, iterative training to learn suitably and yield high performance. While the proliferation of digital media has led to the availability of massive raw and unstructured data, it is often impractical and expensive to gather annotated training datasets for all of them. This motivates our work on learning to recognize from few labeled data. Specifically, we propose a reservoir based spiking neural model that reliably learns to recognize actions in video data by extracting long-range structure from a small number of training examples.

Reservoir or Liquid Computing approaches have shown surprising success in recent years on a variety of temporal data based recognition problems (though effective vision applications are still scarce) [5, 6, 7]. This can be attributed to populations of recurrently connected spiking neurons, similar to the mammalian neocortex anatomy [8], which create a high-dimensional dynamic representation of an input stream. The advantage with such an approach is that the reservoir (with sparse/random internal connectivity) implicitly encodes the temporal information in the input and provides a unique non-linear description for each input sequence, that can be trained for read out by a set of linear output neurons. In fact, in our view, the non-linear integration of input data by the high-dimensional dynamical reservoir results in generic stable internal states that generalize over common aspects of the input thereby allowing rapid learning from limited examples. While the random recurrent connectivity generates favorable complex dynamic activity within a reservoir, it poses unique challenges that are generally not encountered in constructing feedforward/deep learning networks.

A pressing problem with recurrently connected networks is to determine the connection matrix that would be suitable for making the reservoir perform well on a particular task, because it is not entirely obvious how individual neurons should spike while interacting with other neurons. To address this, Abbott et. al [9], proposed an elegant method for constructing recurrent spiking networks, termed as Autonomous (A) models, from continuous variable (rate) networks, called Driven (D) models. The basic idea in the D/A approach is to construct a network (Auto) that performs a particular task by copying another network (Driven) statistics that does the same task. The copying provides an estimate of the internal dynamics for the autonomous network including currents at individual neurons in the reservoir to produce the same outputs. Here, we adopt the D/A based reservoir construction approach and modify it further by introducing approximate delay apportioning to develop a recurrent spiking model for practical action recognition from limited video data. Please note, we use Auto/Autonomous interchangeably in the remainder of the text.

Further, in order to facilitate limited example training in a reservoir spiking framework, we propose a novel spike based encoding/pre-processing method to convert the raw pixel valued videos in the dataset into spiking information that preserves the temporal statistics and correlation across different frames. This approach is inspired by how MicroSaccades (MS) cause retinal ganglion cells to fire synchronously corresponding to contrast edges after the MS. The emitted synchronous spike volley thus rapidly transmits the most salient edges of the stimulus, which often constitute the most crucial information [10]. In addition, our encoding eliminates irrelevant spiking information due to ambiguity such as, noisy background activity or jitter (due to unsteady camera movement), further making our approach more robust and reliable. Hence, our encoding, in general, captures the signature action or movement of a subject across different videos as spiking information. This, in turn, enables the reservoir to recognize/generalize over motion cues from spiking data, to enable learning various types of actions from few video samples per class. Our proposed spiking model is much more adept at learning object/activity types, due in part to our ability to analyze activity dynamically as it takes place over time.

Reservoir Model: Framework & Implementation

Model architecture

Let us first discuss briefly the Driven/Autonomous model approach [9]. The general architecture of the Driven/ Autonomous reservoir model is shown in Fig. 1 (a). Both driven/auto models are 3-layered networks. The input layer is connected to a reservoir that consists of nonlinear (Leaky-Integrate-and-Fire, LIF), spiking neurons recurrently connected to each other in a random sparse manner (generally with a connection probability of 10%), and the spiking output produced by the reservoir is then projected to the output/readout neurons. In case of driven, the output neurons are rate neurons that simply integrate the spiking activity of the reservoir neurons. On the other hand, the auto model is an entirely spiking model with LIF neurons at the output as well as within the reservoir. Please note, each reservoir neuron here is excitatory in nature and the reservoir does not have any inhibitory neuronal dynamics in a manner consistent with conventional Liquid or Reservoir Computing.

Essentially, the role of the driven network is to provide targets () for the auto model, which is the spiking network we are trying to construct for a given task. To achieve this, the driven network is trained on an input () that is a high pass filtered version of the desired output (),

(1)

The connections from the reservoir to the readout neurons () of the driven model are then trained to produce desired activity using the Recursive Least Square (RLS) rule [11, 12]. Once trained, the driven network connections () are imported into the autonomous model and a new set of network connections () are added for each fast connection. The combination of the two sets of synapses, and , allows the auto model to match its internal dynamics to that of the driven network that is already tuned to producing desired . In fact, the connections and in the auto model are derived from the driven network as, and , where and are randomly chosen. Note, the time constants of the two sets of synapses is different i.e. that allows the autonomous network to slowly adjust its activity to produce during training. For further clarification, please refer to [9] to gain more details and insights on D/A approach. After the autonomous model is constructed, we train its output layer (that consist of LIF spiking neurons), , using a supervised Spike Timing Dependent Plasticity (STDP) rule [13] based on the input .

Figure 1: (a) Structure of Driven (Autonomous) networks with rate (spiking) output neurons and fast (fast-slow) synaptic connections. Here, fast connections have equivalent delay time constant. (b) (Top) Structure of Driven-Autonomous model with the proposed apportioning to construct variable input(denoted as )/output (denoted as ) reservoir models. Here, each fast connection has a different delay time constant. (Bottom) Desired output produced by an autonomous model and its corresponding Driven model, that is constructed using approximate apportioning. Please note, input spiking activity (blue curve) and output (red/green curve) in driven are similar since input is a filtered version of the output, while autonomous model works on random input data.

The motif behind choosing as a filtered version of is to compensate for the synaptic filtering at the output characterized by time constant, [14, 9]. However, it is evident that the dependence of on from Eqn. 1 imposes a restriction on the reservoir construction, that is, the number of input and output neurons must be same. To address this limitation and extend the D/A construction approach to variable input-output reservoir models, we propose an approximate definition of the driven input using a delay apportioning method as,

(2)

Here, varies from 1 to X, varies from 1 to Y, where are the number of input/output neurons in our reservoir model respectively and . Fig. 1 (b, Top) shows the architecture of the D/A model with the proposed delay apportioning. For each output function (), we derive the driven inputs by distributing the time constant across different inputs. We are essentially compensating the phase delay due to in an approximate manner. Nevertheless, it turns out this approximate version of still drives the driven network to produce the desired output and derive the connection matrix of the autonomous model that is then trained on the real input (video data in our case). Fig. 1 (b, Bottom) illustrates the desired output genereated by a D/A model of 2(input)400(reservoir neurons)1(output) configuration constructed using Eqn. 2 . Note, for the sake of convenience in representation and comparative purposes, the output of both driven/auto models are shown as rate instead of spiking activity. It is clearly seen in the driven case that the input spiking activity for a particular input neuron , even with apportioning, approximates the target quite well. Also, the autonomous model derived from the driven network in Fig. 1 (b, Bottom), produces the desired spiking activity in response to random input upon adequate training of connections.

Supervised STDP & One-Hot Encoding

Figure 2: (a) Supervised STDP rule (defined by Eqn. 3) using pre-synaptic trace to perform potentiation/depression of weights. are the time instants at which target/actual spiking activity occurs resulting in a weight change. The extent of potentiation or depression of weights is proportional to the value of the pre-synaptic trace at a given time instant. (b) One-hot spike encoding activity for a 3-class problem: Based on the class of the training input, the corresponding output neuron is trained to produce high activity while remaining neurons yield near-zero activity.

Reservoir framework relaxes the burden of training by fixing the connectivity within the reservoir and that from input to the reservoir. Merely, the output layer of readout neurons ( in Fig. 1 (a)) is trained (generally in a supervised manner) to match the reservoir activity with the target pattern. In our D/A based reservoir construction, readout of the driven network is trained using standard RLS learning [12] while we use a modified version of the supervised STDP rule proposed in [13] to conduct autonomous model training. An illustration of our STDP model is shown in Fig. 2 (a). For a given target pattern, the synaptic weights are potentiated or depressed as

(3)

Where denotes the time of occurrence of actual/desired spiking activity at the post-synaptic neuron during the simulation and , namely the presynaptic trace, models the pre-neuronal spiking history. Every time a pre-synaptic neuron fires, , increases by 1, otherwise it decays exponentially with that is of the same order as . Fundamentally, as illustrated in Fig. 2 (a), Eqn. 3 depresses (or potentiates) the weights at those time instants when actual (or target) spike activity occurs. This ensures that actual spiking converges toward the desired activity as training progresses. Once the desired/actual activity become similar, the learning stops, since at time instants where both desired and actual spike occur simulataneously, the weight update value as per Eqn. 3 becomes 0.

Since the main aim of our work is to develop a spiking model for action recognition in videos, we model the desired spiking activity at the output layer based on one-hot encoding. As with standard machine learning applications, the output neuron assigned to a particular class is trained to produce high spiking activity while the remaining neurons are trained to generate zero activity. Fig. 2 (b) shows the one-hot spike encoding for a 3-class problem, wherein the reservoir’s readout has 3 output neurons and based on the training input’s class, the desired spiking activity (sort of ‘bump’) for each output neuron varies. Again, we use the supervised STDP rule described above to train the output layer of the autonomous model. Now, it is obvious that since the desired spiking activity of the auto model ( referring to Fig. 1) follows such one-hot spike encoding, the driven network (from which we derive our auto model) should be constructed to produce equivalent target output activity. To reiterate, the auto model (entirely spiking) works on the real-world input data and is trained to perform required tasks such as classification. The connection matrix () of the auto model is derived from a driven network (with continuous rate-based output neurons) that, in turn, provides the target activity for the output neurons of the auto model.

Input Processing

In this work, we use the UCF101 dataset [15] to perform action recognition with our reservoir approach. UCF101 consists of 13320 realistic videos (collected from YouTube) that fall into 101 different categories or classes, 2 of which are shown in Fig. 3. Since the reservoir computes using spiking inputs, we convert the pixel valued action information to spike data in steps outlined below.

Pixel Motion Detection

Figure 3: Fixed threshold spiking input (denoted by Spike Difference) and variable threshold weighted input data (denoted by Weighted spike) shown for selected frames for two videos of the UCF101 dataset. As time progresses, the correlation across spiking information is preserved. Encircled regions show the background captured in the spiking data.

Unlike static images, video data are a natural target in event-based learning algorithms due to their implicit time-based processing capability. First, we track the moving pixels across different frames so that the basic action/signature is captured. This is done by monitoring the difference of the pixel intensity values () between consecutive frames and comparing the difference against some threshold (, that is user-defined) to detect the moving edges (for each location (x,y) within a frame), as

(4)

The thresholding detects the moving pixels and yields an equivalent spike pattern () for each frame. It is evident that the conversion of the moving edges into spikes preserves the temporal correlation across different frames. While encodes the signature action adequately, we further make the representation robust by varying the threshold ( = {1,2,4,8,16,32}; N=6 in our experiments) and computing the weighted summation of the spike differences for each threshold as shown in the last part of Eqn. 4. The resultant weighted spike difference is then binarized to yield the final spike pattern for each frame. This kind of pixel-change based motion detection to generate spike output is now possible with a new breed of energy efficient sensors inspired by the retina [16].

Fig. 3 shows the spiking inputs captured from the moving pixels based on fixed threshold spike input and varying threshold weighted input for selected frames. It is clearly seen that the weighted input captures more relevant edges and even subtle movements, for instance, in the PlayingGuitar video, the delicate body movement of the person is also captured along with the significant hand gesture. Due to the realistic nature of the videos in UCF101, there exists a considerable amount of intra class variation along with ambiguous noise. Hence, a detailed input pattern that captures subtle gestures facilitates the reservoir in differentiating the signature motion from noise. It should be noted that we convert the RGB pixel video data into grayscale before performing motion detection.

Scan based Filtering

We observe from Fig. 3, for the BabyCrawling case, that along with the baby’s movement, some background activity is also captured. This background spiking occurs on account of the camera movement or jitter. Additionally, we see that majority of the pixels across different frames in Fig. 3 are black since the background is largely static and hence does not yield any spiking activity. To avoid the wasteful computation due to non-spiking pixels as well as circumvent insignificant activity due to camera motion, we create a Bounding Box (BB) around the Centre of Gravity (CoG) of spiking activity for each frame and scan across 5 directions as shown in Fig. 4. The CoG () is calculated as the average of the active pixel locations within the BB. In general, the dimensionality of the video frames in the UCF101 dataset is . We create a BB of size centered at the CoG, capture the spiking activity enclosed within the BB, in what we refer to as the Center (C) scan, and then stride the BB along different directions (by number of pixels equal to half the distance between the CoG and the edge of the BB), namely, Left (L), Right (R), Top (T), Bottom (B), to capture the activity in the corresponding directions. These five windows of spiking activity can be interpreted as capturing the salient edge pixels (derived from pixel motion detection above) after each microsaccade [10]. In our model, the first fixation is at the center of the image. Then, the fixation is shifted from the center of the image to the CoG using the first microsaccade. The rest of the four microsaccades are assumed to occur in four directions (i.e., top, bottom, left and right) with respect to CoG of the image by the stride amount. It should be noted that this sequence of four miscrosaccades is assumed to be always fixed in its direction. The stride amount can, however, vary based on the size of the BB.

Fig. 4 (a) clearly shows that the CRLTB scans only retain the relevant region of interest across different frames of the BabyCrawling video while filtering out the irrelevant spiking activity. Additionally, we also observe that in the PlayingGuitar video shown in Fig. 4 (b), different scans capture diverse aspects of the relevant spiking activity, for instance, hand gesture in C scan while guitar shape in B scan shown in the 3rd row. This ensures that all significant movements for a particular action in a given frame are acquired in a holistic manner.

Figure 4: Scan based filtering from the original spike data resulting in Center, Left, Right, Top, Bottom scans per frame for (a) BabyCrawling video (b) PlayingGuitar video. Each row shows the original frame as well as the different scans captured.

In addition to the scan based filtering, we also delete certain frames to further eliminate the effect of jitter or camera movement on the spike input data. We delete the frames based on two conditions:

  • If the overall number of active pixels for a given spiking frame are greater/lesser than a certain maximum/minimum threshold, then, that frame is deleted. This eliminates all the frames that have flashing activity due to unsteady camera movement. In our experiments, we use a min/max threshold value of 8%/75% of the total number of pixels in a given frame. Note, we perform this step before we obtain the CRLTB scans.

  • Once the CRLTB scans are obtained, we monitor the activity change for a given frame in RLTB as compared to C (i.e. for check ) . If the activity change (say, ) is greater than the mean activity across , then, we delete that particular frame () corresponding to the scan that yields high variance.

It is noteworthy to mention that the above methods of filtering eliminate the ambiguous motion or noise in videos with minimal background activity or jitter motion (such as PlayingGuitar) to a large extent. However, the videos where the background has significant motion (such as HorseRacing) or those where multiple human subjects are interacting (such as Fencing, IceDancing), the extracted spike pattern does not encode a robust and reliable signature motion due to extreme variation in movement. For future implementations, we are studying using a DVS camera [17, 16] to extract reliable spike data despite ego motion. Nonetheless, our proposed input processing method yields reasonable accuracy (shown later in Results section) with the D/A model architecture for action recognition with limited video examples. Supplementary Video 1,2 give better visualization of the spiking data obtained with the pixel motion detection and scan based filtering scheme.

Experimental Methodology

Network Parameters

The proposed reservoir model and learning was implemented in MATLAB. The dynamics of the reservoir neurons in the Driven/Autonomous models are described by

(5)

where 20 ms, denotes the real input spike data, is the derived version of the target output (refer to Eqn. 2), are the weights connecting the input to the reservoir neurons in the driven/auto model respectively. Each neuron fires when its membrane potential, , reaches a threshold -50 mV and is then reset to -70 mV. Following a spike event, the membrane potential is held at the reset value for a refractory period of 5 ms. The parameters correspond to fast/slow synaptic currents observed in the reservoir due to the inherent recurrent activity. In the driven model, only fast synaptic connections are present while auto model has both slow/fast connections. The synaptic currents are encoded as traces, that is, when a neuron in the reservoir spikes, (and in case of auto model) is increased by 1, otherwise it decays exponentially as

(6)

where 100 ms, 5 ms. Both and recurrent connections within the reservoir of D/A model are fixed during the course of training. of the driven model are randomly drawn from a normal distribution. and (refer to Fig. 1 (a)) of the auto network are derived from the driven model using = 1, B = random number drawn from a uniform distribution in the range [0, 0.15]. We discuss the impact of variation of on the decoding capability of the autonomous reservoir in a later section. If the reservoir outputs a spike pattern, say , the weights from the reservoir to the readout () are trained to produce the desired activity, . The continuous rate output neurons in driven network are trained using RLS such that . The spiking output neurons of the auto model (with similar neuronal parameters as Eqn. 5) integrate the net current to fire action potentials, that converge toward the desired spiking activity with supervised STDP learning.

D/A model for Video Classification

The input processing technique yields 5 different CRLTB scan spike patterns per frame for each video. In the D/A based reservoir construction, each of the 5 scans for every video is processed by 5 different autonomous models. Fig. 5 shows an overview of our proposed D/A construction approach for video classification. Each scan is processed separately by the corresponding autonomous model. An interesting observation here is that since the internal topology across all auto models is equivalent, they can all be derived from a single driven model as shown in Fig. 5. In fact, there is no limit to the number of autonomous models that can be extracted from a driven network, which is the principal advantage of our approach. Thus, in case, more scans based on depth filtering or ego motion monitoring are obtained during input processing, our model can be easily extended to incorporate more autonomous models corresponding to the new scans.

Figure 5: Proposed Driven/Autonomous construction scheme for processing the CRLTB scans obtained from the scan based filtering

During training, the driven network is first trained to produce the target pattern (similar to the one-hot spike encoding described in Fig. 2 (b)). Then, each autonomous model is trained separately on the corresponding input scan to produce the target pattern. In all our experiments, we train each auto model for 30 epochs. In each training epoch, we present the training input patterns corresponding to all classes sequentially (for instance, Class1Class 2Class 3) to produce target patterns similar to Fig. 2 (b). Each input pattern is presented for a time period of 300 ms (or 300 time steps) . In each time step, a particular spike frame for a given scan is processed by the reservoir. For patterns where total number of frames is less than 300, the network still continues to train (even in absence of input activity) on the inherent reservoir activity generated due to the recurrent dynamics. In fact, this enables the reservoir to generalize its’ dynamical activity over similar input patterns belonging to the same class while discriminating patterns from other classes. Before presenting a new input pattern, the membrane potential of all neurons in the reservoir are reset to their resting values.

After training is done, we pass the test instance CRLTB scans to each auto model simultaneously and observe the output activity. The test instance is assigned a particular class label based on the output neuron class that generated the highest spiking response. This assignment is done for each auto model. Finally, we take a majority vote across all the assignments (or all auto models) for each test input to predict its class. For instance, if majority of the auto models (i.e. ) predict a class (say Class 1), then predicted output class from the model is Class 1. Now, a single auto model might not have sufficient information to recognize the test input correctly. The remaining scans or auto models, in that case, can compensate for the insufficient or inaccurate output prediction from the individual models. While we reduce the computational complexity of training by using the scan based filtering approach that partitions the input space into smaller scans, the parallel voting during inference enhances the collective decision making of the proposed architecture, thereby improving the overall performance.

The reservoir (both driven/auto) topology in case of the video classification experiments consist of with connectivity within the reservoir. is equivalent to the size of the bounding box used during scan based filtering (in our case ). The number of output neurons vary based on the classification problem, that is, for a 5-class problem. In most experiments below, we use connectivity to maintain the sparseness in the reservoir. In the Results section below, we further elaborate on the role of connectivity on the performance of the reservoir. The number of neurons are varied based upon the complexity of the problem in order to achieve maximum classification accuracy.

Figure 6: Training/Testing video clips of the 2-class problem (a) GolfSwing (b) BenchPress: Column 1 shows the training video, Column 2/3 show the test video examples correctly/incorrectly classified with our D/A model.

Classification Accuracy on UCF101

First, we considered a 2-class problem of recognizing BenchPress/GolfSwing class in UCF101 to test the effectiveness of our overall approach. We specifically chose these 2 classes in our preliminary experiment since majority of videos in these categories had a static background and the variation in action across different videos was minimal. We simulated an 800N-2Output, 10% connectivity D/A reservoir (N:number of reservoir neurons) and trained the autonomous models with a single training video from each class as shown in Fig. 6. It is clearly seen that even with single video training, the reservoir correctly classifies a host of other examples irrespective of the diverse subjects performing the same action. For instance, in the golf examples in Fig. 6 (a), the reservoir does not discriminate between a person wearing shorts or trousers while recognizing the action. Hence, we can gather that the complex transient dynamics within a reservoir latches onto a particular action signature (swing gesture in this case) that enables it to classify several test examples even under diverse conditions. Investigating the incorrect predictions from the reservoir for both classes, we found that the reservoir is sensitive toward variation in views and inconsistencies in action, which results in inaccurate prediction. Fig. 6 shows that the incorrect results are observed when the view angle of the training/test video changes from frontal to adjacent views. In fact, in case of BenchPress, even videos with similar view as training are incorrectly classified due to disparity in motion signature (for instance, number of times the lifting action is performed). Note, we only show a selected number of videos classified correctly/incorrectly in Fig. 6.

To account for such variations, we increased the number of training examples from 1 to 8. In the extended training set, we tried to incorporate all kinds of view/signature based variations for each action class as shown in Fig. 7 (a). For the 2-class problem, with 8 training examples for each class, we simulated a 2000N-10% connectivity reservoir that yields an accuracy of 93.25%. Fig. 7 (b) shows the confusion matrix for the 2-class problem. BenchPress has more misses than GolfSwing as the former has more variation in action signature across different videos. Now, it is well-known that the inherent chaotic or spontaneous activity in a reservoir enables it to produce a wide variety of complex output patterns in response to an input stimulus. In our opinion, this behavior enables data efficient learning as the reservoir learns to generalize over a small set of training patterns and produce diverse yet converging internal dynamics that transform different test inputs toward the same output. Note, testing is done on all the remaining videos in the dataset that were not selected for training.

Figure 7: (a) Clips of the 8 training videos for each class (BenchPress/GolfSwing) incorporating multiple views/variation in action signature (b) Confusion Matrix showing the overall as well as class-wise accuracy for 2-class BenchPress/GolfSwing classification

Next, we simulated a reservoir of 4K/6K/8K/9K neurons (with 10% connectivity) to learn the entire 101 classes in the UCF101 dataset. Note, for each of the 101 classes, we used 8 different training videos that were randomly chosen from the dataset. To ensure that more variation is captured for training, we imposed a simple constraint during selection, that each training example must have a different subject/person performing the action. Fig. 8 (a) illustrates the Top-1, Top-3, Top-5 accuracy observed for each topology. We also report the average number of tunable parameters in each case. The reservoir has fixed connectivity from the input to reservoir as well as within the reservoir that are not affected during training. Hence, we only compare the total number of trainable parameters (comprising the weights from reservoir to output layer) as the reservoir size varies in Fig. 8 (a). We observe that the reservoir with 8K neurons yields maximum accuracy (). Also, the classification accuracy improves as the reservoir size increases from 4K 8K neurons. This is apparent since larger reservoirs with sparse random connectivity generate richer and heterogeneous dynamics that overall boosts the decoding capability of the network. Thus, for the same task difficulty (in this case 101-class problem), increasing the reservoir size improves the performance. However, beyond a certain point, the accuracy saturates. Here, the reservoir with 9K neurons achieves similar accuracy as 8K. In fact, there is a slight drop in accuracy from 8K 9K. This might be a result of some sort of overfitting phenomenon as the reservoir (with more parameters than necessary for a given task) might begin to assign, instead of generalizing, its dynamic activity toward a certain input. Note, all reservoir models discussed so far have 10% connectivity. Hence, the reservoirs in Fig. 8 (a) have increasing number of connections per neuron () as we increase the number of neurons in the reservoir from 4000 to 9000. Please refer to the supplementary information (see Supplementary Fig. S1, S2) that details out the confusion matrix for the 101-class problem with 8K neurons and 10% connectivity.

Figure 8: (a) Top-1/3/5 accuracy on 101-classes obtained with 8-training example (per class) learning for D/A models of different topologies (b) Effect of variation in sparsity (for fixed number of connections per neuron across different reservoir topologies) on the Top-1 accuracy achieved by the reservoir. Note, all results are performed on the D/A reservoir scheme with 8 training videos per class processed with the input spike transformation technique mentioned earlier.

Impact of Variation of Reservoir Sparsity on Accuracy

Since sparsity plays an important role in determining the reservoir dynamics, we simulated several topologies in an iso-connectivity scenario to see the effect of increasing sparseness on the reservoir performance. Fig. 8 (b) shows the variation in Top-1 accuracy for 101-classes as the reservoir sparseness changes. From Fig. 8 (a), we observe that the 8000N-10% connectivity reservoir (with number of connections per neuron) gave us the best results. Hence, in the sparsity analysis, we fixed the number of connections per neuron to across different reservoir topologies. Initially, for a 2000N-40% connectivity reservoir, the accuracy observed is pretty low (). This is evident as lesser sparsity limits the richness in the complex dynamics that can be produced by the reservoir that further reduces its’ learning capability. Increasing the sparseness improves the accuracy as expected. We would like to note that the 4000N-20% connectivity reservoir yields lesser accuracy than that of the sparser 4000N-10% connectivity reservoir of Fig. 8 (a). Another interesting observation here is that the 10,000N-8% reservoir yields higher accuracy ( more) than that of the 8000N-10% reservoir. This result supports the fact that in an iso-connectivity scenario, more sparseness yields better results. However, the accuracy again saturates with a risk of overfitting or losing generalization capability beyond a certain level of sparseness as observed in Fig. 8 (b). Hence, there is no fixed rule to determine the exact connectivity/sparsity ratio for a given task. We empirically arrived at the results by simulating a varying set of reservoir topologies. However, in most cases, we observed that for an N-reservoir model 10% connectivity generally yields reasonable results as seen from Fig. 8 (a). It is worth mentioning that while the 10K neuron reservoir yields the highest accuracy of , it consists of larger number of tunable parameters (total connections/weights from reservoir to output) as compared to observed with the 8K-10% connectivity reservoir (Fig. 8 (a)). In fact, the best accuracy achieved with our approach corresponds to the 10K-8% connectivity reservoir with Top-1/Top-3/Top-5 accuracy as 81.3%/85.2%/87%, respectively. Hence, depending upon the requirements, the connectivity/size of the reservoir can be varied to achieve a favorable efficiency-accuracy tradeoff. Please note that the total number of tunable parameters in the reservoir model include all 5-autonomous models.

EigenValue Spectra Analysis of Reservoir Connections

We analysed the EigenValue (EV) spectra of the synaptic connections (both and of the autonomous models) to study the complex dynamics of the reservoir. EV spectra has been shown to characterize the memory/learning ability of a dynamical system [18]. Rajan et. al [18, 19] have shown that stimulating a reservoir with random recurrent connections activates multiple modes where neurons start oscillating with individual frequencies. These modes then superpose in a highly non-linear manner resulting in the complex chaotic/ persistent activity observed from a reservoir. The EVs of the synaptic matrix typically encode the oscillatory behavior of each neuron in the reservoir. The real part of an EV corresponds to the decay rate of the associated neuron, while frequency is given by the imaginary part. If the real part of a complex EV exceeds 1 (or Re(EV)) 1), the activated mode leads to long lasting oscillatory behavior. If there are a few modes with Re(EV)) 1 , the network is at a fixed point, meaning the reservoir exhibits the same pattern of activity for a given input. It is desirable to construct a reservoir with multiple fixed points so that each one can be used to retain a different memory. In a recognition scenario, each of the fixed points latch onto a particular input pattern or video in our case, thereby yielding a good memory model. As the number of modes with Re(EV) 1 increases, the reservoir starts generating chaotic patterns, that are complex and non-repeating. In order to construct a reservoir model with good memory, we must operate in a region between singular fixed point activity (Re(EV) 1) and complete chaos (Re(EV) 1).

Figure 9: (a) Change in EigenValue spectra of the and connections of an autonomous network . Note, the spectra (corresponding to the connections imported from driven model) remains constant, while the spectra changes with varying values. (b) Impact of variation in , representative of stable/chaotic reservoir dynamics, on the accuracy for 2-class (BenchPress/GolfSwing) 1-training example problem (Fig. 6). The minimum and maximum accuracy observed across 5 different trials, where each trial uses a different video for training, is also shown.

Since our autonomous models have two kinds of synapses with different time constants, we observe two different kinds of EV spectra that contribute concurrently to the dynamical state of the auto network. While the overall reservoir dynamics are dictated by the effective contribution of both connections, (due to larger time constant of synaptic decay) will have a slightly more dominating effect. Fig. 9 (a) illustrates the EV spectra corresponding to the bottom scan autonomous 800N-reservoir model simulated earlier for the 2-class problem (BenchPress/GolfSwing) with just 1 training video (refer Fig. 6). Now, the EV of have considerable number of modes with Re(EV) 1 that is characteristic of chaotic dynamics. Hence, as discussed above, we need to compensate for the overwhelming chaotic activity with modes that exhibit fixed points to obtain a reliable memory model. This can be attained by adjusting the connections. From Fig. 9 (a), we see that as the range of (random constant used to derive the connections from the driven model) increases, the EV spectral radius of the connections grows. For , the EVs are concentrated at the center. Such networks have almost no learning capability as the dynamics converge to a single point that produces the same time-independent pattern of activity for any input stimulus. As the spectral circle enlarges, the number of modes with Re(EV) 1 also increases. For , the number of fixed point modes reaches a reasonable number that in coherence with the chaotic activity due to results in generalized reservoir learning for variable inputs. Increasing B further expands the spectral radius for (with more Re(EV) 1) that adds onto the chaotic activity generated due to . In such scenarios, the reservoir will not be able to discern between inputs that look similar due to persistent chaos. This is akin to the observations about the need for the reservoir to strike a balance between pattern approximation and separability [18].

Fig. 9 (b) demonstrates the accuracy variation for the same 2-class 1-training example learning model with different values of B. We repeated the experiment 5 times wherein we chose a different training video per class in each case. As expected, the accuracy for a fixed state reservoir with is very low. As the number of modes increase, the accuracy improves justifying the relevance of stable and chaotic activity for better learning. As the chaotic activity begins to dominate with increasing B, the accuracy starts declining. We also observe that the overall accuracy trend (for the 2-class 1-training example model) represented by the min-max accuracy bar is consistent even when the training videos are varied. This indicates that the reservoir accuracy is sensistive to the B-initialization and is not biased towards any particular training video in our limited training example learning scheme. While the empirical result corroborates our theoretical analysis and justifies the effectiveness of using the D/A approach in reservoir model construction, further investigation needs to be done to gauge the other advantages/disadvantages of the proposed approach. Please refer to [19, 18] for more clarification on random matrices and EV spectra implication on dynamical reservoir models.

Comparison with State-of-the-Art Recognition Models

Deep Learning Networks (DLNs) are the current state-of-the-art learning models that have made major advances in several recognition tasks [20, 2]. While these large-scale networks are very powerful, they inevitably require large training data and enormous computational resources. Spiking networks offer an alternative solution by exploiting event-based data-driven computations that makes them attractive for deployment on real-time neuromorphic hardware where power consumption and speed become vital constraints [21, 22]. While research efforts in spiking models have soared in the recent past [23, 24, 25, 26], the performance of such models are not as accurate as compared to their artificial counterparts (or DLNs). From our perspective, this work provides a new standard establishing the effectiveness of spiking models and their inherent timing-based processing for action recognition in videos.

Fig. 10 compares our reservoir model to state-of-the-art recognition results [27, 28, 29]. We simulated a VGG-16 [30] DLN architecture that only processes the spatial aspects of the video input. That is, we ignore the temporal features of the video data and attempt to classify each clip by looking at a single frame. We observe that the spatial VGG-16 model yields reduced accuracy of () when presented with the full training dataset (8-training examples per class), implying the detrimental effect of disregarding temporal statistics in the data. Please refer to the supplementary information (see Supplementary Fig. S3) that details out the efficiency vs. accuracy tradeoff between our proposed reservoir and the spatial VGG-16 model. The static hierarchical/feedforward nature of DLNs serves as a major drawback for sequential (such as video) data processing. While deep networks extract good flexible representations from static images, addition of temporal dimension in the input necessitates the model to incorporate recurrent connections that can capture the underlying temporal correlation from the data.

Recent efforts have integrated Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) with DLNs to process temporal information [28, 31]. In fact, the state-of-the-art accuracy on UCF-101 is reported as [29] as illustrated in Fig. 10, where a two-stream convolutional architecture with two VGG-16 models, one for spatial and another for temporal information processing is used. While the spatial VGG model is trained on the real-valued RGB data, the temporal VGG model uses optical flow stacking to capture the correlated motion. Here, both the models are trained on the entire UCF-101 dataset that significantly increases the overall training complexity. All models noted in Fig. 10 (including the LSTM model) use the temporal information in the input data by computing the flow features and subsequently training the deep learning/ recurrent models on such inputs. The overall accuracy shown in Fig. 10 is the combined prediction obtained from the spatial RGB and temporal flow models. The incorporation of temporal information drastically improves a DLN’s accuracy.

Fig. 10 also quantifies the benefits with our reservoir approach as compared to the state-of-the-art models from a computational efficiency perspective. We quantify efficiency in terms of total number of resources utilized during training a particular model. The overall resource utilization is defined as the product of the Number of (tunable) Parameters Number of training Data Examples. The number of data examples are calculated as Number of Frames per video Number of Training videos ( for full training data). Referring to the results for spatial VGG-16 model, we observe that the reservoir (10K-10% connectivity) is more efficient, while yielding significantly higher accuracy, in the iso-data scenario where both the models are shown 8 video samples per class. In contrast, the efficiency improves extensively to in the iso-accuracy scenario wherein the reservoir still uses 8 video samples per class (808 total videos), while the VGG-16 model requires the entire training dataset (9537 total videos) during training. Comparing with the models that include temporal information, our reservoir (10K-10% connectivity), although with lower accuracy, continues to yield significantly higher benefits of approximately and as compared to Two-stream architecture and Temporal Convolutional network, respectively, on account of being trained on limited data. However, in limited training example scenario, the reservoir (81.3% accuracy) continues to outperform an LSTM based machine learning model (67% accuracy) while requiring lesser training data. It is worth mentioning that our reservoir due to the inherent temporal processing capability requires only 30 epochs for training. On the other hand, the spatial VGG-16 model needs 250 epochs. As the number of epochs required for training increase, the overall training cost (can be defined as, Number of Epochs Resource Utilization) would consequently increase. This further ascertains the potential of reservoir computing for temporal data processing.

Figure 10: Accuracy and Efficiency comparison with state-of-the-art recognition models. Resource Utilization (B) is the product of Number of Parameters Number of Data Examples, where denotes billion. The model efficiency is calculated by normalizing the net resource utilization of each model against our reservoir model (10K-10% connectivity). Full Training (in column corresponding to # Training Videos) denotes the entire training dataset that comprises of 9537 total videos across 101 classes.

Discussion

In this paper, we demonstrated the effectiveness of a spiking-based reservoir computing architecture for learning from limited video examples on UCF101 action recognition task. We adopted the Driven/Autonomous reservoir construction approach originally proposed for neuroscientific use and gave it a new dimension to perform multi input-output transformation for practical recognition tasks. In fact, the D/A approach opens up novel possibilities for multi-modal recognition. For instance, consider a 5-class face recognition scenario. Here, the driven network is trained to produce 5 different desired activity. The auto models derived from the driven network will process the facial data to perform classification. Now, the same driven model can also be used to derive autonomous models that will work on, say voice data. Combining the output results across an ensemble of autonomous models will boost the overall confidence of classification. Apart from performance, the D/A model also has several hardware oriented advantages. Besides sparse power-efficient spike communication, the reservoir internal topology across all autonomous models for a particular problem remains equivalent. This provides a huge advantage of reusability of network structures while scaling from one task to another.

Our analysis using eigenvalue spectra results on the reservoir provide a key insight about D/A based reservoir construction such that the complex dynamics can converge to fixed memory states. We believe that this internal stabilization enables it to generalize its memory from a few action signatures. We also compared our reservoir framework with state-of-the-art models and reported the cost/accuracy benefits gained from our model. We would like to note that our work delivers a new benchmark for action recognition from limited training videos in spiking scenario. We believe that we can further improve the accuracy with our proposed approach by employing better input encoding techniques. The scan-based filtering approach discussed in this paper does not account for depth variations and is susceptible to extreme spatial/rotational variation. In fact, we observe that our reservoir model yields higher accuracy for classes with more consistent action signatures and finds it difficult to recognize classes with more variations in movement (refer to Supplementary Fig. S1 for action specific accuracies). Using an adequate spike processing technique (encompassing depth based filtering, motion based tracking or a fusion of these) will be key to achieving improved accuracy. Another intriguing possibility is to combine our spike based model that can rapidly recognize simpler and repeatable actions (e.g., GolfSwing) with minimal samples for training with more other DLN models for those actions that have several variations (e.g., Knitting) or the actions that appear different dependent on the view point or both. We are also studying the extension of our SNN model to a hierarchical model that could possibly capture more complex and subtle aspects of the spatio-temporal signatures in these actions.

Finally, our reservoir model, although biologically inspired, only has excitatory components. The brain’s recurrent circuitry consists of both excitatory and inhibitory neurons that operate in a balanced regime giving it the staggering ability to make sense of a complex and ever-changing world in the most energy-efficient manner [32]. In the future, we will concentrate on incorporating the inhibitory scheme that will further augment the sparseness in reservoir activity thereby improving the overall performance. An interesting line of research in this regard can be found here [7, 33]. Although there have been multiple efforts in the spiking domain exploring hierarchical feedforward and reservoir architectures with biologically plausible notions, most of them have been focused on static images (and some voice recognition) that use the entire training dataset to yield reasonable results [34, 35]. To that effect, this work provides a new perspective on the power of temporal processing for rapid recognition in cognitive applications using spiking networks. In fact, we are currently exploring the use of this model for learning at the edge (such as cell phones and sensors where energy is of paramount importance) using a recently developed neural chip called the Loihi [36] that enables energy efficient on-chip spike timing based learning. We believe that results from this work could be leveraged to enable Loihi based applications such as video analytics, video annotation and rapid searches for actions in video databases.

References

  • 1. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
  • 2. Mnih, V. et al. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
  • 3. Silver, D. et al. Mastering the game of go with deep neural networks and tree search. Nature 529, 484–489 (2016).
  • 4. Lake, B., Salakhutdinov, R., Gross, J. & Tenenbaum, J. One shot learning of simple visual concepts. In Proceedings of the Cognitive Science Society, vol. 33 (2011).
  • 5. Maass, W. Liquid state machines: motivation, theory, and applications. Computability in context: computation and logic in the real world 275–296 (2010).
  • 6. Lukoševičius, M. & Jaeger, H. Reservoir computing approaches to recurrent neural network training. Computer Science Review 3, 127–149 (2009).
  • 7. Srinivasa, N. & Cho, Y. Unsupervised discrimination of patterns in spiking neural networks with excitatory and inhibitory synaptic plasticity. Frontiers in computational neuroscience 8, 159 (2014).
  • 8. Wehr, M. & Zador, A. M. Balanced inhibition underlies tuning and sharpens spike timing in auditory cortex. Nature 426, 442–446 (2003).
  • 9. Abbott, L., DePasquale, B. & Memmesheimer, R.-M. Building functional networks of spiking model neurons. Nature neuroscience 19, 350 (2016).
  • 10. Masquelier, T., Portelli, G. & Kornprobst, P. Microsaccades enable efficient synchrony-based coding in the retina: a simulation study. Scientific reports 6, 24086 (2016).
  • 11. Haykin, S. S. Adaptive filter theory (Pearson Education India, 2008).
  • 12. Sussillo, D. & Abbott, L. F. Generating coherent patterns of activity from chaotic neural networks. Neuron 63, 544–557 (2009).
  • 13. Ponulak, F. & Kasiński, A. Supervised learning in spiking neural networks with resume: sequence learning, classification, and spike shifting. Neural Computation 22, 467–510 (2010).
  • 14. Eliasmith, C. A unified approach to building and controlling spiking attractor networks. Neural computation 17, 1276–1314 (2005).
  • 15. Soomro, K., Zamir, A. R. & Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
  • 16. Hu, Y., Liu, H., Pfeiffer, M. & Delbruck, T. Dvs benchmark datasets for object tracking, action recognition, and object recognition. Frontiers in Neuroscience 10, 405 (2016). URL http://journal.frontiersin.org/article/10.3389/fnins.2016.00405.
  • 17. Li, H., Liu, H., Ji, X., Li, G. & Shi, L. Cifar10-dvs: An event-stream dataset for object classification. Frontiers in Neuroscience 11 (2017).
  • 18. Rajan, K. Spontaneous and stimulus-driven network dynamics (Columbia University, 2009).
  • 19. Rajan, K. & Abbott, L. Eigenvalue spectra of random matrices for neural networks. Physical review letters 97, 188104 (2006).
  • 20. Cox, D. D. & Dean, T. Neural networks and neuroscience-inspired computer vision. Current Biology 24, R921–R929 (2014).
  • 21. Han, S., Pool, J., Tran, J. & Dally, W. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, 1135–1143 (2015).
  • 22. Merolla, P. A. et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science 345, 668–673 (2014).
  • 23. Maass, W. Energy-efficient neural network chips approach human recognition capabilities. Proceedings of the National Academy of Sciences 113, 11387–11389 (2016).
  • 24. Masquelier, T. & Thorpe, S. J. Unsupervised learning of visual features through spike timing dependent plasticity. PLoS computational biology 3, e31 (2007).
  • 25. Srinivasa, N. & Cho, Y. Self-organizing spiking neural model for learning fault-tolerant spatio-motor transformations. IEEE transactions on neural networks and learning systems 23, 1526–1538 (2012).
  • 26. Panda, P. & Roy, K. Unsupervised regenerative learning of hierarchical features in spiking deep networks for object recognition. In Neural Networks (IJCNN), 2016 International Joint Conference on, 299–306 (IEEE, 2016).
  • 27. Simonyan, K. & Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, 568–576 (2014).
  • 28. Srivastava, N., Mansimov, E. & Salakhudinov, R. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning, 843–852 (2015).
  • 29. Feichtenhofer, C., Pinz, A. & Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1933–1941 (2016).
  • 30. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • 31. Donahue, J. et al. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2625–2634 (2015).
  • 32. Herculano-Houzel, S. The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost. Proceedings of the National Academy of Sciences 109, 10661–10668 (2012).
  • 33. Brendel, W., Bourdoukan, R., Vertechi, P., Machens, C. K. & Denéve, S. Learning to represent signals spike by spike. arXiv preprint arXiv:1703.03777 (2017).
  • 34. Kheradpisheh, S. R., Ganjtabesh, M. & Masquelier, T. Bio-inspired unsupervised learning of visual features leads to robust invariant object recognition. Neurocomputing 205, 382–392 (2016).
  • 35. Diehl, P. U. et al. Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In Neural Networks (IJCNN), 2015 International Joint Conference on, 1–8 (IEEE, 2015).
  • 36. Morris, J. http://www.zdnet.com/article/why-intel-built-a-neuromorphic-chip/ (2017).

Acknowledgements

We would like to acknowledge the support of Priyadarshini Panda through a summer internship in the Microarchitecture Research Labs at Intel Corporation. We would also like to thank Michael Mayberry and Daniel Hammerstrom for their helpful comments and suggestions.

Supplementary Information

Fig. S1 shows the class-wise accuracy obtained with our reservoir approach. We can deduce that certain classes are learnt more accurately than the others. This is generally seen for those classes that have more consistent action signatures and lesser variation such as GolfSwing, PommelHorse among others. On the other hand, classes that have more variations in either action or views or both, such as TennisSwing, CliffDiving etc yield less accuracy. In fact, classes that consist of many moving subjects such as MilitaryParade, HorseRacing exhibit lower accuracy. In such cases, the input processing technique may not be able to extract motions from all relevant moving subjects. Here, irrelevant clutter or jitter due to camera motion gets captured in the spiking information as a result of which the reservoir is unable to recognize a particular signature. Such limitations can be eliminated with an adequate processing technique that incorporates depth and (or) motion based tracking. Also, data augmentation with simple transformations can further improve the accuracy. For instance, for a given training video, incorporating both left/right handed views by applying a 180 rotational transformation to all the frames of the original video will yield a different view/action for the same video. Such data augmentation will account for more variations in views/angles enabling the reservoir to learn better.

Fig. S2 shows the confusion matrix of the 101-class problem for the same reservoir topology as Fig. S1.

Figure 11: Figure S1. Class-wise accuracy obtained with 8000 neuron (10% connectivity) model on UCF-101 dataset when trained with 8 training videos per class. Few select classes with high/low accuracy are highlighted with green/red font to quantify the classes with more/less consistent videos.
Figure 12: Figure S2. Confusion Matrix for 101-class problem corresponding to Fig. S1 above. The X/Y axis represent the class numbers with equivalent notations as that of Fig. S1. The colormap indicates the accuracy for a given class label with highest intensity (red) corresponding to 100% match between target & output class and lowest intensity (blue) corresponding to 0% match between target & output class.

Here, we further quantify the advantages with our reservoir approach against the spatial VGG-16 model described in the main manuscript by gauging the efficiency vs. accuracy tradeoff obtained from both the models in equivalent training scenarios. Fig. S3 (a) illustrates the accuracy obtained for different training scenarios. For scenario 1, wherein the entire training dataset is used, the Top-1 accuracy obtained is , thus implying the detrimental effect of disregarding temporal statistics in the data. Then, we tested the VGG model for two different limited example scenarios, one equivalent to our reservoir learning with 8 training examples per class and other with 40 training examples. In both cases, we see that the accuracy drops drastically with the 8-training example scenario yielding merely accuracy. Increasing the number of training examples to 40 does improve the accuracy to , but the performance is much lower than that of the reservoir model.

A noteworthy observation in Fig. S3 (a) is that the Top-5 accuracy for the VGG-16 model (89.8%) trained with the entire dataset and the 8000N-10% connectivity reservoir (86.1%) trained on 8 videos per class (refer to Fig. 8 (a) in main manuscript) is comparable. In contrast, for limited training scenarios, the Top-5 accuracy with VGG is drastically reduced as compared to the reservoir models shown in Fig. 8 (a) of the main manuscript. This is indicative of the DLN’s inferior generalization capability on temporal data.

Fig. S3 (b) shows the normalized benefits observed with the different topologies of reservoir (Fig. 8 (a) in main manuscript) as compared to spatial VGG-16. Here, we quantify efficiency in terms of total number of trainable parameters (or weights) in a given model. The VGG-16 model for 101-class has 97.5M tunable parameters, against which all other models in Fig. S3 (b) are normalized. We observe improvement in efficiency with the reservoir model where both VGG and the reservoir (4000-N reservoir with 20% connectivity) have almost similar accuracy . A noteworthy observation here is that VGG-16 in this case was trained with the entire UCF101 training data, while our reservoir had only 8 samples per class for training. This ascertains the potential of reservoir computing for temporal data processing. In an iso-data scenario (i.e. 8 videos per class for both VGG and reservoir models), we observe that the reservoirs yielding maximal accuracy of continue to be more efficient by than the VGG model.

Figure 13: Figure S3. (a) Top-1/3/5 accuracy obtained with the VGG-16 deep learning model on 101-classes for different training scenarios (b) Efficiency (quantified as total # of tunable parameters) comparison between VGG-16 model and D/A reservoir model of different topologies (from Fig. 8 (a) in main manuscript) for iso-accuracy and iso-data (same number of training examples) scenario. The accuracy/topology for each model is noted on the graph.

For implementing the VGG-16 convolutional network model, we used the machine learning tool, Torch [37]. Also, we used standard regularization Dropout [38] technique to optimize the training process for yielding maximum performance. In case of the DLN model, we feed the raw pixel valued RGB data as inputs to the DLN for frame-by-frame training thereby providing the model with all information about the data. The input video frames originally of dimensions are resized to . The macro-architecture of the VGG-16 model was imported from [39] and then altered for 101-class classification. To ensure sufficient number of training examples for the deep VGG-16 model in the limited example learning cases, we augmented the data via two random transformation schemes: rotation in the range of , width/ height shifts within of total width/height. For the 8/40-training example scenario, each frame of a training video was augmented 10/2 times, respectively using any of the above transformations. The VGG-16 model in each of the training example scenarios described above is trained for 250 epochs. Here, in each epoch, the entire limited/full training dataset (that is the individual frames corresponding to all classes) are presented to the VGG model. Note, UCF101 provides three train/test split recommendations. We use just split #1 for all of our experiments.

References

  • 37. Collobert, R., Kavukcuoglu, K. & Farabet, C. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, EPFL-CONF-192376 (2011).
  • 38. Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research 15, 1929–1958 (2014).
  • 39. Zagoruyko, S. http://torch.ch/blog/2015/07/30/cifar.html (2015).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
44727
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description