Layer Flexible Adaptive Computation Time for Recurrent Neural Networks
Abstract
Deep recurrent neural networks perform well on sequence data and are the model of choice. It is a daunting task to decide the number of layers, especially considering different computational needs for tasks within a sequence of different difficulties. We propose a layer flexible recurrent neural network with adaptive computation time, and expand it to a sequence to sequence model. Contrary to the adaptive computation time model, our model has a dynamic number of transmission states which vary by step and sequence. We evaluate the model on a financial data set and Wikipedia language modeling. Experimental results show the performance improvement of 8% to 12% and indicate the model’s ability to dynamically change the number of layers.
1 Introduction
Recurrent neural networks (RNN) are widely used in supervised machine learning tasks for their superior performance in sequence data, such as machine translation [1, 22], speech recognition [13, 14], image description generation [18, 24], and music generation [7]. The design of the underlying network is always a daunting task requiring substantial computational resources and experimentation. Many recent breakthroughs hinge on multilayer neural networks’ ability to increase model accuracy, [16, 26, 32], leading to the important decision in RNNs of the number of layers used. First, the right choice requires running several very expensive training processes to try many different number of layers. Even if a reinforcement learning algorithm is used to determine a good number of layers, [3, 36], it still requires a substantial training effort. The second issue with the number of layers in RNNs is the fact that the same number of layers is applied to each step in each sequence and the number is the same for each sample. It is conceivable that some samples are harder to classify than others and thus such harder samples should employ more layers. A similar argument holds for steps, e.g., certain steps in a sample can bear less predictive power and thus should use fewer layers in order to decrease the computational burden. The goal of our work is to introduce a network that automatically determines the number of layers  and together with this the number of hidden vectors to use  in training and inference which is dynamic with respect to samples and step number.
To resolve the inherent problems of fixed structure neural networks, Graves [12] addresses this by providing an Adaptive Computation Time (ACT) model for RNN. In Graves’ model, a sigmoidal halting unit is utilized to calculate a halting probability for each intermediate round within a step, and a computation stops when the accumulated halting probability reaches or exceeds a threshold. ACT can utilize multiple computation rounds within each individual step and it can dynamically adapt to different samples and steps. The model is appealing due to its modeling flexibility and its advantage in increasing model accuracy [10]. However, unlike multilayer networks, ACT utilizes a single hidden vector and thus lacks the information transmission abilities of deep networks. With the ACT mechanism, when a step of computation is halted, all intermediate states and outputs are used to calculate one meanfield state and output. As a result, ACT cannot efficiently represent functions of former hidden states and inputs as a multilayer network can due to its limited capacity. Our experimental results show that ACT has marginal benefits over basic RNN or sequence to sequence (seq2seq) models. Therefore, in order to obtain the benefits of both ACT and a multilayer network, we develop a layer flexible RNN model with adaptive computation time. The model uses several rounds in each step similar to ACT but it is also using a flexible number of hidden states between two consecutive steps.
The novelty of our proposed model is its focus on learning the rules of transmitting states of different layers between two consecutive steps. Similar to Graves’ work, we also utilize a unit to determine the action of each round within a step by calculating their halting probabilities. Instead of using a single hidden vector, a step in our model produces multiple hidden states (one state per round within the step). These multiple hidden states are then combined into a different number of hidden states for the next step using attention ideas [2, 23] (the number of new hidden states equals to the number of rounds in the next step). The network can thus have a flexible number of layers according to adaptive computation time in each step. We also develop several strategies to combine hidden states between two steps. Our model increases the accuracy of 7% to 8% on a financial data set and 12% on Wikipedia language modeling, which attests to its robustness.
Our main contributions are as follows.

An RNN model with flexible hidden layers is proposed with adaptive computation time.

A seq2seq model applying the layer flexible RNN in each step. We note that ACT has been developed in the RNN setting and not seq2seq.
The rest of the manuscript is structured as follows. In Section 2 we review the literature. In Section 3, the flexible layer adaptive computation time RNN model is presented, including all of the alternative options. In Section 4 we introduce the data sets and discuss all the experimental results.
2 Literature Review
A deep learning model and algorithm have many hyperparameters. In an RNN, one of the problems is deciding the computation amount of a certain input sequence. A simple solution is comparing different depths of networks and manually selecting the best option, but a series of expensive training processes is required to make the right decision. Hyperparameter optimization [6, 5] and Bayesian optimization [31, 25, 29] have been proposed to select an efficient architecture of a network. Based on these concepts, Zoph [36] and Baker [3] propose mechanisms for network configuration using reinforcement learning. However, massive training efforts are still present. Another problem of such approaches is the assumption of a fixed structure of the network, irrespective of the underlying sample and step. The difficulty of classification varies in each data set and sample, and it is comprehensible that harder samples would require more computation. Therefore, applying networks with the same number of layers is inflexible and it cannot achieve the goal of flexible computation time among different samples. Conditional computation provides general ideas for alleviating the weaknesses of a fixedstructure deep network by establishing a learning policy [9, 4]. A halt neuron is designed and used as an activation threshold in selfdelimiting neural networks [30, 33] to stop an ongoing computation whenever it reaches or exceeds the halting threshold. Work [34] shows that conditional computation helps the networks obtain adaptive depth and thus yield higher accuracy than fixed depth structures. Graves [12] introduces an Adaptive Computation Time (ACT) mechanism for RNN to dynamically calculate each input step computation time and determine their halting condition. These series of work focus on formulating the policies of halting conditions and use a single hidden vector in each cell; none of them contribute to designing flexible multilayer networks or study learning the rules of state transmission.
The ACT mechanism [12] is proved to improve performances and is applied in a few different problems. Universal Transformers [10] apply ACT on a selfattentive RNN to automatically halt computation. A dynamic time model for visual attention [21] is proposed to accelerate the processing time by adding a binary action at each step to determine whether to continue or stop. Figurnov et al. [11] prove that applying ACT on Residual Networks can dynamically choose the number of evaluated layers and propose spatially adaptive computation time for Residual Networks for image processing to adapt the computation amount between spatial positions. Similarly, Neumann et al. [27] extend ACT to a recognizing textual entailment task. In addition, ACT is also applied to reduce computation cost and calculate computation time in speech recognition [20], image classification [19], natural language processing [35], and highway networks [28]. These models simply apply the ACT mechanism on other models to achieve the abilities of adaptive halting computations. They focus on solving their specific problems but do not make any change to the structure of ACT cells. However, our work concentrates in the inner design of a layer flexible ACT cell for its ability of automatically and dynamically adapting the number of layers.
3 Model
We start with an explanation of RNN and ACT. A standard RNN contains three layers: the input layer, the hidden layer, and the output layer. The input layer receives input sequences and transmits them to the hidden layer to compute the hidden states . The output layer calculates the output based on the updated state of each step. The equations are as follows:
In step , input from the input sequence is delivered to the network. A cell in the hidden layer uses the input and the state from the previous step to update the hidden state in the current step. Long ShortTerm Memory (LSTM) [17] and Gated Recurrent Unit (GRU) [8] are frequently applied in the hidden layer cell , which contain the dynamic computation information and the activation of the hidden cells. The output is computed utilizing an output weight , an output bias , and an activation function .
ACT extends the standard RNN. The hidden layer contains several rounds of computation and each round produces an intermediate state and output. The representation of intermediate states and intermediate outputs are as follows:
The first hidden cell, in step , receives the state from the previous step and computes the first intermediate state. All the following rounds of computation use the previous intermediate output and produce an updated state . To distinguish different rounds of computation, a flag is augmented to the input for the first round and another flag is added for all others. Each intermediate output is computed based on the intermediate state in the same round.
To determine the halting condition of a series of rounds of computation, units are introduced in each computation round as . Here is the halting weight and is the halting bias.
The total computation time in a step is decided by the halting units and the maximum threshold . Whenever the accumulated halting units’ value in a step is over 1 or the computation time reaches , the computation halts. The definition of total computation time is as follows:
(1) 
where is a hyperparameter.
ACT uses all the intermediate states and outputs to calculate one meanfield state and output (as represented in (2) and (3) below) for each step. A probability produced by halting unit is introduced into ACT for calculating the meanfield state and output according to the contribution of each intermediate computation round in a step. The updated meanfield state is transmitted to the next input step and the output is delivered to the output layer as the current step’s output.
(2) 
(3) 
Given an input sequence , the ACT model tends to compute as much as possible in each step to avoid making erroneous predictions and incurring errors. This can cause an extra computational expense and impede achieving the goal to adapt the computation time. Therefore, training the model to decrease the amount of computation becomes necessary. ACT introduces ponder cost as to represent the total computation time during the input sequence. The loss function with being the ground truth is modified to encourage the network to also minimize :
(4) 
where is a hyperparameter time penalty that balances the ponder cost and prediction errors.
3.1 Layer Flexible Adaptive Computation Time Recurrent Neural Network
In this section, our Layer Flexible Adaptive Computation Time (LFACT) model is introduced. The main idea of LFACT is dynamically adjusting the number of layers according to the imminent characteristic of different inputs and efficiently transmitting each layer’s information to the same layer in the next step. Differing from ACT where only the meanfield state in (2) is transmitted to the next step, which can be viewed as a single layer network, LFACT is designed for transmitting each layer’s state individually between every consecutive step. In LFACT we compute and as in ACT. Each cell (layer ) in step takes and as input and creates for . Vector is computed from the output of the previous cell and the hidden state from the previous step and same layer . The problem is that at step we produce for but for step we need for . The key of our model is to use the attention principle to create from . 1 depicts the model.
The representation of the LFACT model is as follows:
The LFACT model contains two types of states. One state is the primary output of each hidden cell, which is the same as the states in standard RNN. The other state is the transmission state that is used for transmitting layer information to the next step. The primary state from previous layer and the transmission state from the same layer in the previous time step are combined together through function . The combined state is delivered to the current cell. Possible options for are a multilayer fully connected neural network, or an affine transformation of followed by an activation function. In our experiments, we use .
In step , the hidden layer cell uses the input and the combined state from function to compute and update the primary state . The primary states are used to compute the transmission state for the next step. To avoid possible errors caused by the previous layer, input is directly delivered to each layer as an input. For , the equations governing the relationship between two transmission states read
(5) 
To compute the transmission states , an attention unit is introduced to represent the relationship between the primary states in a certain layer and the primary states in other layers. We propose two choices to select :
Option (a) only considers the relationship between the state of the current layer and the states from the lower layers (i.e. ), called limited (LTD). Alternative (b) utilizes all computed transmission states (i.e. ), called ALL. When strategy LTD is applied and , all primary states in deeper layers (i.e. ) cannot be used. Strategy ALL aims to include the computed information of all the layers. To distinguish different layers, extra weights are utilized to compute . Weights and in (5) to compute are vectors.
We use the same method as ACT to compute (as represented in (1)), the computation time of each step. But unlike ACT, the halting unit is computed based on the output and transmission state of each layer as . In addition, instead of computing a meanfield output, we directly take the output of the deepest layer as one step output as .
When applying loss function (4) to LFACT, the shallow layers have limited involvement in calculating gradients. Therefore, to get the prediction of each layer as accurate as possible, we introduce all of the intermediate outputs in the loss function, as
(6) 
In the experiments we use .
3.2 Sequence to Sequence Model with LFACT
In order to deal with sequence tasks, we propose a combination model using a seq2seq (encoderdecoder) model and our LFACT model, as 5 in Appendix A.1 shows. In the seq2seq model, a cell in each step is replaced with our LFACT model to form a deep and flexible network. The seq2seq encoder part accepts a sequence input, and in the decoder part, we use the last ground truth as input.
4 Computational Experiments
All the models are trained starting with random weights, i.e. no pretraining. Training the LFACT model takes 20% to 30% more time than a typical ACT model. Most experiments are based on a single seed, but in Section 4.2 we conclude that the variance is low if the seed is varied.
4.1 Financial Data Set
We test our LFACT models on a financial data set from [15]. The data set consists of the tick prices of twentytwo ETFs at five minute intervals. The data is labeled into five classes to represent the significance of the price changes, e.g., one class corresponds to the price being within one standard deviation. We have 22 softmax classification layers in each step. We have three test instances, and in each one we train our model on 50 weeks of returns (45,950 samples), use the next week (905 samples) as validation data to save the best performing weights, and test the model based on the saved weights using the following week (905 samples). Sequences have lenght 20. The financial data set is tested on both RNN and seq2seq frameworks.
RNN Based Models:
RNN based models predict the next step price changes in each time step. The LFACT model utilizes option affine transformation for () and strategy ALL for computing transmission state (). We test plain ACT and RNN, which have been tuned with respect to all hyperparameters as our baseline models, and compare them with the RNN based LFACT model. We apply 0.001 as our ponder time penalty () for LFACT and ACT (the value is obtained by the general optimal value of the experiments from Graves[12]), and use the Adam optimizer with 0.0005 learning rate to train the models. The maximum number of layers is 5 and GRU cells with hidden vectors of size 128 are utilized in all the models.
1(a) shows the F1 score improvements of LFACT and ACT over RNN. We test all models on three different instances INS1, INS2, and INS3. Each bar indicates the average F1 score for all prediction steps in an instance. The results of LFACT are based on applying 0.1 to in loss (6). The F1 score of RNN is 0.475, 0.461, 0.447 for INS1, INS2, INS3, respectively. From 1(a), LFACT improves 14.1% over RNN on average and ACT improves 6.3%. We introduce the new loss function (6) in order to directly update the weights of each layer from the intermediate outputs. 1(b) provides the performance comparison for different . The results are the average F1 score improvement over RNN for all three instances. The best range for in (6) is 0.01 to 0.1, and is better than the original one in (4) by 1.2%. The application of different values shows that our new loss function yields improvements.
1(c) provides the F1 score distribution of steps 1 to 20 on INS1. LFACT consistently performs better than ACT, indicating that multiple layers of hidden vectors bring better effectiveness than a single one. The difficulty of a sequential prediction task is higher in early steps than in late ones, because the early steps have limited information from the input. LFACT and ACT both are stable in all prediction steps, but RNN acts poorly in early predictions. This benefit of LFACT and ACT implies that adaptive computation can contribute to hard tasks. 1(d) gives the average computation time () of each step on the test set of INS1. Higher average of early steps proves LFACT’s ability of deeply computing on hard tasks, and further explains why LFACT is so effective on early predictions.
Seq2seq Based Models (10 Prediction Steps):
In addition to the RNN framework, we also use the seq2seq version of models to predict the following ten steps. The raw sequence data with input length of 20 is delivered into seq2seq models as the inputs of the encoder part. All hyperparameters are the same as in the RNN based experiments, and the same strategies for and as in the RNN based LFACT are applied to the seq2seq framework. Considering that the encoder part does not have outputs, we apply loss function (4) in this task.
In 3a, we present the F1 scores relative changes over seq2seq alone for each instance. The F1 scores of seq2seq are 0.439, 0.481, 0.447. The ACT model is worse than seq2seq on INS3, so the improvement here is negative. From the results, the seq2seq based LFACT improves F1 7.4% over seq2seq, and ACT acts similar to seq2seq. In 3b, we provide the F1 scores for the ten prediction steps in the decoder individually on INS1. All three models decrease over time, but LFACT and ACT are more stable than seq2seq. In seq2seq based models, the decoder part has constant input of last ground truth, and can cause information deterioration as time passes. Thus, the benefits of LFACT on late predictions over seq2seq alone imply better abilities of LFACT on information transmission and memorization. Surprisingly, the first prediction of seq2seq is better than LFACT, which conflicts the results from RNN. This may be caused by LFACT requiring delay when transforming from input to predictions since it has more trainable weights than seq2seq. However, the whole point of the seq2seq framework is multiple steps of predictions, and LFACT catches up very fast at the second prediction, so the disadvantage of LFACT should not be concerning.
3 also presents the computation time () results for INS1: 3c and 3d are the results of the training and validation process based on the optimized weights, and 3e is for test. The result shows the change of among the different steps, indicating that the LFACT model has the ability of adapting computation time dynamically according to its input. Because of the same input in the decoder, values are the same from step 21 to 30 within each set. In addition, the low values in test set imply that LFACT has low computation request in the decoder part. Thus, the multiple computation ability of LFACT is not the reason for the good performance in the seq2seq setting, as it is in the early predictions in the RNN setting. Comparing to seq2seq alone which contains only one computation time as well in the decoder, the significant benefits in late predictions for LFACT further confirm the conclusion that LFACT has the excellent abilities for information transmission and memorization.
We also conduct similar experiments by making 5 predictions. These are shown in Appendix A.2. The observations are very similar.
4.2 Wikipedia Language Modeling
This task focuses on predicting characters from the Hutter Prize Wikipedia data set, which is also used in Graves’ ACT paper [12]. The original unicode text is used without any preprocessing. Every character is represented as onehot, and presents one time step. Due to computation resource limitations, 10,240 sequences are randomly selected as the training set, and 1,280 sequences are chosen as validation and test sets without repetition. Each sequence includes 50 consecutive characters, and the next character is predicted at each time step in this task (RNN setting). GRU cells with 128 hidden size are used to structure all models. The maximum number of layers is set to 3, and a softmax layer with size 256 is added to each step in the decoder. We apply the optimized ponder time penalty () 0.06 from Graves’ experiments [12] for this task. The models are evaluated using bit per character . Lower BPC values reflect better performances. All results are based on option affine transformation for g () and strategy ALL ().
In 3(a), we present the experimental results of LFACT and the two baseline models ACT and RNN on the language modeling task. The reported BPC values for LFACT are from different settings of hyperparameter in loss (6). Three different random seeds are applied for ACT and RNN to test the stability of the models. Maximum, minimum, and average BPC values are provided. The bars in 3(a) represent average BPC values, and error bars indicate maximum and minimum BPC. From the experiment, ACT does not have a significant benefit over RNN, but LFACT improves 11.9% over ACT and 12.6% over standard RNN. From the error bars, LFACT has the smallest variance and ACT varies the most. Strong stability for LFACT reflects its better ability to deal with complex situations. To test the influence of the hyperparameter in loss function (6), we compare the different settings of in 3(b). When , the loss function is equal to the original one in (4). From 3(b), the best range for is from 0.01 to 0.1. However, when is set to be a larger value (), the new loss function does not bring any performance improvement over the original loss function.
In addition, we test the fully connected network option for and strategy LTD (). The fully connected network for provides 1.074 BPC, and LTD gives 1.678. Neither of them are better than our experimental settings. Therefore, the affine transformation for and ALL are better strategies for LFACT.
In 3(c), we provide the average maximum and average of each step computation time () during training of the Wikipedia language modeling task. We observe a clear decrease during the early training epochs, which eventually stabilizes. Note that during epochs 5 to 10, the maximum increases but the average still decreases. We postulate that the LFACT model has already obtained the ability to predict most samples during this period, and is putting more effort on the difficult samples. 8 in Appendix A.3 shows the Maximum distributions of training, validation, and test based on the optimized weights. We only present the last 25 steps; the first 25 steps are all 1. The distributions show that the LFACT model is able to keep the computation time as low as possible, but also has the ability of deep computation for certain samples. With the optimized weights, only 0.03% of the sequences in the training set have more than one computation time, and validation and test sets have 0.24% and 0.16% of the sequences with multiple computation. This difference happens because the model is trained based on the training set, and the model should have learned the most efficient way to predict characters in the training set.
5 Conclusion
Deciding the structure of recurrent neural networks has been a problem in deep learning applications, in particular the number of layers. A halting unit is applied in a previous work to adapt the computation time to inputs, but a single hidden vector structure leads to information transmission weaknesses. We propose LFACT which utilizes an attention strategy in designing an information transmission policy which leads to a flexible multilayer recurrent neural network with adaptive computation time. LFACT can automatically adjust computation time according to the computing complexity of inputs and has outstanding dynamic information transmission abilities between consecutive time steps. We apply LFACT in an RNN and a seq2seq setting and evaluate the model on a financial data set and Wikipedia language modeling. The experimental results show a significant improvement of LFACT over RNN and seq2seq and ACT on both data sets. The different number of layers in practice indicates LFACT’s ability of adapting computation time and information transmission.
References
 [1] Michael Auli, Michel Galley, Chris Quirk, and Geoffrey Zweig. Joint language and translation modeling with recurrent neural networks. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013.
 [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 [3] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. International Conference on Learning Representations, 2017.
 [4] Emmanuel Bengio, PierreLuc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. International Conference on Learning Representations, 2015.
 [5] J. Bergstra, D. Yamins, and D. D. Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the 30th International Conference on International Conference on Machine Learning, 2013.
 [6] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems, 2011.
 [7] Nicolas BoulangerLewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription. Proceedings of the 29th International Conference on Machine Learning, 2012.
 [8] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.
 [9] George E Dahl, Dong Yu, Li Deng, and Alex Acero. Contextdependent pretrained deep neural networks for largevocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30–42, 2012.
 [10] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
 [11] Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry P Vetrov, and Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [12] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.
 [13] Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
 [14] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up endtoend speech recognition. arXiv preprint arXiv:1412.5567, 2014.
 [15] Mark Harmon and Diego Klabjan. Dynamic prediction length for time series with sequence to sequence networks. arXiv preprint arXiv:1807.00425, 2018.
 [16] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.
 [17] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 [18] Andrej Karpathy and Li FeiFei. Deep visualsemantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [19] Sam Leroux, Pavlo Molchanov, Pieter Simoens, Bart Dhoedt, Thomas Breuel, and Jan Kautz. Iamnn: Iterative and adaptive mobile neural network for efficient image classification. Workshop on International Conference on Learning Representations, 2018.
 [20] Mohan Li and Min Liu. Endtoend speech recognition with adaptive computation steps. arXiv preprint arXiv:1808.10088, 2018.
 [21] Zhichao Li, Yi Yang, Xiao Liu, Feng Zhou, Shilei Wen, and Wei Xu. Dynamic computational time for visual attention. IEEE International Conference on Computer Vision Workshop, 2017.
 [22] Shujie Liu, Nan Yang, Mu Li, and Ming Zhou. A recursive recurrent neural network for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014.
 [23] Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
 [24] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (mRNN). International Conference on Learning Representations, 2015.
 [25] Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Towards automaticallytuned neural networks. In Workshop on Automatic Machine Learning, 2016.
 [26] Abdelrahman Mohamed, George E Dahl, Geoffrey Hinton, et al. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):14–22, 2012.
 [27] Mark Neumann, Pontus Stenetorp, and Sebastian Riedel. Learning to reason with adaptive computation. NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems, 2016.
 [28] Hyunsin Park and Chang D Yoo. Early improving recurrent elastic highway network. arXiv preprint arXiv:1708.04116, 2017.
 [29] Shreyas Saxena and Jakob Verbeek. Convolutional neural fabrics. In Advances in Neural Information Processing Systems, 2016.
 [30] Jürgen Schmidhuber. Selfdelimiting neural networks. arXiv preprint arXiv:1210.0118, 2012.
 [31] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, 2012.
 [32] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Advances in Neural Information Processing Systems, 2015.
 [33] Rupesh Kumar Srivastava, Bas R Steunebrink, and Jürgen Schmidhuber. First experiments with powerplay. The 2nd Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics, 2013.
 [34] Chris Ying and Katerina Fragkiadaki. Depthadaptive computational policies for efficient visual tracking. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 109–122. Springer, 2017.
 [35] Adams Wei Yu, Hongrae Lee, and Quoc Le. Learning to skim text. The 55th Annual Meeting of the Association for Computational Linguistics, 2018.
 [36] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. International Conference on Learning Representations, 2017.
A Appendix
A.1 Seq2seq Model with LFACT
A.2 Seq2seq Based Models (5 Prediction Steps)
To examine the stability of the LFACT model, we further test the seq2seq based models with 5 prediction steps. The setting is the same as in the 10prediction case except we have only 5 predictions. 6 shows the relative F1 scores for LFACT and ACT based on seq2seq alone. The F1 scores for seq2seq on the three instances are 0.492, 0.534, and 0.498. The seq2seq based LFACT performs better than both ACT and seq2seq in the 5prediction task, and the benefit is significant over ACT. However, the improvement of LFACT over seq2seq is not as pronounced as in the 10prediction task, and ACT is even worse than seq2seq. 7 is the F1 score distributions for the three models on INS1. The results match the 10prediction task, and show that the advantage of LFACT is more likely to affect late predictions in the seq2seq framework.