# EcoRNN: Fused LSTM RNN Implementation with Data Layout Optimization

###### Abstract

Long-Short-Term-Memory Recurrent Neural Network (LSTM RNN) is a state-of-the-art (SOTA) model for analyzing sequential data. Current implementations of LSTM RNN in machine learning frameworks usually either lack performance or flexibility. For example, default implementations in Tensorflow and MXNet invoke many tiny GPU kernels, leading to excessive overhead in launching GPU threads. Although cuDNN, NVIDIA’s deep learning library, can accelerate performance by around , it is closed-source and inflexible, hampering further research and performance improvements in frameworks, such as PyTorch, that use cuDNN as their backend. In this paper, we introduce a new RNN implementation called EcoRNN that is significantly faster than the SOTA open-source implementation in MXNet and is competitive with the closed-source cuDNN. We show that (1) fusing tiny GPU kernels and (2) applying data layout optimization can give us a maximum performance boost of over MXNet default and over cuDNN implementations. Our optimizations also apply to other RNN cell types such as LSTM variants and Gated Recurrent Units (GRUs). We integrate EcoRNN into MXNet Python library and open-source it to benefit machine learning practitioners.

## 1 Introduction

LSTM Hochreiter and Schmidhuber (1997) RNN (Figure 1) is one of the most important machine learning models for analyzing sequential data. It is shown to have applications in areas such as speech recognition Graves et al. (2013); Graves and Jaitly (2014), language modeling Zaremba et al. (2014); Sundermeyer et al. (2012), and machine translation Bahdanau et al. (2014); Sutskever et al. (2014). However, it is also shown to have a much lower computation throughput compared to other types of networks such as Convolutional Neural Networks (CNNs) Lei et al. (2017). Although cuDNN NVIDIA (2017b); Chetlur et al. (2014), the proprietary deep learning library owned by NVIDIA, makes several efforts to accelerate RNN training, it is closed-source and therefore carved in stone. Greff et al. (2017), who do a large-scale analysis on LSTM architectures, show that there are now at least 8 variants of this single cell type used in the machine learning community. All these, however, are impossible to implement with cuDNN and machine learning frameworks, such as PyTorch Paszke et al. (2017), that use cuDNN as their backend Goel (2017). This is possibly one of the reasons that led framework developers, such as those from Tensorflow Abadi et al. (2015) or MXNet Chen et al. (2015), to develop their own implementations Tensorflow (2018a); (incubating) (2017a). Although they win flexibility, performance is lost by around compared with cuDNN (based on our results described later in this Section and also in Section 3). The primary reason, as addressed in the previous work done by Appleyard et al. (2016), is that these open-source implementations slice the computation of "" block (shown in Figure 1), which could be done in one single GPU kernel, into multiple small GPU kernels. This slicing causes performance overhead due to continuously launching a group of GPU threads (Figure 2), which is known as the cudaLaunch function call.

Figure 3 shows the runtime profile comparison
between MXNet Default and CuDNN.^{1}^{1}1
The word Default is used to differentiate between MXNet’s own and the cuDNN implementation under the MXNet framework.
For convenience, we will further refer to the former as Default and the latter as CuDNN).
The profile is obtained by measuring both of them running on a 1-layer LSTM RNN
with a batch size of 64, hidden dimension of 512, and sequence length of 50,
for 1 iteration that includes both forward and backward passes
on a Titan Xp NVIDIA (2017g) GPU card using nvprof NVIDIA (2017f), the NVIDIA profiling tool for GPU programs
(Section 3.1 has a more detailed description of experimental settings).
We observe that cudaLaunch time spent in the case of Default is almost that of CuDNN,
and it also exceeds the amount of actual compute time (GPU Kernels in Figure 3).
This negative effect also exists in the Tensorflow implementation of LSTM RNN and
can be exacerbated as the number of layers or sequence length increases.
Clearly, there is room for improvement by fusing kernels together to get rid of the cudaLaunch overhead.

Although Appleyard et al. (2016) managed to solve the above issue, their work, however, fails to identify the runtime bottleneck of GPU Kernels in Figure 3. In this paper, we build on this prior work to further speed up LSTM RNN by applying data layout optimization Kennedy and Kremer (1998), a technique that originates from compiler research. We introduce a new implementation called EcoRNN, which stands for Efficient Computing of LSTM RNN, and we highlight its major contributions as follows: {enumerate*}[(1)]

EcoRNN can be up to faster than MXNet Default and 50% faster than CuDNN, while making no changes to the LSTM RNN algorithm.

It has a complete design that includes both forward and backward passes, and also supports dropout Srivastava et al. (2014), a technique that has been proven to be useful to avoid overfitting.

It has been integrated into MXNet ver. 0.12.1, one of the SOTA open-source machine learning frameworks. Implementations of EcoRNN propagate from the MXNet C++ core library to the Python interface, making it directly usable to machine learning researchers.

## 2 Data Layout Optimization

### 2.1 What is Data Layout Optimization?

Data layout is a term that is used to specify how a piece of data (e.g., a two-dimensional array of size ) resides in memory. A row-major data layout means that data in the same row sits together in memory ( is adjacent to ), and a column-major data layout indicates that data in the same column is contiguous ( sits next to ). The idea behind data layout optimization is that changing data layout (usually from row-major to column-major or vice versa) can result in better locality in the data access pattern (Figure 5). The reason why this is preferable is because GPUs have caches NVIDIA (2017e) that temporarily store copies of memory data (Figure 4). Caches are faster to access compared to main memory and they are designed based on the observation that, when a memory address is accessed, the same memory address or nearby addresses will likely be accessed in the near future (i.e. memory accesses should exhibit locality for caches to be useful) Hennessy and Patterson (2017). Therefore, better locality yields higher cache utilization (hit rate), which leads to faster memory accesses on average, and eventually, better runtime performance.

We make the following two observations that justify why applying data layout optimization can be beneficial in the context of LSTM RNN.

### 2.2 Observation 1: Data Layout Optimization can speed up Fully-Connected (FC) layers

Suppose that we have matrix of dimension and matrix of dimension , and we want to compare the runtimes of matrix multiply and (Figure 6). This setup mimics the FC layer of an LSTM RNN cell whose batch size is 64 and hidden dimension is 512 (2048 comes from the fact that an LSTM cell has 4 nonlinear gates). Although mathematical intuition says that those two runtimes should by no means be different from each other because and are doing exactly the same amount of computation, actual measurements disagree. Figure 7 is obtained by measuring and on a Titan Xp GPU. The matrix multiply is carried out using cuBLAS 8.0 NVIDIA (2017a), the proprietary library owned by NVIDIA for doing basic linear algebraic operations, and is used by both MXNet and cuDNN for implementing FC layers Inc. (2015). The Runtime measurements have been averaged over 100 iterations. The Cache and Compute Units bars represent the utilization percentage of the corresponding hardware resources and come directly from the nvprof tool (the Cache here means GPU -cache). We see that is almost twice as fast as under this parameter setting, and the reason is that the former has better cache utilization. Therefore, it can feed data faster into compute units, and thus ends up spending more time in actual compute rather than waiting for data to arrive from main memory.

We observe that in LSTM RNN, FC layers usually have the following properties: {enumerate*}[(1)]

, of which the dimension is given by , often has more columns than rows, because the batch size (ranging between ) is usually smaller than the hidden dimension (ranging between ) (incubating) (2017c); Hieber et al. (2017); Luong et al. (2017); Britz et al. (2017).

has more rows than columns. Since an LSTM cell has 4 gates, the ratio between ’s width and height is always 4. The aforementioned properties make usually perform better than , in terms of both cache utilization and runtime.

### 2.3 Observation 2: The Runtime Bottleneck of LSTM RNN is FC layers

We continue on the previous experiment in Figure 3 and dive deeper into the GPU Kernels portion. We obtain the detailed runtime breakdown of CuDNN and the result is shown in Figure 8 (due to the fact that Default slices the "" block in Figure 1 into small pieces, its result is difficult to interpret). Figure 8 shows that more than 85% of the time spent on compute has been allocated to matrix multiplies (sgemm is the name for single-precision matrix multiply kernels in cuBLAS library NVIDIA (2017a)). Despite the fact that we do not know the exact one-to-one correspondence, it can be inferred that the top two kernels with the longest runtime come from the forward and backward passes of FC layers, and the third one performs aggregation of weight gradients along the time dimension. The annotations beside the stacked bar in Figure 8 group GPU kernels together according to their counterpart in Figure 1, which explains why FC layers in LSTM RNN should be the top candidate for optimization.

###
2.4 Applying Data Layout Optimization in LSTM RNN and

Generalization to Other Cell Types

The previous two observations justify why data layout optimization can be helpful in improving LSTM RNN performance. We, hence, apply this optimization by transposing the input data from into , where , , and stand respectively for batch size, sequence length, and hidden dimension, before feeding it into the network. Such transpose operation introduces almost no extra runtime cost, because input data needs to become time-major first before being sliced along the time dimension. The problem is therefore whether the data layout should be or . Runtime measurements recommend as a better choice. We implement LSTM RNN using the layout and defined this new implementation as EcoRNN.

Figure 10 shows the runtime and cache utilization comparison between Default, CuDNN, EcoRNN on the forward pass under the hyperparameter setting . The Runtime is averaged over 100 iterations, and the Cache utilization is computed as the weighted average of utilization percentage by each sgemm kernel runtime Zhu et al. (2018), which is given by

(1) |

We can see from Figure 10 that performance benefits we observe at the low-level for FC layers reappear at the high-level for LSTM RNN – EcoRNN is and faster compared with Default and CuDNN respectively and the reason is because it utilizes cache resources better. We can also derive two more conclusions from Figure 10: {enumerate*}[(1)]

The cache utilization in Default is almost the same with that in CuDNN. This is a good indication that data layout optimization is orthogonal to the techniques that are currently applied in cuDNN implementation of LSTM RNN, which involves other unknown optimizations. We cannot apply data layout optimization in CuDNN directly because it is closed-source, but clearly this optimization can bring more benefits than those hidden optimizations in this hyperparameter setting.

The speedup and cache utilization comparison between CuDNN and Default in Figure 10 matches the same comparison between and in Figure 7, which not only proves the correctness of our observations, but also means that any benefits seen at the FC layers can directly translate into the level of LSTM RNN Amdahl (1967).

Although in this work we focus primarily on LSTMs, the fact that data layout optimization works on FC layers rather than the "" block in Figure 1 means that the same idea applies equally well to different LSTM variants as long as the 4 nonlinear gates are preserved (such as LSTM with peephole connections Gers and Schmidhuber (2000)), and potentially to other RNN cell types. Figure 10 does similar analysis to Figure 7, except that and are now of dimension and respectively, which mimics the FC layer of a GRU cell Cho et al. (2014) with 3 nonlinear gates. We observe that is faster than , which justifies the potential of similar data layout optimization in GRU RNN.

## 3 Experiments and Results

### 3.1 Experimental Settings

All the experiments included in this paper are done on a single machine with Intel®Core™i5-3570 Intel (2012) CPU and Titan Xp NVIDIA (2017g) GPU. We have been using CUDA 8.0 NVIDIA (2017c) toolkit and cuDNN 6.0 NVIDIA (2017b) for our experiments in MXNet ver. 0.12.1 Chen et al. (2015). All runtime measurements are averaged over 101 iterations, but with the first one always discarded to avoid framework tuning or warmup overhead, and all profiling results (hardware utilization, runtime breakdown) are obtained from the nvprof tool NVIDIA (2017f). To provide fair comparison against Default and CuDNN, we integrate EcoRNN into MXNet and propagate it from MXNet’s core library to the MXNet C++ and Python interfaces.

### 3.2 Microbenchmark

To observe the pure benefits of EcoRNN for RNN layers, we implement a microbenchmark that uses MXNet C++ interface and only includes RNN layers (i.e. there are no other layers such as embedding or softmax). We traverse through the set of hyperparameters which is defined as the cartesian product of batch size , hidden dimension , and number of layers (we kept the sequence length fixed at 50 as we observe in our experiments that runtime always scales linearly with respect to sequence length). Figure 11 shows the results of runtime comparison on the microbenchmark between Default, CuDNN, and EcoRNN. We observe that EcoRNN is always significantly better than Default and in most cases better than CuDNN. Even in a few cases where CuDNN slightly outperforms EcoRNN, the performance difference is below 20%. We believe this happens when CuDNN’s optimizations that are aimed at multi-layer LSTM RNN outweigh the benefits of data layout optimization.

### 3.3 Word-Level Language-Modeling Benchmark

Performance benefits that we observe at the C++ level are helpful to estimate the potential speedup, but machine learning researchers usually use high-level programming languages such as Python or R when building RNN models. To see the performance benefits we can get from EcoRNN in this context, we integrate it into the Python interface and test it on the word-level language modeling task in the MXNet repository (incubating) (2017c, d) on the Penn TreeBank (PTB) Zaremba et al. (2014) and Wikitext-2 Merity (2016) datasets. To avoid picking hyperparameter settings that are biased towards EcoRNN, we keep the default set of hyperparameters that is chosen by MXNet developers ( denotes the input dropout probability of LSTM cells).

We verify the correctness of EcoRNN by plotting training and validation quality versus the global number of training steps and training checkpoints respectively. The quality is measured by perplexity (lower perplexity means better quality). The first two graphs in Figure 12 show that the training curve of EcoRNN almost completely overlaps with that of Default and CuDNN. Although all three implementations are the same from the algorithmic perspective, MXNet speedometer (incubating) (2017b) tells us that they are different in terms of speed. The rightmost graph in Figure 12 demonstrates that EcoRNN is and faster than Default and CuDNN respectively under this hyperparameter selection. Table 1 and Figure 13 expands the scope of the evaluation by testing on other set of hyperparameters, of which are suggested by MXNet developers (incubating) (2017d) and are what we added to complete the sweep (other hyperparameters are kept unchanged, except for the dropout probability which scales accordingly with ). We observe that across all the hyperparameters, EcoRNN does equally well with Default and CuDNN in terms of final achieved test perplexity, yet it clearly has the advantage of better training throughput compared with both Default and CuDNN in all but a few cases where performance difference is minimal (within 20%).

Hidden Dimension | 200 | 256 | 512 | 650 | 1024 | 1500 | |
---|---|---|---|---|---|---|---|

Test Perplexity | Default | 109.97 | 103.02 | 92.80 | 89.93 | 88.32 | 85.66 |

EcoRNN | 107.40 | 99.80 | 89.90 | 87.41 | 86.45 | 84.47 |

### 3.4 Correlation between Microbenchmark and Real Application

Some RNN models (e.g., Sockeye Hieber et al. (2017)) have an argument -fused or other equivalents that indicate the switch between the Default and CuDNN implementations, which hence require manual effort from both the model users and the model programmers side. We argue that such switching should be done automatically for machine learning users and it is part of our future plan to build a runtime tool (shown in Figure 14) on top of the microbenchmark that selects the best LSTM RNN implementation depending on the hyperparameter selection. To achieve this, the microbenchmark must be representative of the actual workload. We compute the correlation coefficient between (where stands for the runtime on microbenchmark) and average throughput measurements in Figure 13 and the results are shown in Table 2. We observe that the microbenchmark runtime is highly correlated with the throughput in both the language modeling task of PTB and that of Wikitext-2 and can therefore serve as an efficient predictor for selecting the best LSTM RNN implementation.

## 4 Related Works

EcoRNN is an open-source LSTM RNN implementation that does not impose any restrictions either at the software level (hyperparameters) or at the hardware level (CPU and GPU). Diamos et al. (2016) (Persistent RNN) show that they can achieve substantial speedup by using persistent computational kernels that exploit the GPU’s inverted memory hierarchy, however, their implementation puts significant restrictions on its users. For example, the number of RNN layers must be a multiple of 4, the input data must be 16-byte aligned, and only limited GPU hardware is supported Research (2016). All these make their implementation a less desirable design for machine learning developers.

EcoRNN also makes no changes to the LSTM RNN algorithm. This is in contrast to those approaches taken by some machine learning researchers, who try to address the inefficiency of LSTM RNN from the algorithmic perspective by either getting rid of the RNN components completely (e.g., Transformer Vaswani et al. (2017), ByteNet Kalchbrenner et al. (2016), and PixelCNN van den Oord et al. (2016)) or simplifying the RNN architecture to speed up computation (e.g., RAN Lee et al. (2017), T-RNN Balduzzi and Ghifary (2016), and Miao et al. (2016)) or relieving the burden on recurrent connections to improve model parallelism (e.g., QRNN Bradbury et al. (2016) and SRU Lei et al. (2017)). These approaches are mostly orthogonal to EcoRNN and can be used in conjunction with our approach.

Compiler optimization techniques that were previously developed for high performance computing are important for achieving peak performance in machine learning workloads (such as kernel fusion Bacon et al. (1994); Padua and Wolfe (1986) and data layout transformation Kennedy and Kremer (1998)). Many researchers introduce new compiler frameworks that target DNN workloads, such as XLA Tensorflow (2018b), TVM Chen et al. (2018), Tensor Comprehensions Vasilache et al. (2018), DLVM Wei et al. (2017), nGraph Cyphers et al. (2018), and Glow Rotem et al. (2018). Unfortunately, all prior works either do not have performance evaluations Tensorflow (2018b); Wei et al. (2017); Cyphers et al. (2018) or only have evaluations that are based on Multi-Layer Perceptrons (MLP) Minsky and Papert (2017)Vasilache et al. (2018) and CNN models (such as Resnet He et al. (2015), VGG Simonyan and Zisserman (2014), and MobileNet Howard et al. (2017)) Chen et al. (2018); Vasilache et al. (2018); Rotem et al. (2018). We aim at integrating our optimizations as a part of the existing compiler frameworks and push for optimizations beyond those that are specific for MLP and CNN models.

## 5 Conclusion and Discussion

In this paper, we introduce EcoRNN, a new implementation of LSTM RNN with kernel fusion and data layout optimization. We show the potential of those two optimizations in multiple SOTA machine learning frameworks and RNN cell types other than LSTM. EcoRNN is always significantly better than the MXNet Default, and also the closed-source CuDNN implementations under most hyperparameter settings. We develop a microbenchmark that consists of pure LSTM RNNs and demonstrate that it is representative of the actual workload as the runtime on the microbenchmark is highly correlated with the average throughput measurements reported by MXNet speedometer.

The successful application of data layout optimization in LSTM RNN gives rise to the question as to whether or not it is universally applicable to all FC layers, or matrix multiplies in general. Matrix multiplies are ubiquitous in machine learning models, which is one of the reasons why companies such as Google and NVIDIA introduce hardware dedicated specifically for them Jouppi et al. (2017); NVIDIA (2017d). In this paper, we show how data layout optimization can give a speedup on matrix multiplies. However, applying it universally can be challenging. Figure 15 (left) explains the reason – having a single piece of data and matrix multiplies can already give us a total number of possible execution paths to consider, under the condition that those matrix multiplies are all different in terms of dimensions of matrices. However, if all matrix multiplies are the same, the NP-complete problem will be reduced to simply selecting between either row-major or column-major (i.e. a binary problem, shown in Figure 15 (right)), which is exactly the case of LSTM RNN and other recurrent models, where all FC layers share the same dimension across different layers and time steps.

## 6 Acknowledgements

We really want to express our sincere gratitude to Professor Roger Grosse, Andrew Pelegris, Shang (Sam) Wang from the University of Toronto for kindly giving us feedback on this paper.

## References

- (1)
- Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
- Amdahl (1967) Gene M. Amdahl. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In Proceedings of the April 18-20, 1967, Spring Joint Computer Conference (AFIPS ’67 (Spring)). ACM, New York, NY, USA, 483–485. https://doi.org/10.1145/1465482.1465560
- Appleyard et al. (2016) Jeremy Appleyard, Tomás Kociský, and Phil Blunsom. 2016. Optimizing Performance of Recurrent Neural Networks on GPUs. CoRR abs/1604.01946 (2016). arXiv:1604.01946 http://arxiv.org/abs/1604.01946
- Bacon et al. (1994) David F. Bacon, Susan L. Graham, and Oliver J. Sharp. 1994. Compiler Transformations for High-performance Computing. ACM Comput. Surv. 26, 4 (December 1994), 345–420. https://doi.org/10.1145/197405.197406
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014). arXiv:1409.0473 http://arxiv.org/abs/1409.0473
- Balduzzi and Ghifary (2016) David Balduzzi and Muhammad Ghifary. 2016. Strongly-Typed Recurrent Neural Networks. CoRR abs/1602.02218 (2016). arXiv:1602.02218 http://arxiv.org/abs/1602.02218
- Bradbury et al. (2016) James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2016. Quasi-Recurrent Neural Networks. CoRR abs/1611.01576 (2016). arXiv:1611.01576 http://arxiv.org/abs/1611.01576
- Britz et al. (2017) Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. 2017. Massive Exploration of Neural Machine Translation Architectures. CoRR abs/1703.03906 (2017). arXiv:1703.03906 http://arxiv.org/abs/1703.03906
- Chen et al. (2015) Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. CoRR abs/1512.01274 (2015). arXiv:1512.01274 http://arxiv.org/abs/1512.01274
- Chen et al. (2018) Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-End Optimization Stack for Deep Learning. CoRR abs/1802.04799 (2018). arXiv:1802.04799 http://arxiv.org/abs/1802.04799
- Chetlur et al. (2014) Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759 (2014). arXiv:1410.0759 http://arxiv.org/abs/1410.0759
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. CoRR abs/1406.1078 (2014). arXiv:1406.1078 http://arxiv.org/abs/1406.1078
- Cyphers et al. (2018) Scott Cyphers, Arjun K. Bansal, Anahita Bhiwandiwalla, Jayaram Bobba, Matthew Brookhart, Avijit Chakraborty, William Constable, Christian Convey, Leona Cook, Omar Kanawi, Robert Kimball, Jason Knight, Nikolay Korovaiko, Varun Kumar, Yixing Lao, Christopher R. Lishka, Jaikrishnan Menon, Jennifer Myers, Sandeep Aswath Narayana, Adam Procter, and Tristan J. Webb. 2018. Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning. CoRR abs/1801.08058 (2018). arXiv:1801.08058 http://arxiv.org/abs/1801.08058
- Diamos et al. (2016) Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, and Sanjeev Satheesh. 2016. Persistent RNNs: Stashing Recurrent Weights On-Chip. In Proceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research), Maria Florina Balcan and Kilian Q. Weinberger (Eds.), Vol. 48. PMLR, New York, New York, USA, 2024–2033. http://proceedings.mlr.press/v48/diamos16.html
- Gers and Schmidhuber (2000) Felix A Gers and Jürgen Schmidhuber. 2000. Recurrent nets that time and count. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, Vol. 3. IEEE, 189–194.
- Goel (2017) Hardik Goel. 2017. Add Peephole connections for LSTMs? https://github.com/pytorch/pytorch/issues/630
- Graves and Jaitly (2014) Alex Graves and Navdeep Jaitly. 2014. Towards End-to-End Speech Recognition with Recurrent Neural Networks. In International Conference on Machine Learning. 1764–1772.
- Graves et al. (2013) Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton. 2013. Speech Recognition with Deep Recurrent Neural Networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6645–6649.
- Greff et al. (2017) Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2017), 2222–2232.
- He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). arXiv:1512.03385 http://arxiv.org/abs/1512.03385
- Hennessy and Patterson (2017) John L. Hennessy and David A. Patterson. 2017. Computer Architecture, Sixth Edition: A Quantitative Approach (6th ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
- Hieber et al. (2017) Felix Hieber, Tobias Domhan, Michael Denkowski, David Vilar, Artem Sokolov, Ann Clifton, and Matt Post. 2017. Sockeye: A Toolkit for Neural Machine Translation. ArXiv e-prints (December 2017). arXiv:cs.CL/1712.05690 https://arxiv.org/abs/1712.05690
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
- Howard et al. (2017) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR abs/1704.04861 (2017). arXiv:1704.04861 http://arxiv.org/abs/1704.04861
- Inc. (2015) Vuno Inc. 2015. Implementing Deep Learning using cuDNN. http://images.nvidia.com/content/gtc-kr/part_2_vuno.pdf
- (incubating) (2017a) Apache MXNet (incubating). 2017a. LSTMCell. https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/rnn/rnn_cell.py
- (incubating) (2017b) Apache MXNet (incubating). 2017b. Speedometer. https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/callback.py
- (incubating) (2017c) Apache MXNet (incubating). 2017c. Word Level Language Modeling. https://github.com/apache/incubator-mxnet/tree/master/example/rnn/word_lm
- (incubating) (2017d) Apache MXNet (incubating). 2017d. Word-level language modeling RNN. https://github.com/apache/incubator-mxnet/tree/master/example/gluon/word_language_model
- Intel (2012) Intel. 2012. Intel® Core™ i5-3570 Processor. https://ark.intel.com/products/65702/Intel-Core-i5-3570-Processor-6M-Cache-up-to-3_80-GHz
- Jouppi et al. (2017) Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17). ACM, New York, NY, USA, 1–12. https://doi.org/10.1145/3079856.3080246
- Kalchbrenner et al. (2016) Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aäron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural Machine Translation in Linear Time. CoRR abs/1610.10099 (2016). arXiv:1610.10099 http://arxiv.org/abs/1610.10099
- Kennedy and Kremer (1998) Ken Kennedy and Ulrich Kremer. 1998. Automatic data layout for distributed-memory machines. ACM Transactions on Programming Languages and Systems (TOPLAS) 20, 4 (1998), 869–916.
- Lee et al. (2017) Kenton Lee, Omer Levy, and Luke Zettlemoyer. 2017. Recurrent Additive Networks. CoRR abs/1705.07393 (2017). arXiv:1705.07393 http://arxiv.org/abs/1705.07393
- Lei et al. (2017) Tao Lei, Yu Zhang, and Yoav Artzi. 2017. Training RNNs as Fast as CNNs. CoRR abs/1709.02755 (2017). arXiv:1709.02755 http://arxiv.org/abs/1709.02755
- Luong et al. (2017) Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. 2017. Neural Machine Translation (seq2seq) Tutorial. https://github.com/tensorflow/nmt (2017).
- Merity (2016) Stephen Merity. 2016. The wikitext long term dependency language modeling dataset.
- Miao et al. (2016) Yajie Miao, Jinyu Li, Yongqiang Wang, Shi-Xiong Zhang, and Yifan Gong. 2016. Simplifying Long Short-Term Memory Acoustic Models for Fast Training and Decoding. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2284–2288.
- Minsky and Papert (2017) Marvin L. Minsky and Seymour A. Papert. 2017. Perceptrons: an introduction to computational geometry. MIT press.
- NVIDIA (2017a) NVIDIA. 2017a. cuBLAS Library v8.0.
- NVIDIA (2017b) NVIDIA. 2017b. cuDNN v6.0.
- NVIDIA (2017c) NVIDIA. 2017c. NVIDIA CUDA Toolkit v8.0.
- NVIDIA (2017d) NVIDIA. 2017d. NVIDIA Tesla V100 GPU Architecture. (2017). http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
- NVIDIA (2017e) NVIDIA. 2017e. Parallel Thread Execution ISA v8.0.
- NVIDIA (2017f) NVIDIA. 2017f. Profiler User’s Guide v8.0.
- NVIDIA (2017g) NVIDIA. 2017g. Titan Xp User Guide. http://www.nvidia.com/content/geforce-gtx/NVIDIA_TITAN_Xp_User_Guide.pdf
- Padua and Wolfe (1986) David A Padua and Michael J Wolfe. 1986. Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12 (1986), 1184–1201.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. (2017).
- Research (2016) Baidu Research. 2016. PRNN. https://github.com/baidu-research/persistent-rnn
- Rotem et al. (2018) Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Satish Nadathur, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. 2018. Glow: Graph Lowering Compiler Techniques for Neural Networks. arXiv preprint arXiv:1805.00907 (2018).
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). arXiv:1409.1556 http://arxiv.org/abs/1409.1556
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
- Sundermeyer et al. (2012) Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Thirteenth Annual Conference of the International Speech Communication Association.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3104–3112. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
- Tensorflow (2018a) Tensorflow. 2018a. BasicLSTMCell. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/rnn_cell_impl.py
- Tensorflow (2018b) Tensorflow. 2018b. XLA Overview. (2018). https://www.tensorflow.org/performance/xla/
- van den Oord et al. (2016) Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. Conditional Image Generation with PixelCNN Decoders. CoRR abs/1606.05328 (2016). arXiv:1606.05328 http://arxiv.org/abs/1606.05328
- Vasilache et al. (2018) Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. CoRR abs/1802.04730 (2018). arXiv:1802.04730 http://arxiv.org/abs/1802.04730
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
- Wei et al. (2017) Richard Wei, Vikram S. Adve, and Lane Schwartz. 2017. DLVM: A modern compiler infrastructure for deep learning systems. CoRR abs/1711.03016 (2017). arXiv:1711.03016 http://arxiv.org/abs/1711.03016
- Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).
- Zhu et al. (2018) Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. 2018. TBD: Benchmarking and Analyzing Deep Neural Network Training. arXiv preprint arXiv:1803.06905 (2018).