92¢ /MFlops/s, Ultra-Large-Scale Neural-Network Training on a PIII Cluster Gordon Bell Price/Performance winner. Student paper award finalistKeywords: neural-network, Linux cluster, matrix-multiply

# 92¢ /MFlops/s, Ultra-Large-Scale Neural-Network Training on a PIII Cluster Gordon Bell Price/Performance winner. Student paper award finalist Keywords: neural-network, Linux cluster, matrix-multiply

###### Abstract

Artificial neural networks with millions of adjustable parameters and a similar number of training examples are a potential solution for difficult, large-scale pattern recognition problems in areas such as speech and face recognition, classification of large volumes of web data, and finance. The bottleneck is that neural network training involves iterative gradient descent and is extremely computationally intensive. In this paper we present a technique for distributed training of Ultra Large Scale Neural Networks111Following the convention with integrated circuits, we take ULSNN to mean a neural network with in excess of one million parameters and one million training examples. (ULSNN) on Bunyip, a Linux-based cluster of 196 Pentium III processors. To illustrate ULSNN training we describe an experiment in which a neural network with 1.73 million adjustable parameters was trained to recognize machine-printed Japanese characters from a database containing 9 million training patterns. The training runs with a average performance of 163.3 GFlops/s (single precision). With a machine cost of $150,913, this yields a price/performance ratio of 92.4¢ /MFlops/s (single precision). For comparison purposes, training using double precision and the ATLAS DGEMM produces a sustained performance of 70 MFlops/s or$2.16 / MFlop/s (double precision).

\icmlauthor

Douglas Aberdeen (corresponding, presenting and student author)Douglas.Aberdeen@anu.edu.au \icmlauthorJonathan BaxterJonathan.Baxter@anu.edu.au \icmladdressResearch School of Information Sciences and Engineering, Australian National University, Canberra, Australia, 0200 \icmlauthorRobert EdwardsRobert.Edwards@anu.edu.au \icmladdressDepartment of Computer Science, Australian National University, Canberra, Australia, 0200

## 1 Introduction

Artificial neural networks are a class of parametric, non-linear statistical models that have found wide-spread use in many pattern recognition domains, including speech recognition, character recognition, signal processing, medical diagnosis and finance. The typical network in such an application has 100–100,000 adjustable parameters and requires a similar number of training patterns in order to generalize well to unseen test data. Provided sufficient training data is available, the accuracy of the network is limited only by its representational power, which in turn is essentially proportional to the number of adjustable parameters. Thus, in domains where large volumes of data can be collected — such as speech, face and character recognition, and web page classification — improved accuracy can often be obtained by training a much larger network.

In this paper we describe a method for distributed training of Ultra Large Scale Neural Networks (ULSNN), or networks with more than one million adjustable parameters and a similar number of training examples. At its core, the algorithm uses Emmerald, a single-precision (32 bit) general matrix-matrix multiply (SGEMM) based on the Pentium III SIMD Streaming Extensions (SSE), with a peak performance in excess of 1090 MFlops/s on a single 550 MHz Pentium III. The use of single-precision floating point operations is justified by the fact that we have found it sufficient for gradient-based training of ULSNN’s. For medium–large scale neural networks as few as 16 bits precision is sufficient [2].

To illustrate the use of our ULSNN training code, we describe an experiment in which a neural network with 1.73 million adjustable parameters is being trained to recognize machine-printed Japanese characters from a database containing 9 million training patterns. The training is running on Bunyip, a 196 processor, Linux-based Intel Pentium III cluster consisting of 98 dual 550 MHz processor PC’s, each containing 384 MBytes of RAM, 13 GBytes of hard disk and 3x100 Mb/s fast Ethernet cards. All components in Bunyip are “COTS” (Commodity-Off-The-Shelf), and were sourced from a local PC manufacturer (see http://tux.anu.edu.au/Projects/Beowulf/).

Our longest experiment took hours and minutes, requiring a total of Peta Flops ( single-precision floating-point operations), with an average performance of 152 GFlops/s (single precision) while under load. With no other user processes running the performance increases to 163.3 GFlops/s which was sustained for a four hour test before returning access to other users. Total memory usage during training was 32.37 GBytes. The total machine cost, including the labor cost in construction, was AUD$253,000, or USD$150,913 at the exchange rate of AUD$1 = .5965¢ USD on the day of the final and largest payment. This gives a final price/performance ratio of USD 92.4¢ /MFlops/s (single precision). For comparison purposes, training using double precision and the ATLAS DGEMM [11] produced a sustained performance of 70 MFlops/s or$2.16 /MFlops/s (double precision).

## 2 “Bunyip” Hardware Details

The machine used for the experiments in this paper is “Bunyip”, a 98-node, dual Pentium III Beowulf-class system running Linux kernel 2.2.14. Our main design goals for this machine were to maximise CPU and network performance for the given budget of AUD $250,000 (about USD$149,125). Secondary factors to be balanced into the equation were: amount of memory and disk; reliability; and the overall size of the machine. All dollar figures quoted in the remainder of this paper are US dollars.

The Intel Pentium III processors were chosen over Alpha or SPARC processors for price/performance reasons. Dual-CPU systems were preferable as overall cost and size per CPU is lower than single-CPU or quad-CPU systems. Unfortunately, at the time of designing this machine AMD Athlon and Motorola/IBM G4 systems were not available in dual-CPU configurations. We were also keen to use the SSE floating point instructions of the Pentium III range. 550 MHz CPUs were eventually selected as having the best price/performance available in the Pentium III range at that time.

For the networking requirements, we decided to go with a commodity solution rather than a proprietary solution. Gigabit ethernet was considered, but deemed too expensive at around $300 per node for the NIC and around$1800 per node for the switch. Instead, a novel arrangement of multiple 100 Mb/s NICs was selected with each node having three NICs which contributed some $65 per node (plus switch costs – see below). The configuration for each node is dual Intel Pentium III 550 CPUs on an EPoX KP6-BS motherboard with 384 MBytes RAM, 13 GByte UDMA66 (IDE) hard disk and three DEC Tulip compatible 100 Mb/s network interfaces, one of which has Wake-On-LAN capability and provision for a Boot ROM. The nodes have no removable media, no video capability and no keyboards. Each node cost$1282.

## 8 Conclusion

We have shown how a COTS (Commodity-Off-The-Shelf) Linux Pentium III cluster costing under \$151,000 can be used to achieve sustained, Ultra-Large-Scale Neural-Network training at a performance in excess of 160 GFlops/s (single precision), for a price/performance ratio of 92.4¢/MFlop/s.

Part of the reason for the strong performance is the use of very large training sets. With the current networking set-up, performance degrades significantly with less data per processor, as communication of gradient information starts to dominate over the computation of the gradient.

## Acknowledgements

This project was supported by the Australian Research Council, an Australian National University Major Equipment Grant, and LinuxCare Australia. Thanks are also due to several people who made valuable contributions to the establishment and installation of Bunyip: Peter Christen, Chris Johnson, John Lloyd, Paul McKerras, Peter Strazdins and Andrew Tridgell.

## References

• [1] D. Aberdeen and J. Baxter (1999-08) Ememrald: a fast matrix-matrix multiply using Intel SIMD technology. Technical report Research School of Information Science and Engineering, Australian National University. Note: http://csl.anu.edu.au/daa/files/emmerald.ps Cited by: §3.4.
• [2] K. Asanović and N. Morgan (1991) Experimental determination of precision requirements for back-propagation training of artificial neural networks. Technical report The International Computer Science Institute. Note: ftp://ftp.ICSI.Berkeley.EDU/pub/techreports/1991/tr-91-036.ps.gz Cited by: §1, §4.2.
• [3] (1998-11) Basic linear algebra subroutines. Netlib. Cited by: §3.1.
• [4] J. Bilmes, K. Asanovic, C. Chin and J. Demmel (1997-04) Using PHiPAC to speed Error Back-Propogation learning. In ICASSP, Cited by: §4.2, §4.
• [5] T. L. Fine (1999) Feedforward neural network methodology. Springer, New York. Cited by: §4.2.
• [6] B. Greer and G. Henry (1997-08) High performance software on Intel Pentium Pro processors or Micro-Ops to TeraFLOPS. Technical report Intel. Note: http:// www.cs.utk.edu/ghenry/sc97/paper.htm Cited by: 1st item, §3.2.
• [7] J.Bilmes, K.Asanovic, J.Demmel, D.Lam and C.W.Chin (1996-08) PHiPAC: a portable, high-performace, ANSI C coding methodoloogy and its application to matrix multiply. Technical report University of Tennessee. Note: http://www.icsi.berkeley.edu/bilmes/phipac Cited by: 1st item, §3.1.
• [8] LAM Team Lam/mpi source code v6.3.2. University of Notre Dame. Cited by: §5.1.
• [9] V. Strassen (1969) Gaussian elimination is not optimal. Numerische Mathematik 13, pp. 354–356. Cited by: §3.1.
• [10] M. Thottethodi, S. Chatterjee and A. R. Lebeck (1996) Tuning strassen’s matrix multiplication for memory efficiency. In Super Computing, Cited by: §3.1.
• [11] R. C. Whaley and J. J. Dongarra (1997) Automatically tuned linear algebra software. Technical report Computer Science Department, University of Tennessee. Cited by: §1, 1st item, §3.1.
• [12] R. C. Whaley, A. Petitet and J. J. Dongarra (2000-03) Automated empirical optimizations of software and the atlas project. Technical report Dept. of Computer Sciences, Univ. of TN, Knoxville. Note: http://www.cs.utk.edu/rwhaley/ATLAS/atlas.html Cited by: 3rd item.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters