|Double-buffering (e.g., Paszke et al. (2019))||✗||✓||✓||✗||✓|
|tf.data Abadi et al. (2015)||✗||✓||✗||✗||✓|
|Data sharding (e.g., Kurth et al. (2018))||✓||✗||✗||✗||✓|
|DeepIO Zhu et al. (2018)||✓||✗||✗||✗||✓|
|LBANN data store Jacobs et al. (2019)||✓||✗||✓||✗||✗|
|Locality-aware loading Yang and Cong (2019)||✓||✓||✓||✗||✗|
|HDMLP (this paper)||✓||✓||✓||✓||✓|
Training deep neural networks is an increasingly important workload on systems, from small research clusters to large, dedicated training clusters, as machine learning is adopted by more fields. Given the high cost of training, it is critical that every aspect be as efficient as possible OpenAI (2018); Strubell et al. (2019). Extensive work has been done to optimize training Ben-Nun and Hoefler (2019), including dedicated hardware Jouppi et al. (2017); Markidis et al. (2018), compilers Google (2020); Chen et al. (2018), optimized primitives Chetlur et al. (2014), and communication infrastructure Nvidia (2020a); Dryden et al. (2018).
However, because of these improvements in compute and communication, the performance bottleneck in training is shifting to I/O—loading data samples from storage Pumma et al. (2019). Due to Amdahl’s law, training is increasingly becoming I/O-bound. Indeed, we find that when training ImageNet at scale, up to 52% of the runtime is spent in I/O. As trends in compute capability continue and dataset reach hundreds of millions Sun et al. (2017) to billions Mahajan et al. (2018) of samples and terabytes Mathuriya et al. (2018); Abu-El-Haija et al. (2016) to petabytes Abodo et al. (2018) in size, the I/O bottleneck will only be exacerbated.
It is challenging to optimize training I/O, as stochastic gradient descent (SGD) randomly accesses (typically small) data samples. This problem is especially acute for distributed training, where shared filesystem contention can be detrimental to performance. Existing frameworks often overlap I/O with computation to reduce its overhead, but this is no longer sufficient. Beyond this, ad hoc solutions such as limited lookahead and double-buffering Abadi et al. (2015); Paszke et al. (2019); Chien et al. (2018), data sharding Goyal et al. (2017); Kurth et al. (2018), prestaging Jacobs et al. (2019); Oyama et al. (2020), or modified access patterns Yang and Cong (2019); Zhu et al. (2018) are used. These often have significant limitations, including requiring extra hardware, poor scalability, or deviating from full-dataset randomization. All of these approaches can fail to fully utilize a machine’s I/O subsystem.
To address the I/O bottleneck, we introduce a new I/O middleware framework, HDMLP. The key idea behind HDMLP is to exploit clairvoyance Bélády (1966): Given the seed for the pseudorandom number generator (PRNG) that generates an access stream, we can exactly predict which process will access a given sample arbitrarily far in the future. HDMLP analyzes the access stream to perform integrated prefetching and caching, rather than always reading from storage. It uses a performance model-driven architecture to transparently manage data across a cluster, using both on-node storage hierarchies (e.g., RAM, node-local SSDs) and distributed memory, resulting in much improved I/O performance. Using this, we are able to reduce I/O stall times by at least for training ResNet-50 He et al. (2016) on ImageNet. Further, because HDMLP avoids unnecessary accesses to shared filesystems for data, we significantly reduce runtime impacts from filesystem contention, improving scalability.
Using HDMLP requires no changes to deep learning frameworks and changes to only a few lines of code in existing training scripts, as it presents an iterator-style interface to accessing data. It can also automatically adapt to different datasets and machines, being applicable both to small-scale research clusters and large, dedicated training clusters.
We additionally develop an I/O performance simulator to compare different I/O strategies in a variety of scenarios. Beyond evaluating performance, this simulator can also be used to help design future systems for training workloads by analyzing which components (e.g., larger SSDs) have the largest impact on runtime.
When using HDMLP, I/O resources are fully utilized, such that I/O is either never a bottleneck or necessarily limited by the dataset and hardware. Our key contributions are:
We identify clairvoyance as a key insight for optimizing I/O and use this to perform a probabilistic analysis of access patterns and produce a near-optimal mapping of data to cache hierarchies.
We combine this with a performance model and implement it in HDMLP, an easy-to-use I/O middleware to optimize I/O when training.
We significantly reduce I/O stall time, improving overall training performance. We achieve stall time reductions of – for training at large scale, contributing to – overall training time performance improvement.
Deep neural networks are almost always trained with variants of mini-batch SGD Bottou et al. (2018). The samples that make up a given mini-batch are randomly selected without replacement from the entire training dataset. Training typically involves many epochs (complete passes over the dataset), each of which accesses the dataset in a different random order. For distributed training, we assume a data-parallel regime Ben-Nun and Hoefler (2019), where a mini-batch is partitioned among workers. Hence, a given sample is accessed exactly once in each epoch.
A practical relaxation of this regime sometimes used in practice is to allow workers to sample data from a local shard, or subset of the dataset, which may share samples with other workers (e.g., Kurth et al. (2018)). This approach is typically adopted in order to mitigate I/O overheads that would otherwise occur due to the inefficient I/O utilization of existing systems. In order to stay as close as possible to the guarantees of SGD, we will focus on ensuring full-dataset randomization in this work.
2.1 Machine Learning I/O Frameworks
I/O for training deep neural networks is a complex task and typically consists of multiple stages. At the highest level, “I/O” involves reading a sample from storage and then preprocessing it, where preprocessing may entail multiple stages itself, such as decoding images, tokenizing text, normalization, or data augmentation (see Fig. 1). There are a number of solutions for optimizing the preprocessing stage, such as DALI Nvidia (2020b) or memory mapping optimizations Pumma et al. (2019). In general, these are orthogonal to optimizations for the read stage. We focus primarily on optimizing this first stage, which we will refer to as “I/O”. In this setting, we will assume that the dataset is available in a shared storage location such as a parallel filesystem (PFS), and that training workers run on compute nodes that all have access to the storage.
There are several key characteristics of I/O frameworks:
- System scalability
Whether additional resources can be productively used when scaling to many nodes.
- Dataset scalability
Whether arbitrarily large datasets (much larger than aggregate node storage) are supported.
- Full randomization
Whether sample selection is randomized over the entire dataset.
- Hardware independence
Whether special hardware (e.g., node-local SSDs) is used if present, but is not required.
- Ease of use
Whether significant effort is needed to incorporate the framework in workflows.
We summarize existing approaches, along with HDMLP, according to these characteristics in Tab. 1. All of these approaches are capable of double-buffering, where fetching the next mini-batch is overlapped with computation, and using multiple threads to fetch and preprocess samples. This is the approach taken by PyTorch’s built-in DataLoader. The tf.data module in TensorFlow extends this with longer-range lookahead, but typically performs data shuffling only in a limited-size buffer instead of over the entire dataset. These two approaches have poor system scalability, as workers contend for access to shared storage. Data sharding is widely used in practice, generally with ad hoc data staging scripts, but is necessarily limited by available system storage. None of the existing approaches are hardware independent; either they require additional hardware, such as SSDs, or they neglect the hardware when it is available.
3 I/O Access Patterns
We begin by analyzing the I/O access patterns in training.
3.1 Integrated Prefetching and Caching
We first review the prior work on prefetching and caching algorithms. For caches, given the access sequence , the optimal schedule is given by Bélády’s clairvoyant algorithm Bélády (1966), which discards the data that will not be needed for the longest time. However, it is much more challenging to derive an optimal schedule for integrated prefetching and caching Cao et al. (1995). There exist efficient algorithms for the optimal schedule in the case of a single processor and disk, and approximation algorithms for the case of a single processor with a parallel disk system Kimbrel and Karlin (2000); Albers et al. (2000); Albers and Witt (2001); Jin et al. (2002); Albers and Büttner (2005). Unfortunately, finding an optimal schedule for the parallel disk case is NP-hard Ambühl and Weber (2003). Our case, where there are multiple processors each possibly with multiple levels of cache, is even more challenging.
Nevertheless, we can adapt ideas from these algorithms to our situation. It can be shown that any optimal prefetching and caching strategy for a single processor and disk must follow four rules Cao et al. (1995):
Optimal prefetching: Every prefetch should fetch the next sample in that is not in the cache.
Optimal replacement: Every prefetch should discard the sample whose next use is furthest in the future.
Do no harm: Never discard sample A to prefetch sample B when A will be used before B.
First opportunity: Never prefetch-and-replace when the same operation could have been done previously.
Some of these rules can be generalized to the case of multiple disks Kimbrel and Karlin (2000), or relaxed while still producing good approximations Jin et al. (2002). HDMLP is able to implement Rule 1 exactly and attempts to approximate the remaining rules within a limited time horizon, using that a sample is accessed exactly once per epoch.
3.2 Distribution of Accesses
HDMLP utilizes a second key observation about the access pattern: Although each sample is read once per epoch, the number of times the same worker will read that sample varies depending on the random seed. Exploiting this access frequency disparity allows us to better cache data.
To formalize this, consider a fixed worker and sample, and let be the probability that worker will access the sample in epoch . For workers and epochs, we have that and the access frequency of this sample is . As the are independent Bernoulli random variables with the same success probability, we have that . Thus the mean of the distribution is and the probability that the access frequency is greater than by a factor (and hence the sample will be accessed more often by the worker) is
However, we are primarily interested in the number of samples that will be accessed more often by a worker, which is the sum over all samples of . Then, using that expectation is linear, the expected value is given by , where is the size of the dataset. We verified that this works well using Monte Carlo simulations. As an example, we take , , , and . Our theoretical estimate gives an expected value of approximately 323. Hence, although each sample is read 250 times on average by a node, around 323 samples will be accessed more than 275 times. The distribution from a Monte Carlo simulation is shown in Fig. 2. The number of samples accessed more than 275 times is 329, closely agreeing with the calculations.
As each sample is accessed exactly times by fully-randomized SGD without replacement, if some worker access a sample more (or less) frequently, then some other node must access it less (or more) frequently. We formalize this as follows (for proofs, see Sec. A.1):
If a worker accesses a sample times (resp. times), at least one other worker will access the sample less than or equal to (resp. more than or equal to ) times.
4 Performance Modeling
We now turn to our performance model of training I/O. This, combined with our analysis of access patterns (Sec. 3), forms the basis for the design and implementation of HDMLP (Sec. 5). It also enables us to develop a simulator to compare I/O frameworks, identify bottlenecks, and help design future systems for training workloads (Sec. 6).
First we introduce notation defining our environment. All quantities can be measured with simple performance benchmarks. Tab. 2 summarizes all our notation. We will assume there is one worker per compute node. This is not a strong assumption, as that worker could use multiple GPUs or multiple workers can coordinate via shared memory.
Let there be workers, , where each worker is characterized by:
[MB/s]: Compute throughput for training. This heavily depends on the details of the neural network, hardware, and software. We model as MB/s as it directly relates to I/O subsystem parameters; if it is known only in terms of samples per second, it can be approximated by multiplying this by the average file size. If samples are resized during preprocessing, the original size should be used.
[MB/s]: Data preprocessing rate.
We will assume there is no network congestion.
[MB/s]: Inter-worker network bandwidth.
[MB/s]: Random aggregate read throughput of the PFS, as a function of the number of readers . We model this with as the bandwidth of a PFS is heavily dependent on the number of clients Chowdhury et al. (2019).
To account for the storage diversity present in today’s datacenters, we will assume there are distinct storage classes which group storage that behaves similarly. For example, a storage class can represent RAM, SSDs, HDDs, shared global burst buffers, or emerging NVRAM technologies. Storage class is defined to be the staging buffer, a (usually small) in-memory buffer that is shared with the machine learning framework. A storage class is characterized by:
[MB]: Capacity of storage class . The total local storage of a worker is therefore .
and [MB/s]: Random aggregate read and write throughput for storage class with threads.
: Number of threads used for prefetching data to storage class . We assume there is always at least one thread for prefetching to the staging buffer, i.e., .
We model throughput as a function of as for many storage devices, a single thread cannot saturate its bandwidth.
Let our training dataset consist of samples, where sample has size . Note that each sample may have a different size. The size of the whole dataset is . We can have that , where it is not possible for the dataset to be stored on a single worker, or , where the dataset cannot be stored across all workers. We consider mini-batches of size and let there be epochs. One epoch therefore consists of iterations (or if we keep the last, small iteration).
At iteration , , we process a batch and worker processes its local batch . We write . As each sample is read exactly once in an epoch, the sets for , for some , are pairwise disjoint, which implies the same for the . For data-parallelism, we have that partition . (Adapting this to other training regimes, e.g., model-parallelism, is straightforward.)
Lastly, we write to be the th sample in worker ’s th batch. Then the access sequence of the worker is .
We now define the key metric of our model, , the time elapsed when worker consumes , the th entry of :
where is the time is available in the staging buffer. This is illustrated in Fig. 3. Assuming threads are load balanced, we have
We define as the time to read the th sample into the staging buffer. Here, is the time to fetch the sample into memory and the time to preprocess and store it in the staging buffer. does not depend on the data source, and is
where we assume preprocessing and writing can be pipelined in parallel. For fetching data, there are three cases, and we use the fastest applicable one:
Reading from the PFS, while other workers do as well: .
Reading from another worker’s storage class : .
Reading from its local storage class (assuming the sample is present): .
We will use this performance model to drive runtime selection of data fetching in HDMLP.
We now present the overall design and implementation of HDMLP, which combines our performance model with our analysis of access patterns.
As we know the PRNG seed, we can exactly compute , and with this prefetch data in the correct access order into the staging buffer (satisfying Rule 1). Once a sample is read, a worker will access it again at the earliest in the next epoch, and every sample that follows in the current epoch is necessarily accessed earlier. Therefore, we can approximate Rules 2–4 by immediately dropping samples from the staging buffer after access, freeing up space for samples that (with high probability) will be accessed sooner.
While this determines which samples to fetch to the staging buffer at what time, we need to use our performance model to decide from where to fetch samples. Because we know for each worker, every worker knows where every sample is cached, and we can thus select the fetch option with minimal time.
Finally, we define the strategy used by other storage classes. Suppose the worst case where a worker always waits before consuming a sample. Then the total training time is
We fill the other storage classes to minimize this. If we ignore fixed terms in the strategy, we need to compute . As we can select the fastest location to fetch from, this becomes
where and are the fastest remote and local storage class of sample , respectively. If a file is not available locally or at any remote worker, we define the respective throughput to be 0. Letting be the access frequency of sample and assuming a static system (i.e., samples are already loaded in storage classes and no parameters change), this becomes a sum over all samples:
Assuming that samples are similarly sized, we can conclude:
When is large (i.e., a worker accesses a sample frequently), we want to be large, and therefore should cache the sample in a fast local storage class.
As is often constant or decreasing with many readers, we want to minimize to reduce PFS contention. We also want to be large for samples where is small (i.e., samples not cached locally, or in slow storage). These imply samples should be well-distributed among workers.
Recall that the access frequency is expected to vary for different and Lemma 1 implies that when is small on one worker, it will be large on at least one other worker (and vice versa). We therefore use to make the fetching decision: A worker fetches samples with the largest to its fastest storage class, and so on for slower classes until either it has cached the entire dataset or filled its local storage.
The last step is to define the fetch order. Our analysis has thus far assumed all storage classes have already been filled, but this would require a potentially costly prestaging step that cannot be overlapped with training. We follow Rule 1 and use to ensure we always fetch the samples in order of access. We summarize the overall design in Fig. 4.
We now briefly describe the implementation of this design, summarized in Fig. 5. HDMLP consists of a core backend implemented in C++ and a generic Python interface that exposes the functionality for integration with existing deep learning frameworks.
The Python interface provides the Job class, which represents the execution of a machine learning job on a particular dataset. A single Python process can run multiple jobs at the same time (e.g., training and validation). This only requires the user to specify a few parameters, such as the dataset, batch size, and the number of epochs. The random seed that generates the access sequence can either be specified manually by the caller or generated by the library.
Once initialized, the Job exposes two key aspects: an attribute buffer_p that is a pointer to HDMLP’s underlying staging buffer, allowing zero-copy access to samples; and a get method, which returns samples and their labels, enabling iterator-style access to data.
It is easy to incorporate this into existing training pipelines. We provide convenience wrappers to replace the data loader and commonly used datasets in PyTorch. Using these, minimal changes are required to integrate HDMLP with existing PyTorch codebases, as we demonstrate in Fig. 6.
The core of HDMLP comprises a central manager, generic backends for storage backends and prefetching, and utilities. For simplicity, the parameters for our performance model are specified by a system-wide configuration file, with parameterized values (e.g., PFS bandwidth for a given number of readers) inferred using linear regression when the exact value is not available. This could be generalized to automatically determine performance parameters.
Storage backends need only implement a generic interface, and HDMLP currently supports filesystem- and memory-based storage backends, which are sufficient to support most storage classes (including RAM, SSDs, and HDDs). Additional backends (e.g., for key-value stores or databases) can easily be added.
For tracking samples, a metadata store keeps a catalog of locally cached samples. A distributed manager class handles all distributed operations among workers, using MPI for the underlying communication infrastructure. During setup, it is responsible for distributing a worker’s access sequence to all other workers (with an allgather operation).
It also provides functionality for serving locally cached samples to and requesting samples from remote nodes. While it is always possible for a worker to know that a sample is not cached by any other worker, it is not possible (without additional metadata traffic) for a worker to know whether a worker that will cache a sample has successfully prefetched it. As requesting a remote sample that has not yet been cached results in wasted communication and increased stall time, we use a heuristic to estimate when a sample has been cached. Assuming samples are of comparable size and prefetching is load balanced, if the local prefetching has reached the corresponding access frequency, then the remote worker likely has, too. Note that the failure of this heuristic is not an error, as HDMLP detects such cases, but we wish to minimize such occurrences due to their performance impact. We confirmed that, in practice, there are very few false positives.
The core prefetching logic is managed by prefetcher backends, which implement all the logic for prefetching to a particular storage class. Adding a new prefetcher again only requires implementing a simple, generic interface. HDMLP provides a memory-based prefetcher and a filesystem-based prefetcher (which uses mmap to access files). We also implement a special prefetcher for the staging buffer, which is filled in a circular manner. This prefetcher coordinates with the Python interface via a producer/consumer queue to ensure that the consumer knows when samples are available, and that the prefetcher knows when samples have been consumed (and therefore can be replaced).
Finally, the configuration, storage classes, and prefetchers are managed by a prefetcher manager class, which coordinates the different components. We also provide convenience utilities, including an optimized, OpenCV-based Bradski (2000) image preprocessing pipeline that avoids unnecessary copies, which we observed could be a bottleneck otherwise.
6 Performance Simulator
We developed a Python performance simulator based on our performance model to evaluate different data loading strategies. The simulator supports arbitrary dataset, system, and I/O strategy configurations. We focus on four cases:
: The dataset fits into the first storage class (typically RAM) of each worker. This should not be a challenging situation, but is nevertheless important, as it occurs with small datasets or workers with large amounts of RAM.
: The dataset fits in the aggregate storage of a worker. This scenario is interesting, as while a worker can cache the entire dataset, it must use multiple storage classes to do so.
: The dataset can be cached in the aggregate storage of all workers. This requires workers to exploit distributed caching and to minimize the number of PFS accesses.
: The dataset is too large to be cached even by the aggregate storage of all workers. While this is an uncommon scenario today when using many workers, it is interesting to examine, especially as dataset sizes grow in the future. Further, this scenario already occurs when large datasets are used on small training clusters.
We study the following policies:
Perfect: This simulates the case where no stalls occur and provides a lower bound, although it is not realistic in practice.
Naive: This simulates loading from the PFS with no prefetching or caching.
StagingBuffer: This fills a staging buffer according to the reference string, fetching data from a given location and dropping it after it is consumed. When configured to prefetch data from the PFS, this simulates the double buffering or tf.data policies.
DeepIO: This simulates both ordered or optimistic modes for DeepIO Zhu et al. (2018). In the latter mode, this may change the access order.
ParallelStaging: This simulates parallel data staging, which also changes the access order, as only locally-available samples are accessed by a worker.
LBANN: This simulates the LBANN data store Jacobs et al. (2019), supporting both its dynamic and preloading approaches. As this only caches data in memory, it will fail if the dataset exceeds the aggregate worker memory.
LocalityAware: This simulates the locality-aware approach of Yang and Cong (2019). When using this policy, we reorder batches at the beginning of the simulation to correspond to the logic described in their paper.
Frequency: This simulates our proposed prefetching and caching strategy (Sec. 5).
6.1 Simulation Results
For brevity, we mainly report simulation results for a single setup simulating a small cluster in Fig. 7. Each plot summarizes the execution time, and the stacked bars give the proportion of I/O time spent fetching from a particular storage class. We use workers, each on a dedicated node with a compute throughput of MB/s, a preprocessing rate MB/s, and an inter-worker bandwidth MB/s. We configured a MB staging buffer, and two further storage levels representing 50 GB of RAM and a 100 GB local SSD. Each storage level used two prefetcher threads, and we set MB/s and MB/s. For PFS read throughput, we set MB/s, MB/s, MB/s, and MB/s. These choices were based on benchmarks of the Ault cluster at CSCS CSCS (2020b).
We simulate datasets with normally distributed filesizes, varying the and parameters of the distribution and the number of samples, , based on different representative datasets. A global batch size of was used.
Scenario 1 (, , , , expected size MB): This case is representative of many small research datasets commonly used in practice. As expected, there is relatively little difference in performance for most policies, and they are close to the lower bound. The exception is the naive policy, which is longer than the best-performing policy, illustrating the importance of proper I/O handling.
Scenario 2 (, MB, , , GB, ImageNet-1k Russakovsky et al. (2015)): Here, HDMLP is the best-performing policy, and is very close to the theoretical lower bound. There are several key factors behind this performance: HDMLP does not require an initialization phase (in contrast to data staging), reduce PFS reads (whereas StagingPool policies always read from the PFS), and utilizes all available resources (in contrast to the LBANN data store, which uses only RAM).
Scenario 3 (, MB, , , GB, OpenImages Kuznetsova et al. (2018), and MB, , GB, Places 365-Challenge Zhou et al. (2017)): HDMLP once again performs the best and is close to the lower bound. Note that beginning with this scenario, the LBANN data store no longer supports the datasets because they are larger than the aggregate RAM. We observe that the DeepIO ordered mode performs poorly here, since it fetches uncached samples from the PFS and does not consider access frequency for assigning samples.
Scenario 4 (, MB, , , GB, ImageNet-22k Deng et al. (2009)): In this case, the dataset exceeds the aggregate storage capacity of the cluster. HDMLP again performs well and approaches the lower bound. DeepIO and parallel data staging are able to perform better, as they never access the PFS, mitigating the impact of the large dataset size. However, they no longer access the entire dataset, significantly impacting potential accuracy during training.
Scenario 5 (, , MB, , , TB, CosmoFlow Oyama et al. (2020)): Here we simulate the large CosmoFlow dataset, representative of emerging scientific datasets, on a cluster of 32 workers. We observe similar trends to Scenario 4, but at larger scale, indicating HDMLP is able to strong scale well with dataset size and cluster resources while still providing access to the entire dataset.
6.2 Environment Evaluation
In addition to comparing the performance of I/O policies, our simulator can also be used to quantify the impact of changes to a system on training time. This can be used to identify promising hardware upgrades or when designing new clusters to meet particular training requirements.
To illustrate this, we consider Scenario 3 from above with HDMLP as the policy. In this case, the lower bound on runtime is hours. We then simulated setups with staging buffers of size GB, GB, and GB, which all resulted in runtimes of hours, indicating that the staging buffer is not a limiting factor in this configuration. We subsequently fixed the staging buffer size at GB. We next considered configurations with an added GB, GB, or GB of RAM as an additional storage class, and an added GB, GB, or GB of slower (random read bandwidth of MB/s for two threads) storage. The performance for these configurations is illustrated in Fig. 8.
We observe that, while the best performance is achieved by maximizing total storage, different combinations of storage can often be used to achieve a given performance level if other factors (e.g., cost) need to be optimized for. For example, the performance with GB of slower storage is comparable to using GB of faster RAM. This demonstrates why it is critical that an I/O framework be able to automatically adapt to many different storage backends.
We now experimentally validate HDMLP and compare it to both PyTorch’s data loader and TensorFlow’s tf.data. Our experiments use the CSCS Piz Daint supercomputer CSCS (2020a), with Cray XC50 nodes, each equipped with an Intel Xeon E5-2690 CPU, 64 GB of RAM, and one Nvidia P100 GPU. Nodes are interconnected via a Cray Aries network with a Dragonfly topology and a bandwidth of 10.5 GB/s. We start with small- and medium-scale I/O benchmarks, followed by a full large-scale training run.
Frameworks For our benchmarks, we measure the stall time, or the time spent waiting for I/O without any additional work to do. The lower the stall time, the less a framework’s runtime is impacted by I/O. We evaluate four frameworks:
DataLoader: The single-threaded PyTorch data loader, without any prefetching. The Torchvision ImageFolder is used for the dataset.
Multi-threaded Prefetching: The PyTorch data loader using four threads for performing I/O and prefetching enabled, again using ImageFolder.
TensorFlow: TensorFlow using tf.data and reading from TFRecords, using four threads.
HDMLP: Our HDMLP implementation, integrated into PyTorch and using HDMLPImageFolder. HDMLP was configured with a MB staging buffer and a MB memory storage class. Four threads per worker were used for prefetching to the staging buffer and two threads for the memory storage class.
Dataset File Structure As a benchmark dataset, we use ImageNet-1k, which consists of samples totaling approximately GB. The data initially resides on a shared PFS. The ImageFolder and HDMLP implementations follow the standard folder structure, where each sample is stored in a separate image file, organized in folders, one for each class. TensorFlow, on the other hand, follows the TFRecords format, which groups samples from the dataset together for a total of files (each of which is MB). The shuffling procedure used in tf.data first randomizes the sequence of files to read, then sequentially prefetches contiguous samples within each file (with an automatic buffer size) and shuffles that buffer.
In all cases, we use a per-worker batch size of 128 and standard data preprocessing (random cropping, resizing, and flips plus normalization). Note that on Piz Daint, as there is no local storage and only GB of RAM per node, the dataset cannot be cached in its entirety by any single worker. Therefore, a worker must make aggressive use of prefetching to get good performance, and to fully utilize I/O resources, the storage of other workers.
7.1 I/O Time
We first examined training SqueezeNet Iandola et al. (2016) on a 100-class subset of ImageNet using up to eight workers. SqueezeNet is a small model, hence there is little opportunity for compute to hide I/O. This case is representative of many research situations, where smaller clusters are often used. We present results in Fig. 9. Stall times are relatively large, due to the inability to hide I/O with most frameworks. We observe that the single-threaded data loader consistently performs poorly; in comparison, the multi-threaded version performs much better. This confirms our assumption that prefetching is critical to good performance, and that multiple threads are needed to utilize the PFS bandwidth. HDMLP is able to significantly outperform the other frameworks. In fact, we observe superlinear speedups in several cases: This is because, as additional resources are provided, frameworks are able to better utilize I/O resources. We particularly observe this with HDMLP, which is then able to use more memory for caching and shift away from costly PFS accesses.
We then trained ResNet-50 He et al. (2016) as a representative of standard, compute-intensive training regimes. Results are presented in Fig. 10. TensorFlow performs several times faster than PyTorch in this case, as tf.data is able to prefetch further ahead. HDMLP has by far the lowest stall time, being lower than TensorFlow on 8 workers. This is because it is able to avoid costly accesses to the PFS by caching frequently-used samples locally and accessing samples cached by remote workers.
We also see good scalability, as stall time decreases with additional nodes (and hence additional I/O capacity) for every framework. HDMLP observes only a small performance improvement when scaling from 16 to 32 workers, which we hypothesize is because it was able to cache the entire dataset in distributed memory using 16 workers, and is instead limited by available inter-worker network bandwidth.
7.2 End-to-end Training
Fig. 11 presents large-scale training and stall times for a full training run of ResNet-50 on ImageNet-1k. Note that these are separate runs from those reported in Fig. 10, hence stall times are not exactly comparable. At 32 workers, I/O is a relatively small part of the overall training time. However, as the scale increases, I/O begins to rapidly dominate the runtime of TensorFlow and PyTorch, due to increased pressure on the PFS from many concurrent reads. This is especially the case for TensorFlow, where the more aggressive prefetching results in a higher load being placed on the PFS. In every case, HDMLP is able to keep stall times low and relatively constant by avoiding accesses to the PFS.
Overall, HDMLP is able to better exploit I/O resources for realistic applications by caching data across the entire system in order to minimize stall times.
8 Related Work
We have already discussed existing approaches to optimizing training I/O in Sec. 2.1 and to integrated prefetching and caching in Sec. 3.1. Others have studied the performance of training I/O in specialized situations, such as Pumma et al. (2019) for LMDB in Caffe, Chowdhury et al. (2019) for BeeGFS, and Chien et al. (2018) for TensorFlow I/O.
Beyond these, efficient distributed I/O has long been studied in the context of scientific computing applications Thakur et al. (1999); Li et al. (2003); Howison et al. (2010). Additionally, distributed and hierarchical caching has been studied in other contexts, such as for video-on-demand content Koch et al. (2018) and content delivery networks Borst et al. (2010). Insights from these works were used during the design process of HDMLP.
Clairvoyance has long been an idea used in theoretical studies of prefetching and caching, but has been difficult to translate to practical applications with complex I/O access patterns. With machine learning, where the access pattern is random, but predictable given the random seed that generates it, there is now an application that fully benefits from this. Using clairvoyance, we make a probabilistic analysis of access patterns and show that there is almost always an imbalance in the frequency a worker accesses a particular sample, which we combine with a performance model to drive a hierarchical caching and prefetching policy. HDMLP provides a simple, powerful interface that can be used in existing training pipelines to improve their I/O performance.
As the compute throughput of accelerators continues to grow faster than that of data movement, the cost of I/O—and the importance of optimizing it—will only increase. Our work here serves as an initial step toward incorporating more detailed analyses of I/O into runtime frameworks. We expect that by building on these insights, the I/O bottleneck can be addressed.
Appendix A Supplementary Material
|Number of workers|
|MB/s||Network bandwidth for inter-worker communication|
|MB/s||Network bandwidth for communication with the PFS|
|Number of threads for prefetching to storage class|
|MB||Capacity of storage class|
|MB||Total local storage of a worker|
|MB/s||Random aggregate read throughput of storage class (with reader threads)|
|MB/s||Random aggregate write throughput of storage class (with writer threads)|
|MB/s||Random aggregate read throughput (with clients) of the PFS|
|Number of files|
|MB||Size of sample|
|MB||Size of the whole dataset|
|Number of epochs|
|Number of iterations|
|Subset of samples that are processed at iteration|
|Subset of samples that are processed by worker at iteration|
|Access sequence of a worker|
Proof of Lemma 1.
Assume for the sake of contradiction that every other node accesses the sample times for some , , and . Then the total accesses to this sample are
This contradicts the fact that every sample is accessed exactly times during training.
The proof of the other bound proceeds similarly. Assume for the sake of contradiction that every other node accesses the sample times. Then the total accesses to this sample are
which again contradicts that every sample is accessed exactly times. ∎
- TensorFlow: large-scale machine learning on heterogeneous systems. External Links: Cited by: Table 1, §1.
- Detecting work zones in SHRP 2 NDS videos using deep learning based computer vision. In 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Cited by: §1.
- YouTube-8M: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §1.
- Integrated prefetching and caching in single and parallel disk systems. Information and Computation 198 (1). Cited by: §3.1.
- Minimizing stall time in single and parallel disk systems. Journal of the ACM (JACM) 47 (6). Cited by: §3.1.
- Minimizing stall time in single and parallel disk systems using multicommodity network flows. In Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques, Cited by: §3.1.
- Parallel prefetching and caching is NP-hard. Cited by: §3.1.
- A study of replacement algorithms for a virtual-storage computer. IBM Systems journal 5 (2). Cited by: §1, §3.1.
- Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Comput. Surv. 52 (4). Cited by: §1, §2.
- Distributed caching algorithms for content distribution networks. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM), Cited by: §8.
- Optimization methods for large-scale machine learning. Siam Review 60 (2). Cited by: §2.
- The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §5.2.2.
- A study of integrated prefetching and caching strategies. ACM SIGMETRICS Performance Evaluation Review 23 (1). Cited by: §3.1, §3.1.
- TVM: an end-to-end optimization stack for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), Cited by: §1.
- CuDNN: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759. Cited by: §1.
- Characterizing deep-learning I/O workloads in TensorFlow. In International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS), Cited by: §1, §8.
- I/O characterization and performance evaluation of BeeGFS for deep learning. In Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10. Cited by: 5th item, §8.
- Piz daint. External Links: Cited by: §7.
- Swiss national supercomputing center. External Links: Cited by: §6.1.
- ImageNet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.1.
- Aluminum: an asynchronous, GPU-aware communication library optimized for large-scale training of deep neural networks on HPC systems. Workshop on Machine Learning in HPC Environments (MLHPC). Cited by: §1.
- XLA: optimizing compiler for machine learning. External Links: Cited by: §1.
- Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §1.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §7.1.
- Tuning HDF5 for Lustre file systems. In Workshop on Interfaces and Abstractions for Scientific Data Storage, Cited by: §8.
- SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size. arXiv preprint arXiv:1602.07360. Cited by: §7.1.
- Parallelizing training of deep generative models on massive scientific datasets. In IEEE International Conference on Cluster Computing (CLUSTER), Cited by: Table 1, §1, item 6.
- A simple characterization of provably efficient prefetching algorithms. In Proceedings of the International Conference on Dependable Systems and Networks, Cited by: §3.1, §3.1.
- In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), Cited by: §1.
- Near-optimal parallel prefetching and caching. SIAM Journal on computing 29 (4). Cited by: §3.1, §3.1.
- Category-aware hierarchical caching for video-on-demand content on YouTube. In Proceedings of the 9th ACM Multimedia Systems Conference (MMSys), Cited by: §8.
- Exascale deep learning for climate analytics. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Cited by: Table 1, §1, §2.
- The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §6.1.
- Parallel netCDF: a high-performance scientific I/O interface. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Cited by: §8.
- Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1.
- NVIDIA tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Cited by: §1.
- CosmoFlow: using deep learning to learn the universe at scale. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Cited by: §1.
- NVIDIA collective communications library. External Links: Cited by: §1.
- NVIDIA data loading library. External Links: Cited by: §2.1.
- AI and compute. External Links: Cited by: §1.
- The case for strong scaling in deep learning: training large 3D CNNs with hybrid parallelism. arXiv preprint arXiv:2007.12856. Cited by: §1, §6.1.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Table 1, §1.
- Scalable deep learning via I/O analysis and optimization. ACM Transactions on Parallel Computing (TOPC) 6 (2). Cited by: §1, §2.1, §8.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3). Cited by: §6.1.
- Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1.
- Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
- Data sieving and collective I/O in ROMIO. In Seventh Symposium on the Frontiers of Massively Parallel Computation, Cited by: §8.
- Accelerating data loading in deep neural network training. In International Conference on High Performance Computing, Data, and Analytics (HiPC), Cited by: Table 1, §1, item 7.
- Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (6). Cited by: §6.1.
- Entropy-aware I/O pipelining for large-scale deep learning on HPC systems. In International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Cited by: Table 1, §1, item 4.