An Initial Look at Deep Learning IO Performance

Abstract

This blog post describes an investigation of IO behavior of TensorFlow and PyTorch during resnet50 training running on Lambda Lab’s 8x V100 GPU instances. Both ephemeral local NVMe storage and network attached persistent storage was tested. The local NVMe storage was fast enough to achieve a throughput rate required to hit synthetic test targets. The network attached persistent storage may not be able to fully saturate 8 V100 GPUs during training, though can achieve nearly the same level of performance as the local storage so long as TFRecords are utilized. Further, there are specific behaviors and bottlenecks in TensorFlow and PyTorch that can reduce training performance when using real data from ImageNet.

Acknowledgements

Thank you to Michael Balaban at Lambda Labs for providing access to their GPU cloud for this testing. Thank you to Chuan Li for the creation of his TensorFlow benchmarking tools. Thank you also to Andrej Karpathy, Toby Boyd, Yanan Cao, Sanjoy Das, Thomas Joerg, and Justin Lebar for their excellent blog posts on deep learning and XLA performance that helped inform this article. I hope that this post will be useful for others as your work and writing was useful for me.

Introduction

…just because you can formulate your problem as RL doesn’t mean you should. If you insist on using the technology without understanding how it works you are likely to fail.

Andrej Karpathy, A Recipe for Training Neural Networks, 2019

That was the phrase that stuck in my head when I first started this project. What project you may ask? I want to understand how deep learning experiments utilize fast storage devices. Not just any experiments either: real ones, preferably big. That’s how I happened upon Andrej Karpathy’s blog. He is the former Sr. Director of AI at Tesla and knows a thing or two about training big neural networks. I’ve spent the last decade working on Ceph and have worked on distributed systems and distributed storage for nearly 2 decades at this point. But training neural nets? The closest I’ve come was back in the early 2000s when I tried to build a tool to predict video game framerates. I scraped benchmark numbers from review websites and built M5 decision trees based on hardware and video card settings. It sort of worked, but was terribly overtrained on a small (~4000 sample) dataset. Training with petabytes of data to teach an AI how to responsibly drive a car? I can already feel a bit of imposter syndrome setting in.

Thankfully my goal is comparatively modest. I don’t need to build a cutting edge classifier or explore the intricacies of manually implementing back-propagation. I simply want to understand the IO patterns that are involved when training big datasets with fast GPUs so I can help researchers speed up their work. Up until now, my ability to do this was fairly limited. At the day job I’ve had access to a small group of nodes with extremely modest GPUs. I set up runs with MLPerf but the datasets (WMT G-E and CoCo) easily fit into memory. Other than a short burst of read traffic at the very beginning of training there was very little IO. Recently I had the opportunity to meet Michael Balaban, Co-Founder of Lambda Labs. I told him what I wanted to do and he gave me access to Lambda’s GPU cloud and beta persistent storage to give it a try. I was able to grab one of Lambda’s 8x Tesla V100 instances (These things are incredibly popular so it’s best to grab one early in the morning!). Not all of Lambda’s instance types currently have access to the persistent storage but the V100 instances in the Texas zone do. Once secured, I got to work.

TensorFlow - Synthetic

Before even attempting to run tests with real data, I realized I needed a baseline to start with. Luckily, Chuan Li, Lambda’s Chief Scientific Officer, wrote a tool for running TensorFlow benchmarks and made it available on github here. One of the advantages of Lambda’s cloud is that they’ve already bundled up many popular tools for running deep-learning workloads into one package called Lambda Stack which comes pre-installed when you start an instance. This made it fast to get started, though I did run into one issue. Lambda Stack comes standard with TensorFlow 2, but Chuan Li’s tool relies on a TensorFlow benchmark submodule that is designed to work with TensorFlow 1. Luckily, the parent repository was unofficially updated to work with Tensorflow 2 (with a warning that it is no longer being maintained). A quick “git checkout master” in the “benchmarks” submodule directory got everything working. Chuan Li’s tool makes it simple to run tests with several preconfigured templates already included. I chose the fp16 resnet50 configuration as it should be fast at processing images and is fairly standard.

TF_XLA_FLAGS=--tf_xla_auto_jit=2 ./batch_benchmark.sh X X 1 100 2 config/config_resnet50_replicated_fp16_train_syn

Using the invocation provided in the benchmark README.md file, I was able to quickly run benchmarks with synthetic data on up to 8 V100 GPUs in the node. At one point I got stuck, hitting what appeared at first to be an unexplainable 25% performance loss. I reran the tests multiple times and even monitored GPU clockspeeds/temperatures in nvidia-smi with no luck. Ultimately I discovered my error. In the slow cases, I had inadvertently left out the “TF_XLA_FLAGS=–tf_xla_auto_jit=2” environment variable. It turns out that setting this allows Tensorflow compile and execute functions with XLA (Accelerated Linear Algebra) support which is a pretty big win for these tests.

At this point I decided that I needed to understand how Chuan Li’s tool works. It turns out that he is using the same base tf_cnn_benchmarks.py benchmark code that companies like Nvidia and Dell also use for benchmarking their GPU solutions. I spent some time running it directly with Dell’s settings from their deep learning overview here. Unfortunately those tests had mixed results, even after various tweaks. While researching the XLA issues I mentioned earlier however, I made an even better discovery on the TensorFlow website. I found an excellent blog post with performance data written by some of the core Tensorflow developers. It’s now 4 years old, but still appears to be quite valid. The tuning options used were both simpler and resulted in higher performance versus other configurations that I’ve come across.

Training with synthetic data in Lambda’s cloud resulted in similar performance to what the Tensorflow developer’s reported. In fact, using their own settings yielded slightly faster results when running on Lambda’s 8xV100 instance! It was incredibly encouraging to me that even in Lambda’s cloud environment with virtual machine instances I could achieve performance that was as fast or faster than what the Tensorflow developers were reporting.

Choosing a Real Data Set

The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data.

Andrej Karpathy, A Recipe for Training Neural Networks, 2019

Having convinced myself that I had Tensorflow operating reasonably efficiently in synthetic tests, it was time to start thinking about what dataset to use for “real” training. The largest and most obvious choice is ImageNet. ImageNet is composed of over 1.2 million categorized images that form a roughly 160GB training dataset. It is also the largest dataset I could find that was publicly accessible. Downloading it isn’t so easy however. The only version that I could access is the ImageNet Object Localization Challenge dataset hosted on kaggle.

After finally figuring out how to download the data, it was time to follow Andrej’s advice and try to learn something about it. While ImageNet is curated and annotated, it has many images of different sizes, dimensions, and pixel counts. Images also come from many sources with different levels of quality. Through the power of stack-exchange I was able to find a bash one-liner script to generate a histogram of image sizes:

find . -type f -print0 | xargs -0 ls -l | awk '{size[int(log($5)/log(2))]++}END{for (i in size) printf("%10d %3d\n", 2^i, size[i])}' | sort -n

Roughly 80% of the images are in the 64KB or 128KB size bins. Almost all of the remaining images are smaller. That gives us a pretty good idea of what kind of IOs to expect during classification. Or at least…it does for frameworks that read those images directly. In Tensorflow’s case, there’s an alternative format called TFRecord. TFRecords are basically collections of image data sequentially laid out in much larger files. Instead of iterating over thousands or millions of individual image files, TFRecords allow Tensorflow to instead stream fewer, larger files that each house multiple images. It’s a one time cost to pre-process the data so Tensorflow has less work to do during training. After I downloaded the ImageNet data I took a shot at converting the ImageNet LOC data into TensorFlow records. Luckily, the TensorFlow tpu github repository already has a tool that can do this. I had to manipulate the dataset slightly, but ultimately this process worked (at least for the training data):

pip install gcloud google-cloud-storage
pip install protobuf==3.20.1

mkdir ~/data/ImageNetFoo
ln -s ~/data/ImageNet/ILSVRC/Data/CLS-LOC/train ~/data/ImageNetFoo/train
ln -s ~/data/ImageNet/ILSVRC/Data/CLS-LOC/val ~/data/ImageNetFoo/val
ln -s ~/data/ImageNet/ILSVRC/Data/CLS-LOC/test ~/data/ImageNetFoo/test
ln -s ~/data/ImageNet/LOC_synset_mapping.txt ~/data/ImageNetFoo/synset_labels.txt
python imagenet_to_gcs.py --raw_data_dir=/home/ubuntu/data/ImageNetFoo --local_scratch_dir=/home/ubuntu/ExaltedOrbs/ImageNet/tf_records --nogcs_upload

Perhaps I should say that this worked so long as the original dataset was located on the local NVMe drive. The persistent storage didn’t fare as well. Attempting to decompress ImageNet on the persistent storage resulted in blowing past the max number of open files allowed with errors like:

OSError: [Errno 24] Too many open files.

Unfortunately this couldn’t be fixed on the instance. It appeared to be passed through from the host and the persistent storage was completely unusable until the instance was rebooted. Recently I spoke to one of Lambda’s engineers and they are working on a fix. (It may already be implemented by the time you read this!) I also want to note that the persistent storage is still in beta so issues like this are not entirely unexpected. Having said that, before hitting the error it was significantly slower extracting ImageNet on the persistent storage vs on the local NVMe storage. It’s probably best to extract ImageNet locally and then write the large TFRecords to the persistent storage during the conversion process. Luckily extracting ImageNet to local storage was fine, and storing the original archive and the resulting TFRecords on the persistent storage worked perfectly fine as well.

FIO - Baseline IO Results

Next, I turned my attention to running baseline tests on Lambda’s local and persistent storage using fio. Fio is a highly configurable and well respected benchmark in the storage community and perfect for generating baseline results. I decided to use a dataset size that is roughly similar to ImageNet (200GB), the libaio engine in fio with direct IO, and an appropriately high IO depth to let the NVMe drives stretch their legs a bit.

Throughput with the local NVMe drive(s) is surprisingly good. The persistent storage is slower, but still might be fast enough at a little over 1GB/s for large reads. 16K IOPS was somewhat slower in both cases. I chose 16K so that I could quickly compare to tests I ran in my Ceph QEMU/KVM performance blog post here. Without getting into the details, I suspect there’s still some room for improved IOPS with Lambda’s setup. Luckily though, converting into TFRecords should make Tensorflow throughput bound instead of latency bound. What about PyTorch or other tools that want to read images directly though? Fio gives us the ability to simulate it by using its ‘bssplit’ feature. We can take the size ranges and percentiles generated when examining ImageNet and give fio a similar distribution:

fio --ioengine=libaio --direct=1 --bssplit=2K/1:4K/2:8K/4:16K/8:32K/13:64K/38:128K/33:256K/1 --iodepth=128 --rw=randread --norandommap --size=200G --numjobs=1 --runtime=300 --time_based --name=foo

This isn’t exactly right as we are not reading data spread across millions of files, but it should provide something of an upper bound on what to expect. It looks like the persistent storage can do approximately 10K reads/second at a throughput rate of around 750MB/s. The local storage is about 3-4 times faster. Local storage should be fast enough to support the kind of images/second throughput rates we want to hit in Tensorflow on 8 V100 GPUs, but the jury is still out for the persistent storage.

Tensorflow - ImageNet

Running benchmarks with real data rather than synthetic data is fairly straightforward in Tensorflow. You simply append data_dir and data_name flags to the CLI invocation to let it know where the TFRecords are located:

sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
python ./tf_cnn_benchmarks.py --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --num_gpus=8 --loss_type_to_report=base_loss --data_dir=/home/ubuntu/ImageNet-TF/train --data_name=imagenet

Ouch. Much lower performance with the ImageNet data vs synthetic! This is especially unfortunate given that 4 years ago the Tensorflow developers reported much better results. I spent some time reading and experimenting with different settings. Ultimately the one setting that made a substantial difference was “datasets_num_private_threads”. In the Tensorflow benchmark source code, this setting is described as: “[The] number of threads for a private threadpool created for all datasets computation.” I’ll go into more detail what these threads are doing in a bit. For now, let’s see how increasing the number of threads affects the results:

Increasing the number of private threads has a dramatic effect on performance, though I was unable to fully match the performance achieved in the synthetic tests on either the local or persistent storage. The local storage fared better at high thread counts gradually topping out at around 8600 images/second. At high private thread counts the persistent storage topped out between 7000-8000 images/second with a higher degree of variability between runs. I suspect that in this case the persistent storage has likely hit its (per instance) limit.

In addition to having a dramatic effect on performance, changing the private thread count also had a large effect on the CPU consumption of the TensorFlow process. CPU usage increases almost linearly with additional private threads up to around 30 cores. What exactly are these private threads doing? To answer that question, I utilized two tools that I often deploy when diagnosing CPU usage in Ceph. When testing with a lower number of private threads, I used linux’s perf tool to look at where cycles are being consumed when the private threads are fully saturated. At higher levels of private threads, I used my wallclock profiler uwpmp to look at how private threads spend their time when increasing the thread count no longer improves performance.

In the first case with perf, we can get a good view of the work that these private threads are doing:

--77.31%--tensorflow::ThreadPoolDevice::Compute
          |          
          |--51.19%--0x7f511a00c7d8
          |          |          
          |           --51.18%--tensorflow::jpeg::Uncompress
          |--14.48%--tensorflow::ResizeBilinearOp<Eigen::ThreadPoolDevice, unsigned char>::Compute
          |--5.47%--tensorflow::CastOpBase::Compute
          |--2.66%--tensorflow::ReverseV2Op<Eigen::ThreadPoolDevice, unsigned char, int>::Compute

The majority of the cycles consumed is in jpeg decompression and resize operations, along with a smattering of other stuff. What happens if we look at a case with a higher private thread count but now look at wallclock time instead of cycles? I ended up having some trouble getting the profiler to work properly and consistently get clean callgraphs, but I was able to get at least one run in that revealed some interesting information. First, I saw time spent in the same functions that perf told us we were spending cycles in:

+ 100.00% Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int)
 + 99.90% ???
 |+ 97.30% ???
 ||+ 92.40% ???
 |||+ 77.10% _PyEval_EvalFrameDefault
 ||||+ 47.20% ???
 |||||+ 38.10% tensorflow::jpeg::Uncompress(void const*, int, tensorflow::jpeg::UncompressFlags const&, long*, std::function<unsigned char* (int, int, int)>)
 ||||+ 12.20% tensorflow::ResizeBilinearOp<Eigen::ThreadPoolDevice, unsigned char>::Compute(tensorflow::OpKernelContext*)
 ||||+ 4.40% tensorflow::CastOpBase::Compute(tensorflow::OpKernelContext*)
 ||||+ 1.70% tensorflow::ReverseV2Op<Eigen::ThreadPoolDevice, unsigned char, int>::Compute(tensorflow::OpKernelContext*)

But the wallclock profile also exposed that there may be contention in multiple areas in the private threads around some of the nsync synchronization primitives being used:

 |||||||    |  + 4.50% nsync::nsync_mu_semaphore_p(nsync::nsync_semaphore_s_*)
 |||||||    |   + 4.50% syscall

This almost always appeared nested deep inside:

tensorflow::BFCAllocator::AllocateRaw(unsigned long, unsigned long, tensorflow::AllocationAttributes const&)

Sadly I was missing a number of debug symbols and don’t 100% trust the wallclock trace. For now I’ll just say that the private threads are doing a significant amount of work decompressing and manipulating the image data to keep the GPUs fed. I suspect that with newer and faster GPUs the image retrieval pipeline could become an even bigger issue when training with real image data. The mystery for me is how The TensorFlow developers achieved such good results 4 years ago without using dedicated private threads at all. Perhaps they had a significantly faster jpeg decompression mechanism that I am unaware of?

PyTorch - ImageNet

After running Tensorflow, I also ran some benchmarks in PyTorch using Nvidia’s “DeepLearningExamples” github repo. First, I installed the prereqs and setup the repository:

pip install 'git+https://github.com/NVIDIA/dllogger'
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110
git clone https://github.com/NVIDIA/DeepLearningExamples

Then, prepared ImageNet for usage in PyTorch:

cd ~/data/ImageNet/ILSVRC/Data/CLS-LOC/val
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash

And finally ran a test:

cd DeepLearningExamples/PyTorch/Classification/ConvNets
sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
python ./multiproc.py --nproc_per_node 1 ./main.py --arch resnet50 --label-smoothing 0.1 --run-epoch 1 --amp --static-loss-scale 256 --workspace /home/ubuntu/data/ImageNet-Scratch /home/ubuntu/data/ImageNet-Orig/ILSVRC/Data/CLS-LOC/

There are a couple of differences here versus the TensorFlow tests. First, I’m using the raw ImageNet archive instead of a preprocessed TFRecord dataset, so the read behavior is different. Because I was unable to extract or copy the raw ImageNet archive onto the persistent storage, I’m also only testing the local NVMe drive. Finally, I didn’t see any specific examples for running with fp16 in nVidia’s documentation, so I’m using amp (automatic mixed precision) which may be slightly slower.

Given the number of differences it’s tough to draw direct comparisons with Tensorflow. Amp is one difference, but it’s quite possible that there are tuning options that could improve performance here that I don’t know about. I did notice that PyTorch, like Tensorflow, is using quite a bit of CPU to keep the GPUs working. I suspect that there are ways to tweak the IO pipeline that could improve performance. For now though, let’s compare the IO patterns on the local NVMe drive during the Tensorflow and PyTorch runs. I was hoping to be able to use blktrace to do this, but unfortunately was unable to get any data from the virtual devices in the instance. I was able to collect more general statistics using collectl however.

Disk Read Statistics During PyTorch 8 GPU run:

Time	Name	KBytes	IOs	Size	Wait	QLen
00:29:18	vda	761136	6746	113	58	431
00:29:19	vda	752172	6648	113	112	810
00:29:20	vda	747824	6595	113	84	604
00:29:21	vda	735964	6583	112	73	551
00:29:22	vda	695636	6237	112	102	760

Disk Read Statistics During TensorFlow 8 GPU run:

Time	Name	KBytes	IOs	Size	QLen
00:38:45	vda	1081324	8440	128	7
00:38:46	vda	927512	7241	128	7
00:38:47	vda	913512	7130	128	7
00:38:48	vda	1047444	8186	128	6
00:38:49	vda	968776	7560	128	6

When just looking at the IO sizes, both runs appear similar, but that doesn’t tell the whole story. It is likely that Tensorflow is doing much larger reads that are broken up into contiguous 128KB chunks by the block layer based on the underlying device’s max_sectors_kb setting. The tells here are the very low queue length and wait times for the TensorFlow run versus the PyTorch run. In both case the device service times are low (0), but in the TensorFlow case IOs are still backing up in the device queue.

Interestingly, it appears that it may be possible to use nVidia’s DALI (Data Loading Library) package to read TFRecords into PyTorch. I didn’t have time to attempt it, but potentially that could have a big effect on IO behavior and performance as well.

Conclusion

As I’ve been writing this post, I realize just how complicated it is to understand the performance characteristics of training of neural networks. Even as we talk about metrics like images/second, the options that are used (batch size for instance) can also affect convergence. It’s very difficult to come up with a common methodology that is always better than others. I wonder if another metric, like reaching a desired level of convergence, would be better in the end. Having said that, I am glad for having done this exercise as I learned some valuable things:

Pre-processing data into a format like TFRecords on fast local storage is a big win from an IO perspective. It lets storage systems that have slow metadata performance succeed so long as they have enough sequential read throughput to keep the machine learning framework busy. This is a big win for many distributed file systems that may have substandard metadata performance (and even the good ones may still benefit).
To train on a dataset like ImageNet, you need somewhere around 1-1.3GB/s of raw disk throughput to keep 8 V100 GPUs busy when training in fp16. For amp or fp32 the requirements are likely lower since the GPUs can’t work quite as fast. With modern GPUs that are faster than the V100, the disk throughput requirements could be significantly higher.
Lambda’s local NVMe storage is likely fast enough to saturate 8 GPUs, even newer ones, so long as the rest of the IO path can keep up. The persistent storage appears to become a bottleneck with sufficient GPUs and TensorFlow private threads, though can still function fairly well so long as TFRecords are used. A concern going forward is how to ensure that the data pipeline in TensorFlow and PyTorch are fast enough to keep the GPUs fed. The Tensorflow benchmark required a large number of private threads and showed potential evidence of contention at high thread counts. PyTorch did not appear to natively support TFRecords, but NVidia DALI or other 3rd party code might help improve the IO path.
If it’s necessary to train directly with images rather than TFRecords, it may not make sense to host them on shared file systems. It appears that Tensorflow and possibly PyTorch give users the ability to specify a separate training data and work directory. If all operations against the training data are reads, it may be better to host datasets on read-only block device snapshots. For instance with Ceph, perhaps you could create a read/write RBD volume where you put a certain dataset, take a snapshot, and then map that snapshot as read only on multiple instances that all need access to the same image set.
Even with a training set as large as ImageNet, Lambda’s instances have so much memory that eventually the entire dataset becomes cached. It was necessary to sync and drop caches before each test and keep tests short enough that they didn’t re-read the same data from buffer cache. I was able to watch as long running tests eventually stopped performing reads and got faster as time went on. This could make apples-to-apples comparison between different storage vendors difficult if not carefully controlled.
I’m almost certainly missing additional tweaks that can help speed up both Tensorflow and PyTorch. This post shouldn’t be seen as the be-all/end-all for how to achieve high performance with these frameworks, but I hope it may at least help showcase some of the areas that are valuable to investigate when trying to train with real data and achieve high performance.

This wraps up my initial work looking at Deep Learning IO behavior. I hope that next time I can come armed with a bit more knowledge about the internals of how PyTorch and Tensorflow work, focus a bit more on the quality of the training, find even larger datasets to work with, and maybe actually accomplish something useful rather than just play with ImageNet.

Thanks for reading!

hpc.social