IOPS are dumb

This post is a long-form dump of some thoughts I've had while testing all-flash file systems this past year, and bits of this appear in a presentation and paper I'm presenting at PDSW'21 about new benchmarking techniques for testing all-flash file systems.

"How many IOPS do you need?"

I'm often asked this by storage vendors, and the question drives me a little bonkers. I assume they ask it because their other customers bring them black-and-white IOPS requirements, but I argue that anyone would be hard-pressed to explain the scientific value of one I/O operation (versus one gigabyte) if ever called on it. And yet, IOPS are undeniably important; the illustrious Rob Ross devoted a whole slide dedicated to this at a recent ASCAC meeting:

Rob Ross' perspective on why IOPS are now important for HPC I/O

I agree with all of Rob's bullets and yet I disagree with the title of his slide; IOPS are dumb, and yet ignoring them when designing a performance-optimized parallel file system is even more dumb in contemporary times. So let's talk about the grey area in between that creates this dichotomy.

First, bandwidth is pretty dumb

If there's one constant in HPC, it's that everyone hates I/O. And there's a good reason: it's a waste of time because every second you wait for I/O to complete is a second you aren't doing the math that led you to use a supercomputer in the first place. I/O is the time you are doing zero computing amidst a field called "high performance computing."

That said, everyone appreciates the product of I/O--data. I/O is a necessary part of preserving the results of your calculation, so nobody ever says they wish there was no I/O. Instead, infinitely fast I/O is what people want since it implies that 100% of a scientist's time using an HPC is spent actually performing computations while still preserving the results of that computation after the job has completed.

Peeling back another layer of that onion, the saved results of that computation--data--has intrinsic value. In a typical simulation or data analysis, every byte of input or output is typically the hard-earned product of a lot of work performed by a person or machine, and it follows that if you want to both save a lot of bytes but want to spend as little time as possible performing I/O, the true value of a parallel storage system's performance is in how many bytes per second it can read or write. At a fundamental level, this is why I/O performance has long been gauged in terms of megabytes per second, gigabytes per second, and now terabytes per second. To the casual observer, a file system that can deliver 100 GB/s is more valuable than a file system that can deliver only 50 GB/s assuming all things are equal for this very reason. Easy.

This singular metric of storage system "goodness" quickly breaks down once you start trying to set expectations around it though. For example, let's say your HPC job generates 21 TB of valuable data that must be stored, and it must be stored so frequently that we really can't tolerate more than 30 seconds writing that data out before we start feeling like "too much time" is being spent on I/O instead of computation. This turns out to be 700 GB/s--a rather arbitrary choice since that 30 seconds is a matter of subjectivity, but one that reflects the value of your 21 TB and the value of your time. It should follow that any file system that claims 700 GB/s of write capability should meet your requirements, and any vendor who can deliver such a system should get your business, right?

Of course not. It's no secret that obtaining those hero bandwidths, much like obtaining Linpack-level FLOPS, requires you (the end-user) to perform I/O in exactly the right way. In the case of the aforementioned 700 GB/s file system, this means

Having each MPI process write to its own file (a single shared file will get slowed down by file system lock traffic)
Writing 4 MiB at a time (to exactly match the size of the network transmission buffers, remote memory buffers, RAID alignment, ...)
Using 4 processes per node (enough parallelism to drive the NIC, but not too much to choke the node)
Using 960 nodes (enough parallelism to drive all the file system drives, but not too much to choke the servers)

I've never seen a scientific application perform this exact pattern, and consequentially, I don't expect that any scientific application has ever gotten that 700 GB/s of performance from a "700 GB/s file system" in practice. In that sense, this 700 GB/s bandwidth metric is pretty dumb since nobody actually achieves its rated performance. Of course, that hasn't prevented me from saying these same dumb things when I stump for file systems. The one saving grace of using bandwidth as a meaningful metric of I/O performance, though, is that I/O patterns are a synthetic construct and can be squished, stretched, and reshaped without affecting the underlying scientific data being transmitted.

The value of data is in its contents, not the way it is arranged or accessed. There's no intrinsic scientific reason why someone should or shouldn't read their data 4 MiB at a time as long as the bits eventually get to the CPU that will perform calculations on it in the correct order. The only reason HPC users perform nice, 1 MiB-aligned reads and writes is because they learn (either in training or on the streets) that randomly reading a few thousand bytes at a time is very slow and works against their own interests of minimizing I/O time. This contrasts sharply with the computing side of HPC where the laws of physics generally dictate the equations that must be computed, and the order in which those computations happen dictates whether the final results accurately model some physical process or just spit out a bunch of unphysical garbage results.

Because I/O patterns are not intrinsically valuable, we are free to rearrange them to best suit the strengths and weaknesses of a storage system to maximize the GB/s we can get out of it. This is the entire foundation of MPI-IO, which receives I/O patterns that are convenient for the physics being simulated and reorders them into patterns that are convenient for the storage system. So while saying a file system can deliver 700 GB/s is a bit disingenuous on an absolute scale, it does indicate what is possible if you are willing to twist your I/O pattern to exactly match the design optimum.

But IOPS are particularly dumb

IOPS are what happen when you take the value out of a value-based performance metric like bandwidth. Rather than expressing how many valuable bytes a file system can move per second, IOPS express how many arbitrary I/O operations a file system can service per second. And since the notion of an "I/O operation" is completely synthetic and can be twisted without compromising the value of the underlying data, you might already see why IOPS are a dumb metric of performance. They measure how quickly a file system can do something meaningless, where that meaningless thing (an I/O operation) is itself a function of the file system. It's like saying you can run a marathon at five steps per second--it doesn't actually indicate how long it will take you to cover the twenty six miles.

IOPS as a performance measure was relatively unknown to HPC for most of history. Until 2012, HPC storage was dominated by hard drives which which only delivered high-value performance for large, sequential reads and writes and the notion of an "IOP" was antithetical to performance. The advent of flash introduced a new dimension of performance in its ability to read and write a lot of data at discontiguous (or even random) positions within files or across entire file systems. Make no mistake: you still read and write more bytes per second (i.e., get more value) from flash with a contiguous I/O pattern. Flash just raised the bottom end of performance in the event that you are unable or unwilling to contort your application to perform I/O in a way that is convenient for your storage media.

To that end, when a vendor advertises how many IOPS they can deliver, they really are advertising how many discontiguous 4 KiB reads or writes they can deliver under the worst-case I/O pattern (fully random offsets). You can convert a vendor's IOPS performance back into a meaningful value metric simply by multiplying it by 4 KiB; for example, I've been presenting a slide that claims I measured 29,000 write IOPS and 1,400 read IOPS from a single ClusterStor E1000 OST array:

Performance measurements of a single ClusterStor E1000 NVMe Lustre OST

In reality, I was able to write data at 0.12 GB/s and read data at 5.7 GB/s, and stating these performance metrics as IOPS makes it clear that these data rates reflect the worst-case scenario of tiny I/Os happening at random locations rather than the best-case scenario of sequential I/Os which can happen at 27 GB/s and 41 GB/s, respectively.

Where IOPS get particularly stupid is when we try to cast them as some sort of hero number analogous to the 700 GB/s bandwidth metric discussed above. Because IOPS reflect a worst-case performance scenario, no user should ever be asking "how can I get the highest IOPS" because they'd really be asking "how can I get the best, worst-case performance?" Relatedly, trying to measure the IOPS capability of a storage system gets very convoluted because it often requires twisting your I/O pattern in such unrealistic ways that heroic effort is required to get such terrible performance. At some point, every I/O performance engineer should find themselves questioning why they are putting so much time into defeating every optimization the file system implements to avoid this worst-case scenario.

To make this a little more concrete, let's look at this slide I made in 2019 to discuss the IOPS projections of this exact same ClusterStor E1000 array:

Projected performance of a ClusterStor E1000 NVMe Lustre OST based on a PCIe Gen3 platform

Somehow the random read rate went from a projected 600,000 to an astonishing 1,400,000 read IOPS--which one is the correct measure of read IOPS?

It turns out that they're both correct; the huge difference in measured read IOPS are the result of the the 600 KIOPS estimate coming from a measurement that

ran for a much longer sustained period (180 seconds vs. 69 seconds)
used fewer client nodes (21 nodes vs. of 32 nodes)
wrote larger files (1,008× 8 GiB files vs. 1,024× 384 GiB files)

Unlike the IOPS measurements on individual SSDs which are measured using a standard tool (fio with libaio from a single node), there is no standard method for measuring the IOPS of a parallel file system. And just as the hero bandwidth number we discussed above is unattainable by real applications, any standardized IOPS test for a parallel file system would result in a relatively meaningless number. And yes, this includes IO-500; its numbers have little quantitative value if you want to design a parallel file system the right way.

So who's to say whether a ClusterStor E1000 OST is capable of 600 kIOPS or 1,400 kIOPS? I argue that 1,400 kIOPS is more accurate since I/O is bursty and a three-minute-long burst of completely random reads is less likely than a one-minute long one on a production system. If I worked for a vendor though, I'm sure this would be taken to be a dishonest marketing number since it doesn't reflect an indefinitely sustainable level of performance. And perhaps courageously, the official Cray ClusterStor E1000 data sheet doesn't even wade into these waters and avoids quoting any kind of IOPS performance expectation. Ultimately, the true value of the random read capability is the bandwidth achievable by all of the most random workloads that will realistically be run at the same time on a file system. Good luck figuring that out.

Write IOPS are really dumb

As I said at the outset, I cannot disagree with any of the bullets in the slide Rob presented at ASCAC. That first one is particularly salient--there are a new class of HPC workloads, particularly in AI, whose primary purpose is to randomly sample large datasets to train statistical models. If these datasets are too large to fit into memory, you cannot avoid some degree of random read I/O without introducing biases into your weights. For this reason, there is legitimate need for HPC to demand high random read performance from their file systems. Casting this requirement in terms of 4 KiB random read rates to have a neat answer to the "how many IOPS do you need" question is dubious, but whatever. There's little room for intellectual purity in HPC.

The same can't be said for random write rates. Write IOPS are a completely worthless and misleading performance metric in parallel file systems.

In most cases, HPC applications approximate some aspect of the physical world, and mathematics and physics were created to describe this physical world in a structured way. Whether you're computing over atoms, meshes, or matrices, there is structure to the data you are writing out and the way your application traverses memory to write everything out. You may not write data out in a perfectly ordered way; you may have more atoms on one MPI process than another, or you may be traversing an imbalanced graph. But there is almost always enough structure to scientific data to squish it into a non-random I/O pattern using middleware like MPI-IO.

Granted, there are a few workloads where this is not true. Out-of-core sorting of short-read DNA sequences and in-place updates of telescope mosaics are two workloads that come to mind where you don't know where to write a small bit of data until you've computed on that small bit of data. In both these cases though, the files are never read and written at the same time, meaning that these random-ish writes can be cached in memory, reordered to be less random, and written out to the file asynchronously. And the effect of write-back caching on random write workloads is staggering.

To illustrate this, consider three different ways in which IOR can be run against an all-NVMe file system to measure random 4 KiB writes:

In the naïve case, we just write 4 KiB pages at random locations within a bunch of files (one file per MPI process) and report what IOR tells us the write IOPS were at the end. This includes only the time spent in write(2) calls.
In the case where we include fsync, we call fsync(2) at the end of all the writes and include the time it takes to return along with all the time spent in write(2).
In the O_DIRECT case, we open the file with direct I/O to completely bypass the client write-back cache and ensure that write(2) doesn't return until the data has been written to the file system servers.

These seemingly minor changes result in write IOPS rates that differ by over 30x:

Random write IOPS measured using IOR on an all-NVMe parallel file system

Again we ask: which one is the right value for the file system's write IOPS performance?

If we split apart the time spent in each phase of this I/O performance test, we immediately see that the naïve case is wildly deceptive:

Breakdown of time spent in I/O calls for 4K random write IOR workload

The reason IOR reported a 2.6 million write IOPS rate is because all those random writes actually got cached in each compute node's memory, and I/O didn't actually happen until the file was closed and all cached dirty pages were flushed. At the point this happens, the cache flushing process doesn't result in random writes anymore; the client reordered all of those cached writes into large, 1 MiB network requests and converted our random write workload into a sequential write workload.

The same thing happens in the case where we include fsync; the only difference is that we're including the time required to flush caches in the denominator of our IOPS measurement. Rather frustratingly, we actually stopped issuing write(2) calls after 45 seconds, but so many writes were cached in memory during those 45 seconds that it took almost 15 minutes to reorder and write them all out during that final fsync and file close. What should've been 45 seconds of random writes to the file system turned into 45 seconds of random writes to memory and 850 seconds of sequential writes to the file system.

The O_DIRECT case is the most straightforward since we don't cache any writes, and every one of our random writes from the application turns into a random write out to the file system. This cuts our measured IOPS almost in half, but otherwise leaves no surprises when we expect to only write for 45 seconds. Of course, we wrote far fewer bytes overall in this case since the effective bytes/sec during this 45 seconds was so low.

Based on all this, it's tempting to say that the O_DIRECT case is the correct way to measure random write IOPS since it avoids write-back caches--but is it really? In the rare case where an application intentionally does random writes (e.g., out-of-core sort or in-place updates), what are the odds that two MPI processes on different nodes will try to write to the same part of the same file at the same time and therefore trigger cache flushing? Perhaps more directly, what are the odds that a scientific application would be using O_DIRECT and random writes at the same time? Only the most masochistic HPC user would ever purposely do something like this since it results in worst-case I/O performance; it doesn't take long for a user to realize this I/O pattern is terrible and reformulating their I/O pattern would increase their productive use of their supercomputer.

So if no user in their right mind does truly unbuffered random writes, what's the point in measuring it in the first place? There is none. Measuring write IOPS is dumb. Using O_DIRECT to measure random write performance is dumb, and measuring write IOPS through write-back cache, while representative of most users' actual workloads, isn't actually doing 4K random I/Os and therefore isn't even measuring IOPS.

Not all IOPS are always dumb

This all being said, measuring IOPS can be valuable in contexts outside of parallel file systems. Two cases come to mind where measuring IOPS can be a rational yard stick.

1. Serving up LUNs to containers and VMs

By definition, infrastructure providers shouldn't be responsible for the applications that run inside black-box containers and VMs because they are providing storage infrastructure (block devices) and not storage services (file systems). Blocks in and blocks out are measured in IOPS, so the fit is natural. That said, HPC users care about file systems (that is, scientific applications do not perform I/O using SCSI commands directly!), so worrying about LUN performance isn't meaningful in the HPC context.

2. Measuring the effect of many users doing many things

While individual HPC workloads rarely perform random I/Os on purpose, if you have enough users doing many small tasks all at once, the file system itself sees a workload that approaches something random. The more, small, independent tasks running parallel and the farther back you stand from the overall I/O load timeline, the more random it looks. So, I argue that it is fair to measure the IOPS of a parallel file system for the purposes of measuring how much abuse a file system can take before it begins to impact everybody.

Take, for example, these IOPS scaling I measured on a small all-flash file system using IOR:

Scale-up IOPS benchmarking to demonstrate the saturation point of an all-flash file system

<div>It looks like it takes about 4,096 concurrent random readers or writers to max out the file system. This alone isn’t meaningful until you consider what this means in the context of the whole compute and storage platform.</div>

What fraction of the cluster's compute nodes corresponds to 4096 cores? If you've got, say, 728 dual-socket 64-core AMD Epyc processors, it would only take 32 compute nodes to max out this file system. And if another user wanted to use any of the remaining 696 compute nodes to, say, run a Python script that needed to read in random packages scattered across the file system, there would be no remaining IOPS capacity left at this point, and everyone would experience perceptible lag.

Of course, this is the most extreme case--purely random IOPS--but you can measure the IOPS that a real workload does generate on the server side when, say, sampling a deep learning training dataset. With this, you can then figure out how much headroom that application leaves for every other random-ish workload that needs to run on the same system.

Once you realize that a lot of the unglamorous parts of of scientific computing--reading dotfiles when you log in, loading shared objects when you launch a dynamically linked executable, or even just editing source code--are full of random-like reads, you can establish a quantitative basis for figuring out how badly an IOPS-intensive data analysis application may affect everyone else's interactive accesses on the same file system.

This is not to say that we can easily answer the question of "How many IOPS do you need?" though. How many IOPS a workload can drive is not how many IOPS that workload needs--it's really how fast it can compute before it has run out of data to process and needs to read more in. The faster your compute nodes, generally, the more data they can consume. They still want all the IOPS you can give them so they can spend as much time computing (and not waiting for I/O) as possible, and how many IOPS your application can drive is a function of how quickly it runs given the full stack between it and the storage, including CPU, memory, and networking.

If everything is dumb, now what?

Give up trying to reduce I/O performance down to a single IOPS number, because it's two degrees away from being useful. Bandwidth is a better metric in that it's only one degree away from what actually matters, but at the end of the day, the real metric of I/O performance is how much time an application has to wait on I/O before it can resume performing meaningful computations. Granted, most storage vendors will give you a blank stare if you take this angle to them; telling them that your application spends 50% of its time waiting on I/O isn't going to get you a better file system from a storage company alone, so think about what the real problem could be.

Is the application doing I/O in a pattern (random or otherwise) that prevents the storage system from delivering as many bytes/second as possible? If so, ask your vendor for a storage system that delivers more bandwidth to a wider range of I/O patterns than just perfectly aligned 1 MiB reads and writes.

Is the storage system already running as well as it can, but it only takes a few compute nodes to max it out? If so, your storage system is too small relative to your compute system, and you should ask your vendor for more servers and drives to scale out.

Is the storage system running at 100% CPU even though it's not delivering full bandwidth? Servicing a small I/O requires a lot more CPU than a large I/O since there are fixed computations that have to happen on every read or write regardless of how big it is. Ask your vendor for a better file system that doesn't eat up so much CPU, or ask for more capable servers.

Alternatively, if you have a lot of users all doing different things and the file system is giving poor performance to everyone, ask your vendor for a file system with better quality of service. This will ensure that one big job doesn't starve out all the small ones.

Is the storage system slow but you don't have the time to figure out why? If so, it sounds like you work for an organization that doesn't actually value data because it's not appropriately staffed. This isn't a storage problem!

Ultimately, if solving I/O problems was as easy answering how many IOPS you need, storage wouldn't be the perpetual pain point in HPC that it has been. As with all things in computing, there is no shortcut and the proper way to approach this is by rolling up your sleeves and start ruling out problems. You can (and should!) ask for a lot from your storage vendors--flexibility in delivering bandwidth, CPU-efficient file systems, and quality of service controls are all valid requests when buying storage. But IOPS are not.

hpc.social