hpc.social


High Performance Computing
Practitioners
and friends /#hpc
Share: 
This is a crosspost from   Glenn K. Lockwood Personal perspectives of a supercomputing enthusiast. See the original post here.

SC'23 Recap

The largest high-performance computing industry conference of the year, SC23, was held in Denver last week. This year's conference attracted over 14,000 attendees and 438 exhibitors, finally breaking pre-pandemic records, and it solidly felt like the old days of the conference in terms of breadth of attendees, the technical program, and overall engagement and interaction across the community.

This was the second time I've attended the conference as a vendor instead of a customer, and this meant I spent a fair amount of time running to and from meetings instead of walking the show floor or attending technical sessions. I'm sure I missed some major announcements and themes as a result, but I thought it still might be valuable to contribute my observations based on this narrow lens of an AI-minded storage product manager for a major cloud service provider. If you're interested in a more well-rounded perspective, check out the HPC Social Supercomputing 2023 Summary and contribute your own thoughts!

I don't know the best way to organize the notes that I took, so I grouped them into a few broad categories:

  1. Big news on the Top500
  2. What's new in storage for HPC and AI
  3. The emergence of pure-play GPU clouds
  4. Other technological dribs and drabs
  5. Personal thoughts and reflections on the conference and community

I must also disclose that I am employed by Microsoft and I attended SC23 in that capacity. However, everything in this post is my own personal viewpoint, and my employer had no say in what I did or didn't write here. Everything below is written from my perspective as an enthusiast, not an employee, although my day job probably colors my outlook on the HPC industry.

With all that being said, let's dive into the big news of the week!

Big news on the Top500

Unveiling the new Top500 list is the tentpole event of SC every year regardless of how much people (including myself!) deride HPL, and unlike the lists over the past year, this newest listing had two big surprises. Many of us went into the SC23 season wondering if the Aurora system, whose hardware was delivered this past June, would be far enough in installation and shakeout to unseat Frontier as the second listed exascale system. At the same time, nobody had expected another >500 PF supercomputer to appear on the list, much less one operated privately and for-profit. But both systems made big debuts in the top 5, carrying with them interesting implications.

The new #2: Argonne's Aurora

The Aurora exascale system has a storied history going back to 2015; first conceived of as a 180 PF supercomputer to be delivered in 2018, it evolved into a GPU-based exascale supercomputer that was supposed to land in 2021. Now two years late and a few executives short, Intel and Argonne were stuck between a rock and a hard place in choosing whether to list their HPL results at SC23:

  1. If Aurora wasn't listed on SC23's Top500 list, it risked going up against El Capitan at ISC'24 and being completely overshadowed by the simultaneous launch of a newer, bigger exascale system.
  2. If Aurora was listed at SC23's Top500 list but in an incomplete form, it would fall short of its long-awaited debut as the #1 system and would require a careful narrative to avoid being seen as a failed system.

Intel and Argonne ultimately chose option #2 and listed an HPL run that used only 5,439 of Aurora's 10,624 nodes (51.1% of the total machine), and as expected, people generally understood that this sub-exaflop score was not an indictment of the whole system underdelivering, but more a reflection that the system was still not stable at its full scale. Still, headlines in trade press were dour, and there was general confusion about how to extrapolate Aurora's HPL submission to the full system.  Does the half-system listing of 585.34 PF Rmax at 24.7 MW power mean that the full system will require 50 MW to achieve an Rmax that's still lower than Frontier? Why is the efficiency (Rmax/Rpeak = 55%) so low?

Interestingly, about half the people I talked to thought that Argonne should've waited until ISC'24 to list the full system, and the other half agreed that listing half of Aurora at SC'23 was the better option. Clearly there was no clearly right answer here, and I don't think anyone can fault Argonne for doing the best they could given the Top500 submission deadline and the state of the supercomputer. In talking to a couple folks from ALCF, I got the impression that there's still plenty of room to improve the score since their HPL run was performed under a time crunch, and there were known issues affecting performance that couldn't have been repaired in time. With any luck, Aurora will be ready to go at full scale for ISC'24 and have its moment in the sun in Hamburg.

The new #3: Microsoft's Eagle

The other new Top500 entry near the top of the list was Eagle, Microsoft's surprise 561 PF supercomputer. Like Aurora, it is composed of GPU-heavy nodes, and like Aurora, the HPL run utilized only part (1,800 nodes) of the full system. Unlike Aurora though, the full size of Eagle is not publicly disclosed by Microsoft, and its GPU-heavy node architecture was designed for one specific workload: training large language models for generative AI.

At the Top500 BOF, Prabhat Ram gave a brief talk about Eagle where he emphasized that the system wasn't a custom-built, one-off stunt machine. Rather, it was built from publicly available ND H100 v5 virtual machines on a single 400G NDR InfiniBand fat tree fabric, and Microsoft had one of the physical ND H100 v5 nodes at its booth.  Here's the back side of it:

From top to bottom, you can see it has eight E1.S NVMe drives, 4x OSFP ports which support 2x 400G NDR InfiniBand each, a Microsoft SmartNIC, and a ton of power.  A view from the top shows the HGX baseboard and fans:


<p>Logically, this node (and the ND H100 v5 VM that runs on it) looks a lot like the NVIDIA DGX reference architecture. Physically, it is an air-cooled, Microsoft-designed OCP server, and Eagle’s Top500 run used 1,800 of these servers.</p>

Big HPL number aside, the appearance of Eagle towards the top of Top500 has powerful implications on the supercomputing industry at large.  Consider the following.

Microsoft is a for-profit, public enterprise whose success is ultimately determined by how much money it makes for its shareholders. Unlike government agencies who have historically dominated the top of the list to show their supremacy in advancing science, the Eagle submission shows that there is now a huge financial incentive to build giant supercomputers to train large language models. This is a major milestone in supercomputing; up to this point, the largest systems built by private industry have come from the oil & gas industry, and they have typically deployed at scales below the top 10.

Eagle is also built on the latest and greatest technology--NVIDIA's H100 and NDR InfiniBand--rather than previous-generation technology that's already been proven out by the national labs.  SC23 was the first time Hopper GPUs have appeared anywhere on the Top500 list, and Eagle is likely the single largest installation of both H100 and NDR InfiniBand on the planet. Not only does this signal that it's financially viable to stand up a leadership supercomputer for profit-generating R&D, but industry is now willing to take on the high risk of deploying systems using untested technology if it can give them a first-mover advantage.

Eagle also shows us that the potential upside of bringing a massive new AI model to market is worth both the buying all the infrastructure required to build a half-exaflop system and hiring the talent required to shake out what is literally a world-class supercomputer. And while the US government can always obtain a DPAS rating to ensure it gets dibs on GPUs before AI companies can, there is no DPAS rating for hiring skilled individuals to stand up gigantic systems. This all makes me wonder: if Aurora was a machine sitting in some cloud data center instead of Argonne, and its commissioning was blocking the development of the next GPT model, would it have been able to take the #1 spot from Frontier this year?

The appearance of such a gigantic system on Top500, motivated by and paid for as part of the AI land grab, also raises some existential questions for the US government. What role should the government have in the supercomputing industry if private industry now has a strong financial driver to invest in the development of leadership supercomputing technologies? Historically, government has always incubated cutting-edge HPC technologies so that they could stabilize enough to be palatable to commercial buyers. Today's leadership supercomputers in the national labs have always wound up as tomorrow's midrange clusters that would be deployed for profit-generating activities like seismic imaging or computer-aided engineering. If the AI industry is now taking on that mantle of incubating and de-risking new HPC technologies, perhaps government now needs to focus on ensuring that the technologies developed and matured for AI can still be used to solve scientific problems.

What's new in storage for HPC and AI?

Since I spent much of my career working in HPC storage, and I now focus largely on AI, it should be no surprise that I heard a lot about the intersection of AI and storage.  AI remains high in the hype cycle, so it's natural that just about every storage vendor and discussion had some talk of AI forced into it regardless of it was really relevant or not. However, there were a few places where AI and storage topics intersect that I found noteworthy.

The AI-storage echo chamber

<p>I was asked a lot of questions about storage from journalists, VCs, and even trusted colleagues that followed a common theme: What storage technologies for AI excite me the most? What’s the future of storage for AI?</p>

I don't fault people for asking such a broad question because the HPC/AI storage industry is full of bombastic claims. For example, two prominent storage vendors emblazoned their booths with claims of what their products could do for AI:

These photos illustrate the reality that, although there is general agreement that good storage is needed for GPUs and AI, what constitutes "good storage" is muddy and confusing. Assuming the above approach to marketing (10x faster! 20x faster!) is effective for someone out there, there appears to be a market opportunity in just capitalizing on this general confusion by (1) asserting what the I/O problem that's jamming up all AI workloads is, and (2) showing that your storage product does a great job at solving that specific problem.

For example, the MLPerf Storage working group recently announced the first MLPerf Storage benchmark, and Huiho Zheng from Argonne (co-author of the underlying DLIO tool on which MLPerf Storage was built) described how the MLPerf Storage benchmark reproduces the I/O characteristics of model training at the Workshop on Software and Hardware Co-Design of Deep Learning Systems in Accelerators:

When I saw this premise, I was scratching my head--my day job is to develop new storage products to meet the demands of large-scale AI model training and inferencing, and I have never had a customer come to me claiming that they need support for small and sparse I/O or random access. In fact, write-intensive checkpointing and fine-tuning, not read-intensive data loading, is the biggest challenge faced by those training large language models in my experience. It wasn't until a few slides later did I realize where these requirements may be coming from:

Storage and accelerator vendors are both defining and solving the I/O problems of the AI community which seems counterproductive--shouldn't a benchmark be set by the practitioners and not the solution providers?

What I learned from talking to attendees, visiting storage vendor booths, and viewing talks like Dr. Zheng's underscores a reality that I've faced on my own work with production AI workloads: AI doesn't actually have an I/O performance problem, so storage vendors are struggling to define ways in which they're relevant in the AI market.

I outlined the ways in which LLM training uses storage in my HDF5 BOF talk, and their needs are easy to meet with some local storage and basic programming. So easy, in fact, that a reasonably sophisticated AI practitioner can duct tape their way around I/O problems very quickly and move on to harder problems. There's no reason for them to buy into a sophisticated Rube Goldberg storage system, because it still won't fundamentally get them away from having to resort to local disk to achieve the scalability needed to train massive LLMs.

So yes, I've got no doubt that there are storage products that can deliver 10x or 20x higher performance for some specific AI workload. And MLPerf Storage is probably an excellent way to measure that 20x performance boost. But the reality I've experienced is that a half a day of coding will deliver 19x higher performance when compared to the most naive approach, and every AI practitioner knows and does this already. That's why there are a lot of storage vendors fishing in this AI storage pond, but none of them seem to be reeling in any whoppers.

This isn't to say that there's nothing interesting going on in high-performance storage though. If the most common question I was asked was "what's the future of storage for AI," the second most common question was "what do you think about VAST and WEKA?"

VAST & WEKA

Both companies seem to be doing something right since they were top of mind for a lot of conference attendees, and it probably grinds their respective gears that the field still groups them together in the same bucket of "interesting parallel storage systems that we should try out." Rather than throw my own opinion in the pot though (I work with and value both companies and their technologies!), I'll note the general sentiments I observed.

WEKA came into the week riding high on their big win as U2's official technology partner in September. Their big booth attraction was a popular Guitar Hero game and leaderboard, and an oversized Bono, presumably rocking out to how much he loves WEKA, presided over one of their seating areas:

Much of their marketing centered around accelerating AI and other GPU workloads, and the feedback I heard from the WEKA customers I bumped into during the week backed this up. One person shared that the WEKA client does a great job with otherwise difficult small-file workloads, particularly common in life sciences workloads, and this anecdote is supported by the appearance of a very fast WEKA cluster owned by MSK Cancer Center on the IO500 Production list. People also remarked about WEKA's need for dedicated CPU cores and local storage to deliver the highest performance; this, combined with its client scalability, lends itself well to smaller clusters of fat GPU nodes. I didn't run into anyone using WEKA in the cloud though, so I assume the feedback I gathered had a bias towards more conventional, on-prem styles of architecting storage for traditional HPC.

Whereas WEKA leaned into its rock 'n' roll theme this year, VAST doubled down on handing out the irresistibly tacky light-up cowboy hats they introduced last year (which I'm sure their neighbors at the DDN booth absolutely loved). They were all-in on promoting their new identity as a "data platform" this year, and although I didn't hear anyone refer to VAST as anything but a file system, I couldn't throw a rock without hitting someone who either recently bought a VAST system or tried one out.

Unlike last year though, customer sentiment around VAST wasn't all sunshine and rainbows, and I ran into a few customers who described their presales engagements as more formulaic than the white-glove treatment everyone seemed to be getting a year ago. This isn't surprising; there's no way to give all customers the same royal treatment as a business scales. But it does mean that the honeymoon period between VAST and the HPC industry is probably at an end, and they will have to spend the time between now and SC24 focusing on consistent execution to maintain the momentum they've gotten from the light-up cowboy hats.

The good news for VAST is that they've landed some major deals this past year, and they came to SC with customers and partners in-hand. They co-hosted a standing-room-only party with CoreWeave early in the week and shared a stage with Lambda at a customer breakfast, but they also highlighted two traditional, on-prem HPC customers (TACC and NREL) at the latter event.

VAST clearly isn't letting go of the on-prem HPC market as it also pursues partnerships with emerging GPU cloud service providers; this contrasted with WEKA's apparent focus on AI, GPUs, and the cloud. Time will tell which strategy (if either, or both) proves to be the better approach.

DAOS

Though commercial buyers were definitely most interested in VAST and WEKA, folks from the more sophisticated HPC shops around the world also tossed a few questions about DAOS my way this year.

I usually make it a point to attend the annual DAOS User Group meeting since it is always attended by all the top minds in high-performance I/O research, but I had to miss it this year on account of it running at the same time as my I/O tutorial. Fortunately, DAOS was pervasive throughout the conference, and there was no shortage of opportunity to find out what the latest news in the DAOS was. For example, check out the lineup for PDSW 2023 this year:

Three out of thirteen talks were about DAOS which is more than any other single storage product or project. DAOS also won big at this year's IO500, taking the top two spots in the production storage system list:


<div class="separator" style="clear: both; text-align: center;"></div>

In fact, DAOS underpinned every single new awardee this year, and DAOS is now the second most represented storage system on the list behind Lustre:

Why is DAOS at the top of so many people's minds this year? Well, DAOS reached a few major milestones in the past few months which has thrust it into the public eye.  

First, Aurora is finally online and running jobs, and while the compute system is only running at half its capability, the full DAOS system (all 220 petabytes of it, all of which is TLC NVMe) is up and running--a testament to the scalability of DAOS that many parallel storage systems--including VAST and WEKA--have not publicly demonstrated. Because DAOS is open-source software and Aurora is an open-science system, all of DAOS' at-scale warts are also on full display to the community in a way that no competitive storage system besides of Lustre is.

Second, Google Cloud cast a bold vote of confidence in DAOS by launching Parallelstore, its high-performance parallel file service based on DAOS, in August. Whereas AWS and Azure have bet on Lustre to fill the high-performance file gap (via FSx Lustre and Azure Managed Lustre), GCP has planted a stake in the ground by betting that DAOS will be the better foundation for a high-performance file service for HPC and AI workloads.

Parallelstore is still in private preview and details are scant, but GCP had DAOS and Parallelstore dignitaries at all the major storage sessions in the technical program to fill in the gaps. From what I gathered, Parallelstore is still in its early stages and is intended to be a fast scratch tier; it's using DRAM for metadata which means it relies on erasure coding across servers to avoid data loss on a single server reboot, and there's no way to recover data if the whole cluster goes down at once. This lack of durability makes it ineligible for the IO500 list right now, but the upcoming metadata-on-NVMe feature (which previews in upstream DAOS in 1H2024) will be the long-term solution to that limitation.

Finally, the third major bit of DAOS news was about the formation of the DAOS Foundation. First announced earlier this month, this initiative lives under the umbrella of the Linux Foundation and is led by its five founding members:

I see this handoff of DAOS from Intel to this new foundation as a positive change that makes DAOS a more stable long-term bet; should Intel choose to divest itself of DAOS once its obligations to the Aurora program end, DAOS now can live on without the community having to fork it. The DAOS Foundation is somewhat analogous to OpenSFS (one of the nonprofits backing Lustre) in that it is a vendor-neutral organization around which the DAOS community can gather.

But unlike OpenSFS, the DAOS Foundation will also assume the responsibility of releasing new versions of DAOS after Intel releases its final version (2.6) in March 2024. The DAOS Foundation will also steer feature prioritization, but seeing as how the DAOS Foundation doesn't fund developers directly, it's not clear that contributors like Intel or GCP are actually at the mercy of the foundation's decisions. It's more likely that the DAOS Foundation will just have authority to decide what features will roll up into the next formal DAOS release, and developers contributing code to DAOS will still prioritize whatever features their employers tell them to.

So, DAOS was the talk of the town at SC23. Does this all mean that DAOS is ready for prime time?

While Intel and Argonne may say yes, the community seems to have mixed feelings.  Consider this slide presented by László Szűcs from LRZ at the DAOS Storage Community BOF:

DAOS is clearly crazy fast and scales to hundreds of petabytes in production--Aurora's IO500 listing proves that. However, that performance comes with a lot of complexity that is currently being foisted on application developers, end-users, and system administrators. The "opportunities" listed in László's slide are choices that people running at leadership HPC scale may be comfortable making, but the average HPC user is not equipped to make many of these decisions and make thoughtful choices about container types and library interfaces.

The fact that DAOS was featured so prominently at PDSW--a research workshop--probably underscores this as well. This slide presented by Adrian Jackson's lighting talk sums up the complexity along two different dimensions:

His results showed that your choice of DAOS object class and I/O library atop the DAOS POSIX interface can result in wildly different checkpoint bandwidth. It's hard enough to teach HPC users about getting optimal performance out of a parallel file system like Lustre; I can't imagine those same users will embrace the idea that they should be mindful of which object class they use as they generate data.

The other DAOS-related research talk, presented by Greg Eisenhauer, was a full-length paper that caught me by surprise and exposed how much performance varies when using different APIs into DAOS. This slide is one of many that highlighted this:

I naively thought that the choice of native userspace API (key-value or array) would have negligible effects on performance, but Eisenhauer's talk showed that this isn't true. The reality appears to be that, although DAOS is capable of handling unaligned writes better than Lustre, aligning arrays on large, power-of-two boundaries still has a significant performance benefit.

Based on these sorts of technical talks about DAOS presented this year, the original question--is DAOS ready for prime time--can't be answered with a simple yes or no yet.  The performance it offers is truly best in class, but achieving that performance doesn't come easy right now. Teams who are already putting heroic effort into solving a high-value problems will probably leap at the opportunity to realize the I/O performance that DAOS can deliver. Such high value problems include things like training the next generation of foundational LLMs, and GCP's bet on DAOS probably adds differentiable value to their platform as a place to train such models as efficiently as possible. But the complexity of DAOS at present probably limits its appeal to the highest echelons of leadership HPC and AI, and I think it'll be a while before DAOS is in a place where a typical summer intern will be able to appreciate its full value.

Infinia

It would be unfair of me to give all this regard to WEKA, VAST, and DAOS without also mentioning DDN's brand new Infinia product, launched right before SC23. Those in the HPC storage industry have been awaiting its launch for years now, but despite the anticipation, it really didn't come up in any conversations in which I was involved. I did learn that the engineering team developing Infinia inside DDN is completely separate from the Whamcloud team who is developing Lustre, but this could be a double-edged sword. On the good side, it means that open-source Lustre development effort isn't competing with DDN's proprietary product in engineering priorities on a day-to-day basis. On the bad side though, I still struggle to see how Infinia and Lustre can avoid eventually competing for the same business.

For the time being, Infinia does seem to prioritize more enterprisey features like multitenancy and hands-free operation while Lustre is squarely aimed at delivering maximum performance to a broadening range of workloads. Their paths may eventually cross, but that day is probably a long way off, and Lustre has the benefit of being deeply entrenched across the HPC industry.

The emergence of pure-play GPU clouds

In addition to chatting with people about what's new in storage, I also went into SC23 wanting to understand how other cloud service providers are structuring end-to-end solutions for large-scale AI workloads. What I didn't anticipate was how many smaller cloud service providers (CSPs) showed up to SC for the first time this year, all waving the banner of offering NVIDIA H100 GPUs. These are predominantly companies that either didn't exist a few years ago or have historically focused on commodity cloud services like virtual private servers and managed WordPress sites, so it was jarring to suddenly see them at an HPC conference. How did so many of these smaller CSPs suddenly become experts in deploying GPU-based supercomputers in the time between SC22 and SC23? 

I got to talking to a few folks at these smaller CSPs to figure out exactly what they were offering to customers, and their approach is quite different from how AWS, Azure, and GCP operate. Rather than defining a standard cluster architecture and deploying copies of it all over to be consumed by whoever is willing to pay, these smaller CSPs deploy clusters of whitebox GPU nodes to customer specification and sell them as dedicated resources for fixed terms. If a customer wants a bunch of HGX H100s interconnected with InfiniBand, that's what they get. If they want RoCE, the CSP will deploy that instead. And the same is true with storage: if a customer wants EXAScaler or Weka, they'll deploy that too.

While this is much closer to a traditional on-prem cluster deployment than a typical elastic, pay-as-you-go infrastructure-as-a-service offering, this is different from being a fancy colo. The end customer still consumes those GPUs as a cloud resource and never has to worry about the infrastructure that has to be deployed behind the curtain, and when the customer's contract term is up, their cluster is still owned by the CSP. As a result, the CSP can either resell that same infrastructure via pay-as-you-go or repurpose it for another dedicated customer. By owning the GPUs and selling them as a service, these CSPs can also do weird stuff like take out giant loans to build more data centers using GPUs as collateral. Meanwhile, NVIDIA can sell GPUs wholesale to these CSPs, book the revenue en masse, and let the CSPs deal with making sure they're maintained in production and well utilized.

It also seems like the services that customers of these smaller CSPs get is often more barebones than what they'd get from a Big 3 CSP (AWS, Azure, and GCP). They get big GPU nodes and an RDMA fabric, but managed services beyond that are hit and miss.

For example, one of these smaller CSPs told me that most of their storage is built on hundreds of petabytes of open-source Ceph. Ceph fulfills the minimum required storage services that any cloud must provide (object, block, and file), but it's generally insufficient for large-scale model training. As a result, all the smaller CSPs with whom I spoke said they are also actively exploring VAST and Weka as options for their growing GPU-based workloads. Since both VAST and Weka offer solid S3 and file interfaces, either could conceivably act as the underpinnings of these GPU clouds' first-party storage services as well.

As I said above though, it seems like the predominant model is for these CSPs to just ship whatever dedicated parallel storage the customer wants if something like Ceph isn't good enough. This, and the growing interest in storage from companies like VAST and Weka, suggest a few things:

None of these observations are terribly surprising; at the price these smaller CSPs are offering GPUs compared to the Big 3 CSPs, their gross margin (and therefore their ability to invest in developing services on top of their IaaS offerings) has got to be pretty low. In the short term, it's cheaper and easier to deploy one-off high-performance storage systems alongside dedicated GPU clusters based on customer demand than develop and support a standard solution across all customers.

Of course, building a low-cost GPU service opens the doors for other companies to develop their own AI services on top of inexpensive GPU IaaS that is cost-competitive with the Big 3's native AI platforms (AWS SageMaker, Azure Machine Learning, and Google AI Platform). For example, I chatted with some folks at together.ai, a startup whose booth caught my eye with its bold claim of being "the fastest cloud for [generative] AI:"

Contrary to their banner, they aren't a cloud; rather, they provide AI services--think inferencing and fine-tuning--that are accessible through an API much like OpenAI's API. They've engineered their backend stack to be rapidly deployable on any cloud that provides basic IaaS like GPU-equipped VMs, and this allows them to actually run their computational backend on whatever cloud can offer the lowest-cost, no-frills GPU VMs. In a sense, companies like together.ai develop and sell the frills that these new GPU CSPs lack, establishing a symbiotic alternative to the vertically integrated AI platforms on bigger clouds.

I did ask a few of these smaller CSPs what their overall pitch was. Why I would choose GPU cloud X over their direct competitor GPU cloud Y? The answers went in two directions:

  1. They offer lower cost per GPU hour than their competition
  2. They are faster to get GPUs off a truck and into production than their competition

There's a big caveat here: I didn't talk to many representatives at these CSPs, so my sample size was small and not authoritative. However, taking these value propositions at face value struck me as being quite precarious since their value is really a byproduct of severe GPU shortages driven by the hyped-up AI industry. What happens to these CSPs (and the symbionts whose businesses depend on them) when AMD GPUs appear on the market in volume? What happens if NVIDIA changes course and, instead of peanut-buttering its GPUs across CSPs of all sizes, it focuses its attention on prioritizing deliveries to just a few blessed CSPs?

There is no moat around generative AI, and I left SC23 feeling like there's a dearth of long-term value being generated by some of these smaller GPU CSPs. For those CSPs whose primary focus is buying and deploying as many GPUs in as short a time as possible, not everyone can survive. They'll either come out of this GPU shortage having lost a lot of money building data centers that will go unused, or they'll be sold for parts.

More importantly to me though, I learned that I should give less credence to the splashy press events of hot AI-adjacent startups if their successes lie exclusively with smaller GPU CSPs. Some of these CSPs are paying to make their problems go away in an effort to keep their focus on racking and stacking GPUs in the short term, and I worry that there's a lack of long-term vision and strong opinions in some of these companies. Some of these smaller CSPs seem much more like coin-operated GPU cluster vending machines than platform providers, and that business model doesn't lend itself to making big bets and changing the industry.

Put another way, my job--both previous and current--has always been to think beyond short-term band aids and make sure that my employer has a clear and opinionated view of the technical approach that will be needed to address the challenges of HPC ten years in the future. I know who my peers are at the other Big 3 CSPs and leadership computing facilities across the world, and I know they're thinking hard about the same problems that I am. What worries me is that I do not know who my peers are at these smaller CSPs, and given their speed of growth and smaller margins, I worry that they aren't as prepared for the future as they will need to be. The AI industry as a whole will be better off when GPUs are no longer in such short supply, but the ecosystem surrounding some of these smaller GPU CSPs is going to take some damage when that day comes.

Other dribs and drabs

I also had a lot of interesting conversations and noticed a few subtle themes last week that don't neatly fit into any other category, but I'd love to hear more from others if they noticed the same or have more informed opinions.

APUs and superchips - are they really that useful?

Because I spent my booth duty standing next to one of Eagle's 8-way HGX H100 nodes, a lot of people asked me if I thought the Grace Hopper superchip would be interesting. I'm not an expert in either GPUs or AI, but I did catch up with a few colleagues who are smarter than me in this space last week, and here's the story as I understand it:

The Grace Hopper superchip (let's just call it GH100) is an evolution of the architecture developed for Summit, where V100 GPUs were cache-coherent with the CPUs through a special widget that converted NVLink to the on-chip coherence protocol for Power9. With GH100, the protocol used to maintain coherence across the CPU is directly compatible with the ARM AMBA coherence protocol, eliminating one bump in the path that Power9+V100 had. Grace also has a much more capable memory subsystem and NOC that makes accessing host memory from the GPU more beneficial.

Now, do AI workloads really need 72 cores per H100 GPU? Probably not.

What AI (and HPC) will need are some high-performance cores to handle all the parts of application execution that GPUs are bad at--divergent code paths, pointer chasing, and I/O. Putting capable CPU cores (Neoverse V2, not the N2 used in CPUs like new Microsoft's Cobalt 100) on a capable NOC that is connected to the GPU memory subsystem at 900 GB/s opens doors for using hierarchical memory to train LLMs in clever ways.

For example, naively training an LLM whose weights and activations are evenly scattered across both host memory and GPU memory won't go well since that 900 GB/s of NVLink C2C would be on the critical path of many computations. However, techniques like activation checkpointing could become a lot more versatile when the cost of offloading certain tensors from GPU memory is so much lower. In essence, the presence of easily accessible host memory will likely allow GPU memory to be used more efficiently since the time required to transfer tensors into and out of HBM is easier to hide underneath other computational steps during training.

Pairing an over-specified Grace CPU with a Hopper GPU also allows the rate of GPU development to proceed independently of CPU development. Even if workloads that saturate an H100 GPU might not also need all 72 cores of the Grace CPU, H200 or other future-generation GPUs can grow into the capabilities of Grace without having to rev the entire superchip.

I didn't get a chance to talk to any of my colleagues at AMD to get their perspective on the MI300 APU, but I'd imagine their story is a bit simpler since their memory space is flatter than NVIDIA's superchip design. This will make training some models undoubtedly more straightforward but perhaps leave less room for sophisticated optimizations that can otherwise cram more of a model into a given capacity of HBM. I'm no expert though, and I'd be happy to reference any explanations that real experts can offer! 

What about quantum?

Quantum computing has been a hot topic for many years of SC now, but it feels like a topic that is finally making its way out of pure CS research and into the minds of the everyday HPC facility leaders. I talked to several people last week who asked me for my opinion on quantum computing because they have come to the realization that they need to know more about it than they do, and I have to confess, I'm in the same boat as they are. I don't follow quantum computing advancements very closely, but I know an increasing number of people who do--and they're the sort who work in CTOs' offices and have to worry about risks and opportunities more than intellectual curiosities.

It's hard to say there've been any seismic shifts in the state of the art in quantum computing at SC23; as best I can tell, there's still a rich ecosystem of venture capital-backed startups who keep cranking out more qubits. But this year felt like the first year where HPC facilities who haven't yet started thinking about their position on quantum computing are now behind. Not everyone needs a quantum computer, and not everyone even needs a quantum computing researcher on staff. But everyone should be prepared with a strong point of view if they are asked "what will you be doing with quantum computing?" by a funding agency or chief executive.

NextSilicon

One of the least-stealthy stealth-mode startups in the HPC industry has been NextSilicon, a company who debuted from stealth mode at SC23, launched their new Maverick accelerator, and announced their first big win with Sandia National Lab's Vanguard II project

What's notable about NextSilicon is that, unlike just about every other accelerator startup out there, they are not trying to go head-to-head with NVIDIA in the AI acceleration market. Rather, they've created a dataflow accelerator that aims to accelerate challenging HPC workloads that GPUs are particularly bad at--things like irregular algorithms and sparse data structures. They've paired this hardware with a magical runtime that continually optimizes the way the computational kernel is mapped to the accelerator's reconfigurable units to progressively improve the throughput of the accelerator as the application is running.

The concept of dataflow accelerators has always been intriguing since they're the only alternative to improving computational throughput besides making larger and larger vectors. The challenge has always been that these accelerators are more like FPGAs than general-purpose processors, and they require similar amounts of hardcore CS expertise to use well. NextSilicon claims to have cracked that nut with their runtime, and it seems like they're hiring the rights sorts of people--real HPC with respectable pedigrees--to make sure their accelerator can really deliver value to HPC workloads.

I/O benchmarking developments

At the IO500 BOF, there was rich discussion about adding new benchmarking modes to IOR and IO500 to represent a wider range of patterns.

More specifically, there's been an ongoing conversation about including a 4K random read test, and it sounds like the most outspoken critics against it have finally softened their stance. I've not been shy about why I think using IOPS as a measure of file system performance is dumb, but 4K random IOPS do establish a lower bound of performance for what a real application might experience. Seeing as how IO500 has always been problematic as any representation of how a file system will perform in real-world environments, adding the option to run a completely synthetic, worst-case workload will give IO500 the ability to define a complete bounding box around the lower and upper limits of I/O performance for a file system.

Hendrik Nolte from GWDG also proposed a few new and appealing IOR modes that approach more realistic workload scenarios.  The first was a new locally random mode where data is randomized within IOR segments but segments are repeated:

Compared to globally randomized reads (which is what IOR normally does), this is much closer representation of parallel workloads that are not bulk-synchronous; for example, NCBI BLAST uses thread pools and work sharing to walk through files, and the resulting I/O pattern is similar to this new mode.

He also described a proposal to run concurrent, mixed workloads in a fashion similar to how fio currently works.  Instead of performing a bulk-synchronous parallel write followed by a bulk-synchronous parallel read, his proposal would allow IOR to perform reads and writes concurrently, more accurately reflecting the state of multitenant storage systems. I actually wrote a framework to do exactly this and quantify the effects of contention using IOR and elbencho, but I left the world of research before I could get it published. I'm glad to see others seeing value in pursuing this idea.

The other noteworthy development in I/O benchmarking was presented by Sven Breuner at the Analyzing Parallel I/O BOF where he described a new netbench mode for his excellent elbencho benchmark tool. This netbench mode behaves similarly to iperf in that it is a network-level throughput test, but because it is part of elbencho, it can generate the high-bandwidth incasts and broadcasts that are typically encountered between clients and servers of parallel storage systems:

This is an amazing development because it makes elbencho a one-stop shop for debugging the entire data path of a parallel storage system. For example, if you're trying to figure out why the end-to-end performance of a file system is below expectation, you can use elbencho to test the network layer, the object or file layer, the block layer, and the overall end-to-end path separately to find out which layer is underperforming. Some file systems have specialized included tools to perform the same network tests (e.g., nsdperf for IBM Spectrum Scale), but elbencho now has a nice generic way to generate these network patterns for any parallel storage system.

Some personal thoughts

As with last year, I couldn't attend most of the technical program due to a packed schedule of customer briefings and partner meetings, but the SC23 Digital Experience was excellently done, and I wound up watching a lot of the content I missed during the mornings and after the conference (at 2x speed!). In that sense, the hybrid nature of the conference is making it easier to attend as someone who has to juggle business interests with technical interests; while I can't jump into public arguments about the definition of storage "QOS", I can still tell that my old friends and colleagues are still fighting the good fight and challenging conventional thinking across the technical program.

My Parallel I/O in Practice tutorial

This was the sixth year that I co-presented the Parallel I/O in Practice tutorial with my colleagues Rob Latham, Rob Ross, and Brent Welch. A conference photographer got this great photo of me in the act:

Presenting this tutorial is always an incredibly gratifying experience; I've found that sharing what I know is one of the most fulfilling ways I can spend my time, and being able to start my week in such an energizing way is what sustains the sleep deprivation that always follows. Giving the tutorial is also an interesting window into what the next generation of I/O experts is worrying about; for example, we got a lot of questions and engagement around the low-level hardware content in our morning half, and the I/O benchmarking material in the late afternoon seemed particularly well received. The majority of attendees came from the systems side rather than the user/dev side as well, perhaps suggesting that the growth in demand for parallel storage systems (and experts to run them) is outstripping the demand for new ways to perform parallel I/O. Guessing wildly, perhaps this means new developers are coming into the field higher up the stack, using frameworks like fsspec that abstract away low-level I/O.

Since I've jumped over to working in industry, it's been hard to find the business justification to keep putting work hours into the tutorial despite how much I enjoy it.  I have to confess that I didn't have time to update any of the slides I presented this year even though the world of parallel I/O has not remained the same, and I am going to have to figure out how to better balance these sorts of community contributions with the demands of a day job in the coming years.

An aside on COVID safety

At SC22, I fastidiously wore a KN95 mask while indoors and avoided all after-hours events and indoor dining to minimize my risk of catching COVID. At that time, neither my wife nor I had ever gotten COVID before, and I had no desire to bring it home to my family since my father died of COVID-related respiratory failure two years prior. Staying fully masked at SC22 turned out to be a great decision at the time since a significant number of other attendees, including many I spoke with, contracted COVID at SC22. By comparison, I maintained my COVID-free streak through 2022.

This year I took a more risk-tolerant approach for two reasons:

  1. My wife and I both broke our streaks this past summer and contracted COVID while on vacation, so if I got sick, we knew what to expect, and
  2. I got my gazillionth COVID and flu shots in October in anticipation of attending SC.

Part of my approach to managing risk was bringing my trusty Aranet4 CO2 sensor with me so that I could be aware of areas where there was air circulation and the risk of contracting an airborne illness would be higher. I only wore a KN95 at the airport gates and while on the airplane at SC23, and despite going in all-in on after-hours events, indoor dining, and copious meetings and tours of booth duty, I'm happy to report that I made it through the conference without getting sick.

I have no doubt that being vaccinated helped, as I've had several people tell me they tested positive for COVID after we had dinner together in Denver. But it's also notable that the Denver Convention Center had much better ventilation than Kay Bailey Hutchison Convention Center in Dallas where SC22 was held last year. To show this quantitatively, let's compare air quality measurements from SC22 to SC23.

My schedule for the day on which I give my tutorial is always the same: the tutorial runs from 8:30am to 5:00pm with breaks at 10:00, 12:00, and 3:00. Because of this consistent schedule, comparing the CO2 readings (which are a proxy for re-breathed air) for my tutorial day at SC22 versus SC23 shows how different the air quality was in the two conference centers. Here's what that comparison looks like:

What the plot shows is that CO2 (re-breathed air) steadily increased at the start of the tutorial at both SC22 and SC23, but Denver's convention center kicked on fresh air ventilation after an hour while Dallas simply didn't. Air quality remained poor (over 1,000) throughout the day in Dallas, whereas Denver was pretty fresh (below 700) even during the breaks and the indoor luncheon. This relatively good air circulation inside the convention center at SC23 made me much more comfortable about going maskless throughout the week.

This isn't to say that I felt there was no risk of getting sick this year; there was at least one busy, upscale restaurant/bar in which I dined where the air circulation was no better than in a car or airplane. For folks who just don't want to risk being sick over Thanksgiving, wearing a mask and avoiding crowded bars was probably still the best option this year. And fortunately, Denver's weather was gorgeous, so outdoor dining was completely viable during the week.

AI's effects on the HPC community

Although AI has played a prominent role in previous SC conferences, this was the first year where I noticed that the AI industry is bleeding into the HPC community in weird ways.

For example, I had a bunch of journalists and media types accost me and start asking rather pointed questions while I was on booth duty. Talking to journalists isn't entirely unusual since I've always been supportive of industry press, but the social contract between practitioners like me and journalists has always been pretty formal--scheduling a call in advance, being invited to speak at an event, and things like that have long been the norm. If I was being interviewed on the record, I knew it.

This year though, it seemed like there was a new generation of younger journalists who approached me no differently than a casual booth visitor. Some did introduce themselves as members of the press after we got chatting (good), but others did not (not good) which led me to take away a learning: check names and affiliations before chatting with strangers, because the days where I could assume that all booth visitors would act in good faith are gone.

Now, why the sudden change?  I can think of three possible reasons:

  1. I'm getting older, and there are now tech industry journalists who are younger than me and think I am worth talking to since I've always been around. Maybe the old-school HPC folks that predate me have always had to deal with this.
  2. The proliferation of platforms like Substack make it financially viable to be an independent journalist, and conversely, anyone can be a journalist without editorial oversight.
  3. The spotlight on the massive AI industry is also illuminating the HPC industry. HPC and AI are both built on the same foundational technologies (GPUs, RDMA fabrics, HBM, and the like) so AI journalists now have a reason to start showing up at HPC community events.

It'd be fair to argue that #3 is a stretch and that this isn't an AI phenomenon if not for the fact that I was also accosted by a few venture capitalists for the first time this year. HPC has never been an industry that attracted the attention of venture capital in the way that AI does, so I have to assume being asked specific questions about the viability of some startup's technology is a direct result of the AI market opportunity.

While it's nice to have a broader community of attendees and more media coverage, the increasing presence of AI-focused media and VC types in the SC community means I can't be as open and honest as I once was. Working for a corporation (with secrets of its own to protect) doesn't help there either, so maybe getting cagier when talking to strangers is just a part of growing up.

SC23 as a milestone year

Attending SC23 this year coincided with two personal milestones for me as well.

This is the tenth year I've been in the HPC business, and the first SC I ever attended was SC13.  I can't say that this is my eleventh SC because I didn't attend in 2014 (on account of working at a biotech startup), but I've been to SC13, SC15 through SC19, SC20 and SC21 virtually, and SC22 and SC23 in-person.  At SC13 ten years ago, the weather was a lot colder:

But I still have the fondest memories of that conference because it that was the week where I felt like I had finally found my community after having spent a decade as an unhappy materials science student.

SC23 is also a milestone year because it may be the last SC I attend as a storage and I/O guy. I recently signed on for a new position within Microsoft to help architect the next generation of supercomputers for AI, and I'll probably have to trade in the time I used to spend at workshops like PDSW for opportunities to follow the latest advancements in large-scale model training, RDMA fabrics, and accelerators. But I think I am OK with that.

I never intended to become an I/O or storage expert when I first showed up at SC13; it wasn't until I joined NERSC that I found that I could learn and contribute the most by focusing on storage problems. The world has changed since then, and now that I'm at Microsoft, it seems like the problems faced at the cutting edge of large language models, generative AI, and the pursuit of AGI are where the greatest need lies. As I said earlier in this post, AI has bigger problems to deal with than storage and I/O, and those bigger problems are what I'll be chasing. With any luck, I'll be able to say I had a hand in designing the supercomputers that Microsoft builds after Eagle. And as has been true for my last ten years in this business, I'll keep sharing whatever I learn with whoever wants to know.