Life and leaving NERSC

When word started to spread that I was leaving my job at NERSC for Microsoft, a lot of people either directly or indirectly attributed my decision to being one motivated by money. Rationalizing my decision to leave is certainly a lot easier with this "Glenn was lured away with bags of cash" narrative, but that wasn't really a factor when I chose to move on. Rather, my decision is a reflection of where I see the world of HPC going in the coming decade and where I personally wanted to position myself. For my own therapeutic reasons (and perhaps the benefit of anyone interested in what it's like to work within, and subsequently leave, the DOE HPC complex), I'll try to write it all out here.

Working at NERSC

First things first: NERSC has been a wonderful place to work.

<div style="text-align: center;">A typical view from outside NERSC’s facility in Berkeley after work during the winter months. Yes, it really does look like this.</div> <p>When I started in mid-2015, I came in with about three years of prior work experience (two at SDSC doing user support and one at a biotech startup) and knew a little bit about a lot of things in HPC. But I didn’t really know the basics of I/O or storage–I couldn’t tell you what “POSIX I/O” really meant or how GPFS worked. The fact that I got to help author NERSC’s ten-year strategy around storage in just two years, was invited to present my view on how to bridge the gap between HPC and enterprise storage at Samsung’s North American headquarters a year later, and was trusted to oversee the design and execution of the world’s first 35 petabyte all-flash Lustre file system through my first four years is a testament to how much opportunity is available to learn and grow at NERSC.</p>

There are a couple of reasons for this.

Stable funding

Perhaps foremost, NERSC (and DOE's Leadership Computing Facilities, ALCF and OLCF) enjoy healthy budgets and financial stability since worldwide leadership in scientific advancement is generally a national priority by both major political parties in the US. This means that, regardless of who is president and which party holds majorities in Congress, the DOE HPC facilities can pay their employees and deploy new supercomputers. This solid funding makes it much easier to invest in staff development and long-term planning; I was able to become a resident I/O expert at NERSC because I was never forced to chase after the funding du jour to make ends meet. Congress trusts NERSC to allocate its funding responsibly, and NERSC prioritized letting me learn as much as I could without distraction.

Instant credibility and access

Second, having a NERSC affiliation gives you instant credibility and access in many cases. It's not necessarily fair, but it's definitely true. Within my first year at NERSC, I was invited to give a presentation about I/O performance monitoring in Paris because the organizer wanted a lineup of speakers from all the big players in HPC. I had never been to Europe at that point in my life, but being the I/O guy from NERSC (and being able to present well!) was enough to get me there. And it was during that trip to Paris that I got to meet--and literally have conversation over dinner with--more industry bigshots that I can remember. And that trip to Paris was not an outlier; pandemic aside, NERSC let me go to Europe at least once or twice every year I've worked there.

<div style="text-align: center;">The first photo I ever took of Notre Dame on the first day I’d ever set foot in Europe. NERSC sent me there less than a year after I started.</div> <p>Of course, this is not to say that every employee at a DOE HPC facility is wining and dining in Paris every summer. Many of these opportunities are earned by showing the value of the work you’re doing, just like at any job. But owing to healthy budgets, travel expenses are rarely the limiting factor in chasing after these opportunities. In addition, going out into the world and talking about what you do is part of the job at a DOE facility; being a leader in the field of HPC is part of the mission of NERSC, ALCF, and OLCF, so doing high-risk, first-of-a-kind work and telling the world about it is uniquely valued within DOE in a way that it is not in industry.</p>

Smart people

A product of these two factors (stable budget and instant credibility) results in coworkers and colleagues who are generally very experienced and capable. There's an interesting mix of laissez-faire management and rigorous process-driven management as a result.

Staff are generally given the freedom to choose their own destiny and focus on work that they enjoy much like in any academic environment; it's not hard to pick up passion projects or even move between groups if things get stale on a day-to-day basis. Since everyone is working on their own slices of HPC, there's also easy access to world experts in different areas of technology if you need one. For example, I recall once reviewing a storage system that appeared to rely on multiplexing two 12G SAS links over a single 24G SAS. After one email and a few hours, a coworker confirmed, complete with a citation to the SCSI standards, that this was totally possible. Even if someone in-house didn't know the answer, I had direct access to an engineering manager at a leading storage vendor who owed me a favor and definitely would've known the answer. It's really, really hard to find as many smart people in arm's reach in most other HPC centers.

At the same time, there is rigorous federal oversight on major projects and procurements to ensure that taxpayer dollars are responsibly spent. This is a double-edged sword because all of the reporting and reviews that go into massive capital projects make forward progress very slow at times. All DOE HPC facilities review and re-review everything about these giant supercomputers before making a decision, so by the time the public sees a press release about a new supercomputer, lab staff have spent literal years going over every detail and risk. It sometimes may not seem that way (how many problems has Aurora had?), but rest assured that every schedule slip or technology change the public hears was preceded by countless hours of meetings about risk and cost minimization. On the flip-side though, you have the opportunity to learn every gory detail about the system directly from the people who designed it.

Pay

In true millennial fashion, I think it's important to have an open discussion about the pay. DOE labs pay more than any other HPC facility in the world as far as I am aware, and even in the San Francisco Bay Area, salary at NERSC is comparable to the base salaries offered by all the big tech companies. You can get an idea of what entry-level salaries (think: first job after postdoc or a few years out of undergrad) by searching H1B Visa postings, and anecdotally, I'd wager that a typical HPC job at NERSC pays about 2x that of the same job at a typical US university and 3x-4x that of the same job at a British or European university. All the labs pay about the same to boot, so an HPC job at somewhere like Oak Ridge can afford you a relatively luxurious lifestyle.

Don't get me wrong though; affording to buy a Bay Area house on a single NERSC salary alone would be tough in the same way that buying a Bay Area house on any single salary would be. And while NERSC's compensation is comparable to the base salary of the big tech companies, that base is about all you can get since DOE labs cannot offer equity or substantial bonuses. This is less of a gap if you're just starting out, but anyone who's looked at compensation structures in tech knows that stock-based compensation, not base salary, dominates total compensation as you move up.

So, if money wasn't an issue for me and NERSC is such a great place to work, why would I ever leave?

The road ahead for HPC

On one hand, HPC's future has never been brighter thanks to how much life (and money!) the AI industry is bringing to the development of HPC technologies. We have new all-flash file systems, gigantic GPUs, awesome CPU memory technologies, and mixed-precision techniques in the HPC space that were all directly driven by developments primarily intended for AI workloads. On the other hand, leadership HPC appears to be engaging in unsustainable brinkmanship while midrange HPC is having its value completely undercut by cloud vendors. I've not been shy about my overall anxiety about where HPC is going because of this, but I'll elaborate now that the exascale race has been won.

The future of leadership HPC

Without some monumental breakthrough in transistor technology, there is only one path forward in continuing to build faster and faster supercomputers in the next decade: pour more and more energy (and dissipate more and more heat) into larger and larger (and more and more) GPUs.

The goal post for exascale power keeps moving because that's been the easiest way to hit the mythical exaflop milestone; while the original goal was 20 MW, Frontier is coming in at 29 MW and Aurora at "under 60 MW." Not only is this just a lot of power to feed into a single room, but the cost and effort of actually building this infrastructure is newsworthy in and of itself these days. At the current trajectory, the cost of building a new data center and extensive power and cooling infrastructure for every new leadership supercomputer is going to become prohibitive very soon.

HPC data centers situated in places where the cost of electricity and real estate (stacked atop the risk of earthquake or wildfire) further skew the economics of just adding more power are going to run up against this first. It used to be easy to dismiss these practicality concerns by arguing that colocating scientists with supercomputers created immeasurable synergy and exchange of ideas, but the fact that science never stopped during the work-from-home days of the pandemic have taken a lot of air out of that argument.

My guess is that all the 50-60 MW data centers being built for the exascale supercomputers will be the last of their kind, and that there will be no public appetite to keep doubling down.

Given this, DOE's leadership computing facilities are facing an existential threat: how do you define leadership computing after exascale if you can't just add another 50% more power into your facility? How do you justify spending another $600 million for a supercomputer that uses the same power but only delivers 15% more performance? You can pour similarly huge amounts of money into application modernization to accelerate science, but at the end of the day, you'd still be buying a lot of hardware that's not a lot faster.

The future of places like NERSC

NERSC is probably a little better off since its lack of an exascale machine today gives it at least one more turn of the crank before it hits a hard power limit in its data center. That gives it the ability to deploy at least one more system after Perlmutter that is significantly (at least 2x) more capable but draws significantly more power. However, compared to Frontier and Aurora, such a system may still look rather silly when it lands in the same way that Perlmutter looks a bit silly compared Summit, which was funded by the same agency but deployed years earlier.

And therein lies the dilemma of centers like NERSC--how do you position yourself now so that by the time you deploy an HPC system that is close to maxing out on power, it is sufficiently different from a pure-FLOPS leadership system that it can solve problems that the leadership systems cannot?

The easy go-to solution is to craft a story around "data-centric" supercomputing. We did this when I was at the San Diego Supercomputer Center when we were budget-limited and had to differentiate our $12 million Comet supercomputer from TACC's $30 million Stampede. You invest more in the file system than you would for a pure-FLOPS play, you provide low-cost but high-value onramps like Jupyter and science gateways to enable new science communities that have modest computing needs, and you fiddle with policies like allocations and queue priority to better suit interactive and urgent computing workloads. From a productivity standpoint, this is can be a great story since users will always respond well to lower queue wait times and less frustrations with the file system. From a system architect's standpoint, though, this is really boring. The innovation happens in policies and software, not clever hardware or design, so there's very little that's new for a system designer to think about in this case.

A more innovative approach is to start thinking about how to build a system that does more than just run batch jobs. Perhaps it gives you a private, fast file system where you can store all your data in a way indistinguishable from your personal laptop. Perhaps it gives you a convenient place to run a Jupyter notebook that has immediate access to a powerful GPU. Or perhaps it gives you all the tools to set up an automated process where all you have to do is upload a file to trigger an automatic data analysis and reduction pipeline that returns its output to a shiny HTTP interface. Such a system may not be able to crank out an exaflop using HPL, but does that matter if it's the only system in the country that supports such automation?

There are interesting system architecture questions in the latter case, so as a system designer, I much prefer it over the "data-centric" angle to non-exaflop supercomputing strategies. But there remains a problem.

The problem: cloud

Such a "more than just batch jobs" supercomputer actually already exists. It's called the cloud, and it's far, far ahead of where state-of-the-art large-scale HPC is today--it pioneered the idea of providing an integrated platform where you can twist the infrastructure and its services to exactly fit what you want to get done. Triggering data analysis based on the arrival of new data has been around for the better part of a decade in the form of serverless computing frameworks like Azure Functions. If you need to run a Jupyter notebook on a server that has a beefy GPU on it, just pop a few quarters into your favorite cloud provider. And if you don't even want to worry about what infrastructure you need to make your Jupyter-based machine learning workload go fast, the cloud providers all have integrated machine learning development environments that hide all of the underlying infrastructure.

And therein lies the problem: the definition of "innovation" as non-exaflop HPC runs up against this power wall might actually mean "catching up to the cloud."

This is not to say that NERSC-like HPC centers are entirely behind the cloud; all the DOE HPC facilities have bigger, faster, and more convenient parallel file systems that are generally always on and where data is always somewhere "fast." They also provide familiar, managed software environments and more egalitarian support to small- to mid-scale science projects. DOE HPC also takes the most risk in deploying unproven technologies to shake them out before they become available to the wide market.

However, those gaps are beginning to close. You can stick a full Cray EX system, identical to what you might find at NERSC or OLCF, inside Azure nowadays and avoid that whole burdensome mess of building out a 50 MW data center. You can also integrate such a system with all the rich infrastructure features the cloud has to offer like triggered functions. And when it comes to being first to market for risky HPC hardware, the cloud has already caught up in many ways--Microsoft deployed AMD Milan-X CPUs in their data centers before any HPC shop did, and more recently, Microsoft invested in AMD MI-200 GPUs before Frontier had a chance to shake them out.

Given this steep trajectory, I see only two scenarios for large-scale, non-exaflop HPC facilities in the 10+ year horizon:

They develop, adopt, steal, or squish cloud technologies into their supercomputers to make them functionally equivalent to cloud HPC deployments. They may be a little friendlier to scientific users since cloud functionality wasn't designed for scientific computing alone, but they also may not be as stable, mature, or feature-rich as their cloud cousins.
They find better overall economics in eventually moving to massive, long-term, billion-dollar deals where flagship HPC systems and their "more than just batch jobs" features are colocated inside cloud datacenters sited at economically advantageous (that is, cheap power, cooling, and labor) locations in the country.

There's also grey area in between where national HPC facilities consolidate their physical infrastructure in cheap areas to manage costs but still self-manage their infrastructure rather than fully outsource to a commercial cloud. CSCS has hinted at this model as their future plan since they cannot build 100 MW datacenters in Switzerland, and this is proof that leading HPC facilities around the world see the writing on the wall and need to maneuver now to ensure they remain relevant beyond the next decade. Unfortunately, the politics of consolidating the physical infrastructure across the DOE HPC sites would likely be mired in Congressional politics and take at least a decade to work out. Since serious work towards this hasn't started yet, I don't envision such a grey-area solution emerging before all the DOE facilities hit their power limit.

Hopefully I've painted a picture of how I perceive the road ahead for large-scale HPC facilities and you can guess which one I think will win out.

Final thoughts

I have every confidence that there will still be DOE HPC facilities in ten years and that they will still be staffed by some of the brightest minds in HPC. And even if a cloud-based HPC facility ultimately consumes centers like NERSC, I don't think many people would be out of work. The vast majority of what DOE's HPC people do is think carefully about technology trends, maintain a deep understanding of user requirements, provide excellent support to its thousands of users, and keep complex supercomputers running well. Those jobs don't go away if the supercomputer is in the cloud; it's just the physical location, the hands doing physical hardware swaps, and the breadth of vendor interactions that may change.

For me as a system architect though, it's become too hard for me to catch up to all the new technologies and techniques HPC needs for the future while also building up other staff to be masters of today's I/O challenges. I found myself at a fork in the road. One path would mean catching up on a technical level and then getting in front of where the future of HPC lies before it gets there. The other path would mean trying to steer the entire DOE HPC ship in the right direction, as long as that may take, and have faith that the people I bring along can race far enough ahead to tell me if we're still going where we need to go. Perhaps a bit selfishly, I chose the former. I'm just not ready to give up on racing ahead myself yet, and the only way I could hope to catch up was to make it a full-time job.

I don't claim to know the future, and a lot of what I've laid out is all speculative at best. NERSC, ALCF, or OLCF very well may build another round of data centers to keep the DOE HPC party going for another decade. However, there's no denying that the stakes keep getting higher with every passing year.

That all said, DOE has pulled off stranger things in the past, and it still has a bunch of talented people to make the best of whatever the future holds.

hpc.social