Exascale's long shadow and the HPC being left behind
The delivery of Japan’s all-CPU Fugaku machine and the disclosure of the UK’s all-CPU ARCHER 2 system amidst the news, solidly “pre-Exascale” machines with pre-exascale budgets, is opening old wounds around the merits of deploying all-CPU systems in the context of leadership HPC. Whether a supercomputer can truly be “leadership” if it is addressing the needs of today using power-inefficient, low-throughput technologies (rather than the needs of tomorrow, optimized for efficiency) is a very fair question to ask, and Filippo took this head-on:
<blockquote class="twitter-tweet" style="display: block; margin: auto;"><div dir="ltr" lang="en">Unfortunately take codes from Tier-2 with GPU to Tier-1 without GPU is a huge step backward. These calls are holding back the true potential of #GPU computing in accelerating scientific discovery! https://t.co/qVVEWFDXt1</div>
— Filippo Spiga (@filippospiga) May 20, 2020</blockquote>
Of course, the real answer depends on your definition of “leadership HPC.” Does a supercomputer qualify as “leadership” by definition if its budget is leadership-level? Or does it need to enable science at a scale that was previously unavailable? And does that science necessarily have to require dense floating point operations, as the Gordon Bell Prize has historically incentivized? Does simulation size even have anything to do with the actual impact of the scientific output?
While I do genuinely believe that the global exascale effort has brought nearly immeasurable good to the HPC industry, it’s now casting a very stark shadow that brings contrast to the growing divide between energy-efficient, accelerated computing (and the science that can make use of it) and all the applications and science domains that do not neatly map to dense linear algebra. This growing divide causes me to lose sleep at night because it’s splitting the industry into two parts with unequal share of capital. The future is not bright for infrastructure for long-tail HPC funded by the public, especially since the cloud is aggressively eating up this market.
Because this causes a lot of personal anxiety about the future of the industry in which I am employed, I submitted the following whitepaper in response to an NSCI RFI issued in 2019 titled “Request for Information on Update to Strategic Computing Objectives.” To be clear, I wrote this entirely on my personal time and without the permission or knowledge of anyone who pays me–to that extent, I did not write this as a GPU- or DOE-apologist company man, and I did not use this as a springboard to advance my own research agenda as often happens with these things. I just care about my own future and am continually trying to figure out how much runway I’ve got.
The TL;DR is that I am very supportive of efforts such as Fugaku and Crossroads (contrary to accusations otherwise), which are looking to do the hard thing and advance the state of the art in HPC technology without leaving wide swaths of traditional HPC users and science domains behind. Whether or not efforts like Fugaku or Crossroads are enough to keep the non-Exascale HPC industry afloat remains unclear. For what it’s worth, I never heard of any follow-up to my response to this RFI and expect it fell on deaf ears.
<h2>Response to “Request for Information on Update to Strategic Computing Objectives”</h2>G. K. Lockwood
August 17, 2019
<h3>Preface</h3>This document was written as a direct response to the Request for Information on Update to Strategic Computing Objectives (Document Number 2019-12866) published on June 18, 2019. All views expressed within are the personal opinion of its author and do not represent the views or opinions of any individuals or organizations with whom the author may or may not be associated in any professional or personal capacities. This document was authored without the support, knowledge, or input of any such individuals or organizations, and any similarity between the opinions expressed here and any other individuals or organizations is purely coincidental.
<h3>Question 1. What are emerging and future scientific and technical challenges and opportunities that are central to ensuring American leadership in Strategic Computing (SC), and what are effective mechanisms for addressing these challenges?</h3>
While the NSCI Strategic Plan identified four overarching principles which are undeniably required to maintain continued American leadership, its five strategic objectives are, in many ways, mutually incompatible with each other.
In the three years following the initial NSCI plan towards delivering capable exascale, the outcomes of the Aurora and CORAL-2 procurements within DOE have made undeniably clear that the definition of “capable exascale” necessarily requires the use of GPU technologies. Because GPUs are, in many ways, accelerators specifically suited for scientific problems that can be reduced to dense linear algebra, this has effectively signaled that scientific challenges which are not reducible to dense linear algebra (and therefore incompatible with GPU technologies) are, by definition, no longer of strategic significance.
By bifurcating science domains based on whether they are or are not compatible with GPU-based acceleration, we are now at a crossroads where entire classes of domain science research that have historically run at-scale on CPU-based leadership computing systems will be left behind. To be clear, this is not simply a matter of engineering—many important classes of scientific challenges are fundamentally incompatible with the GPU accelerator model of computation, and no amount of code modernization will change this fact. Yet these same science domains, which rely on complex multiphysics applications that are core to strategic areas such as stockpile stewardship and climate science, are of undeniably critical importance to both national security and society at large.
Thus, there is now a clear and growing gap between NSCI’s ambition to deliver capable exascale and the larger mission to maintain leadership in entirety of truly strategically important computing in the nation. There are technical challenges intrinsic in this growing gap which include pursuing research in hardware and software technologies that approach strategic computing more holistically rather than exclusively from a FLOPS perspective. The community has long acknowledged that the scope of HPC has surpassed simply performing floating point operations, and the definition of capability computing now includes enabling science that, for example, may require tremendous data analysis capabilities (e.g., moving, transforming, and traversing massive data sets) but have relatively low floating point requirements. The DOE Crossroads procurement and the Japanese leadership program and its Fugaku system embody this more balanced approach, and there is little doubt that both Crossroads and Fugaku will demonstrate a number of world’s-firsts and, by definition, demonstrate leadership in strategic computing without making all of the sacrifices required to meet today’s definition of capable exascale.
Both Crossroads and Fugaku have required significant R&D investment to enable these dimensions of capability, and the NSCI would do well to explicitly call out the need for continued investment in such directions that are orthogonal to exaflop-level capability.
<h3>Question 2. What are appropriate models for partnerships between government, academia and industry in SC, and how can these partnerships be effectively leveraged to advance the objectives of SC?</h3>
The most impactful models for industry-government partnership in HPC have come in the form of close collaboration between the HPC facilities that deploy extreme-scale systems and the technology providers in industry that create and support the required hardware and software solutions. Strategy necessarily involves taking input from user requirements, workload characterization, and technology trends to inform future directions, and HPC facilities are uniquely qualified to speak to both user requirements (by virtue of the fact that they directly interact with users in support of HPC systems) and workload characterization (by virtue of the fact that they manage HPC systems). Complementarily, industry technology providers (vendors) are uniquely qualified to speak to technology directions, marketability, and sustainability in the larger technology market.
This effective collaboration can take the form of non-recurring engineering such as those contracts associated with large system procurements (often to address more tactical challenges towards strategic computing) or standalone programs such as DOE PathForward (which addresses longer-term technology development towards strategic computing). In both cases though, industry (not HPC facilities or academic researchers) propose the initial scope of work based on their own understanding of both (1) HPC-specific requirements and (2) larger market and profit prospects. This latter point is critical because the HPC market alone is simply not large enough to sustain purpose-built technologies, and sustaining new technologies and their peripheral enabling ecosystems requires buy-in from multiple markets.
The role of academia in research is more complex, as academic research in HPC can be either basic or applied in nature. Basic research (such as in applied mathematics and algorithm development) has stood on its own historically since such work results in a larger base of knowledge from which specific technology solutions (whether developed by industry or HPC facilities) can be composed both today and in the future. The federal agencies participating in NSCI can claim credit for funding the basic research outcomes that have been incorporated into innumerable software and hardware technologies in use today.
On the other hand, applied research (such as developing new software systems that may implement the outcomes of basic research) has had very mixed outcomes. It is often the case that applied researchers who have no direct relationship with neither HPC facilities nor technology providers formulate research projects based on second-hand HPC requirements and technology trends. It follows that their interpretation of such requirements is incomplete, and their research outcomes are misaligned with the actual needs of HPC facilities and industry. Barring cases where academic applied research outcomes are so valuable that they stand on their own (of which there are many examples including OpenMPI and Tau), applied research in the absence of such a sustainability path results in a tremendous amount of software that has virtually no long-term (i.e., strategic) value to SC.
This speaks to a gap between applied research in academia and those who apply research in practice that must be closed. This gap has been perpetuated by a lack of HPC practitioners (domain scientists and applied researchers directly attached to HPC facilities or technology providers) on the committees that evaluate the merit of research. Thus, a more effective engagement model would involve coupling the academic research pipeline to HPC facilities and industry more closely. This may be something as informal as increasing the diversity of review panels and program committees to include representatives from facilities and industry to a formal requirement that successful research proposals have a clearly defined connection to a specific industry or facility partner. Regardless of the solution though, funding applied research that will be “thrown over the wall” to HPC facilities and vendors without their input is not compatible with SC.
<h3>Question 3. How do we develop and nurture the capable workforce with the necessary skill and competencies to ensure American leadership in SC? What are effective nontraditional approaches to lowering the barriers to knowledge transfer?</h3>
Although virtually every report discussing strategic directions and future requirements of HPC call for knowledge transfer and building a larger workforce through training and outreach (e.g., see the complete set of DOE Exascale Requirements Reviews), such reports generally neglect two critical realities of employing and retaining a talented workforce at production HPC facilities and in industry.
The first reality is that the problems intrinsic to modern HPC (solving problems at extreme scales) are no longer exclusive to HPC. The ubiquity of technology in modern life now means that the entire technology industry must deal with problems at scale as a matter of course. As such, the HPC community is now competing with well-capitalized commercial entities that have increased the absolute value of a skilled engineer to levels that the scientific research community simply cannot afford.
Thus, the perceived lack of skilled workforce in HPC is not a failing of the workforce development strategy in place; in fact, it may be a great indicator of its success, as it has created a workforce whose skills have value that far outstrip the investment put into workforce development. However, this also means that the talented individuals who eschew the higher pay and amenities of working in the larger technology industry do so for non-monetary reasons (work-life balance, attraction to the science mission, geographic locality). It is therefore critically important that strategic computing identify these motivators and built upon them to the greatest possible degree to maintain an edge in an extremely competitive hiring landscape.
The second reality is that the key to an exceptional workforce is not simply a matter of technical knowledge. There is no shortage of individuals who understand parallel programming in the world, and it is of little strategic value to pursue workforce development strategies that prioritize knowledge transfer as the principal outcome. Rather, strategic computing requires a workforce that is capable of critical thinking and has a natural drive to solve problems that have never been solved before. These traits should be emphasized to a far greater degree than the current pedagogical emphasis on material that can be learned from a manual by anyone with a curious mind.
By definition, very few people in the world have prior experience in world-class HPC. There are very limited opportunities to build a credible work history in extreme-scale HPC for individuals who are ineligible for student internships or postdoctoral appointments. As a result, world-class HPC facilities rarely see qualified applicants for open positions when “qualified” is defined on the basis of relevant work experience; a mid-career developer or systems engineer working in a campus-scale HPC organization simply has no opportunities to demonstrate his or her intellectual capability in a way that is outstanding to the facilities that deliver strategic computing resources.
Thus, an integrative approach to workforce development that (1) emphasizes problem-based learning rather than rote reiteration of manuals and standards documents in an environment where (2) representatives from NSCI constituent agencies can engage with trainees (i.e., potential employees) in a fashion with less formality and pretense than a typical “CV-phone screen-interview” pipeline may reveal a much broader potential workforce whose strengths more closely align with strategic computing. Such an approach may manifest in the form of intensive boot camps such as the DOE ATPESC program, grants for mid-career retraining in partnership with a leadership computing facility, or sabbatical support for technical staff at the nation’s mid-scale computing facilities.
<h3>Question 4. How can technical advances in SC and other large government and private initiatives, including infrastructure advances, provide new knowledge and mechanisms for executing next generation research?</h3>
No response.
<h3>Question 5. What are the future national-level use cases that will drive new computing paradigms, and how will new computing paradigms yield new use cases?</h3>It is easy to claim that artificial intelligence will be the most important future national use case to drive new computing paradigms. However, this is a very dangerous statement to make without qualification, as the actual level of readiness for applying AI to solve scientific problems is very low, and the actual scales, aggregate demand, and algorithmic motifs required by such workloads for scientific discovery are poorly undefined. More generally, the requirements of AI workloads at large remain uncertain; for example, the Facebook uses a variety of AI techniques in production and have found that each application area requires different computational, storage, and network resources (see Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective). Outside of the large hyperscale datacenters, industry consensus suggests that production AI workloads remain largely at single-server scales. As such, it is difficult to confidently assert what the rate of scale-out AI will be for strategic computing.
The current leading technique for AI at scale is deep learning, yet scientific discovery is at odds with the black-box nature of this method. Alternative methods such as decision trees offer much more insight into why a trained model behaves as it does and is more compatible with applying physical constraints to which physical systems being modeled (e.g., see Iterative random forests to discover predictive and stable high-order interactions). However, the relative importance of such non-block-box learning techniques in HPC are completely unknown, as are the general optimization points for such techniques in the context of scientific computing. There is a danger that the similarities between deep learning and many HPC problems (GEMM-heavy workloads) place an artificially high importance on the role of deep learning in SC. It may be the case that deep learning is the most effective method for applying AI to address problems in scientific computing, but caution must be taken to ensure that major challenges in SC not all look like deep-learning nails simply because GPUs are a very effective hammer.
From a domain science perspective, there are very few domain sciences where AI can replace traditional simulation-driven workflows wholesale. As such, the role of AI in SC will be largely supplementary; scientific workflows may integrate an AI component to generate starting conditions, replace humans in the loop during steering, or identify areas of interest in the results of a primary simulation. However, it is very unlikely that AI will grow to be of greater significance to scientific computing than modeling and simulation. Instead, it will be the source of new computational resource requirements that simply did not exist in the past because those tasks were carried out by humans. The road towards integrating AI into scientific workflows will also be a long and tortuous one, as the field is evolving far more rapidly in industry than scientific computing traditionally has. Care must be taken that SC not tie itself too closely to a method (and its associated hardware configurations) that may be deprecated in short order.
<h3>Question 6. What areas of research or topics of the 2016 NSCI Strategic Plan should continue to be a priority for federally funded research and require continued Federal R&D investments? What areas of research or topics of the 2016 Strategic Plan no longer need to be prioritized for federally funded research?</h3>
The five objectives outlined in the 2016 NSCI Strategic Plan all gravitate around elements of topics that require continued federal R&D investments, but they do require realignment with the technological, scientific, and economic landscape as it exists now.
<h4>Objective 1: accelerating the development of capable exascale by the mid-2020s</h4>The 2016 NSCI report correctly stated that capable exascale technologies would not be available until the mid-2020s, but DOE pulled its exascale system deliveries into the early 2020s. As a result, the delivery of exascale had to be accelerated at significantly higher costs: there have been significant capital costs (the first US exascale systems will cost between 2x and 10x their immediate predecessors, either setting a new bar for the cost of future leadership HPC systems or resulting in a bubble in funding for all post-exascale machines), operational costs (the power budgets may exceed the original 20 MW goal by 50%), and opportunity cost (only two of the three CORAL labs actually deployed a CORAL-1 machine).
Notably absent here is a commensurate increase (2x-10x, 1.5x, or 1.3x as above) in R&D efforts towards making these exascale systems widely accessible to applications that do not fall under the umbrella of ECP funding. As such, NSCI must continue to emphasize the importance of funding R&D to enable the “capable” component of this objective through the mid-2020s at minimum.
<h4>Objective 2: Developing a coherent platform for modeling, simulation, and data analytics</h4>The convergence of HPC and Big Data was a popular point of discussion when the 2016 report was written, but there has yet to be a compelling, quantitative analysis that demonstrates the difference between a “Big Data” system and an “HPC” system despite the best efforts of several leadership-scale HPC facilities. The challenge is not one of technology and system architecture; rather, the principal design point for “Big Data” systems outside of the HPC world has simply been one of cost (e.g., scaling out cheap hardware over a cheap network for a very well-defined bulk data access pattern) over performance. There is absolutely nothing that stops the typical “Big Data” application stacks, both old (e.g., Hadoop and Spark; see this paper) and new (e.g., TensorFlow; see this paper) from running at scale on any modern HPC systems, and both have been demonstrated at scale on systems that were sensibly designed.
As such, this objective need not be emphasized in the future. Rather, engineering work is required to enable the “Big Data” stacks in use outside of HPC to work efficiently on the HPC systems of tomorrow. This remains a software, not architectural, problem, and very much an engineering, not research, challenge.
<h4>Objective 3: R&D towards post-CMOS technologies and new paradigms</h4>It is not the role of NSCI constituent agencies to fund the development of new materials systems explicitly for post-CMOS computing, because these agencies, their review committees, and the academic researchers they fund do not have the insight into the realities of logistics, material costs, and manufacturing required to predict what combination of materials and microarchitectures could actually be turned into a marketable product that can be sustained by the larger technology industry. In the absence of this insight, R&D towards post-CMOS technologies is likely to produce interesting demonstrations that are impractical for the purposes of actually developing leadership-scale computing systems. Instead, such research should be funded using facility-industry partnerships as discussed previously in Question 2.
Investing in R&D towards new paradigms in computing should also be considered not with respect to enabling new scientific applications, but rather accelerating existing scientific workloads that are incompatible with exascale technologies (GPUs). As discussed in response to Question 1, there is a very real risk of leaving entire domains of computational science behind as the definition of leadership computing (when equated to exascale) becomes increasingly narrow in scope. Developing new accelerator technologies that are of benefit to complex application workflows (e.g., multiphysics simulations) are of critical importance in the coming years missions such as stockpile stewardship and climate science fall by the wayside.
<h4>Objective 4: Improving application development and workforce development</h4>The DOE Exascale Computing Project (ECP) has demonstrated a highly effective way of integrating researchers, application code teams, and facilities towards improving application development. Providing a coherent ecosystem of recommended methods (such as its IDEAS project; e.g., see ECP-IDEAS), development tools (funded under its Software Technologies area), algorithm-application partnerships (through its co-design centers), and application integration efforts (funded under Hardware and Integration area) are an excellent blueprint for improving application development. Developing a more generic model for establishing and supporting this style of development beyond the timeline of the ECP funding should be pursued.
Improving workforce development should reduce its focus on basic technical training and more on improving critical thinking as described in the response to Question 3 above.
<h4>Objective 5: Broadening public-private partnership</h4>As described in the response to Question 2 above, public-private partnership is absolutely critical to sustain SC in the coming years. The financial incentives driving technology development from the world outside of HPC have come to outstrip the resources available to HPC to exist independently. SC efforts must engage with both technology providers and the primary market forces (the enterprise and hyperscale computing industries) to better understand where technologies, solutions, and opportunities can be pursued in partnership rather than in parallel.
<h3>Question 7. What challenges or objectives not included in the 2016 NSCI Strategic Plan should be strategic priorities for the federally funded SC R&D? Discuss what new capabilities would be desired, what objectives should guide such research, and why those capabilities and objective should be strategic priorities?</h3>The mission of providing capable exascale as described in the 2016 NSCI Strategic Plan is proving to be not a sustainable long-term path. As described in the response to Question 1 above, the first exascale machines stand to accelerate scientific problems that can be cast as dense matrix-matrix multiplication problems, but there are large swaths of scientific problems to which this does not apply. If one considers the Graph500 BFS list, three of the top five systems are over seven years old and will be retired in 2019. While graph problems are not prolific in SC, the fact that such little progress has been made in accelerating extreme-scale graph traversal during the seven years that exascale has been aggressively pursued is indicative of some classes of HPC problems being abjectly left behind.
Thus, a primary objective towards capable exascale must be examining the opportunity costs of the current strategic direction. If it is determined that there is simply no way to bring forward those types of computational problems that are incompatible with GPU-based acceleration, then a clearer strategy must be formulated to ensure that the scientific challenges being solved by those computational problems do not stagnate. As it stands, the public discourse surrounding the first-generation US exascale architectures is not universally positive because of this perceived scientific exclusivity of the chosen architectures, and such exclusivity is at odds with both capable computing and computing leadership.
<div>
</div>