hpc.social


High Performance Computing
Practitioners
and friends /#hpc
Share: 
This is a crosspost from   Surfing the Singularity . See the original post here.

Surfing the Singularity - "the Workflow is the App"

Hello and happy fall holidays to you and yours.

As I wrote about in the last blog post [1], as quantum computing hardware matures over the next 5 to 10 years from an experimental toy through to utility and then perhaps advantage over classical (for some applications), it will be included into an already diverse and hybrid computing and applications landscape - on-prem computing, mobile and edge, cloud, and now novel types of computing devices which require new thinking and wholly new means of addressing them. How to deal with the burgeoning heterogeneity of the computing landscape - how to write and run apps which produce and consume data across a widening array of devices- is the topic of this post.


Language Landscape

The Java programming language, once touted in the glory days of "the World Wide Web" as being "write once, deploy anywhere", and in its heyday representing 25% of new application development, is now down below 10%. What's hot? Python (23%), and "the C's", a collection of C, C++, C# and their kin (>24% in total) which are traditionally recompiled for specific hardware platforms. [2] And while Python provides portability, often for performance in math operations it depends on native libraries, built in, you guessed it, the C's. Into this mix wades the US government which has come out recently with a potentially disruptive statement against the use of the C's, citing security concerns due to their free-wheeling memory management, and in spite of efforts like Safe C++, the government is recommending movement to memory safe languages like Rust, currently with just 1% market share, but "with a bullet". [3] Whether it is better to port to Rust or just update to Safe C++ depends on many factors - for example, how good are your docs and test cases - and while there may exist conceptual impedance mismatches between languages, modern AI coding assistants will only increase in capability especially for more rote tasks like porting.

Add to this mix the coding of Graphical Processing Units (GPUs) - originally intended for visualizations but now used in applications for almost anything involving matrix math (turns out, lots of stuff). GPUs today are mostly sold by NVIDIA and are programmed in the C's (sometimes with a Python interface) using the NVIDIA CUDA library. These pieces of the application, the "kernels", are hardware dependent, and while many attempts have been made to create hardware-portable frameworks for GPU programming (see SYCL for example [4]), nearly always the newest fastest GPU features are available in the native non-portable form first, leading to vendor lock. (This might be a good time to remember that NVIDIA does not themselves manufacture chips - they design chips which others produce.)

The manner in which we program GPUs is similar to the way we program quantum computers, i.e. QPUs - we delegate to them the portions of the application to which they are best suited, program them using device-specific instructions, and weave them back into the holistic solution. Rather than wielding the Java hammer where everything is a virtualized nail, we use the best tool for the job. In quantum computing, for example, "variational" hybrid algorithms are a common theme, where some part of the work and preparation are performed on classical hardware as a setup for a quantum step, and then post-processing the results back on classical hardware for potential iteration to an optimal solution.

Two of several emerging patterns for integrating quantum computing into an application solution. [5]

This pattern is analogous to what is also common in classical high performance computing (HPC) for applications like weather modeling and other complex simulations - pre-process on commodity hardware, run an HPC job on the big box, and post-process the results. The introduction into the mix of steerage provided by AI models increases the heterogeneity of the complete solution.

A blended computing landscape, enabling for example, quantum computing to produce highly precise data to train AI to steer a classical HPC simulation. [6]

All these hardware-dependent application pieces for an ever widening array of hardware means that compilers are cool again, and compiler pipelines like LLVM are critical to application development and deployment. [7] Included in this class of development tools are circuit transpilers for quantum hardware which must take into consideration not only the architectural differences between QPUs (e.g. which gates are supported, what's the inter-qubit connectivity like, etc.), but also the changes which can occur in a quantum data center on a daily basis as these new, noisy, and fragile qubits simply fail and go offline, potentially altering the machine's topology. Just-in-time compilation is needed, and compiler optimization is therefore also cool again. Thank you, Frances Allen. [8]


Parts is Parts

What emerges from this landscape is not a singular executable running on one computer, but rather, multiple application piece parts, written in different languages, running on radically different hardware in sequence and simultaneously, being orchestrated into a complete solution.

In other words, a workflow. Back in the day Java's Sun Microsystems (remember them?) asserted "the network is the computer". Now we assert "the workflow is the app".

Or more likely, a workflow of workflows. We like to think of these nested workflows in three types: [9]

  1. in-situ: the workflow is running all on the same machine (e.g. a local process, an HPC job)
  2. intra-site: the workflow is running on different machines within the same connected enterprise (e.g. within the same data center, virtual network, etc.)
  3. inter-site: the workflow is running across different machines in different enterprises (e.g. hybrid on-prem and perhaps multi-vendor cloud)

With all these compute types, languages, and locations working together to realize the workflow and solution, loose coupling is key - components connected but not dependent - each part minding its own business. In other words, to paraphrase the poet, good interfaces make good neighbors. [10]

We use the convenience term "Site" to mean a provider of secure compute and data services. What interfaces must a Site provide? The interface or API can include lots of things, but it must at least provide: 1) authentication and authorization, 2) a means to run components through their lifecycle, 3) a means to manage data being operated on and produced, perhaps being moved into and out of the Site, and 4) some way to get an inventory of the Site's service offerings and provision them for the purposes of running components or holding data. We call these by four functional nicknames: Auth, Run, Repo, and Spin.


Four functional pillars of an interoperable computing site.

We can see in each of the three types of workflows the need for each of these four functional pillars, albeit some as a no-op or inherited from a higher order workflow. For example, in a "type 1" workflow of components running on a single machine or within an HPC allocation the Auth aspects may be implied to be already addressed - i.e. the user is already logged into the machine or authorized to run on the HPC cluster. But a workflow which utilizes compute resources both on-prem and in the cloud will have to interact at runtime with the "auth" aspects of the cloud provider prior to being able to "run" workloads, or put and get data to various "repos". Most cloud providers provide a means to list available computing resources, to "spin" them up and down. This provisioning itself can be part of an end-to-end workflow: authenticate, get an inventory of available services, spin some up, run jobs on them storing the results, and spin them down.

Stuck in the Middle

Most cloud providers - from Amazon to IBM Quantum cloud - provide a callable API interface which can be viewed through the lens of Auth, Run, Repo, Spin. So do some of the supercomputers and cutting edge resources provided by the Federal government, most notably those provided by the National Energy Research Scientific Computing Center (NERSC). [11]

As Sites, these providers expose their offerings to internal and external workflows, however, they do not themselves promote a means to author these cross-site workflows, to manage them, track them, or keep tabs on all that distributed data. What else is needed? First, since cloud and other service providers have no motivation to standardize their interfaces, a framework super-interface could exist with the ability to plug in drivers for specific service providers. This in theory is the Auth, Run, Repo, Spin interface. Second, since each provider defines their own service and runtime component lifecycle (loosely: start, run, and stop with success or fail end states) there needs to be a way to normalize the status terminology - a "fail" on one site is the same as an "error" on another, "success" means the same thing as "done". This permits the third aspect of a middleware framework - the ability to track running jobs on Sites and trigger other jobs on any Site to run accordingly - i.e. the control flow of the workflow.

What about the data? Commonly we need the ability to put data to a Site and get some back - this is the Repo interface of the Site. And while most (but not all) Sites provide some means to store and retrieve data, be it filesystem or S3 object store or database or something else, it would also be nice to be able to say something "meta" about the data - which Site did it come from, what job or application produced it, what other workflow steps on this Site or others consumed it? Some Sites provide storage with metadata (e.g. Amazon S3) but most don't. This metadata comprises the provenance of the data - like a Civil War sword on the Antiques Roadshow, its the paper trail showing where the item came from, proving the item is legit. In a workflow which perhaps produces many pieces of data, perhaps iteratively as it converges on a solution - keeping track of all the data pieces seems, well, important. The acronym FAIR - findable, accessible, interoperable, reusable - seems a good starting point. [12]


Open Says Me

Our open source project lwfm, the "local workflow manager", attempts to render these concepts as a reference implementation. [13] Its small with minimal Python lib dependencies and can be taken anywhere easily as a single runnable component, its provenancial metadata also easily portable and importable. A typical Site driver - a Python class which implements the Site interface - weighs in around 200 lines of code including the whitespace. Armed with a Site driver for a cloud service, you can author long-running workflows which utilize a mix of compute resources, storage, and data infrastructures, and automatically track the provenancial paper trail.

The lwfm middleware component provides some very recognizable services:

Should you use this tooling? I wouldn't recommend it. (Huh? Did I hear you correctly?) How many people are working maintaining it? (Two?) What about the community? (Next to none.) The software would fare poorly on a "spider web" analysis of its overall quality - you would not want to recommend it to your boss.

A convenient multi-axis assessment framework for software model maturity. [14]

The lwfm is a reference implementation of a workflow interop framework, at best. Are there alternatives? OMG are there alternatives! The workflow landscape is notoriously rich, fragmented, and super-niched. But portability and interoperability are often neglected as is data provenance. Government or university projects, while well meaning and sometimes directionally correct, quickly go stale when the funding elapses [15], and commercial solutions while often suffering some of the same deficiencies offer the added trap of vendor lock and can come with a hefty price tag.

Order, Order

So its back to committee. [16] Next week the high performance computing community will be meeting again at the SC Conference Series Supercomputing 2024, this year in Atlanta. Hybrid workflows for scientific and engineering applications - involving classical HPC, AI-focused clusters, and now also quantum computers - will be among the very many topics discussed.[17] And we should expect some surprises - in the new rankings for example of top machines on the planet, at least, the ones they want us to know about. [18]

Perhaps I'll report back on some of those returns in a future blog. Best regards. - andy


References & Amusements

[0] Banner photo by Ben Wicks on Unsplash

[1] "Surfing the Singularity: The Universe Computes", A. Gallo, https://www.linkedin.com/pulse/surfing-singularity-universe-computes-andy-gallo-6fgle

[2] TIOBE ranking of programming language popularity: https://www.tiobe.com/tiobe-index/

[3] Safe C++, with some chronology of the government statements: https://safecpp.org/

[4] SYCL: https://www.khronos.org/sycl/

[5] "Post-variational quantum neural networks", https://pennylane.ai/qml/demos/tutorial_post-variational_quantum_neural_networks

[6] "Hope Versus Hype: Quantum, AI and the Path to Commercial Advantage", Matthias Troyer, presentation at IEEE Quantum Week, Montreal, September 2024.

[7] LLVM: https://llvm.org/

[8] https://amturing.acm.org/award_winners/allen_1012327.cfm

[9] "Industrial Experience Deploying Heterogeneous Platforms for Use in Multi-Modal Power Systems Design Workflows", A. Gallo et al, https://drive.google.com/file/d/1c3YEVmEAUjbI5urj4PiV2TtjzBUzLlws

[10] "Mending Wall, Robert Frost, https://www.poetryfoundation.org/poems/44266/mending-wall

[11] NERSC SuperFacility API: https://docs.nersc.gov/services/sfapi/

[12] "The FAIR Guiding

Principles for scientific data management and stewardship", Mark D. Wilkinson et al., https://pmc.ncbi.nlm.nih.gov/articles/PMC4792175/pdf/sdata201618.pdf

[13] lwfm, https://github.com/lwfm-proj/lwfm

[14] "Model Maturity Web", https://richardarthur.medium.com/co-design-web-6f37664ac1e1

[15] Them's fighting words, and I expect to be roasted for it. But it seems to me that even the most popular software tool kits (no names) which emerged from the massively government funded ExaScale Computing Project failed to gain traction outside of a narrow community, failed to provide sustainable maintenance in the face of the funded end of the ECP, and would thus fair similarly poorly on a spider web analysis of their sustainability, their recommendability.

[16] "Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows", da Silva et al, "https://zenodo.org/records/13844759. I participated in the event, as well as the prior in 2022, and you can compare to that report as well: "Workflows Community Summit 2022: A Roadmap Revolution", also da Silva et al, https://zenodo.org/records/7750670.

[17] SC24, https://sc24.conference-program.com/

[18] TOP 500 supercomputers, June 2024, https://top500.org/lists/top500/list/2024/06/ - to be updated again before Thanksgiving.